Biostatistics with R/Printable version
This is the print version of Biostatistics with R You won't see this message or any elements not part of the book's content when you print or preview this page. |
The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at
https://en.wikibooks.org/wiki/Biostatistics_with_R
Biostatistics with R authors
License
[edit | edit source]The text of this book is released under the terms of the Creative Commons Attribution-ShareAlike 3.0 and GNU Free Documentation License. The particular version of that license that is being used can be found at:
- Wikibooks:Creative Commons Attribution-ShareAlike 3.0 Unported License
- Wikibooks:GNU Free Documentation License
Images used in this document are available under various licenses. Clicking on the image will take you to a description page where the licensing information is displayed.
Authors
[edit | edit source]List
[edit | edit source]- Hanjin Deviasse
- First authorship: Sep, 2015
- Contributions:
- Chainsawriot
- First authorship:2011
- Contributions: 1 page
A Brief Introduction To R/The First Step in R
What is R?
[edit | edit source]How to install R
[edit | edit source]RStudio
[edit | edit source]Use R package
[edit | edit source]Data Entry to R
[edit | edit source]Some Special Values
[edit | edit source]Reference
[edit | edit source]
Import
Why R for biostatistics?
[edit | edit source]R is superior to common statistical packages such as SPSS, SAS and MINITAB because it is
- powerful
- available for many platforms (Mac OS X, Windows, Linux etc.)
- programmable
- non-commercial
- extensively documented
Obtaining R/Installation
[edit | edit source]You may refer to R FAQ
Data Import
[edit | edit source]The format of data set available in Wiley's website are CSV, Excel, MINITAB, SAS and SPSS. Although you can import the data saved in Excel, SAS and SPSS into R using the foreign package, you should download the data in CSV format. It is because CSV is the easiest one to process in R.
For example, you would like to import the "Large Data set" data file. The downloaded data file (LDS_C02_NCBIRTH800.csv) , assuming stored in the directory "/desktop",can be imported into R as a data.frame called "largedataset" using following syntax:
> largedataset <- read.csv("/Desktop/LDS_C02_NCBIRTH800.csv", header=TRUE,na.strings="NA")
if you prefer to choose the data file using the standard "point-and-click" GUI way, you may use the function file.choose(), i.e.
largedataset <- read.csv(file.choose(), header=TRUE,na.strings="NA")
Now, you should imported the data from the CSV to a data frame called "largedataset". You may try to look inside the data frame by calling its name
> largedataset
You can access the variable (in computer lingo, column) "sex" inside the largedataset dataframe by
largedataset$sex
For example, you want to count the frequency of sex
> table(largedataset$sex)
You can attach the data frame so that you can call the variable directly
> attach(largedataset) > table(sex) > detach() #cancel attaching
Basic data management
[edit | edit source]R is designed to be a analysis system instead of a integrated environment such as SPSS. Unlike SPSS, R doesn't have a spreadsheet-like environment for data input. Usually data are entered using different software (e.g. database, spreadsheet software such as OO.o Calc) and then imported to R as described above. For quick one-off calculations, you can do the data entry in R. For example, if you want to calculate the mean age of ten patients (30,31,32,34,35,36,37,30,40,45) you can enter the data into R using the c() function.
> pt_age <- c(30,31,32,34,35,36,37,30,40,45)
You may call the newly created object pt_age by its name...
> pt_age
...and then calculate the mean age of the ten patients.
> mean (pt_age)
Introduction to Biostatistics
REVIEW EXERCISES
1. Explain what is meant by descriptive statistics.
2. Explain what is meant by inferential statistics.
3. Define: (a) Statistics (b)Biostatistics (c) Variable (d)Quantitative variable (e) Qualitative variable (f)Random variable (g) Population (h)Finite population (i) Infinite population (j)Sample (k) Discrete variable (l)Continuous variable (m) Simple random sample (n)Sampling with replacement (o) Sampling without replacement
4. Define the word measurement.
5. List, describe, and compare the four measurement scales.
6. For each of the following variables, indicate whether it is quantitative or qualitative and specify the measurement scale that is employed when taking measurements on each: (a) Class standing of the members of this class relative to each other (b) Admitting diagnosis of patients admitted to a mental health clinic (c) Weights of babies born in a hospital during a year (d) Gender of babies born in a hospital during a year (e) Range of motion of elbow joint of students enrolled in a university health sciences curriculum (f) Under-arm temperature of day-old infants born in a hospital
7. For each of the following situations, answer questions a through e: (a) What is the sample in the study? (b) What is the population? (c) What is the variable of interest? (d) How many measurements were used in calculating the reported results? (e) What measurement scale was used? Situation A. A study of 300 households in a small southern town revealed that 20 percent had at least one school-age child present. Situation B. A study of 250 patients admitted to a hospital during the past year revealed that, on the average, the patients lived 15 miles from the hospital.
8. Consider the two situations given in Exercise 7. For Situation A describe how you would use a stratified random sample to collect the data. For Situation B describe how you would use systematic sampling of patient records to collect the data.
Descriptive Statistics
Summary For Formular with R
[edit | edit source]Formula
Number |
Name | Formula | Formula with R |
---|---|---|---|
2.3.1 | Class interval width using Sturges’s Rule | Example | |
2.4.1 | Mean of a population | Example | |
2.4.2 | Skewness | Example | |
2.4.2 | Mean of a sample | Example | |
2.5.1 | Range | Example | |
2.5.2 | Sample variance | Example | |
2.5.3 | Population variance | Example | |
2.5.4 | Standard deviation | Example | |
2.5.5 | Coefficient of variation | Example | |
2.5.6 | Quartile location in ordered array | Example | |
2.5.7 | Interquartile range | Example | |
2.5.8 | Kurtosis | Example | |
Symbol Key |
|
Example |
The Ordered Array
The Frequency Distribution
[edit | edit source]Example 2.2.1 detailed the procedure to sort an array. This array is a series of ages in subjects received two kinds of smoking cessation program. Suppose you already import the data set using the following command:
> SmokeCProg <- read.csv("/EXA_C01_S04_01.csv", header=T, na.strings=NA)
It is better to use a descriptive name (SmokeCProg for Smoking Cessation Program) rather than commonly used place holder name such as x,y. We can obtain a sorted array of ages using the following command:
> sort(SmokeCProg$AGE)
The frequency distribution of Ages as shown in table 2.3.1 can be obtained using:
> table(cut(SmokeCProg$AGE, b=c(0,39,49,59,69,79,89))) (0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 11 46 70 45 16 1
cut command break up AGE variables based on the break points (0,39,49,59,69,79,89) provided. In table 2.3.2, the frequency table of age was provided. As suggested by Venables et al. in the book "An Introduction to R", statistical analysis is normally done as a series of steps, with intermediate results being stored in objects. Compared to other statistical packages, R will only give minimal output. We will demonstrate this important characteristic in this example. In previous example, we calculated the frequency distribution of Ages using table() and cut() command. We can store the results in form of a object called "AgeFreqTable" using:
> AgeFreqTable <- table(cut(SmokeCProg$AGE, b=c(0,39,49,59,69,79,89)))
You will get no output. Until you call the object "AgeFreqTable"
> AgeFreqTable (0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 11 46 70 45 16 1
In order to obtain the cumulative frequency, we can process the object "AgeFreqTable" using cumsum() command
> cumsum(AgeFreqTable) (0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 11 57 127 172 188 189
Before we jump to the calculation of relative frequency, we can obtain the total number of observations in a variable using length() function
> length(SmokeCProg$AGE) [1] 189
We can calculate the relative frequency by dividing each items in the object "AgeFreqTable" by the total number of observations using
> AgeFreqTable/length(SmokeCProg$AGE) (0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 0.058201058 0.243386243 0.370370370 0.238095238 0.084656085 0.005291005
Similarly, the cummulative relative frequency can be calculated using
> cumsum(AgeFreqTable)/length(SmokeCProg$AGE) (0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 0.05820106 0.30158730 0.67195767 0.91005291 0.99470899 1.00000000
If you would like to round the results of relative frequency to 4 digits, you can use the round() function
> round (AgeFreqTable/length(SmokeCProg$AGE),digits=4) (0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 0.0582 0.3016 0.6720 0.9101 0.9947 1.0000
Alternatively, you can store the results of relative frequency in a new object and then process that object with round() function
> AgeRelFreqTable <- AgeFreqTable/length(SmokeCProg$AGE) > round (AgeRelFreqTable, digits=4)
Exercise: Try to round the results of cummulative relative frequency to 4 digits using R command To plot a histogram, you can use the hist() function, e.g.
> hist(SmokeCProg$AGE)
You can customize the histogram by adding some arguments (i.e. options), you may type ?hist to learn more about the argument of hist() function. For example, if you want to plot a histogram with only five bars (similar to Figure 2.3.2)
> hist(SmokeCProg$AGE, breaks=5)
You can add more arguments to hist() functions, e.g.
> hist(SmokeCProg$AGE, breaks=5, ylim=c(0,70), main="Histogram of Ages of 189 subjects", col="red", xlab="Age")
Remember, always consult the document (e.g. ?hist or help.search("histogram") ) when you have question. In 95% of the time, you can find the answer in help document. For example, you don't know how to plot a stem-and-leaf graph to display your data. You don't even know the name of the function. You can use help.search() to search for the keyword "stem", i.e.
> help.search("stem")
A function called stem() should be in the results. We then try to use this function to visual our data
> stem(SmokeCProg$AGE) The decimal point is 1 digit(s) to the right of the | 3 | 04 3 | 577888899 4 | 00223333334444444 4 | 55566666677777788888889999999 5 | 0000000011112222223333333333333333344444444444 5 | 555666666777777788999999 6 | 000011111111111222222233444444 6 | 556666667888999 7 | 0111111123 7 | 567888 8 | 2
Not similar to MINITAB, the steam unit is adjusted by the scale argument. The plot above using a default scale of 1 which is equivalent to steam unit =5. To change the steam unit to 10, the value of scale argument should be change to 0.5
> stem(SmokeCProg$AGE, scale=0.5) The decimal point is 1 digit(s) to the right of the | 3 | 04577888899 4 | 0022333333444444455566666677777788888889999999 5 | 00000000111122222233333333333333333444444444445556666667777777889999 6 | 000011111111111222222233444444556666667888999 7 | 0111111123567888 8 | 2
Central Tendency
[edit | edit source]
Some Basic Probability Concepts
Formular with R
[edit | edit source]Formular Number | Name | Formular | Formular with R |
---|---|---|---|
3.2.1 | Classical probability | Example | |
3.2.2 | Relative frequency probability | Example | |
3.3.1–3.3.3 | Properties of probability |
|
Example |
3.4.1 | Multiplication rule | Example | |
3.4.2 | Conditional probability | Example | |
3.4.3 | Addition rule | Example | |
3.4.4 | Independent events | Example | |
3.4.5 | Complementary events | Example | |
3.4.6 | Marginal probability | Example | |
Sensitivity of a screening test | Example | ||
Specificity of a screening test | Example | ||
3.5.1 | Predictive value positive of a screening test | Example | |
3.5.2 | Predictive value negative of a screening test | Example | |
Symbol Key |
|
Example |
Probability Distributions
Summary of Formulars with R
[edit | edit source]Formular Number | Name | Formular | Formular with R |
---|---|---|---|
4.2.1 | Mean of a frequency distribution | Example | |
4.2.2 | Variance of a frequency distribution |
or |
Example |
4.3.1 | Combination of objects | Example | |
4.3.2 | Binomial distribution function | Example | |
4.3.3–4.3.5 | Tabled binomial probability equalities |
|
Example |
4.4.1 | Poisson distribution function | Example | |
4.6.1 | Normal distribution function | Example | |
4.6.2 | z-transformation | Example | |
4.6.3 | Standard normal distribution function | Example | |
Symbol Key |
Some Important Sampling Distributions
Summary of Formulars with R
[edit | edit source]Formular Number | Name | Formular | Formular with R |
---|---|---|---|
5.3.1 | z-transformation for sample mean | Example | |
5.4.1 | z-transformation for difference between two means | Example | |
5.5.1 | z-transformation for sample proportion | Example | |
5.5.2 | Continuity correction when x < np | Example | |
5.5.3 | Continuity correction when x > np | Example | |
5.6.1 | z-transformation for difference between two proportions | Example | |
Symbol Key |
Estimation
Summary of Formulars with R
[edit | edit source]Formular Number | Name | Formular | Formular with R |
---|---|---|---|
6.2.1 | Expression of an interval estimate | estimator ± (reliability coefficient)× standard error of the estimator | Example |
6.2.2 | Interval estimate for when is known | Example | |
6.3.1 | t-transformation | Example | |
6.3.2 | Interval estimate for when is unknown | Example | |
6.4.1 | Interval estimate for the difference between two population means when and are known | Example | |
6.4.2 | Pooled variance estimate | Example | |
6.4.3 | Standard error of estimate | Example | |
6.4.4 | Interval estimate for the difference between two population means when s 1 is unknown | Example | |
6.4.5 | Cochran’s correction for reliability coefficient when variances are not equal | Example | |
6.4.6 | Interval estimate using Cochran’s correction for t | Example | |
6.5.1 | Interval estimate for a population proportion | Example | |
6.6.1 | Interval estimate for the difference between two population proportions | Example | |
6.7.1–6.7.3 | Sample size determination when sampling with replacement | Example | |
6.7.4–6.7.5 | Sample size determination when sampling without replacement | Example | |
6.8.1 | Sample size determination for proportions when sampling with replacement | Example | |
6.8.2 | Sample size determination for proportions when sampling without replacement | Example | |
6.9.1 | Interval estimate for s 2 | Example | |
6.9.2 | Interval estimate for s | Example | |
6.10.1 | Interval estimate for the ratio of two variances | Example | |
6.10.2 | Relationship among F ratios | Example | |
Symbol Key |
Hypothesis Testing
Summary of Formulars with R
[edit | edit source]Formular Number | Name | Formular | Formular with R |
---|---|---|---|
7.1.1, 7.1.2, 7.2.1 | z-transformation (using either or ) | Example | |
7.2.2 | t-transformation | Example | |
7.2.3 | Test statistic when sampling from a population that is not normally distributed | Example | |
7.3.1 | Test statistic when sampling from normally distributed populations:population variances known | Example | |
7.3.2 | Test statistic when sampling from normally distributed populations:population variances unknown and equal | Example | Example |
7.3.3, 7.3.4 | Test statistic when sampling from normally distributed populations: population variances unknown and unequal | Example | Example |
7.3.5 | Sampling from populations that are not normally distributed | Example | Example |
7.4.1 | Test statistic for paired differences when the population variance is unknown | Example | Example |
7.4.2 | Test statistic for paired differences when the population variance is known | Example | Example |
7.5.1 | Test statistic for a single population proportion | Example | Example |
7.6.1, 7.6.2 | Test statistic for the difference between two population proportions | Example | Example |
7.7.1 | Test statistic for a single population variance | Example | Example |
7.8.1 | Variance ratio | Example | Example |
7.9.1, 7.9.2 | Upper and lower critical values for � x | Example | Example |
7.10.1, 7.10.2 | Critical value for determining sample size to control type II errors | Example | Example |
7.10.3 | Sample size to control type II errors | Example | Example |
5.5.3 | Continuity correction when x > np | Example | Example |
5.6.1 | z-transformation for difference between two proportions | Example | Example |
Symbol Key |
Analysis of Variance
Summary of Formulars with R
[edit | edit source]Formular Number | Name | Formular | Formular with R |
---|---|---|---|
8.2.1 | One-way ANOVA model | Example | Example |
8.2.2 | Total sum-of-squares | Example | Example |
8.2.3 | Within-group sum-of-squares | Example | Example |
8.2.4 | Among-group sum-of-squares | Example | Example |
8.2.5 | Within-group variance | Example | Example |
8.2.6 | Among-group variance I | Example | Example |
8.2.9 | Tukey’s HSD (equal sample sizes) | Example | Example |
8.2.10 | Tukey’s HSD (unequal sample sizes) | Example | Example |
8.3.1 | Two-way ANOVA model | Example | Example |
8.3.2 | Sum-of-squares representation | Example | Example |
8.3.3 | Sum-of-squares total | Example | Example |
8.3.4 | Sum-of-squares block | Example | Example |
8.3.5 | Sum-of-squares treatments | Example | Example |
8.3.6 | Sum-of-squares error | Example | Example |
8.4.1 | Fixed-effects, additive single-factor, repeated-measures ANOVA model | Example | Example |
8.4.2 | Fixed-effects, additive two-factor, repeated-measures ANOVA model | Example | Example |
8.5.1 | Two-factor completely randomized fixed-effects factorial model | Example | Example |
8.5.2 | Probabilistic representation of a | Example | Example |
8.5.3 | Sum-of-squares total I | Example | Example |
8.5.4 | Sum-of-squares total II | Example | Example |
8.5.5 | Sum-of-squares treatment partition | Example | Example |
Symbol Key |
Simple Linear Regression and Correlation
Summary of Formulars with R
[edit | edit source]Formular Number | Name | Formular | Formular with R |
---|---|---|---|
9.2.1 | Assumption of linearity | Example | Example |
9.2.2 | Simple linear regression model | Example | Example |
9.2.3 | Error (residual) term | Example | Example |
9.3.1 | Algebraic representation of a straight line | Example | Example |
9.3.2 | Least square estimate of the slope of a regression line | Example | Example |
9.3.3 | Least square estimate of the intercept of a regression line | Example | Example |
9.4.1 | Deviation equation | Example | Example |
9.4.2 | Sum-of-squares equation | Example | Example |
9.4.3 | Estimated population coefficient of determination | Example | Example |
9.4.4–9.4.7 | Means and variances of point estimators a and b | Example | Example |
9.4.8 | z statistic for testing hypotheses about b | Example | Example |
9.4.9 | t statistic for testing hypotheses about b | Example | Example |
9.5.1 | Prediction interval for Y for a given X | Example | Example |
9.5.2 | Confidence interval for the mean of Y for a given X | Example | Example |
9.7.1–9.7.2 | Correlation coefficient | Example | Example |
9.7.3 | t statistic for correlation coefficient | Example | Example |
9.7.4 | z statistic for correlation coefficient | Example | Example |
9.7.5 | Estimated standard deviation for z statistic | Example | Example |
9.7.6 | Z statistic for correlation coefficient | Example | Example |
9.7.7 | Z statistic for correlation coefficient when n < 25 | Example | Example |
9.7.8 | Standard deviation for z à | Example | Example |
9.7.9 | Z Ã statistic for correlation coefficient | Example | Example |
9.7.10 | Confidence interval for r | Example | Example |
Symbol Key |
Multiple Regression and Correlation
Summary of Formulars with R
[edit | edit source]Formular Number | Name | Formular | Formular with R |
---|---|---|---|
10.2.1 | Representation of the multiple linear regression equation | Example | Example |
10.2.2 | Representation of the multiple linear regression equation with two independent variables | Example | Example |
10.2.3 | Random deviation of a point from a plane when there are two independent variables | Example | Example |
10.3.1 | Sum-of-squared residuals | Example | Example |
10.4.1 | Sum-of-squares equation | Example | Example |
10.4.2 | Coefficient of multiple determination | Example | Example |
10.4.3 | t statistic for testing hypotheses about b i | Example | Example |
10.5.1 | Estimation equation for multiple linear regression | Example | Example |
10.5.2 | Confidence interval for the mean of Y for a given X | Example | Example |
10.5.3 | Prediction interval for Y for a given X | Example | Example |
10.6.1 | Multiple correlation model | Example | Example |
10.6.2 | Multiple correlation coefficient | Example | Example |
10.6.3 | F statistic for testing the multiple correlation coefficient | Example | Example |
10.6.4–10.6.6 | Partial correlation between two variables (1 and 2) after controlling for a third (3) | Example | Example |
10.6.7 | t statistic for testing hypotheses about partial correlation coefficients | Example | Example |
Symbol Key |
Regression Analysis: Some Additional Techniques
Summary of Formulars with R
[edit | edit source]Formular Number | Name | Formular | Formular with R |
---|---|---|---|
11.4.1–11.4.3 | Representations of the simple linear regression model | Example | Example |
11.4.4 | Simple logistic regression model | Example | Example |
11.4.5 | Alternative representation of the simple logistic regression model | Example | Example |
11.4.6 | Alternative representation of the multiple logistic regression model | Example | Example |
11.4.7 | Alternative representation of the multiple logistic regression model | Example | Example |
Symbol Key |
The Chi-Square Distribution and the Analysis of Frequencies
Summary of Formulars with R
[edit | edit source]Formular Number | Name | Formular | Formular with R |
---|---|---|---|
12.2.1 | Standard normal random variable | Example | Example |
12.2.2 | Chi-square distribution with n degrees of freedom | Example | Example |
12.2.3 | Chi-square probability density function | Example | Example |
12.2.4 | Chi-square test statistic | Example | Example |
12.4.1 | Chi-square calculation formula for a 2 Â 2 contingency table | Example | Example |
12.4.2 | Yates’s corrected chi-square calculation for a 2 Â 2 contingency table | Example | Example |
12.6.1–12.6.2 | Large-sample approximation to the chi-square | Example | Example |
12.7.1 | Relative risk estimate | Example | Example |
12.7.2 | Confidence interval for the relative risk estimate | Example | Example |
12.7.3 | Odds ratio estimate | Example | Example |
12.7.4 | Confidence interval for the odds ratio estimate | Example | Example |
12.7.5 | Expected frequency in the Mantel–Haenszel statistic | Example | Example |
12.7.6 | Stratum expected frequency in the Mantel–Haenszel statistic | Example | Example |
12.7.7 | Mantel–Haenszel test statistic | Example | Example |
12.7.8 | Mantel–Haenszel estimator of the common odds ratio | Example | Example |
Example | Example | Example | Example |
Example | Example | Example | Example |
Symbol Key |
Nonparametric and Distribution-Free Statistics
Summary of Formulars with R
[edit | edit source]Formular Number | Name | Formular | Formular with R |
---|---|---|---|
13.3.1 | Sign test statistic | Example | Example |
13.3.2 | Large-sample approximation of the sign test | Example | Example |
13.6.1 | Mann–Whitney test statistic | Example | Example |
13.6.2 | Large-sample approximation of the Mann–Whitney test | Example | Example |
13.6.3 | Equivalence of the Mann–Whitney and Wilcoxon two-sample statistics | Example | Example |
13.7.1–13.7.2 | Kolmogorov–Smirnov test statistic | Example | Example |
13.8.1 | Kruskal–Wallis test statistic | Example | Example |
13.8.2 | Kruskal–Wallis test statistic adjustment for ties | Example | Example |
13.9.2 | Friedman test statistic | Example | Example |
13.10.1 | Spearman rank correlation test statistic | Example | Example |
13.10.2 | Large-sample approximation of the Spearman rank correlation | Example | Example |
13.10.3–13.10.4 | Correction for tied observations in the Spearman rank correlation | Example | Example |
13.11.1 | Theil's estimator of b | Example | Example |
Survival Analysis
Summary of Formulars with R
[edit | edit source]Formular Number | Name | Formular | Formular with R |
---|---|---|---|
14.2.1 | Example | Example | Example |
14.2.2 | Example | Example | Example |
14.2.3 | Example | Example | Example |
14.2.4 | Example | Example | Example |
14.2.5 | Example | Example | Example |
Example | Example | Example | Example |
Example | Example | Example | Example |
Example | Example | Example | Example |
Example | Example | Example | Example |
Example | Example | Example | Example |
Vital Statistics
Summary of Formulars with R
[edit | edit source]Formular Number | Name | Formular | Formular with R |
---|---|---|---|
Example | Example | Example | Example |
Example | Example | Example | Example |
Example | Example | Example | Example |
Example | Example | Example | Example |
Example | Example | Example | Example |
Example | Example | Example | Example |
Example | Example | Example | Example |
Example | Example | Example | Example |
Example | Example | Example | Example |
Example | Example | Example | Example |
Further reading
For Biostatistics
[edit | edit source]For R programming
[edit | edit source]