Jump to content

Biostatistics with R/Import

From Wikibooks, open books for an open world

Why R for biostatistics?

[edit | edit source]

R is superior to common statistical packages such as SPSS, SAS and MINITAB because it is

  • powerful
  • available for many platforms (Mac OS X, Windows, Linux etc.)
  • programmable
  • non-commercial
  • extensively documented

Obtaining R/Installation

[edit | edit source]

You may refer to R FAQ

Data Import

[edit | edit source]

The format of data set available in Wiley's website are CSV, Excel, MINITAB, SAS and SPSS. Although you can import the data saved in Excel, SAS and SPSS into R using the foreign package, you should download the data in CSV format. It is because CSV is the easiest one to process in R.

For example, you would like to import the "Large Data set" data file. The downloaded data file (LDS_C02_NCBIRTH800.csv) , assuming stored in the directory "/desktop",can be imported into R as a data.frame called "largedataset" using following syntax:

> largedataset <- read.csv("/Desktop/LDS_C02_NCBIRTH800.csv", header=TRUE,na.strings="NA")

if you prefer to choose the data file using the standard "point-and-click" GUI way, you may use the function file.choose(), i.e.

largedataset <- read.csv(file.choose(), header=TRUE,na.strings="NA")

Now, you should imported the data from the CSV to a data frame called "largedataset". You may try to look inside the data frame by calling its name

> largedataset

You can access the variable (in computer lingo, column) "sex" inside the largedataset dataframe by

largedataset$sex

For example, you want to count the frequency of sex

> table(largedataset$sex)

You can attach the data frame so that you can call the variable directly

> attach(largedataset)
> table(sex)
> detach() #cancel attaching

Basic data management

[edit | edit source]

R is designed to be a analysis system instead of a integrated environment such as SPSS. Unlike SPSS, R doesn't have a spreadsheet-like environment for data input. Usually data are entered using different software (e.g. database, spreadsheet software such as OO.o Calc) and then imported to R as described above. For quick one-off calculations, you can do the data entry in R. For example, if you want to calculate the mean age of ten patients (30,31,32,34,35,36,37,30,40,45) you can enter the data into R using the c() function.

> pt_age <- c(30,31,32,34,35,36,37,30,40,45)

You may call the newly created object pt_age by its name...

> pt_age

...and then calculate the mean age of the ten patients.

> mean (pt_age)