Reading Data from Files

In practical data analysis, data are often stored in a data file. R can read different types of data from files such as the free format text files, comma separated value files, Excel files, SPSS files, SAS files, and Stata files.

Read Data from a Free Format Text File

The most common way to get data into R is to save data as free format in a text file and then use read.table() function to read the data. For example, let's read the data in a file called gpa.txt which is available on the website. The content of the data file is shown below.

## GPA data
## 999 represents missing data
 id gender college   gpa weight
  1      f     yes   3.6    110
  2      m     yes   3.5    170
  3      m      no  99.0    165
  4      m      no 999.0    190
  5      f      no 999.0     95
  6      m     yes   3.7    200
  7      m     yes   3.6    150
  8      f     yes   3.8    100
  9      f     yes   3.0    130
 10      f      no 999.0    120

Note that the first two lines of the data file start with "#", which are clearly notes or comments about the data. The third line appears to be variable names. After that, there are 10 lines of data.

The function read.table() can load data from a local computer and from a remote location on Internet. Since the data file here is online, we first show how to get data in this way using the code below. Note that

  • file provides the link to the data file on the Internet or the path to the file on a computer.
  • header=TRUE tells there are variable names in the data file.
  • na.string="999" tells missing data are coded by 999. Multiple missing data strings can be provided in a vector such as na.string=c("99","999").
  • comment.char = "#" lets R to skip lines starting with "#" in the file.

> gpadata <- read.table(file='https://advstats.psychstat.org/data/gpa.txt', header=TRUE, na.string="999", comment.char = "#") > > gpadata id gender college gpa weight 1 1 f yes 3.6 110 2 2 m yes 3.5 170 3 3 m no 99.0 165 4 4 m no NA 190 5 5 f no NA 95 6 6 m yes 3.7 200 7 7 m yes 3.6 150 8 8 f yes 3.8 100 9 9 f yes 3.0 130 10 10 f no NA 120 >

Access data

Data that are read into R are generally saved as a data frame. Some useful operations for a data frame are listed below.

  • Type the name of the data to show all the data
  • head and tail: show the first and last few rows of data
  • names: list the variable names in the data set
  • dim: show the number of rows (sample size) and columns (number of variables) of the data
  • attach: copy the variables into R working memory
  • detach: remove the variables from working memory
  • dataset$varname: take out a variable in the data set
  • dataset[i,j]: take out values according to index

> #gpadata <- read.table('https://advstats.psychstat.org/data/gpa.txt', header=TRUE, na.string="999") > head(gpadata) id gender college gpa weight 1 1 f yes 3.6 110 2 2 m yes 3.5 170 3 3 m no 99.0 165 4 4 m no NA 190 5 5 f no NA 95 6 6 m yes 3.7 200 > tail(gpadata) id gender college gpa weight 5 5 f no NA 95 6 6 m yes 3.7 200 7 7 m yes 3.6 150 8 8 f yes 3.8 100 9 9 f yes 3.0 130 10 10 f no NA 120 > names(gpadata) [1] "id" "gender" "college" "gpa" "weight" > dim(gpadata) [1] 10 5 > gpadata$weight [1] 110 170 165 190 95 200 150 100 130 120 > gpadata[, 2] [1] f m m m f m m f f f Levels: f m > gpadata[, 'gender'] [1] f m m m f m m f f f Levels: f m > attach(gpadata) The following object is masked _by_ .GlobalEnv: gender > gender [1] "F" "F" "M" "F" "M" > detach(gpadata) > gender ## this would produce an error [1] "F" "F" "M" "F" "M" >

 

To cite the book, use: Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.