Reading Data from Files
In practical data analysis, data are often stored in a data file. R can read different types of data from files such as the free format text files, comma separated value files, Excel files, SPSS files, SAS files, and Stata files.
Read Data from a Free Format Text File
The most common way to get data into R is to save data as free format in a text file and then use read.table()
function to read the data. For example, let's read the data in a file called gpa.txt which is available on the website. The content of the data file is shown below.
## GPA data ## 999 represents missing data id gender college gpa weight 1 f yes 3.6 110 2 m yes 3.5 170 3 m no 99.0 165 4 m no 999.0 190 5 f no 999.0 95 6 m yes 3.7 200 7 m yes 3.6 150 8 f yes 3.8 100 9 f yes 3.0 130 10 f no 999.0 120
Note that the first two lines of the data file start with "#", which are clearly notes or comments about the data. The third line appears to be variable names. After that, there are 10 lines of data.
The function read.table()
can load data from a local computer and from a remote location on Internet. Since the data file here is online, we first show how to get data in this way using the code below. Note that
file
provides the link to the data file on the Internet or the path to the file on a computer.header=TRUE
tells there are variable names in the data file.na.string="999"
tells missing data are coded by 999. Multiple missing data strings can be provided in a vector such asna.string=c("99","999")
.comment.char = "#"
lets R to skip lines starting with "#" in the file.
> gpadata <- read.table(file='https://advstats.psychstat.org/data/gpa.txt', header=TRUE, na.string="999", comment.char = "#") > > gpadata id gender college gpa weight 1 1 f yes 3.6 110 2 2 m yes 3.5 170 3 3 m no 99.0 165 4 4 m no NA 190 5 5 f no NA 95 6 6 m yes 3.7 200 7 7 m yes 3.6 150 8 8 f yes 3.8 100 9 9 f yes 3.0 130 10 10 f no NA 120 >
Access data
Data that are read into R are generally saved as a data frame. Some useful operations for a data frame are listed below.
- Type the name of the data to show all the data
head
andtail
: show the first and last few rows of datanames
: list the variable names in the data setdim
: show the number of rows (sample size) and columns (number of variables) of the dataattach
: copy the variables into R working memorydetach
: remove the variables from working memorydataset$varname
: take out a variable in the data setdataset[i,j]
: take out values according to index
> #gpadata <- read.table('https://advstats.psychstat.org/data/gpa.txt', header=TRUE, na.string="999") > head(gpadata) id gender college gpa weight 1 1 f yes 3.6 110 2 2 m yes 3.5 170 3 3 m no 99.0 165 4 4 m no NA 190 5 5 f no NA 95 6 6 m yes 3.7 200 > tail(gpadata) id gender college gpa weight 5 5 f no NA 95 6 6 m yes 3.7 200 7 7 m yes 3.6 150 8 8 f yes 3.8 100 9 9 f yes 3.0 130 10 10 f no NA 120 > names(gpadata) [1] "id" "gender" "college" "gpa" "weight" > dim(gpadata) [1] 10 5 > gpadata$weight [1] 110 170 165 190 95 200 150 100 130 120 > gpadata[, 2] [1] f m m m f m m f f f Levels: f m > gpadata[, 'gender'] [1] f m m m f m m f f f Levels: f m > attach(gpadata) The following object is masked _by_ .GlobalEnv: gender > gender [1] "F" "F" "M" "F" "M" > detach(gpadata) > gender ## this would produce an error [1] "F" "F" "M" "F" "M" >
To cite the book, use:
Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.