Reading Data from Files

In practical data analysis, data are often stored in a data file. Python can read different types of data from files such as the free format text files, comma separated value files, Excel files, SPSS files, SAS files, and Stata files.

Read Data from a Free Format Text File

One common way to get data into python is to save data as free format in a text file and then read the data into python. For example, let's read the data in a file called gpa.txt which is available on the website. The content of the data file is shown below.

## GPA data
## 999 represents missing data
 id gender college   gpa weight
  1      f     yes   3.6    110
  2      m     yes   3.5    170
  3      m      no  99.0    165
  4      m      no 999.0    190
  5      f      no 999.0     95
  6      m     yes   3.7    200
  7      m     yes   3.6    150
  8      f     yes   3.8    100
  9      f     yes   3.0    130
 10      f      no 999.0    120

Note that the first two lines of the data file start with "#", which are clearly notes or comments about the data. The third line appears to be variable names. After that, there are 10 lines of data.

To read the data into Python, we will use the pandas library. The function read_csv in the pandas library can be used. The following common arguments can be used in the function:
  • filepath: the path to the data file (file path, URL, or file-like object)
  • sep: the separator of the data, which is a comma by default, but you can change it to any other delimiter like tab (\t), semicolon (;), etc. Particularly, for data separated by space, using r'\s+'
  • header: the row number of the header, which is 0 by default; "None" for no variable names
  • skiprows: the number of rows to skip before reading the data
  • na_values: the values to be treated as missing data
  • comment: the character that starts a comment
  • skipfooter: The number of lines at the bottom of the file to skip. This is helpful when the file has footer rows with metadata or summary information that you don’t want to include.
  • thousands: The character to recognize as a thousands separator. This can be useful for numeric columns with commas as thousands separators.
>>> import pandas as pd >>> file_path_name = "https://advstats.psychstat.org/data/gpa.txt" >>> >>> df = pd.read_csv(file_path_name, ... skiprows=0, # Skip the first row (a comment row in this case) ... na_values=["99", "999"], # Treat empty strings and "NA" as NaN ... comment="#", # Skip any line starting with "#" ... sep=r'\s+') >>> print(df) id gender college gpa weight 0 1 f yes 3.6 110 1 2 m yes 3.5 170 2 3 m no NaN 165 3 4 m no NaN 190 4 5 f no NaN 95 5 6 m yes 3.7 200 6 7 m yes 3.6 150 7 8 f yes 3.8 100 8 9 f yes 3.0 130 9 10 f no NaN 120

Access data

Data that are read in this way are saved as a data frame. Some useful operations for a data frame are listed below.

>>> import pandas as pd >>> file_path_name = "https://advstats.psychstat.org/data/gpa.txt" >>> >>> df = pd.read_csv(file_path_name, ... skiprows=0, # Skip the first row (a comment row in this case) ... na_values=["99", "999"], # Treat empty strings and "NA" as NaN ... comment="#", # Skip any line starting with "#" ... sep=r'\s+') >>> print(df) id gender college gpa weight 0 1 f yes 3.6 110 1 2 m yes 3.5 170 2 3 m no NaN 165 3 4 m no NaN 190 4 5 f no NaN 95 5 6 m yes 3.7 200 6 7 m yes 3.6 150 7 8 f yes 3.8 100 8 9 f yes 3.0 130 9 10 f no NaN 120 >>> >>> df.head(n=5) ## first 5 rows id gender college gpa weight 0 1 f yes 3.6 110 1 2 m yes 3.5 170 2 3 m no NaN 165 3 4 m no NaN 190 4 5 f no NaN 95 >>> df.tail(n=5) ## last 5 rows id gender college gpa weight 5 6 m yes 3.7 200 6 7 m yes 3.6 150 7 8 f yes 3.8 100 8 9 f yes 3.0 130 9 10 f no NaN 120 >>> >>> df.shape ## rows and numbers (10, 5) >>> df.info() RangeIndex: 10 entries, 0 to 9 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10 non-null int64 1 gender 10 non-null object 2 college 10 non-null object 3 gpa 6 non-null float64 4 weight 10 non-null int64 dtypes: float64(1), int64(2), object(2) memory usage: 532.0+ bytes >>> df.describe() ## summary statistics for numerical columns id gpa weight count 10.00000 6.000000 10.000000 mean 5.50000 3.533333 143.000000 std 3.02765 0.280476 37.505555 min 1.00000 3.000000 95.000000 25% 3.25000 3.525000 112.500000 50% 5.50000 3.600000 140.000000 75% 7.75000 3.675000 168.750000 max 10.00000 3.800000 200.000000 >>> >>> df.isnull() # Returns a DataFrame of True/False for missing values id gender college gpa weight 0 False False False False False 1 False False False False False 2 False False False True False 3 False False False True False 4 False False False True False 5 False False False False False 6 False False False False False 7 False False False False False 8 False False False False False 9 False False False True False >>> df.isnull().sum() # Count of missing values per column id 0 gender 0 college 0 gpa 4 weight 0 dtype: int64 >>> >>> ## select a column >>> df['gpa'] 0 3.6 1 3.5 2 NaN 3 NaN 4 NaN 5 3.7 6 3.6 7 3.8 8 3.0 9 NaN Name: gpa, dtype: float64 >>> df.gpa 0 3.6 1 3.5 2 NaN 3 NaN 4 NaN 5 3.7 6 3.6 7 3.8 8 3.0 9 NaN Name: gpa, dtype: float64 >>> >>> ## multiple column >>> df[['gpa', 'id']] gpa id 0 3.6 1 1 3.5 2 2 NaN 3 3 NaN 4 4 NaN 5 5 3.7 6 6 3.6 7 7 3.8 8 8 3.0 9 9 NaN 10 >>> >>> ## select rows >>> df.iloc[0] ## first row id 1 gender f college yes gpa 3.6 weight 110 Name: 0, dtype: object >>> df.iloc[0:3] ## first three rows id gender college gpa weight 0 1 f yes 3.6 110 1 2 m yes 3.5 170 2 3 m no NaN 165 >>> >>> ## both rows and columns >>> df.loc[0:3, ['gpa', 'college']] gpa college 0 3.6 yes 1 3.5 yes 2 NaN no 3 NaN no

Modify the data

After reading the data, you may want to modify the data. For example, you may want to change the data type of a variable, add a new variable, or remove a variable.

>>> import pandas as pd >>> file_path_name = "https://advstats.psychstat.org/data/gpa.txt" >>> >>> df = pd.read_csv(file_path_name, ... skiprows=0, # Skip the first row (a comment row in this case) ... na_values=["99", "999"], # Treat empty strings and "NA" as NaN ... comment="#", # Skip any line starting with "#" ... sep=r'\s+') >>> print(df) id gender college gpa weight 0 1 f yes 3.6 110 1 2 m yes 3.5 170 2 3 m no NaN 165 3 4 m no NaN 190 4 5 f no NaN 95 5 6 m yes 3.7 200 6 7 m yes 3.6 150 7 8 f yes 3.8 100 8 9 f yes 3.0 130 9 10 f no NaN 120 >>> >>> ## add a new column >>> df['sex'] = df['gender'].map({'m': 0, 'f':1}) >>> df id gender college gpa weight sex 0 1 f yes 3.6 110 1 1 2 m yes 3.5 170 0 2 3 m no NaN 165 0 3 4 m no NaN 190 0 4 5 f no NaN 95 1 5 6 m yes 3.7 200 0 6 7 m yes 3.6 150 0 7 8 f yes 3.8 100 1 8 9 f yes 3.0 130 1 9 10 f no NaN 120 1 >>> >>> ## rename a column >>> df.rename(columns={'sex':'sexnew'}, inplace=True) >>> df id gender college gpa weight sexnew 0 1 f yes 3.6 110 1 1 2 m yes 3.5 170 0 2 3 m no NaN 165 0 3 4 m no NaN 190 0 4 5 f no NaN 95 1 5 6 m yes 3.7 200 0 6 7 m yes 3.6 150 0 7 8 f yes 3.8 100 1 8 9 f yes 3.0 130 1 9 10 f no NaN 120 1 >>> >>> ## Delete a column >>> df.drop('sexnew', axis=1, inplace=True) >>> df id gender college gpa weight 0 1 f yes 3.6 110 1 2 m yes 3.5 170 2 3 m no NaN 165 3 4 m no NaN 190 4 5 f no NaN 95 5 6 m yes 3.7 200 6 7 m yes 3.6 150 7 8 f yes 3.8 100 8 9 f yes 3.0 130 9 10 f no NaN 120 >>> >>> ## fill missing values >>> df['gpa'].fillna(df['gpa'].mean(), inplace=True) ## with column means >>> df id gender college gpa weight 0 1 f yes 3.600000 110 1 2 m yes 3.500000 170 2 3 m no 3.533333 165 3 4 m no 3.533333 190 4 5 f no 3.533333 95 5 6 m yes 3.700000 200 6 7 m yes 3.600000 150 7 8 f yes 3.800000 100 8 9 f yes 3.000000 130 9 10 f no 3.533333 120 >>> >>> ## sort data >>> df.sort_values(by=['gpa', 'weight'], ascending=[True, False]) id gender college gpa weight 8 9 f yes 3.000000 130 1 2 m yes 3.500000 170 3 4 m no 3.533333 190 2 3 m no 3.533333 165 9 10 f no 3.533333 120 4 5 f no 3.533333 95 6 7 m yes 3.600000 150 0 1 f yes 3.600000 110 5 6 m yes 3.700000 200 7 8 f yes 3.800000 100

Handle missing data

Missing data are common in data analysis. You may want to remove the rows with missing data, replace the missing data with a specific value, or use a statistical method to impute the missing data.

>>> import pandas as pd >>> file_path_name = "https://advstats.psychstat.org/data/gpa.txt" >>> >>> df = pd.read_csv(file_path_name, ... skiprows=0, # Skip the first row (a comment row in this case) ... na_values=["99", "999"], # Treat empty strings and "NA" as NaN ... comment="#", # Skip any line starting with "#" ... sep=r'\s+') >>> print(df) id gender college gpa weight 0 1 f yes 3.6 110 1 2 m yes 3.5 170 2 3 m no NaN 165 3 4 m no NaN 190 4 5 f no NaN 95 5 6 m yes 3.7 200 6 7 m yes 3.6 150 7 8 f yes 3.8 100 8 9 f yes 3.0 130 9 10 f no NaN 120 >>> >>> ## find missing values >>> df.isnull() id gender college gpa weight 0 False False False False False 1 False False False False False 2 False False False True False 3 False False False True False 4 False False False True False 5 False False False False False 6 False False False False False 7 False False False False False 8 False False False False False 9 False False False True False >>> ## count the missing values >>> df.isnull().sum() id 0 gender 0 college 0 gpa 4 weight 0 dtype: int64 >>> >>> ## fill missing values >>> df['gpa'].fillna(df['gpa'].mean()) ## with column means 0 3.600000 1 3.500000 2 3.533333 3 3.533333 4 3.533333 5 3.700000 6 3.600000 7 3.800000 8 3.000000 9 3.533333 Name: gpa, dtype: float64 >>> >>> ## remove a column with missing value >>> df.dropna(subset=['gpa']) id gender college gpa weight 0 1 f yes 3.6 110 1 2 m yes 3.5 170 5 6 m yes 3.7 200 6 7 m yes 3.6 150 7 8 f yes 3.8 100 8 9 f yes 3.0 130 >>> ## remove any rows with missing data >>> df.dropna(axis=0) id gender college gpa weight 0 1 f yes 3.6 110 1 2 m yes 3.5 170 5 6 m yes 3.7 200 6 7 m yes 3.6 150 7 8 f yes 3.8 100 8 9 f yes 3.0 130 >>> ## remove any columns with missing data >>> df.dropna(axis=1) id gender college weight 0 1 f yes 110 1 2 m yes 170 2 3 m no 165 3 4 m no 190 4 5 f no 95 5 6 m yes 200 6 7 m yes 150 7 8 f yes 100 8 9 f yes 130 9 10 f no 120

Save data

After modifying the data, you may want to save the data to a file. The function to_csv in the pandas library can be used. The following common arguments can be used in the function:

  • filepath: the path to the data file (file path, URL, or file-like object)
  • sep: the separator of the data, which is a comma by default, but you can change it to any other delimiter like tab (\t), semicolon (;), etc. Particularly, for data separated by space, using r'\s+'
  • header: whether to write the column names
  • index: whether to write row names
  • na_rep: the string representation of missing data
  • float_format: the format of the floating-point numbers
  • date_format: the format of the date
  • decimal: the character recognized as a decimal point
>>> import pandas as pd >>> file_path_name = "https://advstats.psychstat.org/data/gpa.txt" >>> >>> df = pd.read_csv(file_path_name, ... skiprows=0, # Skip the first row (a comment row in this case) ... na_values=["99", "999"], # Treat empty strings and "NA" as NaN ... comment="#", # Skip any line starting with "#" ... sep=r'\s+') >>> print(df) id gender college gpa weight 0 1 f yes 3.6 110 1 2 m yes 3.5 170 2 3 m no NaN 165 3 4 m no NaN 190 4 5 f no NaN 95 5 6 m yes 3.7 200 6 7 m yes 3.6 150 7 8 f yes 3.8 100 8 9 f yes 3.0 130 9 10 f no NaN 120 >>> >>> df.to_csv('output.csv', index=False)

 

To cite the book, use: Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.