Reading Data from Files

In practical data analysis, data are often stored in a data file. Python can read different types of data from files such as the free format text files, comma separated value files, Excel files, SPSS files, SAS files, and Stata files.

Read Data from a Free Format Text File

One common way to get data into python is to save data as free format in a text file and then read the data into python. For example, let's read the data in a file called gpa.txt which is available on the website. The content of the data file is shown below.

## GPA data
## 999 represents missing data
 id gender college   gpa weight
  1      f     yes   3.6    110
  2      m     yes   3.5    170
  3      m      no  99.0    165
  4      m      no 999.0    190
  5      f      no 999.0     95
  6      m     yes   3.7    200
  7      m     yes   3.6    150
  8      f     yes   3.8    100
  9      f     yes   3.0    130
 10      f      no 999.0    120

Note that the first two lines of the data file start with "#", which are clearly notes or comments about the data. The third line appears to be variable names. After that, there are 10 lines of data.

To read the data into Python, we will use the pandas library. The function read_csv in the pandas library can be used. The following common arguments can be used in the function:

filepath: the path to the data file (file path, URL, or file-like object)
sep: the separator of the data, which is a comma by default, but you can change it to any other delimiter like tab (\t), semicolon (;), etc. Particularly, for data separated by space, using r'\s+'
header: the row number of the header, which is 0 by default; "None" for no variable names
skiprows: the number of rows to skip before reading the data
na_values: the values to be treated as missing data
comment: the character that starts a comment
skipfooter: The number of lines at the bottom of the file to skip. This is helpful when the file has footer rows with metadata or summary information that you don’t want to include.
thousands: The character to recognize as a thousands separator. This can be useful for numeric columns with commas as thousands separators.

>>> import pandas as pd
>>> file_path_name = "https://advstats.psychstat.org/data/gpa.txt"
>>> 
>>> df = pd.read_csv(file_path_name, 
...                   skiprows=0,              # Skip the first row (a comment row in this case)
...                   na_values=["99", "999"], # Treat empty strings and "NA" as NaN
...                   comment="#",             # Skip any line starting with "#"
...                   sep=r'\s+')
>>> print(df)
   id gender college  gpa  weight
0   1      f     yes  3.6     110
1   2      m     yes  3.5     170
2   3      m      no  NaN     165
3   4      m      no  NaN     190
4   5      f      no  NaN      95
5   6      m     yes  3.7     200
6   7      m     yes  3.6     150
7   8      f     yes  3.8     100
8   9      f     yes  3.0     130
9  10      f      no  NaN     120

Access data

Data that are read in this way are saved as a data frame. Some useful operations for a data frame are listed below.


>>> import pandas as pd

>>> file_path_name = "https://advstats.psychstat.org/data/gpa.txt"

>>> 

>>> df = pd.read_csv(file_path_name, 
...                   skiprows=0,           # Skip the first row (a comment row in this case)
...                   na_values=["99", "999"], # Treat empty strings and "NA" as NaN
...                   comment="#",          # Skip any line starting with "#"
...                   sep=r'\s+')

>>> print(df)
   id gender college  gpa  weight
0   1      f     yes  3.6     110
1   2      m     yes  3.5     170
2   3      m      no  NaN     165
3   4      m      no  NaN     190
4   5      f      no  NaN      95
5   6      m     yes  3.7     200
6   7      m     yes  3.6     150
7   8      f     yes  3.8     100
8   9      f     yes  3.0     130
9  10      f      no  NaN     120

>>> 

>>> df.head(n=5)  ## first 5 rows
   id gender college  gpa  weight
0   1      f     yes  3.6     110
1   2      m     yes  3.5     170
2   3      m      no  NaN     165
3   4      m      no  NaN     190
4   5      f      no  NaN      95

>>> df.tail(n=5)  ## last 5 rows
   id gender college  gpa  weight
5   6      m     yes  3.7     200
6   7      m     yes  3.6     150
7   8      f     yes  3.8     100
8   9      f     yes  3.0     130
9  10      f      no  NaN     120

>>> 

>>> df.shape  ## rows and numbers
(10, 5)

>>> df.info()

RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       10 non-null     int64  
 1   gender   10 non-null     object 
 2   college  10 non-null     object 
 3   gpa      6 non-null      float64
 4   weight   10 non-null     int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 532.0+ bytes

>>> df.describe()  ## summary statistics for numerical columns
             id       gpa      weight
count  10.00000  6.000000   10.000000
mean    5.50000  3.533333  143.000000
std     3.02765  0.280476   37.505555
min     1.00000  3.000000   95.000000
25%     3.25000  3.525000  112.500000
50%     5.50000  3.600000  140.000000
75%     7.75000  3.675000  168.750000
max    10.00000  3.800000  200.000000

>>> 

>>> df.isnull()  # Returns a DataFrame of True/False for missing values
      id  gender  college    gpa  weight
0  False   False    False  False   False
1  False   False    False  False   False
2  False   False    False   True   False
3  False   False    False   True   False
4  False   False    False   True   False
5  False   False    False  False   False
6  False   False    False  False   False
7  False   False    False  False   False
8  False   False    False  False   False
9  False   False    False   True   False

>>> df.isnull().sum()  # Count of missing values per column
id         0
gender     0
college    0
gpa        4
weight     0
dtype: int64

>>> 

>>> ## select a column

>>> df['gpa']
0    3.6
1    3.5
2    NaN
3    NaN
4    NaN
5    3.7
6    3.6
7    3.8
8    3.0
9    NaN
Name: gpa, dtype: float64

>>> df.gpa
0    3.6
1    3.5
2    NaN
3    NaN
4    NaN
5    3.7
6    3.6
7    3.8
8    3.0
9    NaN
Name: gpa, dtype: float64

>>> 

>>> ## multiple column

>>> df[['gpa', 'id']]
   gpa  id
0  3.6   1
1  3.5   2
2  NaN   3
3  NaN   4
4  NaN   5
5  3.7   6
6  3.6   7
7  3.8   8
8  3.0   9
9  NaN  10

>>> 

>>> ## select rows

>>> df.iloc[0]  ## first row
id           1
gender       f
college    yes
gpa        3.6
weight     110
Name: 0, dtype: object

>>> df.iloc[0:3]  ## first three rows
   id gender college  gpa  weight
0   1      f     yes  3.6     110
1   2      m     yes  3.5     170
2   3      m      no  NaN     165

>>> 

>>> ## both rows and columns

>>> df.loc[0:3, ['gpa', 'college']]
   gpa college
0  3.6     yes
1  3.5     yes
2  NaN      no
3  NaN      no

Modify the data

After reading the data, you may want to modify the data. For example, you may want to change the data type of a variable, add a new variable, or remove a variable.


>>> import pandas as pd

>>> file_path_name = "https://advstats.psychstat.org/data/gpa.txt"

>>> 

>>> df = pd.read_csv(file_path_name, 
...                   skiprows=0,           # Skip the first row (a comment row in this case)
...                   na_values=["99", "999"], # Treat empty strings and "NA" as NaN
...                   comment="#",          # Skip any line starting with "#"
...                   sep=r'\s+')

>>> print(df)
   id gender college  gpa  weight
0   1      f     yes  3.6     110
1   2      m     yes  3.5     170
2   3      m      no  NaN     165
3   4      m      no  NaN     190
4   5      f      no  NaN      95
5   6      m     yes  3.7     200
6   7      m     yes  3.6     150
7   8      f     yes  3.8     100
8   9      f     yes  3.0     130
9  10      f      no  NaN     120

>>> 

>>> ## add a new column

>>> df['sex'] = df['gender'].map({'m': 0, 'f':1})

>>> df
   id gender college  gpa  weight  sex
0   1      f     yes  3.6     110    1
1   2      m     yes  3.5     170    0
2   3      m      no  NaN     165    0
3   4      m      no  NaN     190    0
4   5      f      no  NaN      95    1
5   6      m     yes  3.7     200    0
6   7      m     yes  3.6     150    0
7   8      f     yes  3.8     100    1
8   9      f     yes  3.0     130    1
9  10      f      no  NaN     120    1

>>> 

>>> ## rename a column

>>> df.rename(columns={'sex':'sexnew'}, inplace=True)

>>> df
   id gender college  gpa  weight  sexnew
0   1      f     yes  3.6     110       1
1   2      m     yes  3.5     170       0
2   3      m      no  NaN     165       0
3   4      m      no  NaN     190       0
4   5      f      no  NaN      95       1
5   6      m     yes  3.7     200       0
6   7      m     yes  3.6     150       0
7   8      f     yes  3.8     100       1
8   9      f     yes  3.0     130       1
9  10      f      no  NaN     120       1

>>> 

>>> ## Delete a column

>>> df.drop('sexnew', axis=1, inplace=True)

>>> df
   id gender college  gpa  weight
0   1      f     yes  3.6     110
1   2      m     yes  3.5     170
2   3      m      no  NaN     165
3   4      m      no  NaN     190
4   5      f      no  NaN      95
5   6      m     yes  3.7     200
6   7      m     yes  3.6     150
7   8      f     yes  3.8     100
8   9      f     yes  3.0     130
9  10      f      no  NaN     120

>>> 

>>> ## fill missing values

>>> df['gpa'].fillna(df['gpa'].mean(), inplace=True) ## with column means

>>> df
   id gender college       gpa  weight
0   1      f     yes  3.600000     110
1   2      m     yes  3.500000     170
2   3      m      no  3.533333     165
3   4      m      no  3.533333     190
4   5      f      no  3.533333      95
5   6      m     yes  3.700000     200
6   7      m     yes  3.600000     150
7   8      f     yes  3.800000     100
8   9      f     yes  3.000000     130
9  10      f      no  3.533333     120

>>> 

>>> ## sort data

>>> df.sort_values(by=['gpa', 'weight'], ascending=[True, False])
   id gender college       gpa  weight
8   9      f     yes  3.000000     130
1   2      m     yes  3.500000     170
3   4      m      no  3.533333     190
2   3      m      no  3.533333     165
9  10      f      no  3.533333     120
4   5      f      no  3.533333      95
6   7      m     yes  3.600000     150
0   1      f     yes  3.600000     110
5   6      m     yes  3.700000     200
7   8      f     yes  3.800000     100

Handle missing data

Missing data are common in data analysis. You may want to remove the rows with missing data, replace the missing data with a specific value, or use a statistical method to impute the missing data.


>>> import pandas as pd

>>> file_path_name = "https://advstats.psychstat.org/data/gpa.txt"

>>> 

>>> df = pd.read_csv(file_path_name, 
...                   skiprows=0,           # Skip the first row (a comment row in this case)
...                   na_values=["99", "999"], # Treat empty strings and "NA" as NaN
...                   comment="#",          # Skip any line starting with "#"
...                   sep=r'\s+')

>>> print(df)
   id gender college  gpa  weight
0   1      f     yes  3.6     110
1   2      m     yes  3.5     170
2   3      m      no  NaN     165
3   4      m      no  NaN     190
4   5      f      no  NaN      95
5   6      m     yes  3.7     200
6   7      m     yes  3.6     150
7   8      f     yes  3.8     100
8   9      f     yes  3.0     130
9  10      f      no  NaN     120

>>> 

>>> ## find missing values

>>> df.isnull()
      id  gender  college    gpa  weight
0  False   False    False  False   False
1  False   False    False  False   False
2  False   False    False   True   False
3  False   False    False   True   False
4  False   False    False   True   False
5  False   False    False  False   False
6  False   False    False  False   False
7  False   False    False  False   False
8  False   False    False  False   False
9  False   False    False   True   False

>>> ## count the missing values

>>> df.isnull().sum()
id         0
gender     0
college    0
gpa        4
weight     0
dtype: int64

>>> 

>>> ## fill missing values

>>> df['gpa'].fillna(df['gpa'].mean()) ## with column means
0    3.600000
1    3.500000
2    3.533333
3    3.533333
4    3.533333
5    3.700000
6    3.600000
7    3.800000
8    3.000000
9    3.533333
Name: gpa, dtype: float64

>>> 

>>> ## remove a column with missing value

>>> df.dropna(subset=['gpa'])
   id gender college  gpa  weight
0   1      f     yes  3.6     110
1   2      m     yes  3.5     170
5   6      m     yes  3.7     200
6   7      m     yes  3.6     150
7   8      f     yes  3.8     100
8   9      f     yes  3.0     130

>>> ## remove any rows with missing data

>>> df.dropna(axis=0)
   id gender college  gpa  weight
0   1      f     yes  3.6     110
1   2      m     yes  3.5     170
5   6      m     yes  3.7     200
6   7      m     yes  3.6     150
7   8      f     yes  3.8     100
8   9      f     yes  3.0     130

>>> ## remove any columns with missing data

>>> df.dropna(axis=1)
   id gender college  weight
0   1      f     yes     110
1   2      m     yes     170
2   3      m      no     165
3   4      m      no     190
4   5      f      no      95
5   6      m     yes     200
6   7      m     yes     150
7   8      f     yes     100
8   9      f     yes     130
9  10      f      no     120

Save data

After modifying the data, you may want to save the data to a file. The function to_csv in the pandas library can be used. The following common arguments can be used in the function:

filepath: the path to the data file (file path, URL, or file-like object)
sep: the separator of the data, which is a comma by default, but you can change it to any other delimiter like tab (\t), semicolon (;), etc. Particularly, for data separated by space, using r'\s+'
header: whether to write the column names
index: whether to write row names
na_rep: the string representation of missing data
float_format: the format of the floating-point numbers
date_format: the format of the date
decimal: the character recognized as a decimal point


>>> import pandas as pd

>>> file_path_name = "https://advstats.psychstat.org/data/gpa.txt"

>>> 

>>> df = pd.read_csv(file_path_name, 
...                   skiprows=0,           # Skip the first row (a comment row in this case)
...                   na_values=["99", "999"], # Treat empty strings and "NA" as NaN
...                   comment="#",          # Skip any line starting with "#"
...                   sep=r'\s+')

>>> print(df)
   id gender college  gpa  weight
0   1      f     yes  3.6     110
1   2      m     yes  3.5     170
2   3      m      no  NaN     165
3   4      m      no  NaN     190
4   5      f      no  NaN      95
5   6      m     yes  3.7     200
6   7      m     yes  3.6     150
7   8      f     yes  3.8     100
8   9      f     yes  3.0     130
9  10      f      no  NaN     120

>>> 

>>> df.to_csv('output.csv', index=False)

To cite the book, use: Zhang, Z. & Wang, L. (2017-2025). Advanced statistics using Python. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.