Scatter plot

A scatter plot (also called a scatter graph, scatter chart, scattergram, or scatter diagram) is a plot to display the relation between two variables. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis (X-axis) and the value of the other variable determining the position on the vertical axis (Y-axis). Typically, the response/outcome/dependent variable is on the Y-axis, and the variable we suspect may be related to the y-axis variable, predictor/explanatory/independent variable is on the X-axis.

A scatter plot reveals the relationship or association between two variables (form, direction, strength) such as

  • Are variables X and Y related?
  • Are variables X and Y linearly related?
  • Are variables X and Y non-linearly related?
  • Are changes in Y related to changes in X?
  • Are there any outliers?

Some examples of scatter plots are given below.

Examples of scatter plots

Examples

To generate a scatter plot, the function scatter() from the matplotlib library or the function scatterplot() from the seaborn library can be used.

In the following, we plot the relationship between the age (in years) variable and the hvltt (verbal ability) variable of the ACTIVE study. The relationship of the two variables is not clear although tending to be negative.


>>> import pandas as pd

>>> active = pd.read_csv("https://advstats.psychstat.org/data/active.csv")

>>> 

>>> import seaborn as sns

>>> import matplotlib.pyplot as plt

>>> 

>>> # Create scatterplot

>>> sns.scatterplot(x='age', y='hvltt', color='blue', data=active)


>>> 

>>> ## use matplotlib

>>> # plt.scatter(active['age'], active['hvltt'], color='blue', alpha=0.7)

>>> 

>>> # Add labels and title

>>> plt.title('Scatterplot using Seaborn')
Text(0.5, 1.0, 'Scatterplot using Seaborn')

>>> plt.xlabel('Age')
Text(0.5, 0, 'Age')

>>> plt.ylabel('Verbal test score')
Text(0, 0.5, 'Verbal test score')

>>> 

>>> plt.savefig('scatter.svg', format='svg')

>>> # Show the plot

>>> plt.show()

Add regression line and a smoothing curve

Oftentimes, we are interested in whether two variables are linearly or nonlinearly related. We can better visualize the relationship by adding a straight regression line (linear) or a smoothed curve to the scatter plot. In R, the smoothed curve can be estimated using the loess.smooth() function or we can generate the plot using the scatter.smooth() function directly.

In the example below, we add both a regression line and a smoothed line to the scatter plot between age and hvltt variable. Note that their relationship appears to be nonlinear. Some comments about the code used:

  • lm() function fits a linear regression model.
  • abline() function will add a line with given intercept and slope to an existing figure.
  • lwd option sets the width of lines.
  • lty option sets the width of lines.
  • legend() function adds a legend to the existing figure.

>>> import pandas as pd

>>> active = pd.read_csv("https://advstats.psychstat.org/data/active.csv")

>>> 

>>> import numpy as np

>>> import seaborn as sns

>>> import matplotlib.pyplot as plt

>>> from scipy.stats import linregress

>>> 

>>> fig, ax = plt.subplots()

>>> # Scatterplot

>>> sns.scatterplot(x='age', y='hvltt', color='blue', label='Data Points', data=active, ax=ax)


>>> 

>>> # Add a linear regression line

>>> sns.regplot(x='age', y='hvltt', scatter=False, color='red', 
...             label='Regression Line', data=active, ax=ax)


>>> 

>>> # Add smoothed curve

>>> sns.regplot(x='age', y='hvltt', scatter=False, color='blue', lowess=True,
...             label='Regression Line', data=active, ax=ax)


>>> 

>>> # Add labels and title

>>> plt.title('Scatterplot with Smoothed Curve and Regression Line')
Text(0.5, 1.0, 'Scatterplot with Smoothed Curve and Regression Line')

>>> plt.xlabel('Age')
Text(0.5, 0, 'Age')

>>> plt.ylabel('Verbal score')
Text(0, 0.5, 'Verbal score')

>>> 

>>> # Show the legend

>>> plt.legend([plt.Line2D([0], [0], color='blue', lw=2), plt.Line2D([0], [0], color='r', lw=2)], 
...            ['Smoothed Curve', 'Linear Curve'], loc='best')


>>> 

>>> plt.savefig('scatter.svg', format='svg')

>>> plt.show()

To cite the book, use: Zhang, Z. & Wang, L. (2017-2025). Advanced statistics using Python. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.