Histogram

A histogram is a graphical display of frequencies over a set of continuous intervals for a continuous variable. The range of a variable is divided into a list of equal intervals. Within each interval, the number of participants, frequency, is counted. Then, the frequencies can be plotted with attached bars. Heights of the bars stand for frequencies or relative frequencies.

The purpose of a histogram is often to graphically summarize the distribution of a variable such as

  • center (i.e., the location) of the data
  • spread (i.e., the scale) of the data
  • skewness of the data
  • presence of outliers
  • presence of multiple modes in the data.
Some examples of histogram are given below.

Examples

To generate a histogram, the function hist() can be used. In the following, we have histogram for the ufov (useful field of view) variable and the reason (reasoning ability) variable of the ACTIVE study. Clearly, the distribution of ufov is highly skewed while the distribution of reason is more normal.

> usedata('active') > attach(active) > > hist(ufov) > hist(reason) >


Probability density vs. frequency

By default, the histogram is a representation of frequencies, the counts within each interval of a variable. If we set the option prob=TRUE, the probability densities are plotted (so that the histogram has a total area of one).

> usedata('active') > attach(active) > > par(mfrow=c(1,2)) > > hist(reason) > hist(reason, prob=T) >

Add an estimated density curve and a normal curve

Often times, we are interested in whether the distribution of a variable is close to normal distribution. We can visually compare them by adding a normal curve to the histogram. In addition, a smoothed density curve can be added to approximate the distribution represented by the histogram for better comparison. In R, the smoothed density can be estimated using the density() function and the normal curve can be generated using the dnorm() function.

In the example below, we add an estimated density curve and a normal curve to the histogram of the reason variable. Some comments about the code used:

  • The histogram has to be plotted using the density instead of the frequency.
  • na.rm=T or na.rm=TRUE will remove the missing data (represented by NA in R) before applying a function.
  • lines() function will add a line to an existing figure. Therefore, a figure has to be there before the use of this function.
  • curve() can generate a new plot or add to an existing plot. To add to an existing plot, use the option add=T (or TRUE).
  • dnorm(x,mean=mean(reason,na.rm=T), sd=sd(reason,na.rm=T)) generates a curve with the same mean and standard deviation as the reason variable.
  • Oftentimes, one has to change ylim to make the plot fit.
> usedata('active') > attach(active) > > hist(reason, prob=T, ylim=c(0, .03)) > lines(density(reason,na.rm=T)) > curve(dnorm(x,mean=mean(reason,na.rm=T), + sd=sd(reason,na.rm=T)), add=T, col='red') >

To cite the book, use: Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.