Histogram
A histogram is a graphical display of frequencies over a set of continuous intervals for a continuous variable. The range of a variable is divided into a list of equal intervals. Within each interval, the number of participants, frequency, is counted. Then, the frequencies can be plotted with attached bars. Heights of the bars stand for frequencies or relative frequencies.
The purpose of a histogram is often to graphically summarize the distribution of a variable such as
- center (i.e., the location) of the data
- spread (i.e., the scale) of the data
- skewness of the data
- presence of outliers
- presence of multiple modes in the data.
Examples
To generate a histogram, the function hist()
can be used. In the following, we have histogram for the ufov
(useful field of view) variable and the reason
(reasoning ability) variable of the ACTIVE study. Clearly, the distribution of ufov
is highly skewed while the distribution of reason
is more normal.
> usedata('active') > attach(active) > > hist(ufov) > hist(reason) >
Probability density vs. frequency
By default, the histogram is a representation of frequencies, the counts within each interval of a variable. If we set the option prob=TRUE
, the probability densities are plotted (so that the histogram has a total area of one).
> usedata('active') > attach(active) > > par(mfrow=c(1,2)) > > hist(reason) > hist(reason, prob=T) >
Add an estimated density curve and a normal curve
Often times, we are interested in whether the distribution of a variable is close to normal distribution. We can visually compare them by adding a normal curve to the histogram. In addition, a smoothed density curve can be added to approximate the distribution represented by the histogram for better comparison. In R, the smoothed density can be estimated using the density()
function and the normal curve can be generated using the dnorm()
function.
In the example below, we add an estimated density curve and a normal curve to the histogram of the reason
variable. Some comments about the code used:
- The histogram has to be plotted using the density instead of the frequency.
na.rm=T
orna.rm=TRUE
will remove the missing data (represented byNA
in R) before applying a function.lines()
function will add a line to an existing figure. Therefore, a figure has to be there before the use of this function.curve()
can generate a new plot or add to an existing plot. To add to an existing plot, use the optionadd=T
(orTRUE
).dnorm(x,mean=mean(reason,na.rm=T), sd=sd(reason,na.rm=T))
generates a curve with the same mean and standard deviation as thereason
variable.- Oftentimes, one has to change
ylim
to make the plot fit.
> usedata('active') > attach(active) > > hist(reason, prob=T, ylim=c(0, .03)) > lines(density(reason,na.rm=T)) > curve(dnorm(x,mean=mean(reason,na.rm=T), + sd=sd(reason,na.rm=T)), add=T, col='red') >
To cite the book, use:
Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R. Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2.
To take the full advantage of the book such as running analysis within your web browser, please subscribe.