Intro to Statistics: Part 6: Probability Density Functions

In the previous article we used a density histogram to chart the distribution of a random variable representing people's heights.  We also discussed how to interpret probability densities and how to convert them to probabilities.  In the process we caught a glimpse of a probability density function when we overlaid the sample data with a normal distribution curve.  Here's the chart (and R code) from the previous article to refresh your memory:

library(UsingR)
library(ggplot2)
data(father.son)
heights <- sample(father.son$fheight)

ggplot() + 
    geom_histogram(aes(y=..density..,x=heights), 
                   binwidth=1, fill="yellow", colour="black") +
    ggtitle("Density histogram of observed heights") + 
    ylab("Probability densities") + 
    xlab("Heights (bin width = 1 in) ") +
    geom_vline(x=mean(heights),linetype="dashed",size=1,colour="blue") +
    stat_function(aes(x=heights), fun = dnorm, colour="red",size=1,
                  arg=list(mean=mean(heights), sd=sd(heights)) )

The normal distribution curve (red line) is an example of a probability density function.  A probablity density function (sometimes referred to as a distribution curve) is a mathematical function that describes a random variable's distribution, in terms of probability densities.

A probablity density function takes an outcome value as input and returns the probability density associated with that outcome.  Remember that probability density is {probability} divided by {unit of outcome}, so in order to convert the probability density to an actual probability, we need to multiply it by the bin width, where the bin width is the range of outcomes we're interested in.

For example, if we want to know the probability of observing someone with a height of 70in, we can plug 70in into the probability density function (the normal distribution curve) that we overlaid, and it will return the corresponding probability density.  So then what is our bin width?  Well, the "bin width" for a singular outcome in a continuous distribution is 0, because when we're talking about the singular outcome 70in, we're actually talking about 70.00000000000..., and the probability of observing that precise outcome is 0.

Instead we can compute the probability of observing an outcome within a given range of outcomes, for example between 70 - 71in.   However, since our probability density function is a continuously changing function, the density of the first value in the range (70in) might be different than the density of the last value in the range (71in).  So computing the probability isn't as easy as it was with the density histogram, where the outcomes are "binned up" with constant density over the range.

Recall that the probability for a single bin is equal to the area of the bin's rectangular bar in the histogram (height * width).  In order to compute the probability of a range using the probability density function, we need to determine the area under the curve across the range of outcomes.  This is equivalent to taking an integral in calculus. 

Aside: Calculating areas under function curves is a calculus thing.  So if you've learned calculus then this should look familiar.  If you haven't learned calculus, don't worry about it, it's not that important.  You don't need to know calculus to understand statistics.  Just know that calculus provides the theories and methods that allow us to compute the area under the curve precisely (but you can let your computer do the actual number crunching).

So for example if we wanted to know the probability of observing a height between 70 - 71in, we'd compute the area under the probability density curve between 70 and 71.  The area in question is depicted in red in the chart below.

    x <- seq(70,71,0.001)
    y <- dnorm(x, mean=mean(heights), sd=sd(heights))
    shade <- data.frame( rbind(c(70,0), cbind(x,y), c(71,0)) )

ggplot() + 
    geom_histogram(aes(y=..density..,x=heights), 
                   binwidth=1, fill="yellow", colour="black") +
    ggtitle("Density histogram of observed heights") + 
    ylab("Probability densities") + 
    xlab("Heights (bin width = 1 in) ") +
    geom_vline(x=mean(heights),linetype="dashed",size=1,colour="blue") +
    stat_function(aes(x=heights), fun = dnorm, colour="red",size=1,
                  arg=list(mean=mean(heights), sd=sd(heights)) ) +
    geom_polygon(data = shade, aes(x, y), fill="red")

Notice that the area in red looks to be roughly the same as the yellow bar behind it.  The yellow bar represents our "binned up" estimate of the probability for that range.

As with the density histogram, it's worth noting that the total area under the probability density function across the entire range of outcomes is equal to 1.  This is consistent with the idea that the sum of probabilities for all outcomes in a distribution equals 1. 

 

R Functions

R provides a few handy functions for working with the probability density function of a normal distribution:   

  • dnorm: (d = density): Takes an outcome value as input and returns its probability density
  • pnorm: (p = probability): Takes an outcome value as input and returns the probability of observing an outcome less than or equal to the given outcome (i.e. the range from -infinity up to the given outcome)
  • qnorm: (q = quantile): Takes a probability value (a.k.a. "quantile") and returns the corresponding outcome such that the probability of observing an outcome less than or equal to the returned outcome is equal to the given probability (go ahead and take a second to parse that...)

qnorm is basically the inverse of pnorm.  Where pnorm takes an outcome and returns a probability, qnorm takes a probability (technically a "quantile"), and returns an outcome.  Or in mathematical terms:

    qnorm( pnorm(x) ) == x
    pnorm( qnorm(y) ) == y

where x is an outcome value and y is a probability. 

pnorm and qnorm are probably the functions you'll use most often, so take some time to get familiar with them. 

 

Probability function

We can use pnorm to determine the area of the region shaded in red above, which gives us the probability of observing a height between 70 - 71in (according to the normal distribution curve that we fitted to the data).  First we compute the pnorm result for 71in, which gives us the probability of 71 or less, then subtract away the pnorm result for 70in.  This leaves us with the probability of observing an outcome in the range 70 - 71in:

    p71.or.less <- pnorm(71, mean=mean(heights), sd=sd(heights))
    p70.or.less <- pnorm(70, mean=mean(heights), sd=sd(heights))
    round(p71.or.less - p70.or.less, 3)
    
    ## [1] 0.086

This probability value, 0.086, is equal to the area of the region shaded in red.  Notice that it's approximately the same area as the yellow bar behind it, whose height is about 0.086 (which when multiplied by the bin width (1 inch) gives the probability, 0.086, which is the same as the pnorm-computed probability).

 

Quantile Function

As mentioned above, the quantile function takes a probability and returns the corresponding outcome such that the probability of observing an outcome that is less than or equal to the returned outcome is equal to the given probability (that definition is no easier the second time around).  

For example, we can use the quantile functions to answer questions like, "what is the height at which 25% of the population is shorter than or equal to that height?"

    height <- qnorm(0.25, mean=mean(heights), sd=sd(heights));
    round(height, 1)
    
    ## [1] 65.8

This tells us that 25% of our sample heights are equal to or shorter than 65.8in.  Or in other words, if you selected a height at random from the sample, the probability that it is less than or equal to 65.8in is 0.25.

 

Recap

  1. random variable is described by the characteristics of its distribution
  2. Each outcome in the distribution has an associated probability of occurring
  3. The sum of the probabilities of all possible outcomes equals 1
  4. The expected value, E[X], of a distribution is the weighted average of all outcomes, where each outcome is weighted by its probability
  5. The variance, Var(X), is the "measure of spread" of a distribution. It's calculated by taking the weighted average of the squared differences between each outcome and the expected value.
  6. The standard deviation of a distribution is the square root of its variance
  7. A random variable's distribution is commonly plotted using probability density function.
  8. A probability density function takes an outcome value as input and returns the corresponding probability density for the given outcome
  9. The probability of observing an outcome within a given range can be determined by computing the area under the probability density function curve within the given range.
  10. sample is a subset of a population. Statistical methods and principles are applied to the sample's distribution in order to make inferences about the true distribution -- i.e. the distribution across the population as a whole
 

Intro to Statistics ... 4 . 5 . 6 . 7 . 8 ...

/