Probability Distributions

Linking Probability and Data

Robert W. Walker

2026-02-25

Probability Distributions

Distributions in R are defined by four core parts:

  • r: random variables
  • d: density/probability that Pr(X=x) or f(x)
  • p: cumulative probability (given q) Pr(X\leq q)=p
  • q: quantile (given p): x such that Pr(X\leq q)=p

A Grape Escape?

A filling process is supposed to fill jars with 16 ounces of grape jelly, according to the label, and regulations require that each jar contain between 15.95 and 16.05 ounces.

Some Questions

  1. Suppose that the uniform random process of filling in known to fill between 15.9 and 16.1 ounces uniformly.
  • Plot the histogram of 1000 simulated values. NB: unif is the noun with boundaries a (default 0) and b(default 1).
Jars <- runif(1000, 15.9, 16.1)
Jars %>% data.frame() %>% ggplot() + aes(x=Jars) + geom_histogram(binwidth=0.005)

  • What is the probability that a random jar is outside of requirements?

Exactly? 50 percent because 25 percent are between 15.9 and 15.95 and 25 percent are between 16.05 and 16.1.

table(Jars < 15.95 | Jars > 16.05) # | captures or

FALSE  TRUE 
  520   480 
  • Scale (z) the jars and summarise them.
summary(scale(Jars))
       V1          
 Min.   :-1.78517  
 1st Qu.:-0.86083  
 Median :-0.01421  
 Mean   : 0.00000  
 3rd Qu.: 0.84030  
 Max.   : 1.71533  
sd(scale(Jars))
[1] 1

More Questions

  1. The mean of the normal random process of filling is known to be 16.004 ounces with standard deviation 0.028 ounces.
  • What is the probability that a random jar is outside of requirements? NB: norm is the noun with mean (default 0) and sd (default 1).
pnorm(15.95, 16.004, 0.028) + pnorm(16.05, 16.004, 0.028, lower.tail=FALSE)
[1] 0.07709829
  • What is the probability that a random jar contains more than 16.1 ounces?
1-pnorm(16.1, 16.004, 0.028)
[1] 0.0003033834
  • What is the probability that a random jar contains less than 16.04 ounces?
pnorm(16.04, 16.004, 0.028)
[1] 0.9007286
  • 95% of jars, given a normal, will contain between XXX and XXX ounces of jelly.
qnorm(c(0.025,0.975), 16.004, 0.028)
[1] 15.94912 16.05888
  • The bottom 5% of jars contain, at most, XXX ounces of jelly.
qnorm(0.05, 16.004, 0.028)
[1] 15.95794
  • The top 25% of jars contain at least XXX ounces of jelly.
qnorm(0.75, 16.004, 0.028)
[1] 16.02289

Why Normals?

  • The Central Limit Theorem
  • They Dominate Ops [6\sigma]
  • Normal Approximations Abound

Events: The Poisson

Poisson

Poisson

Take a binomial with p very small and let n \rightarrow \infty. We get the Poisson distribution (y) given an arrival rate \lambda specified in events per period.

f(y|\lambda) = \frac{\lambda^{y}e^{-\lambda}}{y!}

Examples: The Poisson

  • Walk in customers
  • Emergency Room Arrivals
  • Births, deaths, marriages
  • Prussian Cavalry Deaths by Horse Kick
  • Fish?

Air Traffic Controllers

FAA Decision: Expend or do not expend scarce resources investigating claimed staffing shortages at the Cleveland Air Route Traffic Control Center.

Essential facts: The Cleveland ARTCC is the US’s busiest in routing cross-country air traffic. In mid-August of 1998, it was reported that the first week of August experienced 3 errors in a one week period; an error occurs when flights come within five miles of one another by horizontal distance or 2000 feet by vertical distance. The Controller’s union claims a staffing shortage though other factors could be responsible. 21 errors per year (21/52 errors per week) has been the norm in Cleveland for over a decade.

Some Questions

  1. Plot a histogram of 1000 random weeks. NB: pois is the noun with no default for \lambda – the arrival rate.
DF <- data.frame(Close.Calls = rpois(1000, 21/52))
ggplot(DF) + aes(x=Close.Calls) + geom_histogram()

ggplot(DF) + aes(x=Close.Calls) + stat_ecdf(geom="step")

  • Plot a sequence on the x axis from 0 to 5 and the probability of that or fewer incidents along the y. seq(0,5)
DF <- data.frame(x=0:5, y=ppois(0:5, 21/52))
ggplot(DF) + aes(x=x, y=y) + geom_col()

  1. What would you do and why? Not impossible

  2. After analyzing the initial data, you discover that the first two weeks of August have experienced 6 errors. What would you now decide? Well, once is 0.0081342. Twice, at random, is that squared. We have a problem.

Deaths by Horse Kick in the Prussian cavalry?

library(vcd)
data(VonBort)
head(VonBort)
  deaths year corps fisher
1      0 1875     G     no
2      0 1875     I     no
3      0 1875    II    yes
4      0 1875   III    yes
5      0 1875    IV    yes
6      0 1875     V    yes
mean(VonBort$deaths)
[1] 0.7

Bernoulli Trials

The Generic Bernoulli Trial

Suppose the variable of interest is discrete and takes only two values: yes and no. For example, is a customer satisfied with the outcomes of a given service visit?

For each individual, because the probability of yes (1) \pi and no (0) 1-\pi must sum to one, we can write:

f(x|\pi) = \pi^{x}(1-\pi)^{1-x}

Binomial Distribution

For multiple identical trials, we have the Binomial:

f(x|n,\pi) = {n \choose k} \pi^{x}(1-\pi)^{n-x} where {n \choose k} = \frac{n!}{(n-k)!}

The Binomial

BinomialR

BinomialR

Scottish Pounds

Informal surveys suggest that 15% of Essex shopkeepers will not accept Scottish pounds. There are approximately 200 shops in the general High Street square.

  1. Draw a plot of the distribution and the cumulative distribution of shopkeepers that do not accept Scottish pounds.
Scots <- data.frame(Potential.Refusers = 0:200) %>% mutate(Prob = round(pbinom(Potential.Refusers, size=200, 0.15), digits=4))
Scots %>% ggplot() + aes(x=Potential.Refusers, y=Prob) + geom_point() + labs(x="Refusers", y="Prob. of x or Less Refusers") + theme_minimal() -> Plot1
Plot1

A Nicer Plot

library(plotly)
p <- ggplotly(Plot1)
p

More Questions About Scottish Pounds

  1. What is the probability that 24 or fewer will not accept Scottish pounds?
pbinom(24, 200, 0.15)
[1] 0.1368173
  1. What is the probability that 25 or more shopkeepers will not accept Scottish pounds?
1-pbinom(24, 200, 0.15)
[1] 0.8631827
  1. With probability 0.9 [90 percent], XXX or fewer shopkeepers will not accept Scottish pounds.
qbinom(0.9, 200, 0.15)
[1] 37

Application: The Median is a Binomial with p=0.5

Interestingly, any given observation has a 50-50 chance of being over or under the median. Suppose that I have five datum.

  1. What is the probability that all are under?
pbinom(0,size=5, p=0.5)
[1] 0.03125
  1. What is the probability that all are over?
dbinom(5,size=5, p=0.5)
[1] 0.03125
  1. What is the probability that the median is somewhere in between our smallest and largest sampled values?

Everything else.

The Rule of Five

  • This is called the Rule of Five by Hubbard in his How to Measure Anything.

Geometric Distributions

How many failures before the first success? Now defined exclusively by p. In each case, (1-p) happens k times. Then, on the k+1^{th} try, p. Note 0 failures can happen…

Pr(y=k) = (1-p)^{k}p

Example: Entrepreneurs

Suppose any startup has a p=0.1 chance of success. How many failures?

Example: Entrepreneurs

Suppose any startup has a p=0.1 chance of success. How many failures for the average/median person?

qgeom(0.5,0.1)
[1] 6
  1. [Geometric] Plot 1000 random draws of “How many vendors until one refuses my Scottish pounds?”
Geoms.My <- data.frame(Vendors=rgeom(1000, 0.15))
Geoms.My %>% ggplot() + aes(x=Vendors) + geom_histogram(binwidth=1)

We could also do something like.

plot(seq(0,60), pgeom(seq(0,60), 0.15))

Negative Binomial Distributions

How many failures before the r^{th} success? In each case, (1-p) happens k times. Then, on the k+1^{th} try, we get our r^{th} p. Note 0 failures can happen…

Pr(y=k) = {k+r-1 \choose r-1}(1-p)^{k}p^{r}

Needed Sales

I need to make 5 sales to close for the day. How many potential customers will I have to have to get those five sales when each customer purchases with probability 0.2.

library(patchwork)
DF <-  data.frame(Customers = c(0:70)) %>% 
  mutate(m.Customers = dnbinom(Customers, size=5, prob=0.2), 
         p.Customers = pnbinom(Customers, size=5, prob=0.2)) 
pl1 <- DF %>% ggplot() + aes(x=Customers) + geom_line(aes(y=p.Customers)) 
pl2 <- DF %>% ggplot() + aes(x=Customers) + geom_point(aes(y=m.Customers))

Simulation: A Powerful Tool

In this last example, I was concerned with sales. I might also want to generate revenues because I know the rough mean and standard deviation of sales. Combining such things together forms the basis of a Monte Carlo simulation.

Some of the basics are covered in a swirl on simulation.

An Example

Customers arrive at a rate of 7 per hour. You convert customers to buyers at a rate of 85%. Buyers spend, on average 600 dollars with a standard deviation of 150 dollars.

Sim <- 1:1000
Customers <- rpois(1000, 7)
Buyers <- rbinom(1000, size=Customers, prob = 0.85)
Data <- data.frame(Sim, Buyers, Customers)
Data <- Data %>% group_by(Sim) %>% mutate(Revenue = sum(rnorm(Buyers, 600, 150))) %>% ungroup()

Simulation Results

A Summary

Distributions are how variables and probability relate. They are a graph that we can enter in two ways. From the probability side to solve for values or from values to solve for probability. It is always a function of the graph.

Distributions generally have to be sentences.

  • The name is a noun but it also has
  • parameters – verbs – that makes the noun tangible.