Linking Probability and Data
There is a very nice graphical user interface for parts of R called radiant that is especially useful for probability distributions. To acquire it, try:
There is a web instance that can be deployed.
Jaynes presents a few core ideas and requirements for his rational system. Probability emerges as the representation of circumstances in which any given realization of a process is either TRUE or FALSE but both are possible and can be expressed by probabilities.
Sensitivity refers to the ability of a test to designate an individual with a disease as positive. Specificity refers to the ability of a test to designate an individual without a disease as negative.
False positives are then the complement/opposite of specificity and false negatives are the complement/opposite of sensitivity.
Truth | Positive Test | Negative Test |
---|---|---|
Positive | Sensitivity | False Negative |
Negative | False Positive. | Specificity |
When we get to hypothesis testing next time, this comes up again with null and alternative hypotheses and the related decision.
Truth | Reject Null | Accept Null |
---|---|---|
Alternative | Correct | Type II error |
Null | Type I error | Correct |
I do not love the book definition of this. Technically, it is a variable whose values are generated according to some random process; your book implies that these are limited to quantities.
It is really a measurable function defined on a probability space that maps from the sample space [the set of possible outcomes] to the real numbers.
What does it mean to say something is independent of something else?
Is of necessity two-dimensional,
Our core concept is a probability distribution just as above. These come in two forms for two types [discrete (qualitative)] and continuous (quantitative)] and can be either:
Distributions are nouns.
Sentences are incomplete without verbs – parameters.
We need both; it is for this reason that the former slide is true.
We do not always have a grounding for either the name or the parameter.
For now, we will work with univariate distributions though multivariate distributions do exist.
The differences are sums versus integrals. Why?
The probability of exactly any given value is zero on a true continuum.
E(X) = \sum_{x \in X} x \cdot Pr(X=x) E(X) = \int_{x \in X} x \cdot f(x)dx
E[(X-\mu)^2] = \sum_{x \in X} (x-\mu)^2 \cdot Pr(X=x) E((X-\mu)^2) = \int_{x \in X} (x-\mu)^2 \cdot f(x)dx
Probability distributions are mathematical formulae expressing likelihood for some set of qualities or quantities.
Like a proper English sentence, both are required.
For our purposes, it is a systematic description of a phenomenon that shares important and essential features of that phenomenon. Models frequently give us leverage on problems in the absence of alternative approaches.
library(patchwork)
Unif <- data.frame(x=seq(0, 1, by = 0.005)) %>% mutate(p.x = punif(x), d.x = dunif(x))
p1 <- ggplot(Unif) + aes(x=x, y=p.x) + geom_step() + labs(title="Distribution Function [cdf/cmf]") + theme_minimal()
p2 <- ggplot(Unif) + aes(x=x, y=d.x) + geom_step() + labs(title="Density Function [pdf/pmf]") + theme_minimal()
p2 + p1
f(x|\mu,\sigma^2 ) = \frac{1}{\sqrt{2\pi\sigma^{2}}} \exp \left[ -\frac{1}{2} \left(\frac{x - \mu}{\sigma}\right)^{2}\right]
Is the workhorse of statistics. Key features:
library(patchwork)
Unif <- data.frame(x=seq(0, 1, by = 0.005)) %>% mutate(p.x = punif(x), d.x = dunif(x))
p1 <- ggplot(Unif) + aes(x=x, y=p.x) + geom_step() + labs(title="Distribution Function [cdf/cmf]") + theme_minimal()
p2 <- ggplot(Unif) + aes(x=x, y=d.x) + geom_step() + labs(title="Density Function [pdf/pmf]") + theme_minimal()
p2 + p1
The generic z-transformation applied to a variable x centers [mean\approx 0] and scales [std. dev. \approx variance \approx 1] to z_{x} for population parameters.1 In this case, two things are important.
this is the idea behind there only being one normal table in a statistics book.
the \mu and \sigma are presumed known.
z = \frac{x - \mu}{\sigma}
The scale
command in R does this for a sample.
z = \frac{x - \overline{x}}{s_{x}} where \overline{x} is the sample mean of x and s_{x} is the sample standard deviation of x.
In samples, the 0 and 1 are exact; these are features of the mean and degrees of freedom. If I know the mean and any n-1 observations, the n^{th} observation is exactly the value such that the deviations add up to zero/cancel out.
Suppose earnings in a community have mean 55,000 and standard deviation 10,000. This is in dollars. Suppose I earn 75,000 dollars. First, if we take the top part of the fraction in the z equation, we see that I earn 20,000 dollars more than the average (75000 - 55000). Finishing the calculation of z, I would divide that 20,000 dollars by 10,000 dollars per standard deviation. Let’s show that.
z = \frac{75000 dollars - 55000 dollars}{\frac{10000 dollars}{SD}} = +2 SD .
I am 2 standard deviations above the average (the +) earnings. All z does is re-scale the original data to standard deviations with zero as the mean. The metric is the standard deviation.
Suppose I earn 35,000. That makes me 20,000 below the average and gives me a z score of -2. I am 2 standard deviations below average (the -) earnings.
z is an easy way to assess symmetry.
Hypo.Income z.Income
1 48092.03 -0.6285253
2 51008.63 -0.3456873
3 28654.96 -2.5134433
4 45433.06 -0.8863804
5 32133.90 -2.1760715
6 39885.85 -1.4243233
-1 1
502 498
Distributions in R are defined by four core parts:
A filling process is supposed to fill jars with 16 ounces of grape jelly, according to the label, and regulations require that each jar contain between 15.95 and 16.05 ounces.
Exactly? 50 percent because 25 percent are between 15.9 and 15.95 and 25 percent are between 16.05 and 16.1.
Take a binomial with p very small and let n \rightarrow \infty. We get the Poisson distribution (y) given an arrival rate \lambda specified in events per period.
f(y|\lambda) = \frac{\lambda^{y}e^{-\lambda}}{y!}
FAA Decision: Expend or do not expend scarce resources investigating claimed staffing shortages at the Cleveland Air Route Traffic Control Center.
Essential facts: The Cleveland ARTCC is the US’s busiest in routing cross-country air traffic. In mid-August of 1998, it was reported that the first week of August experienced 3 errors in a one week period; an error occurs when flights come within five miles of one another by horizontal distance or 2000 feet by vertical distance. The Controller’s union claims a staffing shortage though other factors could be responsible. 21 errors per year (21/52 errors per week) has been the norm in Cleveland for over a decade.
DF <- data.frame(Close.Calls = rpois(1000, 21/52))
ggplot(DF) + aes(x=Close.Calls) + geom_histogram()
What would you do and why? Not impossible
After analyzing the initial data, you discover that the first two weeks of August have experienced 6 errors. What would you now decide? Well, once is 0.0081342. Twice, at random, is that squared. We have a problem.
Suppose the variable of interest is discrete and takes only two values: yes and no. For example, is a customer satisfied with the outcomes of a given service visit?
For each individual, because the probability of yes (1) \pi and no (0) 1-\pi must sum to one, we can write:
f(x|\pi) = \pi^{x}(1-\pi)^{1-x}
For multiple identical trials, we have the Binomial:
f(x|n,\pi) = {n \choose k} \pi^{x}(1-\pi)^{n-x} where {n \choose k} = \frac{n!}{(n-k)!}
Informal surveys suggest that 15% of Essex shopkeepers will not accept Scottish pounds. There are approximately 200 shops in the general High Street square.
Interestingly, any given observation has a 50-50 chance of being over or under the median. Suppose that I have five datum.
Everything else.
How many failures before the first success? Now defined exclusively by p. In each case, (1-p) happens k times. Then, on the k+1^{th} try, p. Note 0 failures can happen…
Pr(y=k) = (1-p)^{k}p
Suppose any startup has a p=0.1 chance of success. How many failures?
Suppose any startup has a p=0.1 chance of success. How many failures for the average/median person?
Geoms.My <- data.frame(Vendors=rgeom(1000, 0.15))
Geoms.My %>% ggplot() + aes(x=Vendors) + geom_histogram(binwidth=1)
We could also do something like.
How many failures before the r^{th} success? In each case, (1-p) happens k times. Then, on the k+1^{th} try, we get our r^{th} p. Note 0 failures can happen…
Pr(y=k) = {k+r-1 \choose r-1}(1-p)^{k}p^{r}
I need to make 5 sales to close for the day. How many potential customers will I have to have to get those five sales when each customer purchases with probability 0.2.
library(patchwork)
DF <- data.frame(Customers = c(0:70)) %>%
mutate(m.Customers = dnbinom(Customers, size=5, prob=0.2),
p.Customers = pnbinom(Customers, size=5, prob=0.2))
pl1 <- DF %>% ggplot() + aes(x=Customers) + geom_line(aes(y=p.Customers))
pl2 <- DF %>% ggplot() + aes(x=Customers) + geom_point(aes(y=m.Customers))
In this last example, I was concerned with sales. I might also want to generate revenues because I know the rough mean and standard deviation of sales. Combining such things together forms the basis of a Monte Carlo simulation.
Some of the basics are covered in a swirl on simulation.
Customers arrive at a rate of 7 per hour. You convert customers to buyers at a rate of 85%. Buyers spend, on average 600 dollars with a standard deviation of 150 dollars.
Distributions are how variables and probability relate. They are a graph that we can enter in two ways. From the probability side to solve for values or from values to solve for probability. It is always a function of the graph.
Distributions generally have to be sentences.
DADM : Probability Distributions