On Probability, II

OIS Chapter 3 and Jaynes

Robert W. Walker

2026-02-25

Overview

  • Tables
  • Probability

Ethics: Dehumanization for Aggregation

Three Concepts from Set Theory

  • Intersection [and]
  • Union [or] avoid double counting the intersection
  • Complement [not]

Three Distinct Probabilities

  • Joint: Pr(x=x and y=y)
  • Marginal: Pr(x=x) or Pr(y=y)
  • Conditional: Pr(x=x | y=y) or Pr(y =y | x = x)

Joint Probability

The table sums to one.

For Berkeley:

UCBAdmit %>% tabyl(M.F, Admit) %>% adorn_percentages("all")
    M.F        No       Yes
 Female 0.2823685 0.1230667
   Male 0.3298719 0.2646929
prop.table(table(UCBAdmit$M.F,UCBAdmit$Admit))
        
                No       Yes
  Female 0.2823685 0.1230667
  Male   0.3298719 0.2646929

Marginal Probability

The row/column sums to one. We collapse the table to a single margin. Here, two can be identified. The probability of Admit and the probability of M.F.

UCBAdmit %>% tabyl(M.F)
    M.F    n   percent
 Female 1835 0.4054353
   Male 2691 0.5945647
UCBAdmit %>% tabyl(Admit)
 Admit    n   percent
    No 2771 0.6122404
   Yes 1755 0.3877596
prop.table(table(UCBAdmit$M.F))

   Female      Male 
0.4054353 0.5945647 
prop.table(table(UCBAdmit$Admit))

       No       Yes 
0.6122404 0.3877596 

Conditional Probability

How does one margin of the table break down given values of another? Each row or column sums to one

Four can be identified, the probability of admission/rejection for Male, for Female; the probability of male or female for admits/rejects.

For Berkeley:

UCBAdmit %>% tabyl(M.F, Admit) %>% adorn_percentages("row")
    M.F        No       Yes
 Female 0.6964578 0.3035422
   Male 0.5548123 0.4451877
prop.table(table(UCBAdmit$M.F,UCBAdmit$Admit), 1)
        
                No       Yes
  Female 0.6964578 0.3035422
  Male   0.5548123 0.4451877
UCBAdmit %>% tabyl(M.F, Admit) %>% adorn_percentages("col")
    M.F        No       Yes
 Female 0.4612053 0.3173789
   Male 0.5387947 0.6826211
prop.table(table(UCBAdmit$M.F,UCBAdmit$Admit), 2)
        
                No       Yes
  Female 0.4612053 0.3173789
  Male   0.5387947 0.6826211

Law of Total Probability

Is a combination of the distributive property of multiplication and the fact that probabilities sum to one.

For example, the probability of Admitted and Male is the probability of admission for males times the probability of male.

Pr(x=x, y=y) = Pr(y | x)Pr(x)

Or it is the probability of being admitted times the probability of being male among admits.

Pr(x=x, y=y) = Pr(x | y)Pr(y)

Now the Substance

The fill aesthetic is great for displaying these things. For example, are males and females equally likely to be admitted to Berkeley?

Plaintiffs say no.

ggplot(UCBAdmit) + aes(x=M.F, fill=Admit) + geom_bar() + scale_fill_viridis_d()

Is that an Adequate Comparison?

The University says no. Why? The most important factor in the probability of admission is likely to be the department. This has a huge impact on what we see.

ggplot(UCBAdmit) + 
  aes(x=M.F, fill=Admit) + 
  geom_bar(position="fill") + 
  scale_fill_viridis_d() + 
  facet_wrap(vars(Dept))

The Magic of Bayes Rule

To find the joint probability [the intersection] of x and y, we can use either of the aforementioned methods. To turn this into a conditional probability, we simply take it is a proportion of the relevant margin.

Pr(x | y) = \frac{Pr(y | x) Pr(x)}{Pr(y)}

Applications of these Topics

Churn

Prior customers and current customers; engagement metrics, etc. The rows are prior state: customer/user and not. The columns are current state: customer/user and not.

The churn rate is the rate at which prior customers become current non-customers: a conditional probability.

Churn by Stripe

General Markov Processes/Matrices

Markov Chains

Markov Chains

Markov Chains

A Bit on Juries

  • Start from Section 3.2.7
  • The juror’s decision tree

Tree

Tree

Three nodes: guilty and not at each, convict at the third.

The Magic of Bayes Rule

To find the joint probability [the intersection] of x and y, we can use either of the aforementioned methods. To turn this into a conditional probability, we simply take it is a proportion of the relevant margin.

Pr(x | y) = \frac{Pr(y | x) Pr(x)}{Pr(y)}

By itself, this is algebra. It is magic in an application.

Pr(User | +) = \frac{Pr(+ | User) Pr(User)}{Pr(+)}

This poses the question: what does a positive test mean?

Working an Example

Suppose a test is 99% accurate for Users and 95% accurate for non-Users. Moreover, suppose that Users make up 10% of the population. So given some population to which this applies, we have:

Pr(User, +) = Pr(+ | User)*Pr(User)

Pr(User, -) = Pr(- | User)*Pr(User)

Pr(\overline{User}, +) = Pr(+ | \overline{User})*Pr(\overline{User})

and

Pr(\overline{User}, -) = Pr(- | \overline{User})*Pr(\overline{User})

The Table

Status Positive Negative Total
User 0.099 0.001 0.1
non-User 0.045 0.855 0.9
——— ———– ———- ——
Total 0.144 0.856 1.0

Pr(User | +) = \frac{Pr(+ | User) Pr(User)}{Pr(+)}

yields:

Pr(User | +) = \frac{0.099 [0.99*0.1]}{0.144[0.99*0.1 + 0.05*0.9]} = 0.6875

Some Language

Sensitivity refers to the ability of a test to designate an individual with a disease as positive. Specificity refers to the ability of a test to designate an individual without a disease as negative.

False positives are then the complement/opposite of specificity and false negatives are the complement/opposite of sensitivity.

Truth Positive Test Negative Test
Positive Sensitivity False Negative
Negative False Positive. Specificity

Applied to Hypothesis Testing

When we get to hypothesis testing in inference, this comes up again with null and alternative hypotheses and the related decision.

Truth Reject Null Accept Null
Alternative Correct Type II error
Null Type I error Correct

A Core Idea: Independence

What does it mean to say something is independent of something else? The simplest way to think about it is, “do I learn something more about x by knowing y than not”. If two things are independent, I don’t need to care about y if x is my objective.

Random Variables

I do not love the book definition of this. Technically, it is a variable whose values are generated according to some random process; your book implies that these are limited to quantities. It is really a measurable function defined on a probability space that maps from the sample space [the set of possible outcomes] to the real numbers.

Bayesian Reasoning is Core to Decisions with Data

Pr(Decision | data) = \frac{Pr(data | Decision) Pr(Decision)}{Pr(Data)}