This dataset represents Admissions decisions by two traits: M.F. captures Male/Female election and Dept captures the department, labeled A through F. Please provide exact binomial confidence intervals, 95%, for the probability of Admission for Males, for Females, and then for each of Males and Females by department. Next, reanalyse the data using the normal approximation and provide differences overall and for each department. Collect your responses and code into a quarto markdown document with python code chunks that I can download as text.
Overview
This document analyses the UC Berkeley admissions dataset. Each row represents one applicant with columns M.F (Male/Female), Admit (Yes/No), and Dept (A–F). We compute:
Exact (Clopper–Pearson) 95% binomial confidence intervals for the probability of admission for Males overall, Females overall, and each gender × department cell.
Normal-approximation (Wald) 95% CIs for the same quantities, and the Male − Female difference in admission probability, overall and by department.
Setup
How’s that done?
import pandas as pdimport numpy as npfrom scipy.stats import norm, beta as beta_dist# ── load data ────────────────────────────────────────────────────────────────df = pd.read_csv("./data/UCB-Admit.csv")df.columns = df.columns.str.strip() # remove BOM / whitespacedf["admitted"] = (df["Admit"] =="Yes").astype(int)print(df.shape)
(4526, 4)
How’s that done?
print(df.head())
M.F Admit Dept admitted
0 Male Yes A 1
1 Male Yes A 1
2 Male Yes A 1
3 Male Yes A 1
4 Male Yes A 1
How’s that done?
print(df["M.F"].value_counts())
M.F
Male 2691
Female 1835
Name: count, dtype: int64
How’s that done?
print(df["Dept"].value_counts().sort_index())
Dept
A 933
B 585
C 918
D 792
E 584
F 714
Name: count, dtype: int64
Part 1 — Exact (Clopper–Pearson) Binomial CIs
The exact 95% CI for a binomial proportion with \(k\) successes out of \(n\) trials is
where \(B(p; a, b)\) is the \(p\)-th quantile of a Beta distribution. scipy.stats.binom.proportion_confint (method 'exact') implements this directly.
How’s that done?
from scipy.stats import beta as beta_distdef exact_ci(k, n, alpha=0.05):"""Clopper-Pearson exact 95% CI via the Beta distribution.""" lo = beta_dist.ppf(alpha /2, k, n - k +1) if k >0else0.0 hi = beta_dist.ppf(1- alpha /2, k +1, n - k) if k < n else1.0return k / n, lo, hidef summarise(sub): n =len(sub) k = sub["admitted"].sum() p, lo, hi = exact_ci(k, n)return pd.Series({"n": n, "admits": k, "p_hat": p,"CI_low": lo, "CI_high": hi})
1a. Overall by gender
How’s that done?
overall_exact = ( df.groupby("M.F") .apply(summarise) .reset_index())overall_exact.columns = ["Gender", "n", "Admits", "p_hat", "CI_low", "CI_high"]# Format for displaydisp = overall_exact.copy()for c in ["p_hat", "CI_low", "CI_high"]: disp[c] = disp[c].map("{:.4f}".format)print(disp.to_string(index=False))
Gender n Admits p_hat CI_low CI_high
Female 1835.0 557.0 0.3035 0.2826 0.3252
Male 2691.0 1198.0 0.4452 0.4263 0.4642
Gender Dept n Admits p_hat CI_low CI_high
Female A 108.0 89.0 0.8241 0.7390 0.8906
Female B 25.0 17.0 0.6800 0.4650 0.8505
Female C 593.0 202.0 0.3406 0.3025 0.3803
Female D 375.0 131.0 0.3493 0.3011 0.4000
Female E 393.0 94.0 0.2392 0.1978 0.2845
Female F 341.0 24.0 0.0704 0.0456 0.1029
Male A 825.0 512.0 0.6206 0.5865 0.6538
Male B 560.0 353.0 0.6304 0.5889 0.6705
Male C 325.0 120.0 0.3692 0.3166 0.4242
Male D 417.0 138.0 0.3309 0.2859 0.3784
Male E 191.0 53.0 0.2775 0.2153 0.3467
Male F 373.0 22.0 0.0590 0.0373 0.0879
Part 2 — Normal (Wald) Approximation CIs and Differences
Gender Dept n Admits p_hat Wald_low Wald_high
Female A 108.0 89.0 0.8241 0.7523 0.8959
Female B 25.0 17.0 0.6800 0.4971 0.8629
Female C 593.0 202.0 0.3406 0.3025 0.3788
Female D 375.0 131.0 0.3493 0.3011 0.3976
Female E 393.0 94.0 0.2392 0.1970 0.2814
Female F 341.0 24.0 0.0704 0.0432 0.0975
Male A 825.0 512.0 0.6206 0.5875 0.6537
Male B 560.0 353.0 0.6304 0.5904 0.6703
Male C 325.0 120.0 0.3692 0.3168 0.4217
Male D 417.0 138.0 0.3309 0.2858 0.3761
Male E 191.0 53.0 0.2775 0.2140 0.3410
Male F 373.0 22.0 0.0590 0.0351 0.0829
2c. Male − Female differences (overall and by department)
# Department-level differencesdept_diff = make_diff_table(dept_wald)disp6 = dept_diff.copy()for c in ["p_Male", "p_Female", "Diff (M-F)", "Diff_CI_low", "Diff_CI_high"]: disp6[c] = disp6[c].map("{:.4f}".format)print("\n=== Department-level Male − Female differences ===")
=== Department-level Male − Female differences ===
How’s that done?
print(disp6.to_string(index=False))
Dept p_Male p_Female Diff (M-F) Diff_CI_low Diff_CI_high
A 0.6206 0.8241 -0.2035 -0.2825 -0.1244
B 0.6304 0.6800 -0.0496 -0.2368 0.1375
C 0.3692 0.3406 0.0286 -0.0363 0.0935
D 0.3309 0.3493 -0.0184 -0.0845 0.0477
E 0.2775 0.2392 0.0383 -0.0379 0.1145
F 0.0590 0.0704 -0.0114 -0.0476 0.0248
Part 3 — Side-by-side comparison: Exact vs. Wald
How’s that done?
# Merge exact and wald for overallcmp_overall = overall_exact.merge( overall_wald[["Gender", "Wald_low", "Wald_high"]], on="Gender")print("=== Overall: Exact vs Wald CI bounds ===")
=== Overall: Exact vs Wald CI bounds ===
How’s that done?
disp7 = cmp_overall.copy()for c in ["p_hat", "CI_low", "CI_high", "Wald_low", "Wald_high"]: disp7[c] = disp7[c].map("{:.4f}".format)print(disp7.to_string(index=False))
# By departmentcmp_dept = dept_exact.merge( dept_wald[["Gender", "Dept", "Wald_low", "Wald_high"]], on=["Gender", "Dept"])print("\n=== By Department: Exact vs Wald CI bounds ===")
=== By Department: Exact vs Wald CI bounds ===
How’s that done?
disp8 = cmp_dept.copy()for c in ["p_hat", "CI_low", "CI_high", "Wald_low", "Wald_high"]: disp8[c] = disp8[c].map("{:.4f}".format)print(disp8.to_string(index=False))
Gender Dept n Admits p_hat CI_low CI_high Wald_low Wald_high
Female A 108.0 89.0 0.8241 0.7390 0.8906 0.7523 0.8959
Female B 25.0 17.0 0.6800 0.4650 0.8505 0.4971 0.8629
Female C 593.0 202.0 0.3406 0.3025 0.3803 0.3025 0.3788
Female D 375.0 131.0 0.3493 0.3011 0.4000 0.3011 0.3976
Female E 393.0 94.0 0.2392 0.1978 0.2845 0.1970 0.2814
Female F 341.0 24.0 0.0704 0.0456 0.1029 0.0432 0.0975
Male A 825.0 512.0 0.6206 0.5865 0.6538 0.5875 0.6537
Male B 560.0 353.0 0.6304 0.5889 0.6705 0.5904 0.6703
Male C 325.0 120.0 0.3692 0.3166 0.4242 0.3168 0.4217
Male D 417.0 138.0 0.3309 0.2859 0.3784 0.2858 0.3761
Male E 191.0 53.0 0.2775 0.2153 0.3467 0.2140 0.3410
Male F 373.0 22.0 0.0590 0.0373 0.0879 0.0351 0.0829
Discussion
Simpson’s Paradox
The aggregate data (Part 1a / 2a) suggest Males are admitted at a higher rate than Females overall. However, when we examine each department separately (Parts 1b / 2b / 2c), the Male − Female difference is small and often negative (females admitted at equal or higher rates within most departments).
This reversal is a classic demonstration of Simpson’s Paradox: females disproportionately applied to departments with low overall admission rates (C–F), dragging down the female aggregate even though they fared similarly or better within each department.
Exact vs. Normal approximation
At the large sample sizes present here the Wald and Clopper–Pearson intervals agree closely overall. Some department-level cells (especially small female counts in Dept A) show modest discrepancies, which is expected: the Wald approximation degrades when \(n\) is small or \(\hat p\) is close to 0 or 1. The exact interval is preferred in those situations.