Gemini Binomial Analysis

Author

Robert W. Walker

Published

February 24, 2026

NotePrompt
  1. This dataset represents Admissions decisions by two traits: M.F. captures Male/Female election and Dept captures the department, labeled A through F. Please provide exact binomial confidence intervals, 95%, for the probability of Admission for Males, for Females, and then for each of Males and Females by department

  2. This is great, now please reanalyse the data using the normal approximation and provide differences overall and for each department.

  3. Could you collect your responses and code into a quarto markdown document with python code chunks?

  4. Can you create the text file, the chunks are hampered by your rendering?

Introduction

This document analyzes a dataset representing admissions decisions. The dataset includes gender (M.F), department (Dept, labeled A through F), and the admission decision (Admit).

Below, we compute 95% confidence intervals for the probability of admission for Males and Females, both overall and broken down by department, using two different methods.


Part 1: Exact Binomial Confidence Intervals

First, we calculate the exact binomial (Clopper-Pearson) confidence intervals for the probability of Admission using the Beta distribution method.

How’s that done?
import pandas as pd
from statsmodels.stats.proportion import proportion_confint

# Load the dataset
df = pd.read_csv('./data/UCB-Admit.csv')

def get_exact_ci(successes, trials):
    if trials == 0:
        return (None, None)
    return proportion_confint(successes, trials, alpha=0.05, method='beta')

results = []

# Overall Males
males = df[df['M.F'] == 'Male']
k_m = (males['Admit'] == 'Yes').sum()
n_m = len(males)
ci_m = get_exact_ci(k_m, n_m)
results.append({
    'Group': 'Males (Overall)', 'Successes': k_m, 'Trials': n_m, 
    'Prob': k_m/n_m if n_m else 0, 'Lower 95% CI': ci_m[0], 'Upper 95% CI': ci_m[1]
})

# Overall Females
females = df[df['M.F'] == 'Female']
k_f = (females['Admit'] == 'Yes').sum()
n_f = len(females)
ci_f = get_exact_ci(k_f, n_f)
results.append({
    'Group': 'Females (Overall)', 'Successes': k_f, 'Trials': n_f, 
    'Prob': k_f/n_f if n_f else 0, 'Lower 95% CI': ci_f[0], 'Upper 95% CI': ci_f[1]
})

# By Department
for dept in sorted(df['Dept'].unique()):
    for gender in ['Male', 'Female']:
        subset = df[(df['M.F'] == gender) & (df['Dept'] == dept)]
        k = (subset['Admit'] == 'Yes').sum()
        n = len(subset)
        ci = get_exact_ci(k, n)
        prob = k/n if n else None
        results.append({
            'Group': f'{gender}s in Dept {dept}', 'Successes': k, 'Trials': n, 
            'Prob': prob, 'Lower 95% CI': ci[0], 'Upper 95% CI': ci[1]
        })

# Display Results
res_exact_df = pd.DataFrame(results)
res_exact_df.style.format({
    'Prob': '{:.2%}', 
    'Lower 95% CI': '{:.2%}', 
    'Upper 95% CI': '{:.2%}'
})
Exact Binomial Confidence Intervals
  Group Successes Trials Prob Lower 95% CI Upper 95% CI
0 Males (Overall) 1198 2691 44.52% 42.63% 46.42%
1 Females (Overall) 557 1835 30.35% 28.26% 32.52%
2 Males in Dept A 512 825 62.06% 58.65% 65.38%
3 Females in Dept A 89 108 82.41% 73.90% 89.06%
4 Males in Dept B 353 560 63.04% 58.89% 67.05%
5 Females in Dept B 17 25 68.00% 46.50% 85.05%
6 Males in Dept C 120 325 36.92% 31.66% 42.42%
7 Females in Dept C 202 593 34.06% 30.25% 38.03%
8 Males in Dept D 138 417 33.09% 28.59% 37.84%
9 Females in Dept D 131 375 34.93% 30.11% 40.00%
10 Males in Dept E 53 191 27.75% 21.53% 34.67%
11 Females in Dept E 94 393 23.92% 19.78% 28.45%
12 Males in Dept F 22 373 5.90% 3.73% 8.79%
13 Females in Dept F 24 341 7.04% 4.56% 10.29%

Interpretation of Exact Intervals

Overall, Males had an admission rate of 44.52% (95% CI: 42.63% - 46.42%), while Females had an admission rate of 30.35% (95% CI: 28.26% - 32.52%).


Part 2: Normal Approximation (Wald Method) and Differences

Next, we re-analyze the data utilizing the normal approximation (Wald method) for binomial confidence intervals at 95%. This allows us to easily calculate the confidence intervals for the differences in probability of admission (Male minus Female).

How’s that done?
import numpy as np
from scipy.stats import norm

z = norm.ppf(0.975) # 1.96 for 95% CI

def wald_ci(k, n):
    if n == 0: return np.nan, np.nan, np.nan
    p = k / n
    se = np.sqrt(p * (1 - p) / n)
    return p, p - z * se, p + z * se

def wald_diff_ci(k1, n1, k2, n2):
    if n1 == 0 or n2 == 0: return np.nan, np.nan, np.nan
    p1 = k1 / n1
    p2 = k2 / n2
    diff = p1 - p2
    se = np.sqrt(p1 * (1 - p1) / n1 + p2 * (1 - p2) / n2)
    return diff, diff - z * se, diff + z * se

# Overall Stats Calculation
pm, pm_low, pm_high = wald_ci(k_m, n_m)
pf, pf_low, pf_high = wald_ci(k_f, n_f)
diff_o, diff_o_low, diff_o_high = wald_diff_ci(k_m, n_m, k_f, n_f)

print("--- Overall Admission Differences ---")
--- Overall Admission Differences ---
How’s that done?
print(f"Overall Males: p={pm:.4f}, CI=({pm_low:.4f}, {pm_high:.4f})")
Overall Males: p=0.4452, CI=(0.4264, 0.4640)
How’s that done?
print(f"Overall Females: p={pf:.4f}, CI=({pf_low:.4f}, {pf_high:.4f})")
Overall Females: p=0.3035, CI=(0.2825, 0.3246)
How’s that done?
print(f"Overall Diff (M-F): diff={diff_o:.4f}, CI=({diff_o_low:.4f}, {diff_o_high:.4f})\n")
Overall Diff (M-F): diff=0.1416, CI=(0.1134, 0.1698)
How’s that done?
# Department Level Calculation
dept_results = []
for dept in sorted(df['Dept'].unique()):
    m_sub = df[(df['M.F'] == 'Male') & (df['Dept'] == dept)]
    f_sub = df[(df['M.F'] == 'Female') & (df['Dept'] == dept)]
    
    k_m_dept = (m_sub['Admit'] == 'Yes').sum()
    n_m_dept = len(m_sub)
    k_f_dept = (f_sub['Admit'] == 'Yes').sum()
    n_f_dept = len(f_sub)
    
    p_m, pm_l, pm_h = wald_ci(k_m_dept, n_m_dept)
    p_f, pf_l, pf_h = wald_ci(k_f_dept, n_f_dept)
    diff, diff_l, diff_h = wald_diff_ci(k_m_dept, n_m_dept, k_f_dept, n_f_dept)
    
    dept_results.append({
        'Dept': dept,
        'Male_p': p_m, 'Male_CI_L': pm_l, 'Male_CI_U': pm_h,
        'Female_p': p_f, 'Female_CI_L': pf_l, 'Female_CI_U': pf_h,
        'Diff': diff, 'Diff_CI_L': diff_l, 'Diff_CI_U': diff_h
    })

res_diff_df = pd.DataFrame(dept_results)
res_diff_df.style.format({
    'Male_p': '{:.2%}', 'Male_CI_L': '{:.2%}', 'Male_CI_U': '{:.2%}',
    'Female_p': '{:.2%}', 'Female_CI_L': '{:.2%}', 'Female_CI_U': '{:.2%}',
    'Diff': '{:.2%}', 'Diff_CI_L': '{:.2%}', 'Diff_CI_U': '{:.2%}'
})
Normal Approximation Confidence Intervals and Differences
  Dept Male_p Male_CI_L Male_CI_U Female_p Female_CI_L Female_CI_U Diff Diff_CI_L Diff_CI_U
0 A 62.06% 58.75% 65.37% 82.41% 75.23% 89.59% -20.35% -28.25% -12.44%
1 B 63.04% 59.04% 67.03% 68.00% 49.71% 86.29% -4.96% -23.68% 13.75%
2 C 36.92% 31.68% 42.17% 34.06% 30.25% 37.88% 2.86% -3.63% 9.35%
3 D 33.09% 28.58% 37.61% 34.93% 30.11% 39.76% -1.84% -8.45% 4.77%
4 E 27.75% 21.40% 34.10% 23.92% 19.70% 28.14% 3.83% -3.79% 11.45%
5 F 5.90% 3.51% 8.29% 7.04% 4.32% 9.75% -1.14% -4.76% 2.48%

Analysis of the Differences & Simpson’s Paradox

  1. Overall: Males had a higher probability of being admitted compared to Females. Based on the normal approximation, the overall difference in probability is statistically significant because the 95% CI for the difference (+11.34% to +16.98%) does not cross zero.
  2. By Department (Simpson’s Paradox): When we re-examine the data by looking at each department separately, the trend reverses or disappears:
    • Department A: Females were admitted at a significantly higher rate than Males (the 95% CI for the Male-Female difference is entirely negative: -28.25% to -12.44%).
    • Departments B, C, D, E, F: The 95% confidence intervals for the differences all cross zero, meaning there is no statistically significant difference between Male and Female admission probabilities in any of these individual departments.

This illustrates a classic example of Simpson’s Paradox: a trend appears in different groups of data but disappears or reverses when these groups are combined.