Gemini Binomial Analysis

Author

Robert W. Walker

Published

February 24, 2026

Prompt

This dataset represents Admissions decisions by two traits: M.F. captures Male/Female election and Dept captures the department, labeled A through F. Please provide exact binomial confidence intervals, 95%, for the probability of Admission for Males, for Females, and then for each of Males and Females by department
This is great, now please reanalyse the data using the normal approximation and provide differences overall and for each department.
Could you collect your responses and code into a quarto markdown document with python code chunks?
Can you create the text file, the chunks are hampered by your rendering?

Introduction

This document analyzes a dataset representing admissions decisions. The dataset includes gender (M.F), department (Dept, labeled A through F), and the admission decision (Admit).

Below, we compute 95% confidence intervals for the probability of admission for Males and Females, both overall and broken down by department, using two different methods.

Part 1: Exact Binomial Confidence Intervals

First, we calculate the exact binomial (Clopper-Pearson) confidence intervals for the probability of Admission using the Beta distribution method.

How’s that done?

import pandas as pd
from statsmodels.stats.proportion import proportion_confint

# Load the dataset
df = pd.read_csv('./data/UCB-Admit.csv')

def get_exact_ci(successes, trials):
    if trials == 0:
        return (None, None)
    return proportion_confint(successes, trials, alpha=0.05, method='beta')

results = []

# Overall Males
males = df[df['M.F'] == 'Male']
k_m = (males['Admit'] == 'Yes').sum()
n_m = len(males)
ci_m = get_exact_ci(k_m, n_m)
results.append({
    'Group': 'Males (Overall)', 'Successes': k_m, 'Trials': n_m, 
    'Prob': k_m/n_m if n_m else 0, 'Lower 95% CI': ci_m[0], 'Upper 95% CI': ci_m[1]
})

# Overall Females
females = df[df['M.F'] == 'Female']
k_f = (females['Admit'] == 'Yes').sum()
n_f = len(females)
ci_f = get_exact_ci(k_f, n_f)
results.append({
    'Group': 'Females (Overall)', 'Successes': k_f, 'Trials': n_f, 
    'Prob': k_f/n_f if n_f else 0, 'Lower 95% CI': ci_f[0], 'Upper 95% CI': ci_f[1]
})

# By Department
for dept in sorted(df['Dept'].unique()):
    for gender in ['Male', 'Female']:
        subset = df[(df['M.F'] == gender) & (df['Dept'] == dept)]
        k = (subset['Admit'] == 'Yes').sum()
        n = len(subset)
        ci = get_exact_ci(k, n)
        prob = k/n if n else None
        results.append({
            'Group': f'{gender}s in Dept {dept}', 'Successes': k, 'Trials': n, 
            'Prob': prob, 'Lower 95% CI': ci[0], 'Upper 95% CI': ci[1]
        })

# Display Results
res_exact_df = pd.DataFrame(results)
res_exact_df.style.format({
    'Prob': '{:.2%}', 
    'Lower 95% CI': '{:.2%}', 
    'Upper 95% CI': '{:.2%}'
})

Exact Binomial Confidence Intervals
	Group	Successes	Trials	Prob	Lower 95% CI	Upper 95% CI
0	Males (Overall)	1198	2691	44.52%	42.63%	46.42%
1	Females (Overall)	557	1835	30.35%	28.26%	32.52%
2	Males in Dept A	512	825	62.06%	58.65%	65.38%
3	Females in Dept A	89	108	82.41%	73.90%	89.06%
4	Males in Dept B	353	560	63.04%	58.89%	67.05%
5	Females in Dept B	17	25	68.00%	46.50%	85.05%
6	Males in Dept C	120	325	36.92%	31.66%	42.42%
7	Females in Dept C	202	593	34.06%	30.25%	38.03%
8	Males in Dept D	138	417	33.09%	28.59%	37.84%
9	Females in Dept D	131	375	34.93%	30.11%	40.00%
10	Males in Dept E	53	191	27.75%	21.53%	34.67%
11	Females in Dept E	94	393	23.92%	19.78%	28.45%
12	Males in Dept F	22	373	5.90%	3.73%	8.79%
13	Females in Dept F	24	341	7.04%	4.56%	10.29%

Interpretation of Exact Intervals

Overall, Males had an admission rate of 44.52% (95% CI: 42.63% - 46.42%), while Females had an admission rate of 30.35% (95% CI: 28.26% - 32.52%).

Part 2: Normal Approximation (Wald Method) and Differences

Next, we re-analyze the data utilizing the normal approximation (Wald method) for binomial confidence intervals at 95%. This allows us to easily calculate the confidence intervals for the differences in probability of admission (Male minus Female).

How’s that done?

import numpy as np
from scipy.stats import norm

z = norm.ppf(0.975) # 1.96 for 95% CI

def wald_ci(k, n):
    if n == 0: return np.nan, np.nan, np.nan
    p = k / n
    se = np.sqrt(p * (1 - p) / n)
    return p, p - z * se, p + z * se

def wald_diff_ci(k1, n1, k2, n2):
    if n1 == 0 or n2 == 0: return np.nan, np.nan, np.nan
    p1 = k1 / n1
    p2 = k2 / n2
    diff = p1 - p2
    se = np.sqrt(p1 * (1 - p1) / n1 + p2 * (1 - p2) / n2)
    return diff, diff - z * se, diff + z * se

# Overall Stats Calculation
pm, pm_low, pm_high = wald_ci(k_m, n_m)
pf, pf_low, pf_high = wald_ci(k_f, n_f)
diff_o, diff_o_low, diff_o_high = wald_diff_ci(k_m, n_m, k_f, n_f)

print("--- Overall Admission Differences ---")

--- Overall Admission Differences ---

How’s that done?

print(f"Overall Males: p={pm:.4f}, CI=({pm_low:.4f}, {pm_high:.4f})")

Overall Males: p=0.4452, CI=(0.4264, 0.4640)

How’s that done?

print(f"Overall Females: p={pf:.4f}, CI=({pf_low:.4f}, {pf_high:.4f})")

Overall Females: p=0.3035, CI=(0.2825, 0.3246)

How’s that done?

print(f"Overall Diff (M-F): diff={diff_o:.4f}, CI=({diff_o_low:.4f}, {diff_o_high:.4f})\n")

Overall Diff (M-F): diff=0.1416, CI=(0.1134, 0.1698)

How’s that done?

# Department Level Calculation
dept_results = []
for dept in sorted(df['Dept'].unique()):
    m_sub = df[(df['M.F'] == 'Male') & (df['Dept'] == dept)]
    f_sub = df[(df['M.F'] == 'Female') & (df['Dept'] == dept)]
    
    k_m_dept = (m_sub['Admit'] == 'Yes').sum()
    n_m_dept = len(m_sub)
    k_f_dept = (f_sub['Admit'] == 'Yes').sum()
    n_f_dept = len(f_sub)
    
    p_m, pm_l, pm_h = wald_ci(k_m_dept, n_m_dept)
    p_f, pf_l, pf_h = wald_ci(k_f_dept, n_f_dept)
    diff, diff_l, diff_h = wald_diff_ci(k_m_dept, n_m_dept, k_f_dept, n_f_dept)
    
    dept_results.append({
        'Dept': dept,
        'Male_p': p_m, 'Male_CI_L': pm_l, 'Male_CI_U': pm_h,
        'Female_p': p_f, 'Female_CI_L': pf_l, 'Female_CI_U': pf_h,
        'Diff': diff, 'Diff_CI_L': diff_l, 'Diff_CI_U': diff_h
    })

res_diff_df = pd.DataFrame(dept_results)
res_diff_df.style.format({
    'Male_p': '{:.2%}', 'Male_CI_L': '{:.2%}', 'Male_CI_U': '{:.2%}',
    'Female_p': '{:.2%}', 'Female_CI_L': '{:.2%}', 'Female_CI_U': '{:.2%}',
    'Diff': '{:.2%}', 'Diff_CI_L': '{:.2%}', 'Diff_CI_U': '{:.2%}'
})

Normal Approximation Confidence Intervals and Differences
	Dept	Male_p	Male_CI_L	Male_CI_U	Female_p	Female_CI_L	Female_CI_U	Diff	Diff_CI_L	Diff_CI_U
0	A	62.06%	58.75%	65.37%	82.41%	75.23%	89.59%	-20.35%	-28.25%	-12.44%
1	B	63.04%	59.04%	67.03%	68.00%	49.71%	86.29%	-4.96%	-23.68%	13.75%
2	C	36.92%	31.68%	42.17%	34.06%	30.25%	37.88%	2.86%	-3.63%	9.35%
3	D	33.09%	28.58%	37.61%	34.93%	30.11%	39.76%	-1.84%	-8.45%	4.77%
4	E	27.75%	21.40%	34.10%	23.92%	19.70%	28.14%	3.83%	-3.79%	11.45%
5	F	5.90%	3.51%	8.29%	7.04%	4.32%	9.75%	-1.14%	-4.76%	2.48%

Analysis of the Differences & Simpson’s Paradox

Overall: Males had a higher probability of being admitted compared to Females. Based on the normal approximation, the overall difference in probability is statistically significant because the 95% CI for the difference (+11.34% to +16.98%) does not cross zero.
By Department (Simpson’s Paradox): When we re-examine the data by looking at each department separately, the trend reverses or disappears:
- Department A: Females were admitted at a significantly higher rate than Males (the 95% CI for the Male-Female difference is entirely negative: -28.25% to -12.44%).
- Departments B, C, D, E, F: The 95% confidence intervals for the differences all cross zero, meaning there is no statistically significant difference between Male and Female admission probabilities in any of these individual departments.

This illustrates a classic example of Simpson’s Paradox: a trend appears in different groups of data but disappears or reverses when these groups are combined.