Today, we examine inferential statistical techniques.

One key bit of data

UCB-Admissions.
CADDS.

Gemini Does the Work

Example

This example is inline on the page.

Claude Does the Work

Example

Slides

Link is here

The class plan:

Single sample binary inference.
Comparisons of binary variables.
Single sample quantitative inference and t.
Comparing quantitative variables and sampling.

Your post-class exercise:

The assignment: an inference example for your quantitative data.

For Next Class:

Reading and review
Deliverable: An inference example from your data [Assignment 15]. A collation of previous exercises collected into an example for a midterm.

Single sample inference for binary

Single sample inference for quantities

How’s that done?

import pandas as pd
import numpy as np

# Read data
df = pd.read_csv('data/CADDS.csv')

# Filter
filtered_df = df[df['Ethnicity'].isin(['Hispanic', 'White not Hispanic'])]

# Verify length
num_rows = len(filtered_df)
print(f"Number of remaining rows: {num_rows}")

Number of remaining rows: 777

How’s that done?

# Resampling method (Bootstrapping) for 90% CI of mean Expenditures
np.random.seed(42) # for reproducibility
expenditures = filtered_df['Expenditures'].values

# Bootstrap
n_iterations = 10000
bootstrap_means = np.empty(n_iterations)

for i in range(n_iterations):
    sample = np.random.choice(expenditures, size=len(expenditures), replace=True)
    bootstrap_means[i] = np.mean(sample)

ci_lower = np.percentile(bootstrap_means, 5)
ci_upper = np.percentile(bootstrap_means, 95)

print(f"90% Confidence Interval: ({ci_lower:.2f}, {ci_upper:.2f})")

90% Confidence Interval: (16956.48, 19245.72)

How’s that done?

print(f"Mean: {np.mean(expenditures):.2f}")

Mean: 18100.86

Subsamples and difference inference for quantities

How’s that done?

import pandas as pd
import numpy as np

# Read and filter data
df = pd.read_csv('data/CADDS.csv')
filtered_df = df[df['Ethnicity'].isin(['Hispanic', 'White not Hispanic'])]

# Separate expenditures by ethnicity
hisp_exp = filtered_df[filtered_df['Ethnicity'] == 'Hispanic']['Expenditures'].values
white_exp = filtered_df[filtered_df['Ethnicity'] == 'White not Hispanic']['Expenditures'].values

np.random.seed(42) # Set seed for reproducibility
n_iterations = 10000

hisp_means = np.empty(n_iterations)
white_means = np.empty(n_iterations)
diff_means = np.empty(n_iterations)

# Bootstrapping
for i in range(n_iterations):
    samp_hisp = np.random.choice(hisp_exp, size=len(hisp_exp), replace=True)
    samp_white = np.random.choice(white_exp, size=len(white_exp), replace=True)
    
    m_hisp = np.mean(samp_hisp)
    m_white = np.mean(samp_white)
    
    hisp_means[i] = m_hisp
    white_means[i] = m_white
    diff_means[i] = m_white - m_hisp # White not Hispanic - Hispanic

# Calculate 90% Confidence Intervals (5th and 95th percentiles)
ci_hisp_lower, ci_hisp_upper = np.percentile(hisp_means, 5), np.percentile(hisp_means, 95)
ci_white_lower, ci_white_upper = np.percentile(white_means, 5), np.percentile(white_means, 95)
ci_diff_lower, ci_diff_upper = np.percentile(diff_means, 5), np.percentile(diff_means, 95)

# Calculate observed means
mean_hisp = np.mean(hisp_exp)
mean_white = np.mean(white_exp)
mean_diff = mean_white - mean_hisp

print(f"Hispanic Expenditures:")

Hispanic Expenditures:

How’s that done?

print(f"  Observed Mean: ${mean_hisp:.2f}")

  Observed Mean: $11065.57

How’s that done?

print(f"  90% CI: (${ci_hisp_lower:.2f}, ${ci_hisp_upper:.2f})\n")

  90% CI: ($9745.81, $12389.69)

How’s that done?

print(f"White not Hispanic Expenditures:")

White not Hispanic Expenditures:

How’s that done?

print(f"  Observed Mean: ${mean_white:.2f}")

  Observed Mean: $24697.55

How’s that done?

print(f"  90% CI: (${ci_white_lower:.2f}, ${ci_white_upper:.2f})\n")

  90% CI: ($23030.54, $26416.89)

How’s that done?

print(f"Difference in Means (White not Hispanic - Hispanic):")

Difference in Means (White not Hispanic - Hispanic):

How’s that done?

print(f"  Observed Mean Difference: ${mean_diff:.2f}")

  Observed Mean Difference: $13631.98

How’s that done?

print(f"  90% CI: (${ci_diff_lower:.2f}, ${ci_diff_upper:.2f})")

  90% CI: ($11459.98, $15798.14)

Groups

How’s that done?

import pandas as pd
import numpy as np

# Read and filter data
df = pd.read_csv('data/CADDS.csv')
filtered_df = df[df['Ethnicity'].isin(['Hispanic', 'White not Hispanic'])]

np.random.seed(42) # Set seed for reproducibility
n_iterations = 10000

# Get unique age cohorts
cohorts = sorted(filtered_df['Age.Cohort'].unique())

results = []

for cohort in cohorts:
    cohort_data = filtered_df[filtered_df['Age.Cohort'] == cohort]
    
    hisp_exp = cohort_data[cohort_data['Ethnicity'] == 'Hispanic']['Expenditures'].values
    white_exp = cohort_data[cohort_data['Ethnicity'] == 'White not Hispanic']['Expenditures'].values
    
    # Check if we have enough data to bootstrap
    if len(hisp_exp) == 0 or len(white_exp) == 0:
        results.append(f"### Age Cohort: {cohort}\nNot enough data for both ethnicities to compare.\n")
        continue

    hisp_means = np.empty(n_iterations)
    white_means = np.empty(n_iterations)
    diff_means = np.empty(n_iterations)

    # Bootstrapping
    for i in range(n_iterations):
        samp_hisp = np.random.choice(hisp_exp, size=len(hisp_exp), replace=True)
        samp_white = np.random.choice(white_exp, size=len(white_exp), replace=True)
        
        m_hisp = np.mean(samp_hisp)
        m_white = np.mean(samp_white)
        
        hisp_means[i] = m_hisp
        white_means[i] = m_white
        diff_means[i] = m_white - m_hisp # White not Hispanic - Hispanic

    # Calculate 90% Confidence Intervals
    ci_hisp = (np.percentile(hisp_means, 5), np.percentile(hisp_means, 95))
    ci_white = (np.percentile(white_means, 5), np.percentile(white_means, 95))
    ci_diff = (np.percentile(diff_means, 5), np.percentile(diff_means, 95))

    mean_hisp = np.mean(hisp_exp)
    mean_white = np.mean(white_exp)
    mean_diff = mean_white - mean_hisp
    
    res_str = f"### Age Cohort: {cohort}\n"
    res_str += f"- **Hispanic** (n={len(hisp_exp)}): Mean = ${mean_hisp:.2f}, 90% CI = (${ci_hisp[0]:.2f}, ${ci_hisp[1]:.2f})\n"
    res_str += f"- **White not Hispanic** (n={len(white_exp)}): Mean = ${mean_white:.2f}, 90% CI = (${ci_white[0]:.2f}, ${ci_white[1]:.2f})\n"
    res_str += f"- **Difference (White - Hispanic)**: Mean = ${mean_diff:.2f}, 90% CI = (${ci_diff[0]:.2f}, ${ci_diff[1]:.2f})\n"
    
    results.append(res_str)

print('\n'.join(results))

### Age Cohort: A.0-5
- **Hispanic** (n=44): Mean = $1393.20, 90% CI = ($1263.61, $1524.00)
- **White not Hispanic** (n=20): Mean = $1366.90, 90% CI = ($1110.94, $1632.90)
- **Difference (White - Hispanic)**: Mean = $-26.30, 90% CI = ($-311.22, $272.17)

### Age Cohort: B.6-12
- **Hispanic** (n=91): Mean = $2312.19, 90% CI = ($2171.86, $2454.78)
- **White not Hispanic** (n=46): Mean = $2052.26, 90% CI = ($1846.15, $2252.76)
- **Difference (White - Hispanic)**: Mean = $-259.93, 90% CI = ($-508.29, $-15.80)

### Age Cohort: C.13-17
- **Hispanic** (n=103): Mean = $3955.28, 90% CI = ($3802.08, $4105.42)
- **White not Hispanic** (n=67): Mean = $3904.36, 90% CI = ($3690.24, $4113.66)
- **Difference (White - Hispanic)**: Mean = $-50.92, 90% CI = ($-313.79, $205.55)

### Age Cohort: D.18-21
- **Hispanic** (n=78): Mean = $9959.85, 90% CI = ($9366.27, $10571.31)
- **White not Hispanic** (n=69): Mean = $10133.06, 90% CI = ($9615.44, $10657.61)
- **Difference (White - Hispanic)**: Mean = $173.21, 90% CI = ($-621.97, $970.98)

### Age Cohort: E.22-50
- **Hispanic** (n=43): Mean = $40924.12, 90% CI = ($39330.41, $42527.58)
- **White not Hispanic** (n=133): Mean = $40187.62, 90% CI = ($39329.31, $41047.76)
- **Difference (White - Hispanic)**: Mean = $-736.49, 90% CI = ($-2569.61, $1064.21)

### Age Cohort: F.51-Over
- **Hispanic** (n=17): Mean = $55585.00, 90% CI = ($53484.51, $57692.66)
- **White not Hispanic** (n=66): Mean = $52670.42, 90% CI = ($51382.32, $53951.73)
- **Difference (White - Hispanic)**: Mean = $-2914.58, 90% CI = ($-5427.79, $-519.59)