Chapter 2

Statistics & Probability Foundations

The essential math behind AI — descriptive statistics, probability, distributions, and hypothesis testing with hands-on Python examples.

22 min read

Before diving into machine learning algorithms, you need a working knowledge of statistics and probability. These are the mathematical foundations that every ML model is built on — from calculating averages to understanding why a model’s predictions are trustworthy.

Why statistics matters for AI

Machine learning is essentially applied statistics at scale. Understanding distributions, variance, and probability helps you choose the right model, evaluate its performance, and avoid common pitfalls like overfitting.

Descriptive statistics

Descriptive statistics summarize and describe the main features of a dataset. Before any model can learn, you need to understand your data — and descriptive statistics are the first tool you reach for.

Measures of central tendency

Central tendency tells you where the "center" of your data sits. The three classic measures are:

Mean — the arithmetic average. Add all values and divide by the count. Sensitive to outliers — a single billionaire skews the average income of a neighborhood.
Median — the middle value when data is sorted. If you have an even number of values, take the average of the two middle ones. Robust to outliers.
Mode — the most frequent value. Especially useful for categorical data like "red", "blue", "green".

python

import numpy as np
from scipy import stats as sp_stats

data = np.array([23, 45, 12, 67, 34, 89, 23, 56, 23, 78])

print(f"Mean:   {np.mean(data):.1f}")           # 45.0
print(f"Median: {np.median(data):.1f}")         # 39.5
print(f"Mode:   {sp_stats.mode(data).mode}")    # 23

Computing mean, median, and mode.

When to use which?

Use mean for symmetric data without outliers. Use median for skewed data or when outliers are present (e.g., income, house prices). Use mode for categorical data or to find the most common category.

Measures of spread

Spread (or dispersion) tells you how scattered your data is. Two datasets can have the same mean but wildly different spreads.

Range — max minus min. Simple but misleading with outliers.
Variance — average of squared deviations from the mean. Larger = more spread.
Standard deviation (σ) — the square root of variance. Same units as the data, so easier to interpret.
Interquartile Range (IQR) — the range of the middle 50% (Q3 − Q1). Robust to outliers.

python

import numpy as np

data = np.array([23, 45, 12, 67, 34, 89, 23, 56, 23, 78])

print(f"Range:    {np.ptp(data)}")               # 77
print(f"Variance: {np.var(data):.1f}")           # 604.6
print(f"Std Dev:  {np.std(data):.1f}")           # 24.6

q1, q3 = np.percentile(data, [25, 75])
print(f"IQR:      {q3 - q1:.1f}")                # 36.5

Spread measures with NumPy.

Percentiles and quartiles

A percentile tells you what percentage of the data falls below a given value. For example, if your test score is at the 90th percentile, you scored higher than 90% of test-takers. Quartiles split the data into four equal parts:

Q1 (25th percentile) — 25% of data falls below this value.
Q2 (50th percentile) — the median.
Q3 (75th percentile) — 75% of data falls below this value.

python

import numpy as np

scores = np.array([55, 62, 68, 72, 75, 78, 82, 88, 91, 95])

for p in [25, 50, 75, 90]:
    print(f"{p}th percentile: {np.percentile(scores, p):.1f}")
# 25th: 67.0 | 50th: 76.5 | 75th: 86.5 | 90th: 92.2

Percentiles in practice.

Skewness and kurtosis

The shape of a distribution matters. Skewness measures asymmetry: positive skew means a long right tail (e.g., income data), negative skew means a long left tail. Kurtosis measures how heavy the tails are compared to a normal distribution — high kurtosis means more extreme outliers.

python

from scipy import stats
import numpy as np

np.random.seed(42)
# Income-like data: positively skewed
income = np.random.exponential(scale=50000, size=1000)

print(f"Skewness: {stats.skew(income):.2f}")    # ~2.0 (right-skewed)
print(f"Kurtosis: {stats.kurtosis(income):.2f}")  # ~6.0 (heavy tails)

Measuring distribution shape.

Probability basics

Probability quantifies uncertainty. It is a number between 0 (impossible) and 1 (certain). In AI, probabilities are everywhere: a classifier outputs the probability that an email is spam, a language model outputs the probability of each next word.

Events and sample spaces

The sample space is the set of all possible outcomes. An event is a subset of those outcomes. For a single die roll, the sample space is {1, 2, 3, 4, 5, 6}. The event "roll an even number" is {2, 4, 6}.

P(A) — probability of event A. Always between 0 and 1.
Complement P(Aᶜ) — probability A does NOT happen. P(Aᶜ) = 1 − P(A).
P(A ∩ B) — probability of both A and B (joint probability).
P(A ∪ B) — probability of A or B. Equals P(A) + P(B) − P(A ∩ B).
P(A | B) — probability of A given that B happened (conditional probability).

python

import numpy as np

# Simulate 100,000 dice rolls
np.random.seed(42)
rolls = np.random.randint(1, 7, size=100000)

print(f"P(roll = 6):     {np.mean(rolls == 6):.4f}")       # ~0.1667
print(f"P(roll is even): {np.mean(rolls % 2 == 0):.4f}")  # ~0.5000
print(f"P(roll >= 5):    {np.mean(rolls >= 5):.4f}")      # ~0.3333

Basic probability with dice.

Independent vs. dependent events

Two events are independent if knowing one happened tells you nothing about the other. Coin flips are independent — getting heads once does not affect the next flip. Events are dependent when one changes the probability of the other, like drawing cards from a deck without replacement.

python

# Independent: Two coin flips
# P(heads AND heads) = P(heads) * P(heads)
p_both_heads = 0.5 * 0.5
print(f"P(HH) = {p_both_heads}")  # 0.25

# Dependent: Drawing 2 aces from a 52-card deck without replacement
p_first_ace  = 4 / 52
p_second_ace = 3 / 51  # one ace is gone, one card is gone
print(f"P(2 aces) = {p_first_ace * p_second_ace:.4f}")  # 0.0045

Independent vs. dependent probability.

The law of large numbers

As you repeat an experiment more times, the observed frequency converges to the true probability. Flip a coin 10 times and you might get 70% heads. Flip it 10,000 times and you will be very close to 50%. This is the Law of Large Numbers — and it is why more training data almost always helps ML models.

python

import numpy as np

np.random.seed(0)
flips = np.random.choice([0, 1], size=10000)  # 0=tails, 1=heads

for n in [10, 100, 1000, 10000]:
    est = np.mean(flips[:n])
    print(f"After {n:>5} flips: P(heads) ≈ {est:.4f}")
# 10: 0.4000 | 100: 0.4800 | 1000: 0.4990 | 10000: 0.5017

Watching convergence with more samples.

Bayes' theorem

Bayes' theorem lets you update your beliefs when new evidence arrives. It answers: given that I observed some evidence, what is the updated probability of my hypothesis?

The formula explained

The formula is: P(H|E) = P(E|H) × P(H) / P(E). In plain English:

P(H) — the prior: your belief before seeing any evidence.
P(E|H) — the likelihood: how probable is the evidence if the hypothesis is true?
P(E) — the marginal likelihood: how probable is the evidence overall?
P(H|E) — the posterior: your updated belief after seeing the evidence.

Bayes in practice: medical testing

python

# A test is 99% accurate. Disease prevalence is 1 in 1000.
# If you test positive, what is the actual probability you have the disease?

p_disease = 0.001            # prior
p_positive_given_disease = 0.99  # sensitivity
p_positive_given_healthy = 0.01  # false positive rate

# Total probability of testing positive
p_positive = (p_positive_given_disease * p_disease +
              p_positive_given_healthy * (1 - p_disease))

# Bayes' theorem
p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive
print(f"P(disease | positive) = {p_disease_given_positive:.3f}")  # ~0.090
# Only 9%! Most positives are false positives.

Bayes' theorem in action — disease testing.

The base rate fallacy

Even highly accurate tests can produce mostly false positives when the condition is rare. Always consider the base rate (prevalence) when interpreting probabilities.

Bayes in AI: spam filtering

The Naive Bayes classifier is one of the simplest and fastest ML algorithms. It uses Bayes' theorem to calculate the probability that a message is spam given the words it contains. "Naive" because it assumes each word is independent — an oversimplification that works surprisingly well in practice.

python

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

emails = [
    "Win a free prize now", "Meeting at 3pm tomorrow",
    "Click here for cash",  "Project update attached",
    "Free money guaranteed", "Lunch plans for Friday?",
]
labels = ["spam", "ham", "spam", "ham", "spam", "ham"]

vec = CountVectorizer()
X = vec.fit_transform(emails)
clf = MultinomialNB().fit(X, labels)

test = vec.transform(["Win free cash prize"])
print(clf.predict(test))        # ['spam']
print(clf.predict_proba(test))  # probabilities for [ham, spam]

Naive Bayes spam classifier.

Probability distributions

A distribution describes how the values of a random variable are spread. Understanding distributions is critical because every ML model makes assumptions about the distribution of the data.

The normal (Gaussian) distribution

The most important distribution in all of statistics. The classic bell curve is defined by two parameters: the mean (μ) and standard deviation (σ). The 68-95-99.7 rule says that ~68% of data falls within 1σ of the mean, ~95% within 2σ, and ~99.7% within 3σ.

python

import numpy as np

np.random.seed(42)
samples = np.random.normal(loc=100, scale=15, size=100000)

for k in [1, 2, 3]:
    within = np.mean(np.abs(samples - 100) < k * 15)
    print(f"Within {k}σ: {within:.1%}")
# Within 1σ: 68.3% | Within 2σ: 95.4% | Within 3σ: 99.7%

The 68-95-99.7 rule in action.

The binomial distribution

Counts the number of successes in a fixed number of independent yes/no trials. For example: out of 100 website visitors, how many clicked the buy button if each has a 5% chance?

python

from scipy import stats

# 100 visitors, 5% conversion rate
n, p = 100, 0.05
dist = stats.binom(n, p)

print(f"Expected clicks:  {dist.mean():.1f}")    # 5.0
print(f"Std dev:          {dist.std():.2f}")     # 2.18
print(f"P(exactly 5):     {dist.pmf(5):.3f}")    # 0.180
print(f"P(10 or more):    {1 - dist.cdf(9):.4f}") # 0.0282

Binomial distribution example.

The Poisson distribution

Models the number of events in a fixed interval when events occur independently at a constant average rate. Examples: server requests per minute, typos per page, goals per soccer match.

python

from scipy import stats

# Average 8 requests per minute
lambda_rate = 8
dist = stats.poisson(lambda_rate)

print(f"P(exactly 8):    {dist.pmf(8):.3f}")     # 0.140
print(f"P(more than 12): {1 - dist.cdf(12):.4f}") # 0.0638
print(f"P(0 requests):   {dist.pmf(0):.6f}")     # 0.000335

Poisson: modeling server requests.

Choosing the right distribution

Normal for continuous measurements (height, weight). Binomial for counting successes in fixed trials. Poisson for counting events in time/space. Knowing which distribution fits your data is half the battle in statistics.

Sampling and estimation

In the real world, you rarely have access to an entire population. Instead, you take a sample and use it to estimate population parameters. The quality of your sample determines the reliability of your conclusions — and your ML model.

Types of sampling

Random sampling — every member of the population has an equal chance of being selected. The gold standard.
Stratified sampling — divide the population into subgroups (strata) and sample from each. Ensures representation.
Systematic sampling — select every k-th element from a list.
Convenience sampling — use whatever data is easiest to collect. Prone to bias.

Confidence intervals

A confidence interval gives a range of plausible values for a population parameter. A 95% CI means: if we repeated this experiment many times, 95% of the intervals we compute would contain the true value.

python

import numpy as np
from scipy import stats

np.random.seed(42)
sample = np.random.normal(loc=170, scale=10, size=50)  # heights in cm

mean = np.mean(sample)
se = stats.sem(sample)  # standard error of the mean
ci = stats.t.interval(0.95, df=len(sample)-1, loc=mean, scale=se)

print(f"Sample mean: {mean:.1f} cm")
print(f"95% CI:      ({ci[0]:.1f}, {ci[1]:.1f}) cm")
# e.g., 95% CI: (167.2, 173.4) cm

95% confidence interval for the mean.

Correlation vs. causation

Two variables can be correlated (they move together) without one causing the other. This is one of the most important distinctions in data science. Ice cream sales and drowning rates both rise in summer — but ice cream does not cause drowning. The hidden common cause is hot weather.

Measuring correlation

Pearson's r measures linear correlation between two variables. It ranges from −1 (perfect negative) to +1 (perfect positive). Zero means no linear relationship (but there could still be a non-linear one!).

python

import numpy as np

study_hours = np.array([1, 2, 3, 4, 5, 6, 7, 8])
exam_scores = np.array([40, 50, 55, 65, 70, 75, 85, 90])

r = np.corrcoef(study_hours, exam_scores)[0, 1]
print(f"Pearson r = {r:.3f}")  # 0.993 (strong positive)

# Interpret: |r| < 0.3 weak, 0.3-0.7 moderate, > 0.7 strong

Pearson correlation coefficient.

Spurious correlations

With enough variables, you will always find correlations by chance. The number of Nicolas Cage films correlates with swimming pool drownings. This is why we need hypothesis testing and domain knowledge — not just data mining.

Correlation is a clue, not proof

In ML, we use correlations to select features. But always ask: could a hidden variable explain the relationship? Causation requires controlled experiments or causal inference methods.

The Central Limit Theorem

The Central Limit Theorem (CLT) is arguably the most important theorem in statistics. It states that the average of many independent samples will be approximately normally distributed, regardless of the original distribution. This is why the normal distribution shows up everywhere and why sample means are so useful.

Why the CLT matters for AI

It justifies using confidence intervals and hypothesis tests even when data is not normally distributed.
It explains why ensemble methods (averaging multiple models) often work better than single models.
It underpins stochastic gradient descent — mini-batch averages approximate the true gradient.

python

import numpy as np

np.random.seed(0)
# Roll a die 30 times, take the mean. Repeat 10,000 times.
sample_means = [np.mean(np.random.randint(1, 7, size=30))
                for _ in range(10000)]

print(f"Mean of means: {np.mean(sample_means):.2f}")  # ~3.50
print(f"Std of means:  {np.std(sample_means):.3f}")  # ~0.31
# The distribution of these means is approximately normal!

CLT in action: dice rolls become a bell curve.

Hypothesis testing

Hypothesis testing is how we decide whether a result is statistically significant or just due to random chance. In ML, you use it to answer questions like: "Is model A really better than model B, or did it just get lucky on this test set?"

The step-by-step workflow

1State a null hypothesis (H₀): "there is no effect or difference."
2State an alternative hypothesis (H₁): "there IS an effect."
3Choose a significance level (α), typically 0.05.
4Collect data and compute a test statistic.
5Calculate the p-value — the probability of seeing this result (or more extreme) if H₀ were true.
6If p-value < α, reject H₀ and conclude the result is statistically significant.

Common statistical tests

t-test — compare the means of two groups. Use when data is roughly normal.
Chi-squared test — test relationships between categorical variables.
ANOVA — compare means across three or more groups.
Mann-Whitney U — non-parametric alternative to the t-test when data is not normal.

python

from scipy import stats
import numpy as np

np.random.seed(42)
group_a = np.random.normal(loc=70, scale=10, size=50)  # control
group_b = np.random.normal(loc=75, scale=10, size=50)  # treatment

t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value:     {p_value:.4f}")
print("Significant!" if p_value < 0.05 else "Not significant")

Comparing two groups with a t-test.

Type I and Type II errors

No test is perfect. Two kinds of mistakes can happen:

Type I error (false positive) — you reject H₀ when it is actually true. Like a spam filter marking a legitimate email as spam.
Type II error (false negative) — you fail to reject H₀ when it is actually false. Like a spam filter letting spam through.

p-values are not probabilities of truth

A p-value of 0.03 does NOT mean there is a 3% chance the null hypothesis is true. It means: if H₀ were true, there is a 3% chance of seeing data this extreme. This subtle distinction trips up even experienced researchers.

Statistics in the ML pipeline

Everything you learned in this chapter connects directly to machine learning. Here is where each concept shows up:

Descriptive stats — exploratory data analysis (EDA) before building any model.
Probability — classification outputs, loss functions, probabilistic models.
Bayes' theorem — Naive Bayes classifiers, Bayesian optimization, prior/posterior in Bayesian ML.
Distributions — assuming data normality, initializing neural network weights.
Sampling — train/test splits, cross-validation, bootstrapping.
Correlation — feature selection, multicollinearity detection.
CLT — justifies why SGD with mini-batches works, ensemble averaging.
Hypothesis testing — A/B testing models, significance of improvements.

You are ready for ML

With these statistical foundations, you have the vocabulary and intuition to understand what ML algorithms are actually doing under the hood. The next chapter puts it all together with machine learning basics.

PreviousIntroduction to AI Next Machine Learning Basics