Chapter 3

Machine Learning Basics

The core vocabulary of ML: features, labels, training, and the difference between supervised and unsupervised learning — with runnable examples.

11 min read

Machine learning is the practice of teaching computers to find patterns in data. Rather than programming the answer, you give the algorithm many examples and let it infer the rule. The learned rule is called a model.

Key vocabulary

Feature — an input variable (e.g. the size of a house).
Label (or target) — the answer we want to predict (e.g. the price).
Training — the process of adjusting the model to fit the data.
Inference — using the trained model to predict on new data.
Generalization — how well the model performs on data it has never seen.

Supervised learning

In supervised learning, every training example comes with the correct answer (a label). The model learns to map inputs to outputs. Two common tasks are regression (predict a number) and classification (predict a category).

python

from sklearn.linear_model import LinearRegression
import numpy as np

# Feature: house size (m2). Label: price (in thousands).
X = np.array([[50], [70], [90], [110], [130]])
y = np.array([150, 210, 270, 330, 390])

model = LinearRegression()
model.fit(X, y)                 # training

print(model.predict([[100]]))  # -> ~300  (inference)

Linear regression with scikit-learn — predicting house prices.

Regression vs. classification

Predicting a continuous value (price, temperature) is regression. Predicting a discrete category (spam / not-spam, cat / dog) is classification.

Here is a classification example that predicts whether a message is spam based on two simple features:

python

from sklearn.tree import DecisionTreeClassifier

# Features: [num_links, has_word_free]
X = [[3, 1], [0, 0], [5, 1], [1, 0], [4, 1]]
y = ["spam", "ham", "spam", "ham", "spam"]

clf = DecisionTreeClassifier().fit(X, y)
print(clf.predict([[4, 1]]))  # -> ['spam']

A tiny classifier.

Unsupervised learning

In unsupervised learning there are no labels. The algorithm discovers structure on its own — for example grouping similar customers together (clustering) or reducing the number of dimensions in the data.

python

from sklearn.cluster import KMeans
import numpy as np

X = np.array([[1, 2], [1, 4], [10, 2], [10, 4], [1, 3], [10, 3]])
kmeans = KMeans(n_clusters=2, n_init=10).fit(X)

print(kmeans.labels_)  # e.g. [0 0 1 1 0 1]

K-means clustering groups data without labels.

The train / test split

A model that memorizes its training data but fails on new data is overfitting. To detect this, we always hold back some data for testing:

python

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model.fit(X_train, y_train)
print("accuracy:", model.score(X_test, y_test))

Never evaluate on the data you trained on.

Overfitting is the #1 pitfall

If your model scores 99% on training data but 60% on test data, it memorized instead of learning. More data, simpler models, and regularization all help.

The typical ML workflow

1Collect and clean the data.
2Split it into training and test sets.
3Choose a model and train it on the training set.
4Evaluate on the test set using a suitable metric.
5Tune, repeat, and finally deploy the model.

PreviousStatistics & Probability Next Neural Networks