Machine Learning Basics
The core vocabulary of ML: features, labels, training, and the difference between supervised and unsupervised learning — with runnable examples.
Machine learning is the practice of teaching computers to find patterns in data. Rather than programming the answer, you give the algorithm many examples and let it infer the rule. The learned rule is called a model.
Key vocabulary
- Feature — an input variable (e.g. the size of a house).
- Label (or target) — the answer we want to predict (e.g. the price).
- Training — the process of adjusting the model to fit the data.
- Inference — using the trained model to predict on new data.
- Generalization — how well the model performs on data it has never seen.
Supervised learning
In supervised learning, every training example comes with the correct answer (a label). The model learns to map inputs to outputs. Two common tasks are regression (predict a number) and classification (predict a category).
from sklearn.linear_model import LinearRegression
import numpy as np
# Feature: house size (m2). Label: price (in thousands).
X = np.array([[50], [70], [90], [110], [130]])
y = np.array([150, 210, 270, 330, 390])
model = LinearRegression()
model.fit(X, y) # training
print(model.predict([[100]])) # -> ~300 (inference)Regression vs. classification
Predicting a continuous value (price, temperature) is regression. Predicting a discrete category (spam / not-spam, cat / dog) is classification.
Here is a classification example that predicts whether a message is spam based on two simple features:
from sklearn.tree import DecisionTreeClassifier
# Features: [num_links, has_word_free]
X = [[3, 1], [0, 0], [5, 1], [1, 0], [4, 1]]
y = ["spam", "ham", "spam", "ham", "spam"]
clf = DecisionTreeClassifier().fit(X, y)
print(clf.predict([[4, 1]])) # -> ['spam']Unsupervised learning
In unsupervised learning there are no labels. The algorithm discovers structure on its own — for example grouping similar customers together (clustering) or reducing the number of dimensions in the data.
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [10, 2], [10, 4], [1, 3], [10, 3]])
kmeans = KMeans(n_clusters=2, n_init=10).fit(X)
print(kmeans.labels_) # e.g. [0 0 1 1 0 1]The train / test split
A model that memorizes its training data but fails on new data is overfitting. To detect this, we always hold back some data for testing:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model.fit(X_train, y_train)
print("accuracy:", model.score(X_test, y_test))Overfitting is the #1 pitfall
If your model scores 99% on training data but 60% on test data, it memorized instead of learning. More data, simpler models, and regularization all help.
The typical ML workflow
- 1Collect and clean the data.
- 2Split it into training and test sets.
- 3Choose a model and train it on the training set.
- 4Evaluate on the test set using a suitable metric.
- 5Tune, repeat, and finally deploy the model.