Chapter 7

Computer Vision

Teaching machines to see — from pixel grids to image classification, object detection, and real-world applications.

11 min read

Computer vision is the field of AI that enables machines to interpret and make decisions based on visual data — images and video. From unlocking your phone with your face to self-driving cars reading road signs, computer vision is everywhere.

How computers see images

To a computer, an image is just a grid of numbers. A grayscale image is a 2D array where each value represents pixel brightness (0 = black, 255 = white). A color image adds a third dimension with three channels: Red, Green, Blue (RGB).

Image dimensions

A 1920×1080 color photo has 1920 × 1080 × 3 = 6,220,800 numbers. That is the raw data a vision model must process for a single frame.

Convolutional Neural Networks (CNNs)

CNNs are the workhorse of computer vision. Instead of looking at every pixel independently, they slide small filters (kernels) across the image to detect local patterns — edges, textures, shapes — and build up increasingly abstract representations layer by layer.

Convolutional layers — detect local patterns using learnable filters.
Pooling layers — reduce spatial dimensions, keeping the most important information.
Fully connected layers — combine extracted features to make a final prediction.

python

import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 16, 3, padding=1),  # 3 RGB channels -> 16 filters
            nn.ReLU(),
            nn.MaxPool2d(2),                 # halve spatial dims
            nn.Conv2d(16, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(32 * 8 * 8, 10),       # 10 output classes
        )

    def forward(self, x):
        return self.classifier(self.features(x))

A minimal CNN in PyTorch for image classification.

Image classification

Classification assigns a single label to an entire image. Pretrained models like ResNet, EfficientNet, and Vision Transformers (ViT) can classify thousands of object categories with near-human accuracy.

python

from transformers import pipeline

classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
result = classifier("https://upload.wikimedia.org/wikipedia/commons/4/4d/Cat_November_2010-1a.jpg r in result[:3]:
    print(f"{r['label']:30s} {r['score']:.3f}")
# tabby, tabby cat               0.931
# tiger cat                      0.042
# Egyptian cat                   0.018

Classify any image with a pretrained model.

Object detection

While classification answers what is in the image, object detection also answers where. It draws bounding boxes around each detected object and labels them. Popular architectures include YOLO (You Only Look Once) and Faster R-CNN.

python

from transformers import pipeline

detector = pipeline("object-detection", model="facebook/detr-resnet-50")
results = detector("https://images.unsplash.com/photo-1529778873920-4da4926a72c2?fm=jpg&q=60&w=3000&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxzZWFyY2h8M3x8a2l0dHklMjBjYXR8ZW58MHx8MHx8fDA%3D r in results:
    print(f"{r['label']:15s} score={r['score']:.2f} box={r['box']}")

Detect objects with a pretrained DETR model.

Transfer learning

Training a vision model from scratch requires millions of images. Transfer learning lets you take a model pretrained on a huge dataset (like ImageNet) and fine-tune it on your small, specific dataset. You keep the learned feature detectors and only retrain the final classification layers.

Start with pretrained models

For most real-world tasks, fine-tuning a pretrained model beats training from scratch. You get better accuracy with less data and much less compute.

Real-world applications

1Medical imaging — detecting tumors, fractures, and retinal diseases from X-rays and MRIs.
2Autonomous vehicles — recognizing pedestrians, traffic signs, and lane markings.
3Manufacturing — quality control by spotting defects on assembly lines.
4Agriculture — drone imagery analysis for crop health monitoring.
5Retail — visual search ("find products that look like this photo").

Bias in vision models

Models trained on unrepresentative data can perform poorly on underrepresented groups. Always evaluate your model across diverse demographics before deployment.

PreviousNLP Next Generative AI