Computer Vision
Teaching machines to see — from pixel grids to image classification, object detection, and real-world applications.
Computer vision is the field of AI that enables machines to interpret and make decisions based on visual data — images and video. From unlocking your phone with your face to self-driving cars reading road signs, computer vision is everywhere.
How computers see images
To a computer, an image is just a grid of numbers. A grayscale image is a 2D array where each value represents pixel brightness (0 = black, 255 = white). A color image adds a third dimension with three channels: Red, Green, Blue (RGB).
Image dimensions
A 1920×1080 color photo has 1920 × 1080 × 3 = 6,220,800 numbers. That is the raw data a vision model must process for a single frame.
Convolutional Neural Networks (CNNs)
CNNs are the workhorse of computer vision. Instead of looking at every pixel independently, they slide small filters (kernels) across the image to detect local patterns — edges, textures, shapes — and build up increasingly abstract representations layer by layer.
- Convolutional layers — detect local patterns using learnable filters.
- Pooling layers — reduce spatial dimensions, keeping the most important information.
- Fully connected layers — combine extracted features to make a final prediction.
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 16, 3, padding=1), # 3 RGB channels -> 16 filters
nn.ReLU(),
nn.MaxPool2d(2), # halve spatial dims
nn.Conv2d(16, 32, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(32 * 8 * 8, 10), # 10 output classes
)
def forward(self, x):
return self.classifier(self.features(x))Image classification
Classification assigns a single label to an entire image. Pretrained models like ResNet, EfficientNet, and Vision Transformers (ViT) can classify thousands of object categories with near-human accuracy.
from transformers import pipeline
classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
result = classifier("https://upload.wikimedia.org/wikipedia/commons/4/4d/Cat_November_2010-1a.jpg r in result[:3]:
print(f"{r['label']:30s} {r['score']:.3f}")
# tabby, tabby cat 0.931
# tiger cat 0.042
# Egyptian cat 0.018Object detection
While classification answers what is in the image, object detection also answers where. It draws bounding boxes around each detected object and labels them. Popular architectures include YOLO (You Only Look Once) and Faster R-CNN.
from transformers import pipeline
detector = pipeline("object-detection", model="facebook/detr-resnet-50")
results = detector("https://images.unsplash.com/photo-1529778873920-4da4926a72c2?fm=jpg&q=60&w=3000&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxzZWFyY2h8M3x8a2l0dHklMjBjYXR8ZW58MHx8MHx8fDA%3D r in results:
print(f"{r['label']:15s} score={r['score']:.2f} box={r['box']}")Transfer learning
Training a vision model from scratch requires millions of images. Transfer learning lets you take a model pretrained on a huge dataset (like ImageNet) and fine-tune it on your small, specific dataset. You keep the learned feature detectors and only retrain the final classification layers.
Start with pretrained models
For most real-world tasks, fine-tuning a pretrained model beats training from scratch. You get better accuracy with less data and much less compute.
Real-world applications
- 1Medical imaging — detecting tumors, fractures, and retinal diseases from X-rays and MRIs.
- 2Autonomous vehicles — recognizing pedestrians, traffic signs, and lane markings.
- 3Manufacturing — quality control by spotting defects on assembly lines.
- 4Agriculture — drone imagery analysis for crop health monitoring.
- 5Retail — visual search ("find products that look like this photo").
Bias in vision models
Models trained on unrepresentative data can perform poorly on underrepresented groups. Always evaluate your model across diverse demographics before deployment.