AIbyExamples
Chapter 6

Natural Language Processing

How computers understand, generate, and translate human language — from tokenization to transformers.

12 min read

Natural Language Processing (NLP) is the branch of AI that deals with the interaction between computers and human language. Every time you ask a voice assistant a question, use autocomplete in a search bar, or get a machine translation, NLP is doing the heavy lifting.

Why is language hard for machines?

Language is ambiguous, context-dependent, and constantly evolving. The sentence "I saw her duck" has at least two meanings. Handling this kind of ambiguity is what makes NLP both fascinating and challenging.

Tokenization — breaking text into pieces

The very first step in any NLP pipeline is tokenization: splitting raw text into smaller units called tokens. Tokens can be words, sub-words, or even individual characters depending on the method.

python
import nltk
nltk.download('punkt_tab', quiet=True)
from nltk.tokenize import word_tokenize

text = "AI by Examples makes NLP approachable!"
tokens = word_tokenize(text)
print(tokens)
# ['AI', 'by', 'Examples', 'makes', 'NLP', 'approachable', '!']
Simple word tokenization with NLTK.

Bag of Words and TF-IDF

Once text is tokenized, we need to turn it into numbers — models only understand vectors. Two classic approaches:

  • Bag of Words (BoW) — count how often each word appears. Simple but ignores word order.
  • TF-IDF — weight each word by how important it is in a document relative to the whole collection. Rare, meaningful words score higher.
python
from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
    "Machine learning is great",
    "Deep learning is a subset of machine learning",
    "NLP processes human language",
]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(docs)
print(vectorizer.get_feature_names_out())
print(matrix.toarray().round(2))
TF-IDF vectorization with scikit-learn.

Word embeddings

The breakthrough that transformed NLP was learning dense vector representations of words — called embeddings. Words with similar meanings end up close together in vector space. The famous example: king − man + woman ≈ queen.

  • Word2Vec — learns embeddings by predicting context words (or vice-versa).
  • GloVe — uses global word co-occurrence statistics.
  • FastText — extends Word2Vec to handle sub-word units, great for morphologically rich languages.

Transformers and attention

The Transformer architecture (2017) replaced sequential models like RNNs with a mechanism called self-attention. Instead of reading text one word at a time, transformers look at all words in parallel and learn which words are most relevant to each other. This is the foundation of BERT, GPT, and every modern large language model.

Attention is all you need

The original transformer paper title became a meme in AI. The key insight: let the model decide which parts of the input to focus on, rather than processing everything in fixed order.

Named Entity Recognition (NER)

NER identifies and classifies key entities in text — people, organizations, locations, dates, and more. It is essential for information extraction pipelines.

python
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
result = ner("Elon Musk founded SpaceX in Hawthorne, California.")
for entity in result:
    print(f"{entity['word']:20s} {entity['entity_group']:10s} {entity['score']:.2f}")
# Elon Musk            PER        0.99
# SpaceX               ORG        0.99
# Hawthorne, California LOC       0.98
NER with a pretrained transformer.

Text generation

Generative language models predict the next token given the preceding context. This simple principle, scaled to billions of parameters, produces remarkably coherent text.

python
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")
output = generator("The future of AI is", max_length=40, num_return_sequences=1)
print(output[0]['generated_text'])
Text generation with Hugging Face.

Hallucinations

Language models can generate plausible-sounding but factually incorrect text. Always verify generated content, especially for high-stakes applications.