In 2017, a team at Google published a paper called "Attention Is All You Need”.

Bold title for an academic paper. Sounds like a Beatles song about machine learning. But the idea inside it changed everything. That paper introduced a design called the transformer, and every major AI chatbot since has been built on it.

The trick: instead of reading text one word at a time, let every word look at every other word at once. Follow along as we trace what happens from when the moment you type in "What country is north of the US?" to when the model answers "Canada”.

Table of Contents

Breaking the text into pieces

The first thing the model does is chop your question into smaller word pieces (aka “tokens”). Everything in this example is one token each, but for instance, “understand” would be further broken into “under” and “stand”. Our input becomes a sequence of numbers, and from here on out, everything else that follows is math.

The model has a vocabulary of roughly 50,000 tokens.

Same as a well-read teenager, except this one built its entire word list by consuming the sum of all human knowledge… every book, every forum post, every Wikipedia edit.

Before the model can think about your question, it has to stop seeing words and start seeing numbers. That translation step—turning language into math—is called tokenization. Everything after this point is arithmetic.

Assign each piece a meaning

Each token is placed on a map of meaning. "Country," "US," and "north" cluster in one region due to a geographical connection. "The" and "of" huddle off as glue words.

The model also stamps a position number onto each token so it knows that "the dog bit the man" and "the man bit the dog" are very different headlines.

Nobody programmed these positions. The model figures them out after reading roughly two trillion words, compressed into a single map that our seven tokens now sit on.

Imagine a city map where every word has an address, and similar words live in the same neighbourhood. Building that map—giving each fragment a location in meaning-space—is what engineers call embedding.

Words paying attention to each other

Now every word turns to every other word and asks: "How relevant are you to me?"

When "north" looks around, it locks onto "US" and "country" because of geography. North of what matters. It ignores "is" and "the" because they don't carry that signal.

The model runs this check from dozens of passes: one catches grammar, another geography, another notices that "what" plus "country" signals a question expecting a place name.

All of this is controlled by ~175 billion knobs. The model tunes them simultaneously.

Instead of reading left to right like we do, the model lets every word look at every other word simultaneously. That parallel awareness is called self-attention, and the multiple angles it checks from are called attention heads.

Digesting and refining

After gathering context from the other words, each word runs through a small internal network on its own. If the attention step was buying ingredients, this is the cooking.

The model expands, transforms, and compresses each word's meaning through a billion math operations. A human with a calculator doing one operation every five seconds would need 1,300 years to process a single word through this step.

The original meaning gets added back in at the end. The model never forgets what it started with; like writing notes in the margin instead of rewriting the whole page.

Each word's meaning gets expanded, mixed, and compressed back down through what's called a feed-forward network—a small, fast calculator that runs independently on every word.

Repeating until it clicks

Everything in sections 3 and 4 is one layer.

A modern LLM stacks 96 of these layers. Same two jobs every time: pay attention, then digest. But each layer catches something the previous one missed.

The first few layers catch surface-level stuff: grammar, word proximity, basic syntax. Middle layers connect dots across the sentence. By the final layers, "north" goes from meaning "a direction" to meaning "the specific direction being asked about in relation to the United States, in the context of a question expecting a country name."

Your question runs through about a trillion math operations across all 96 layers, and the whole thing finishes in roughly four milliseconds.

One pass of attention + feed-forward is called a transformer layer. Stack 96 of them and you get a model that builds understanding the way a photograph develops—blurry shapes first, fine detail last.

Picking the most likely word

Finally, the model scores every word in its vocabulary as a possible next word. All 50,000 of them. A 50,000-contestant talent show, judged in under a millisecond. "Canada" gets 89%. "Mexico" gets 8%. "Banana" gets effectively zero.

The model then amplifies the differences: high scores get pushed higher, low scores get crushed. There's a dial that controls this “temperature”. Turn it down and the model always picks the safe answer. Crank it up and underdogs get a fighting chance. (That's why chatbots sometimes say weird things. Someone turned up the dial.)

The model doesn't pick a word. It scores every word in its vocabulary and the highest score wins. That scoring function—turning raw numbers into a probability ranking—is called softmax.

Doing it all over again

Out comes "Canada". That single word gets appended to the original input, and the entire pipeline loops back to step 1.

The model now sees "What country is north of the US? Canada" and predicts what should come next. Then again. Then again. One word at a time, every time.

A 500-word answer means the model runs this entire 7-step pipeline roughly 650 times. If each pass took a second instead of 4ms, you'd wait 10mins for a single reply.

Instead it happens fast enough that the words appear to stream onto your screen in real time. Every word you see was chosen the same way: break it up, map it, pay attention, digest, repeat 96 times, score 50,000 options, pick one.

The model only ever predicts one word at a time. Then it feeds that word back into itself and predicts the next one. This one-word-at-a-time loop is called autoregressive generation—the model literally writes by reading what it just wrote.

Nobody taught this system that Canada is north of the United States. Nobody programmed rules about borders, grammar, or how to answer questions.

This entire thing is just one thing, repeated over and over, on an enormous amount of text.. Predict a word, get it wrong, adjust the knobs, try again. Repeat a few billion times.

So there it is, every capability that surprises you about these models traces back to one dumb, simple game: guess what comes next.

Common Misconceptions

"It understands what it's saying."
It doesn't. The model is doing one thing: predicting the most likely next word based on patterns it absorbed from training data. When it gives you a perfect answer about gravity, it hasn't understood physics. It has seen millions of sentences about gravity and learned which words tend to follow which. The output sounds like comprehension because we're wired to assume that capability equates to thinking. Same instinct that makes people apologize to Roombas.

"If it sounds confident, it must be right."
Confidence and correctness are completely unrelated inside a language model. The model was trained to produce fluent, decisive-sounding text—not to flag uncertainty. Studies show it's wrong roughly one in four times, and it delivers those wrong answers with the same clean, step-by-step tone as the right ones. Worse, the training process actually rewards confident guessing over honest uncertainty. A model that hedges gets penalized; a model that bluffs gets promoted. Treat every answer like a smart friend's opinion: probably useful, worth double-checking.

"It has a copy of everything it was trained on."
It doesn't store documents the way a hard drive stores files. During training, the model distilled patterns from text—it doesn't keep retrievable copies. It can't look up "that New York Times article from March 5th" any more than you can replay a specific meal you ate in 2019. Occasionally it can regurgitate short passages it saw thousands of times (a quirk called memorization), but that's an edge case, not how it works. Its knowledge is dissolved across billions of numerical weights, not a text filing cabinet.

Sources:

Keep reading