I didn’t really like GenAI because it hallucinates, consumes lots of energy, has raised memory and SSD prices, etc. But as an IT engineer, I can’t ignore it. In this post, I’ll try to learn how GenAI (LLM) works by asking a lot of questions to AI.

I mainly use Copilot because it is most lenient with hourly/daily usage limit. Below are mainly outputs from Copilot (but I modified/summarized them). Sorry if I didn’t remove all hallucinations. As it turned out, LLM as a technology is pretty interesting. Let’s see if I can learn something complex as LLM from AI.

Phases in LLM

  1. Pre-training
  2. Fine-tuning/Alignment
  3. Inference

Pre-training

The model learns general language patterns from huge text datasets. The model reads massive amounts of text (books, code, articles) and learns statistical patterns: grammar, facts, reasoning structures, and how words relate.

  • The model predicts the next token repeatedly.
  • Billions of predictions teach it how language works.
  • It learns general capabilities (reasoning, coding, summarization) without being told explicit rules.

Fine-tuning/Alignment

The model is adjusted to behave safely, follow instructions, or specialize in a domain.

  • Supervised fine‑tuning (SFT): Humans provide example prompts and ideal answers.
  • Reinforcement learning (RLHF or RLAIF): Humans or AI judge outputs, and the model learns preferred behavior.
  • Domain fine‑tuning: Optional specialization (e.g., medical, legal, coding).

Inference

This is when we use ChatGPT, Gemini, etc. The trained model takes user input and generates output token by token.

  • Your prompt is converted to tokens.
  • The model predicts the next token repeatedly.
  • It uses attention layers to decide which parts of the prompt matter.
  • It streams output until it reaches a stop condition.

So far, so good. Let’s move on to high-level overview of LLM.

High‑level overview

Transformer

A Transformer is the foundation of modern LLMs (including ChatGPT, Gemini, and Claude). It’s a neural-network architecture built around self-attention, which lets the model examine all tokens in a sequence at once and determine which ones are most relevant.

Text is broken into “tokens” (words and sub-words). Each token is mapped to a vector the model can operate on, and that vector flows through many Transformer layers. As it moves through the stack, the model enriches the vector with meaning, nuance, and contextual relationships.

In this article, I used the words GenAI ≈ LLM ≈ Transformer.

High-level data flow

After text is tokenized (ie, split into tokens), each token flows through the following components:

  1. Embedding

    • Convert the token to a token ID, then to a vector.
    • This vector encodes the token’s meaning in a form the model can manipulate.
  2. Transformer blocks (stacked many times)

    Each block refines the token representation through two core sublayers:

    a. Attention (self-attention)

    • Each token “looks at” all other tokens in the sequence.
    • It computes relationships and decides which tokens are important.
    • Attention adds to the token spatial information in the sequence (ie, context)

    b. MLP (Multi-Layer Perceptron)

    • A small neural network applied to each token independently.
    • It transforms and recombines features discovered by attention.
    • MLP adds to the token w/ context knowledge that the model acquired during the training

    Each sublayer is wrapped with residuals and layer norm to keep training stable and representations consistent.

    Across all these layers, the token vector keeps the same shape, but its contents become increasingly rich and context‑aware.

  3. LM head, Softmax, etc.

    After the final Transformer block, each token’s vector has been fully enriched with context and knowlege, and is ready to produce the next token.

    a. Final LayerNorm

    • The model applies one last normalization to stabilize the vector

    b. LM head (LM=Language Modeling)

    • The normalized vector is passed through a large linear layer whose output dimension equals the vocabulary.
    • This produces a logits vector: one unnormalized score for every possible token.

    c. Softmax

    • Softmax converts the logits into a probability distribution.
    • Each value (across the entire vocabulary) represents the model’s estimated probability that this token is the next one.

Key concepts

Tokens

A token is a small chunk of text (a word, sub-word, or symbol) that the model processes one step at a time.

  • “international” → “inter”, “nation”, “al”
  • “cat” → “cat”
  • “Boston” → “Bos”, “ton”

The model predicts the next token based on previous tokens.

Embeddings

An embedding is a numerical representation of a token. It is a vector of real numbers that captures the token’s meaning and its similarity to other tokens.

Conceptual example:

"cat" → (1.3, 0.2, ..., -0.4)

Each dimension of the vector encodes some learned feature of how the token is used in language. Tokens that appear in similar contexts get embeddings that point in similar directions. Embedding tokens lets the model mathematically measure semantic similarity using vector distance or cosine similarity.

Vector distance:

\[ distance(\mathbf{u}, \mathbf{v}) = \sqrt{\sum_{i=1}^{n} (u_i - v_i)^2} \]

  • Small distance → similar meaning

Cosine similarity:

\[ \text{cosine\_sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|} \]

where \(\mathbf{u} \cdot \mathbf{v}\) is the dot product between vectors u and v

  • Value ranges from –1 to 1
  • 1 → vectors point in the same direction (very similar)
  • 0 → unrelated
  • –1 → opposite meaning (rare in embeddings)

Example:

  • “cat” and “dog” end up close together because they appear in similar contexts (pets, animals).
  • “run” and “running” are close because they share grammatical and semantic roles.
  • “king – man + woman ≈ queen” shows how relationships can be encoded as vector arithmetic.

Embeddings are the model’s numerical input. After tokenization, each token is mapped to its embedding vector. These vectors then flow into the model’s layers and attention mechanism, where they are transformed repeatedly to help the model predict the next token.

Throughout the Transformer stack, the model dimension (the vector size of each token representation) stays the same.

Transformer blocks/layers

A Transformer block/layer is the repeated building unit inside a Transformer model. Each block performs two core operations that refine token representations: self-attention and a feed-forward network, wrapped with residual connections and normalization.

A Transformer block/layer contains:

  • Multi-head self-attention — Each token looks at all other tokens and computes weighted interactions. This captures relationships like coreference, syntax, and long-range dependencies.
  • Feed-forward network (FFN) — A small MLP applied independently to each token, adding nonlinearity and capacity.
  • Residual connections — Preserve information and stabilize training by adding the input back to each sublayer’s output.
  • Layer normalization — Normalizes activations to improve stability and convergence.

A token starts as an embedding vector of size `d_model`, and each block transforms it while keeping that dimensionality (i.e., the vector length). You can imagine a token’s vector flowing through many blocks, being enriched with more contextual information at each step.

Attention (self-attention)

Attention assigns a weight (0–1) to each token in the sequence relative to others. Attention is the model’s short-term focus mechanism (computed fresh for every prompt). This lets the model understand relationships like:

  • subject ↔ verb
  • pronoun ↔ reference
  • cause ↔ effect

Attention is computed many times in parallel across multiple “heads,” each focusing on different patterns (syntax, meaning, long-range dependencies).

MLP (Multi-Layer Perceptron)

An MLP is a feed-forward (ie, one-way) neural network made of fully connected layers with nonlinear activations.

Core structure:

  • Input layer — receives the input vector.
  • Hidden layer — expands the dimension (e.g., 4× wider), applies a linear transform Wx + b, then applies a nonlinear activation (e.g., GELU).
  • Output layer — projects the expanded vector back down to the original dimension.

Hidden layer computation: h(x) = φ(Wx + b) where

  • W is a weight matrix
  • x is the input vector
  • b is a bias vector
  • φ is a nonlinear activation function

Reason for combining linear + nonlinear steps:

  • Without φ, the whole network would be just one big linear function.
  • With φ, the network can approximate arbitrarily complex functions (Universal Approximation).

Weights

Weights are the learned numerical parameters of a neural network. They are optimized during pre‑training and fine‑tuning, and remain frozen during inference.

A weight is a single floating‑point value that determines how strongly one piece of information influences another:

  • Large positive weight → amplifies a signal
  • Large negative weight → inverts or suppresses a signal
  • Weight near zero → ignores that signal

Weights are arranged into matrices. There are many weights used throughout a Transformer model.

Examples:

  • Embedding matrix E is a weight matrix which converts a token ID to a token vector.
  • Attention projection matrices Wq,Wk,Wv are weight matrices that transform token vectors into queries(Q), keys(K), and values(V), enabling the model to compute how relevant each token is to every other token.

You’ll see many WX + b calculation patterns in a model where W is a weight matrix, X is a hidden state (=input vector) and b is a bias vector.

How LLM works

Tokenizing

The tokenizer uses a fixed lookup table to split a sequence into token IDs. For example,

  • “inter” → 5021
  • “nation” → 1834
  • “al” → 291
  • “ization” → 7420

Embedding

Converts each token to a token vector using the embedding matrix.

Embedding matrix E:

\(E \in \mathbb{R}^{V \times d_{\text{model}}}\) – a matrix of V rows x d_model columns of real numbers where

  • V: vocabulary size (eg, 50k)
  • \(d_{\text{model}}\): width (eg, 4096)

It’s a simple lookup.

X0 = E[token_ID]

X0 is the token vector of dimension d_model. It will be the input parameter for the first Transformer block.

TF-block > Attention

Attention is to add the token context information. That is, where is the token in the sequence, is this a verb/noun, etc.

First, compute Q, K, V:

  • Q: query; what this token is looking for
  • K: key; what this token is offering as a “feature” to match against
  • V: value; the information this token provides if it’s attended to

Computation is simple & similar for Q, K and V. Just multiplies learned matrices Wq, Wk or Wv to the input vector X.

Q = XWq, K = XWk, V = XWv

In case of Q = XWq,

  • \(X \in \mathbb{R}^{L \times d_{\text{model}}}\); the input matrix X
    • Each row = one token’s embedding/hidden state/vector
    • Length L rows (L=sequence length)
  • \(Wq \in \mathbb{R}^{d_{\text{model}} \times d_{\text{head}}}\)
    • One such matrix per attention head
    • d_head is typically d_matrix / number of heads
  • \(Q \in \mathbb{R}^{L \times d_{\text{head}}}\)
    • Same number of rows as X
    • Each row is \(d_{\text{head}}\)

Remember high school math that matrix dimensions \((L, d_{model}) \times (d_{model}, d_{head})\) will be \((L, d_{head})\)

This is the same for K and V. The only difference is the learned matrices Wk and Wv.

Then, compute attention:

\(\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{QK^{\mathsf{T}}}{\sqrt{d_{\text{head}}}}\right)V\)

  • \(K^{T}\) is the transposed K; shape (\(d_{\text{head}}\), L)

This is complex. Let’s break into parts.

\(QK^{\mathsf{T}} \;\Rightarrow\; Q \in \mathbb{R}^{L \times d_{\text{head}}}\; K^{\mathsf{T}} \in \mathbb{R}^{d_{\text{head}} \times L} \;\Rightarrow\; QK^{\mathsf{T}} \in \mathbb{R}^{L \times L}\)

This is an attention score matrix. It highlights, for every token i, how strongly its query vector Q[i] matches the key vectors K[j] of all tokens in the sequence.

\(S = \frac{QK^{\mathsf{T}}}{\sqrt{d_{\text{head}}}}\)

Then, they divide this by \(\sqrt{d_{\text{head}}}\) to shrink the size of the raw dot products so that the values do not grow too large as \(d_{head}\)​ increases. S is called the scaled scores.

Given a row of scores S[i] (= Si-1, Si-2, … Si-T), the softmax for the j-th entry is:

\(\mathrm{softmax}(S[i])_j = \frac{e^{S_{i,j}}}{\sum_{k=1}^{T} e^{S_{i,k}}}\)

They use \(e^x\) (the exponential function) because it magnifies differences, especially around the largest values. A slightly bigger score becomes significantly larger after exponentiation.

Softmax normalizes each row so that all values:

  • are between 0 and 1
  • sum to 1

Effectively, softmax turns the scaled scores S into a probability distribution over which tokens to attend to.

Again, in

\[ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{QK^{\mathsf{T}}}{\sqrt{d_{\text{head}}}}\right)V = \mathrm{softmax}(S)\,V , \]

\(QK^{\mathsf{T}}\) is the attention score matrix with shape \((L, L)\). Each entry \(S_{i,j}\) tells how strongly token \(i\) matches token \(j\). Dividing by \(\sqrt{d_{\text{head}}}\) and applying softmax simply normalizes these scores.

The final attention output is obtained by using these normalized weights (\(A_{i,j}\)) to blend the value vectors (\(V_{j}\)):

\[ O_i = \sum_{j=1}^{L} A_{i,j} V_j \]

So the output matrix has shape \((L, d_{\text{head}})\), and each row is a sum of all weighted (by A) V vectors in the sequence. That is, the output matrix is yet another enriched hidden state matrix.

TF-block > MLP (Multi-Layer Perceptron)

MLP is to add the contextuarized token the knowledge information that the model acquired from its training.

After the attention block generates the final output matrix X with shape \((L,d_{head})\), X is the input of MLP block. MLP has two linear layers (the \(Wx+b\) parts) with a non-linearity (σ) in between:

\[ \mathrm{MLP}(x) = W_2\,\sigma(W_1 x + b_1) + b_2 \]

In \(W_1 x + b_1\) part,

  • \( X \in \mathbb{R}^{d_{{\text{model}}}} \)
  • \( W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{mlp}}} \)
    • \(d_{mlp}\) is 2-4 times larger than \(d_{model}\)
  • \( b_1 \in \mathbb{R}^{d_{\text{mlp}}} \)

So, this basically expands \(X\)’s dimension (but in a learned way) for the σ function to work well.

σ is the activation function. In modern Transformers, GELU (Gaussian Error Linear Unit) is the standard choice. The GELU curve looks like a smoother version of ReLU, where ReLU(x) = max(0, x) sets all negative values to zero. GELU instead keeps small negative values but down-weights them smoothly.

That σ is non-linear is extremely important. Without σ, the entire MLP would be a purely linear transformation. And if every Transformer block were linear, stacking many blocks would still collapse into a single linear function. A single linear function cannot represent the complex, hierarchical features that language requires.

The \(W_2 \sigma + b_2\) part is to shrink the σ output to dimension \( MLP(x) \in \mathbb{R}^{d_{{\text{model}}}} \) back again, so that it can be the input to the next Transformer block.

Notice that \(W_1 x + b_1\) and \(W_2 \sigma + b_2\) are not just simple expansion/shrinkage, and they use learned weights and biases, allowing the MLP to build stable, meaningful internal representations that attention alone cannot provide.

Final LayerNorm

In every Transformer block, after the Attention and MLP sublayers (each with a residual connection), the output is passed through Layer Normalization to stabilize the vector x.

The final LayerNorm takes a vector \(x \in \mathbb{R}^{d_{model}}\) and normalizes it per token.

LayerNorm has two learned parameter vectors:

  • Scale(gain): \(\gamma \in \mathbb{R}^{d_{model}}\)
  • Shift(bias): \(\beta \in \mathbb{R}^{d_{model}}\)

And a constant:

  • epsilon: \(\epsilon \approx 10^{-5}\)

Given a token vector: \( x = (x_1,x_2,…x_{d_{model}}) \):

Mean: \[ \mu = \frac{1}{d_{\text{model}}} \sum_{i=1}^{d_{\text{model}}} x_i \]

Variance: \[ \sigma^2 = \frac{1}{d_{\text{model}}} \sum_{i=1}^{d_{\text{model}}} (x_i - \mu)^2 \]

Normalize: \[ \hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} \]

Scale+shift: \[ \mathrm{LayerNorm}(x)_i = \gamma_i \hat{x}_i + \beta_i \]

This will be the input vector h (not x anymore) to LM head.

LM head (Language Modeling)

LM head turns the last hidden vector into logits. This step is important. The input vector h now suggests/implies what token will come next, but it’s still an internal representation. LM head will convert this into logits, raw probability scores for all available tokens in the vocabulary.

  • \( h \in \mathbb{R}^{d_{model}} \)
  • \( logits \in \mathbb{R}^V \)

where \(V\) is the vocabulary size. Each element of the logits vector is an unnormalized score for how likely that token is to be the next token. A later softmax converts these logits into actual probabilities.

Logits: \[ \text{logits} = h W_{\text{LM}} + b_{\text{LM}} \]

Softmax to turn logits into probabilities: \[ P(\text{token}=i \mid h) = \frac{\exp(\text{logits}_i)} {\sum_{j=1}^{V} \exp(\text{logits}_j)} \]

If using the (transposed) embedding matrix E as the LM head weight (many models do this): \[ W_{\text{LM}} = E^{\mathsf{T}} \]

\[ \text{logits} = h E^{\mathsf{T}} + b_{\text{LM}} \]

How was learning LLM from Copilot

Well, that’s it. It’s been a tough but exciting 2-3 weeks.

Learning complex things such as how LLM works from Copilot was a much better experience than I had imagined. I had barely remembered what dot product means for matrices (“wasn’t it \( a b \cos \theta \) ?”), but I was able to refresh my memory by asking lots of stupid questions, and copilot explained to me diligently all the time.

I asked a lot of questions. What does this part in the formula mean? Why do you use \(e^x\) here?, Why did researchers even think of this in the first place?, etc, etc.

It also taught me how to write math formulas in LaTex. It’s been 30 years since I last used LaTex. It helped me to setup LaTex in org-mode for ox-hugo and hugo. (As it turned out, the hugo theme version was too old to handle LaTex.)

P.S.

After using other LLMs such as ChatGPT, Gemini and Claude, I’ve found that Copilot is the least capable of the four. If I were to start this again, I’d use Copilot only when I used up quotas for other LLMs.