1. Overview

Positional Encoding is a technique used in Transformer architectures to encode the order of tokens in a sequence.

Transformers process tokens in parallel, unlike sequential models such as RNNs. Because of this, the model has no inherent understanding of token order. Positional encoding injects information about token positions into token embeddings before they enter the self-attention layers.

The final representation sent to the transformer is:

Input = TokenEmbedding + PositionalEncoding

This allows the model to learn both:

semantic meaning (from embeddings)
sequence order (from positional encoding)

2. Why This Exists

The Core Problem

Self-attention processes tokens simultaneously, not sequentially.

Example sentence:

"Nitish killed lion"
"Lion killed Nitish"

Both sentences contain the same tokens:

[Nitish, killed, lion]

If sent to self-attention simultaneously, the model cannot distinguish token order, so both sequences appear identical.

This is a fundamental limitation because word order determines meaning in natural language.

Why Previous Models Didn't Have This Problem

Model	Order Awareness	Reason
RNN	Yes	Tokens processed sequentially
LSTM	Yes	Hidden state carries time information
Transformer	No	Tokens processed in parallel

Transformers sacrifice sequential processing for parallel efficiency, so positional information must be explicitly added.

3. First Principles Explanation

To solve the ordering problem, we must encode position information alongside token embeddings.

Components

Token Embedding
Positional Encoding
Self-Attention Layer

Interaction

Token → Embedding Vector
Position → Positional Encoding Vector

Final Input = Embedding + Positional Encoding

Each token therefore carries:

semantic information + positional information

Design Requirements

A good positional encoding must satisfy:

Bounded values

Neural networks train best with values in small ranges (e.g. -1 to 1).

Continuous values

Neural networks prefer smooth functions, not discrete jumps.

Ability to capture relative positions

The model should infer relationships like:

distance(token_i, token_j)

Unique representation

Each position must have a distinct encoding.

4. How It Works

Step 1 — Tokenize Sentence

Example:

Sentence: "River Bank"
Tokens: [River, Bank]

Step 2 — Convert Tokens to Embeddings

Example:

River → embedding vector (d_model)
Bank  → embedding vector (d_model)

Example dimension:

d_model = 512

Step 3 — Generate Positional Encoding

For position pos and dimension i:

PE(pos,2i)   = sin(pos / 10000^(2i/d_model))
PE(pos,2i+1) = cos(pos / 10000^(2i/d_model))

Key ideas:

even dimensions → sine
odd dimensions → cosine

Step 4 — Add Positional Encoding to Embedding

InputVector = Embedding + PositionalEncoding

Step 5 — Send to Self-Attention

The resulting vector contains both:

semantic meaning + position

This vector becomes the input to the transformer encoder.

5. Example

Situation

Sentence:

"The lion runs"

Token positions:

The  → position 0
lion → position 1
runs → position 2

Implementation Idea

Compute positional encodings:

PE(0)
PE(1)
PE(2)

Then combine:

Embedding(The)  + PE(0)
Embedding(lion) + PE(1)
Embedding(runs) + PE(2)

Expected Outcome

The transformer can now learn relationships such as:

which word comes first
relative distances between words
syntactic dependencies

Summary

Transformers process tokens in parallel, losing order information.
Positional encoding injects token position information.
Encodings are generated using sine and cosine functions at multiple frequencies.
Positional vectors have the same dimension as embeddings.
Final input to transformers is:

embedding + positional_encoding

Positional Encoding in Transformers

1. Overview

2. Why This Exists

The Core Problem

Why Previous Models Didn't Have This Problem

3. First Principles Explanation

Components

Interaction

Design Requirements

4. How It Works

Step 1 — Tokenize Sentence

Step 2 — Convert Tokens to Embeddings

Step 3 — Generate Positional Encoding

Step 4 — Add Positional Encoding to Embedding

Step 5 — Send to Self-Attention

5. Example

Situation

Implementation Idea

Expected Outcome

Summary

Comments

More from this blog

Recursive Language Models (RLMs)

What is HyDE

How to Create a Custom Solana Token with Metadata (Using Token-2022)?

Installing Solana CLI in an Easy Way

Command Palette

1. Overview

2. Why This Exists

The Core Problem

Why Previous Models Didn't Have This Problem

3. First Principles Explanation

Components

Interaction

Design Requirements

4. How It Works

Step 1 — Tokenize Sentence

Step 2 — Convert Tokens to Embeddings

Step 3 — Generate Positional Encoding

Step 4 — Add Positional Encoding to Embedding

Step 5 — Send to Self-Attention

5. Example

Situation

Implementation Idea

Expected Outcome

Summary

Comments

More from this blog