Skip to main content

Command Palette

Search for a command to run...

Positional Encoding in Transformers

Published
4 min read
Positional Encoding in Transformers

1. Overview

Positional Encoding is a technique used in Transformer architectures to encode the order of tokens in a sequence.

Transformers process tokens in parallel, unlike sequential models such as RNNs. Because of this, the model has no inherent understanding of token order. Positional encoding injects information about token positions into token embeddings before they enter the self-attention layers.

The final representation sent to the transformer is:

Input = TokenEmbedding + PositionalEncoding

This allows the model to learn both:

  • semantic meaning (from embeddings)

  • sequence order (from positional encoding)


2. Why This Exists

The Core Problem

Self-attention processes tokens simultaneously, not sequentially.

Example sentence:

"Nitish killed lion"
"Lion killed Nitish"

Both sentences contain the same tokens:

[Nitish, killed, lion]

If sent to self-attention simultaneously, the model cannot distinguish token order, so both sequences appear identical.

This is a fundamental limitation because word order determines meaning in natural language.

Why Previous Models Didn't Have This Problem

Model Order Awareness Reason
RNN Yes Tokens processed sequentially
LSTM Yes Hidden state carries time information
Transformer No Tokens processed in parallel

Transformers sacrifice sequential processing for parallel efficiency, so positional information must be explicitly added.


3. First Principles Explanation

To solve the ordering problem, we must encode position information alongside token embeddings.

Components

  1. Token Embedding

  2. Positional Encoding

  3. Self-Attention Layer

Interaction

Token → Embedding Vector
Position → Positional Encoding Vector

Final Input = Embedding + Positional Encoding

Each token therefore carries:

semantic information + positional information

Design Requirements

A good positional encoding must satisfy:

  1. Bounded values

Neural networks train best with values in small ranges (e.g. -1 to 1).

  1. Continuous values

Neural networks prefer smooth functions, not discrete jumps.

  1. Ability to capture relative positions

The model should infer relationships like:

distance(token_i, token_j)
  1. Unique representation

Each position must have a distinct encoding.


4. How It Works

Step 1 — Tokenize Sentence

Example:

Sentence: "River Bank"
Tokens: [River, Bank]

Step 2 — Convert Tokens to Embeddings

Example:

River → embedding vector (d_model)
Bank  → embedding vector (d_model)

Example dimension:

d_model = 512

Step 3 — Generate Positional Encoding

For position pos and dimension i:

PE(pos,2i)   = sin(pos / 10000^(2i/d_model))
PE(pos,2i+1) = cos(pos / 10000^(2i/d_model))

Key ideas:

  • even dimensions → sine

  • odd dimensions → cosine

Step 4 — Add Positional Encoding to Embedding

InputVector = Embedding + PositionalEncoding

Step 5 — Send to Self-Attention

The resulting vector contains both:

semantic meaning + position

This vector becomes the input to the transformer encoder.


5. Example

Situation

Sentence:

"The lion runs"

Token positions:

The  → position 0
lion → position 1
runs → position 2

Implementation Idea

Compute positional encodings:

PE(0)
PE(1)
PE(2)

Then combine:

Embedding(The)  + PE(0)
Embedding(lion) + PE(1)
Embedding(runs) + PE(2)

Expected Outcome

The transformer can now learn relationships such as:

  • which word comes first

  • relative distances between words

  • syntactic dependencies


Summary

  • Transformers process tokens in parallel, losing order information.

  • Positional encoding injects token position information.

  • Encodings are generated using sine and cosine functions at multiple frequencies.

  • Positional vectors have the same dimension as embeddings.

  • Final input to transformers is:

embedding + positional_encoding