# Building a GPT

Last Updated: January 2023

Last Updated: January 2023

Will GPTs/LLMs approach human-level intelligence? In short, no, imho. GPTs/LLMs perform extraordinarily well at pattern recognition and associated predictive tasks, as we see in this post, but they do not possess reasoning, nor understanding, nor empathy; they possess very close to zero intelligence in many important human respects.

Will GPTs be useful? Extraordinarily so, imho. In the post below you will see that I am able to consume the written works of Shakespeare, train a GPT on it, and then generate new language that is reasonably close to Shakespearean in just a few hundred lines of code, with a training process taking minutes on a single GPU.

GPTs are a tool, not a goal. Our goal here is not to write Shakespeare, but to write a tool that can write Shakespeare, and thus can train on any text and generate new text that is reasonably close to the original. Legal contracts for example, or code for programming tasks, or whatever. In a new post coming soon, I convert millions of tick-level price observations to text and train this model on the tick data to see how my GPT performs on financial data (spoiler: it does quite well compared to SOTA).

GPTs are therefore a tool for creating tools.

Optimization and scaling infrastructure is how you make money with GPTs and how companies will establish moats most effectively.

**ChatGPT**, **Bard**, and **Bing**, but in this case a much simpler version. A GPT is a Large Language Model (LLM), in turn a deep learning neural network architecture used for natural language processing tasks, such as AI-assistants, automated writing, content curation, and sentiment analysis. GPTs are pre-trained on large amounts of text data using unsupervised learning, which allows them to learn patterns and relationships in the data. This pre-training step is crucial, as it enables the models to perform well on a variety of downstream NLP tasks without requiring large amounts of task-specific training data.

**Karpathy's approah**, to gain a better understanding of how GPTs work under the hood. I examine through construction the transformer architecture, which is a key component of GPTs, to see how GPTs can be used in real-world applications. I returm to the seminal AI papers **Attention Is All You Need**, which introduced the transformer architecture, and the GPT-3 paper "**Language Models are Few-Shot Learners**. Overall, this post is a hands-on exploration of GPTs and the transformer architecture, with the goal of insight into how these models can be used commercially. As always, source code for this post may be found on **my GitHub**.

**LSTM** takes a thorough look at the LSTM neural network. As we move from neural networks to transformers specifically, attention is the super-important thing. So what is attention exactly?

**Attention**

**Self-attention**

Mathematically, self-attention may be written:

$$A = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V$$

**Cross-attention**

**Scaled-attention**

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V $$

**my GitHub**.

In [1]:

```
# check GPU (if working on local machine)
import torch
if torch.cuda.is_available():
device = torch.device("cuda")
print(f"device: {device}")
print(f"Device name: {torch.cuda.get_device_name(0)}")
else:
print("CUDA is not available.")
```

device: cuda Device name: NVIDIA GeForce RTX 3080 Ti Laptop GPU

In [2]:

```
# for running in docker image
device = 'cuda' if torch.cuda.is_available() else 'cpu'
```

In [3]:

```
# get data
with open('input.txt', 'r', encoding='utf-8') as f:
text = f.read()
print("length of dataseet in characters: ", len(text))
```

length of dataseet in characters: 1115394

In [4]:

```
# show unique characters appearing in the dataset (note the space character, which is first in the set): i.e., the vocabulary of possible characters the model can see or emit
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)
```

!$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz 65

Build a simple encoder and decoder: i.e., take a string, output a list of integers, where each character is a token. The approach below is similar to, but much more simplified than:

goolge sentencepiece(which uses sub-word encodings) andOpenAI tiktoken.

In [5]:

```
# convert the raw text as a string into some sequence of integers according to some vocabulary of possible elements
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
# build a simple encoder and decoder, effectively a tokenizer and detokenizer
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
print(encode("today is friday, looking forward to the weekend!"))
print(decode(encode("today is friday, looking forward to the weekend!")))
```

In [6]:

```
# encode training dataset and store it in a torch.tensor
import torch
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])
```

In [7]:

```
# 90:10 train:val split
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]
```

I set the time dimension (i.e., the contexts) of the tensors feeding into the transformer equal to a maximum of 8 characters (i.e., I set block_size = 8). Note: I train on block_size+1 because the transformer trains on the first 8 characters and predicts the +1th or 9th character. Put another way, the transformer sees contexts from one character thru block_size.

And I set the batch dimension of the tensors feeding into the transformer to 4, so batch_size = 4 (i.e., 4 independent sequences will be processed in parallel).

In [8]:

```
# set block_size = 8 to train on []:block_size+1] = 8+1 characters at a time
block_size = 8
train_data[:block_size+1]
```

Out[8]:

tensor([18, 47, 56, 57, 58, 1, 15, 47, 58])

In [9]:

```
# +1 because we want to predict the next character, thus block_size+1 allows us to do that, i.e., the transformer trains on the first 8 characters and predicts the +1th or 9th character
# to illustrate:
x = train_data[:block_size]
y = train_data[1:block_size+1]
print('Illustrating how the transformer trains on the first 8 characters and predicts the +1th or 9th character:')
for t in range(block_size):
context = x[:t+1]
target = y[t]
print(f'when input is {context}, the target is {target}')
```

In [10]:

```
# I set the batch dimension of the tensors feeding into the transformer to 4, so batch_size = 4 (i.e., 4 independent sequences will be processed in parallel).
torch.manual_seed(3407)
batch_size = 4
block_size = 8
def get_batch(split):
# generate a batch of data of inputs x and targets y
data = train_data if split == 'train' else val_data
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i:i+block_size] for i in ix])
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
x, y = x.to(device), y.to(device) # move data to GPU
return x, y
xb, yb = get_batch('train')
print('Here is the tensor input to the transformer:',
'\n',
xb
)
```

In [11]:

```
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(3407)
class BigramLanguageModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
# each token directly reads off the logits for the next token from a lookup table
self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
def forward(self, idx, targets=None):
# idx and targets are both (B, T) tensors of integers
logits = self.token_embedding_table(idx) # (B, T, C, i.e., a batch by time (context) by channel tensor, where channel is vocab size)
if targets is None:
loss = None
else:
# reorganize logits tensor from (B, T, C) to (B*T, C) in order to fit pytorch's cross_entropy loss function
B, T, C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets) # cross_entropy here computes negative log likelihood loss
return logits, loss
def generate(self, idx, max_new_tokens):
# idx is a (B, T) array of indices in the current context
for _ in range(max_new_tokens):
# get the predictions
logits, loss = self(idx)
# focus only on the last time step
logits = logits[:, -1, :] # becomes (B, C)
# apply softmax to get the probabilities
probs = F.softmax(logits, dim=-1) # (B, C)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
# append sampled index to the running sequence
idx = torch.cat([idx, idx_next], dim=1) # (B, T+1)
return idx
model = BigramLanguageModel(vocab_size)
m = model.to(device) # move model to GPU
logits, loss = m(xb, yb)
print('logits shape:', logits.shape)
print('loss:', loss)
# context = torch.zeros((1,1), dtype=torch.long, device=device), here created on-the-fly by print() on the GPU
print(decode(m.generate(idx = torch.zeros((1,1), dtype=torch.long, device=device), max_new_tokens=300)[0].tolist()))
```

The model is untrained and provides predictions that are random, so the output is meaningless.

I now train the bigram model to make it less random.

In [12]:

```
# create a pytorch optimizer
optimizer = torch.optim.Adam(m.parameters(), lr=1e-3)
```

In [13]:

```
batch_size = 32 # increase the batch size from 4 to 32 to speed up training
for steps in range(10000): # increase the number of steps to train for, to improve results
# get a batch of data
xb, yb = get_batch('train')
# evaluate the loss
logits, loss = m(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
print('loss:', loss.item()) # training for 10000 steps brings the loss down to ~2.5
```

loss: 2.5604467391967773

In [14]:

```
# As above, context = torch.zeros((1,1), dtype=torch.long, device=device), here created on-the-fly by print() on the GPU
print(decode(m.generate(idx = torch.zeros((1,1), dtype=torch.long, device=device), max_new_tokens=300)[0].tolist()))
```

I now write the first self-attention block for processing the tokens, following several steps, each progressively more effective, that hopefully help to make the self-attention contstruct clearer.

Let's start with a very simple example, which essentially relates tokens to each other via their history.

In [15]:

```
# simple example
torch.manual_seed(3407)
B,T,C = 4,8,2 # batch size, time steps, channels
x = torch.randn(B,T,C)
x.shape
```

Out[15]:

torch.Size([4, 8, 2])

A simple way to enable tokens to communicate in the manner we desire (i.e., with the tokens that precede them in T), is to calculate an average of all the preceding elements. Consider, for example, the fifth token: take the channels that make up that information at that step, but also the channels from the fourth step, third step, second and first steps, and average them. This creates, effectively, a feature vector that summarizes the 5th token in the context of its history. An average like this is an extremely weak and lossy, i.e., a lot of information about the spacial arrangements of the tokens is lost.

So, for every batch element independently, for every $n^{th}$ token in that sequence, calculate the average of all the vectors in all the previous tokens and also at the $n^{th}$ token.

In [16]:

```
# I want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C)) # bow for bag of words
for b in range(B):
for t in range(T):
xprev = x[b,:t+1] # (t, C)
xbow[b,t] = torch.mean(xprev, 0)
print(x[0])
print('xbow averages everything up to the current location of the nth token: ', '\n',
xbow[0])
```

Karpathy shows how to use matrix multiplication to increase the efficiency of the above operation.

In [17]:

```
wei = torch.tril(torch.ones((T,T))) # wei denotes weights, torch.tril provides lower triangular matrix
wei = wei / wei.sum(1, keepdim=True) # normalize weights so that they sum to 1
xbow2 = wei @ x # (B, T, T) @ (B, T, C) --> (B, T, C)
torch.allclose(xbow, xbow2) # check that the two methods give the same result
```

Out[17]:

True

Applying a softmax to each row to normalize.

In [18]:

```
tril = torch.tril(torch.ones((T,T))) # tril matrix of lower triangular ones
wei = torch.zeros((T,T)) # wei begins as a matrix of zeros
wei = wei.masked_fill(tril == 0, float('-inf')) # weights for the future tokens are set to -inf, so future tokens are ignored
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)
```

Out[18]:

True

Now observe a single head perform self attention. A "head" refers to what is effectively a sub-network that processes input sequences independently. In transformers, self-attention normally comprises multiple attention heads that allows the model to attend to different parts of the input sequence at different levels of granularity, enabling the model to capture more diverse and nuanced relationships between the different elements of the input. Thus, self-attention enables the model to gather information from the past and apply it in a data-dependent way.

In [19]:

```
import torch.nn as nn
torch.manual_seed(3407)
B, T, C = 4, 8, 32 # batch, time, channels (recall, channels is dimensionality of the input, e.g., now 32 for a 32-dimensional embedding)
x = torch.randn(B,T,C)
# Observe a single head perform self-attention
head_size = 16 # the head hyperparameter, being the number of dimensions in the query, key, and value vectors
key = nn.Linear(C, head_size, bias=False) # key vector roughly speaking means what do I contain
query = nn.Linear(C, head_size, bias=False) # query vector roughly speaking means what am I looking for
value = nn.Linear(C, head_size, bias=False) # value vector roughly speaking means what do I return
k = key(x) # (B, T, head_size)
q = query(x) # (B, T, head_size)
# the affinities are obtained by taking the dot product of the query and key vectors
wei = q @ k.transpose(-2,-1) # (B, T, head_size) @ (B, head_size, T) --> (B, T, T) --> wei is roughly speaking the affinity matrix
tril = torch.tril(torch.ones((T,T)))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
v = value(x) # results in 16-dimensional vectors because that is the head_size
out = wei @ v
out.shape # (B, T, head_size)
```

Out[19]:

torch.Size([4, 8, 16])

Observe $wei$, the matrix of affinities, as a matrix of lower triangular values:

In [20]:

```
# wei[0]
```

In [21]:

```
# single self-attention block
class Head(nn.Module):
""" one head of self-attention """
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
self.dropout = nn.Dropout(dropout)
def forward(self, x):
B,T,C = x.shape
k = self.key(x) # (B,T,C)
q = self.query(x) # (B,T,C)
# compute attention scores ("affinities")
wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
wei = F.softmax(wei, dim=-1) # (B, T, T)
wei = self.dropout(wei)
# perform the weighted aggregation of the values
v = self.value(x) # (B,T,C)
out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
return out
out.shape
```

Out[21]:

torch.Size([4, 8, 16])

In [22]:

```
# simple bigram model
class BigramLanguageModel(nn.Module):
def __init__(self):
super().__init__()
# each token directly reads off the logits for the next token from a lookup table
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
self.ln_f = nn.LayerNorm(n_embd) # final layer norm
self.lm_head = nn.Linear(n_embd, vocab_size)
def forward(self, idx, targets=None):
B, T = idx.shape
# idx and targets are both (B,T) tensor of integers
tok_emb = self.token_embedding_table(idx) # (B,T,C)
pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
x = tok_emb + pos_emb # (B,T,C)
x = self.blocks(x) # (B,T,C)
x = self.ln_f(x) # (B,T,C)
logits = self.lm_head(x) # (B,T,vocab_size)
if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)
return logits, loss
print(logits, loss)
```

In [23]:

```
# generate method, to generate new tokens
def generate(self, idx, max_new_tokens):
# idx is (B, T) array of indices in the current context
for _ in range(max_new_tokens):
# crop idx to the last block_size tokens
idx_cond = idx[:, -block_size:]
# get the predictions
logits, loss = self(idx_cond)
# focus only on the last time step
logits = logits[:, -1, :] # becomes (B, C)
# apply softmax to get probabilities
probs = F.softmax(logits, dim=-1) # (B, C)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
# append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
return idx
```

Consolidating the above, my model now generates text outputs that are recognizably Shakespearean. The train loss is now 1.6488 and the val loss is 1.8093, which is a marked improvement. In a future post I will work on certain aspects of the model to improve the performance further and confront it will different datasets. The model's training and output is shown below. Please see

my GitHubfor the consolidated code, which I omit here because it slows the time it takes to open the post to an unacceptable level.

0.209729 M parameters
step 0: train loss 4.2614, val loss 4.2568

step 100: train loss 2.6580, val loss 2.6605

step 200: train loss 2.5000, val loss 2.5046

step 300: train loss 2.4105, val loss 2.4309

step 400: train loss 2.3485, val loss 2.3569

step 500: train loss 2.2949, val loss 2.3087

step 600: train loss 2.2367, val loss 2.2507

step 700: train loss 2.1907, val loss 2.2175

step 800: train loss 2.1549, val loss 2.1812

step 900: train loss 2.1101, val loss 2.1578

step 1000: train loss 2.0802, val loss 2.1154

step 1100: train loss 2.0301, val loss 2.0998

step 1200: train loss 2.0254, val loss 2.0861

step 1300: train loss 1.9997, val loss 2.0595

step 1400: train loss 1.9890, val loss 2.0445

step 1500: train loss 1.9456, val loss 2.0219

step 1600: train loss 1.9181, val loss 2.0008

step 1700: train loss 1.9106, val loss 2.0062

step 1800: train loss 1.8987, val loss 2.0026

step 1900: train loss 1.8739, val loss 1.9658

step 2000: train loss 1.8701, val loss 1.9788

step 2100: train loss 1.8438, val loss 1.9617

step 2200: train loss 1.8322, val loss 1.9344

step 2300: train loss 1.8115, val loss 1.9326

step 2400: train loss 1.8084, val loss 1.9267

step 2500: train loss 1.7888, val loss 1.9249

step 2600: train loss 1.7759, val loss 1.9167

step 2700: train loss 1.7881, val loss 1.8967

step 2800: train loss 1.7682, val loss 1.8924

step 2900: train loss 1.7538, val loss 1.9109

step 3000: train loss 1.7535, val loss 1.8929

step 3100: train loss 1.7371, val loss 1.8828

step 3200: train loss 1.7228, val loss 1.8752

step 3300: train loss 1.7182, val loss 1.8677

step 3400: train loss 1.7155, val loss 1.8733

step 3500: train loss 1.7122, val loss 1.8637

step 3600: train loss 1.7037, val loss 1.8632

step 3700: train loss 1.6886, val loss 1.8564

step 3800: train loss 1.6866, val loss 1.8321

step 3900: train loss 1.6841, val loss 1.8379

step 4000: train loss 1.6814, val loss 1.8447

step 4100: train loss 1.6798, val loss 1.8399

step 4200: train loss 1.6841, val loss 1.8392

step 4300: train loss 1.6779, val loss 1.8295

step 4400: train loss 1.6667, val loss 1.8330

step 4500: train loss 1.6572, val loss 1.8032

step 4600: train loss 1.6613, val loss 1.8300

step 4700: train loss 1.6624, val loss 1.8185

step 4800: train loss 1.6433, val loss 1.8098

step 4900: train loss 1.6480, val loss 1.8206

step 4999: train loss 1.6488, val loss 1.8093

This price-nend; it wroable all more I the to be in much sruch on the chargen tell dent, Apurseticeit,--my regried, and mystory far Merch, Red city not we decemblemanvy wantwith a, a mirch Recors is rublence. You and AUo's faces fathee, the pausurals, and know Her no swoot he mest not of me? If rite and and true to latiul did crumb Though yout with little are in them, Frant? shall youble morry, thou march To have noble, bender will bit but In bisoved eyed She dick you.

CORIOLANUS: So place to whence she were was me ence, So thou gamoust to a genaluest thee, Furst your not Lord sly, it, that Of but to brischarr with she peisul, What by you waves behis duke fife?

First The foels. And that sit dest not to be work enters to cannowis cutle thrive with fother So firthed: do and I RI women with be orish with loved And to mothanking of out wut for nubly hast uphis farels didstruqer shee will.

BUCHBOLANR: Duke you, to I she laking be got an keepost in thyse.

GLOUCESTER: But, derrous not my denamous and. But and he duked from happy this furnily dears igk.

Citid: The vorse disque tiruthlo. I with the life.

Pirst That of, not the leaver off it wort? It I wovet the boundrangt, the pooly, Like a so! I how that crum the timet clring, Englink marry him a somet, not do? We sleep take jockita, with saward sweet in learent? throw thou know in knebears Twrue, Promeslesed hurse the brace, good's dear that?'

Nurst known As belo.

DUKE VINCENLOND: Heldsher mer my trine juke wuthers: We work, work. QUE: Where I would lossed a maber gived?

Rethed Setchmness.

PULY: O, wacrimb'd noble and live honesters, The giver to ut to go arminy theims? Call tis by reforge emaget,; dishes, Which sit dakes what she sayit As formoul muty cannot his hamelvenator: That goness, kneel. My comy, And say how a wife works Most, myce. Come, is that work, and me astshumble Formelf'd couses, yet should never frife! To well devoll on life, it no true, is marcher and shall spy give, not hother

Brown, T.B., et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165

Vaswani, A., et al. (2017). Attention Is All You Need. arXiv:1706.03762