Synopsis

Notes and pointers during the exploration of the book LLM From Scratch by Sebastian Raschka. While I am using this book as the central reference, the notes will contain additional references and pointers to other resources.

Chapter 1: Understanding Large Language Models

A good quality corpus is crucial.
Choice of corpus will depend on the downstream tasks.
For example, to generate code, the corpus should contain a substantial amount of code snippets.

An open-source corpus of 3 trillion tokens has been detailed in (Soldaini et al. 2024).

Key challenges in pre-training

Data: Creating or curating a large corpus of text data is a significant challenge. The quality and diversity of the data are crucial for the performance of the model.
Model: The model architecture is key for pre-training. A good model architecture will help in learning the underlying structure of the data.
Budget: Pre-training a large language model requires a lot of computational resources. A good budget is crucial for pre-training.
GPT-3 potentially cost $4.6 million to train, per Li (2020).

Ouyang et al. (2022) presented ideas on how to fine-tune GPT-3 on a dataset of instructions.

Chapter 2: Working with text data

For a brief overview of tokens and embeddings in the context of deep learning text models, see LLMs: Understanding Tokens and Embeddings

There can be

Standalone models just for embeddings (eg: Word2Vec (see Mikolov et al. 2013), GloVe (see Pennington, Socher, and Manning 2014)), or
Models that use embeddings as part of a larger model (eg: BERT (see Devlin et al. 2018), GPT-3 (see Brown et al. 2020)).

Note

Embeddings from one model are typically not directly compatible with another model, because the embeddings are learned in the context of the model.

Tokenization

The more tokens we have, the more information we can capture. However, more tokens also mean more computational resources.

How does the token vocabulary affect computation?

The number of tokens affects the size of the embedding matrix.

The embedding matrix is a lookup table that maps each token to its corresponding embedding vector.

The size of the embedding matrix is determined by the vocabulary size and the embedding dimension. The larger the vocabulary, the larger the embedding matrix, and the more memory and computation are required to process it.

Tokens can be created not just for words, but also for subwords and characters.
This allows us to capture more information, and handle “words” not seen before.
We do not encode grammar rules (which are hard to define and may not be exhaustive or easy to update). Instead, we let the model learn the rules from the data. This is a key difference between traditional NLP and deep learning NLP.

import importlib
import stage1.tokenization
importlib.reload(stage1.tokenization)

<module 'stage1.tokenization' from '/Users/jaju/github/knowledge-garden/content/notes/llm-from-scratch/stage1/tokenization.py'>

V1

The first, naive implementation that can not handle unseen tokens, as tokens are identified based on word boundaries from the training text.

stage1.tokenization.v1()

2025-09-20 09:09:36.107 | INFO     | utils.downloaders:download:21 - data/the_verdict.txt already exists. Skipping download.

Error: Token 'draggees' not found in vocab.

V2

Ability to handle unseen tokens, plus additional signals to like begin/end of text, padding, etc. We simply preprocess the text to handle unseen tokens and replace then with a special token.

stage1.tokenization.v2()

2025-09-20 09:09:36.112 | INFO     | utils.downloaders:download:21 - data/the_verdict.txt already exists. Skipping download.

╒═════════╤════════════════════════════╕
│ Input   │ three she draggees equally │
├─────────┼────────────────────────────┤
│ Encoded │ [1004, 876, 1131, 394]     │
├─────────┼────────────────────────────┤
│ Decoded │ three she <|unk|> equally  │
╘═════════╧════════════════════════════╛

V3 - Byte Pair Encoding (BPE)

Used in the original ChatGPT as well as GPT-2 and GPT-3.
- This one goes further granular in how it identifies tokens - encompassing all atomic units and then certain agglomerations of them.- This is a data-driven approach to tokenization, where we learn the tokens from the data.
tiktoken is a Python implementation of BPE. Implementation from scratch is not a key aim of this book, but it is useful to understand the concepts.

In this example, we don’t create a new vocabulary, but use the ‘GPT-2’ vocabulary. (See source)

stage1.tokenization.v3()

2025-09-20 09:09:36.116 | INFO     | utils.downloaders:download:21 - data/the_verdict.txt already exists. Skipping download.

╒═════════╤════════════════════════════════════╕
│ Input   │ three she draggees equally         │
├─────────┼────────────────────────────────────┤
│ Encoded │ [15542, 673, 6715, 469, 274, 8603] │
├─────────┼────────────────────────────────────┤
│ Decoded │ three she draggees equally         │
╘═════════╧════════════════════════════════════╛

Notice that the number of tokens is more than the number of distinct words in this example. We also have unseen words, which are split into multiple tokens that are in the vocabulary. This is a key feature of BPE, as it allows us to handle unseen words by breaking them down into smaller units that are in the vocabulary.

V4

This is a further improvement over the previous version. “Improvement” not because of the tokenization technique but because of how we inject special symbols into the text. We use a special token to indicate the beginning and end of a sentence, and a special token to indicate padding.

stage1.tokenization.v4()

2025-09-20 09:09:36.219 | INFO     | utils.downloaders:download:21 - data/the_verdict.txt already exists. Skipping download.

╒═════════╤═══════════════════════════════════════════════════════════════════════════════╕
│ Input   │ three she dragged equally. <|endoftext|> This is the end of the document.     │
├─────────┼───────────────────────────────────────────────────────────────────────────────┤
│ Encoded │ [15542, 673, 17901, 8603, 13, 220, 50256, 770, 318, 262, 886, 286, 262, 3188, │
│         │ 13]                                                                           │
├─────────┼───────────────────────────────────────────────────────────────────────────────┤
│ Decoded │ three she dragged equally. <|endoftext|> This is the end of the document.     │
╘═════════╧═══════════════════════════════════════════════════════════════════════════════╛

BPE: A further demonstration

BPE can handle (apparently) garbage text. And also, it preserves spaces between tokens, because it makes no special assumptions about the text, including the notion of word boundaries.

stage1.tokenization.v4bpe()

╒═════════╤═════════════════════════════════════════════════════════════════════════════════╕
│ Input   │ asd asdjfkjsdf ksjfksa sdkfjsj   powiuosadoapofqfvv                             │
├─────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Encoded │ [292, 67, 355, 28241, 69, 74, 8457, 7568, 479, 82, 73, 69, 591, 64, 264, 34388, │
│         │ 69, 8457, 73, 220, 220, 7182, 16115, 418, 4533, 499, 1659, 80, 69, 25093]       │
├─────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Decoded │ asd asdjfkjsdf ksjfksa sdkfjsj   powiuosadoapofqfvv                             │
╘═════════╧═════════════════════════════════════════════════════════════════════════════════╛

Embeddings

Tokens as 1-hot vectors create sparse matrices that are inefficient and fail to capture semantic relationships
Embeddings represent tokens in a lower-dimensional space that preserves semantic relationships
Embeddings are learned, not universal representations, and are specific to the model they’re trained in
The embedding space dimension is a tunable hyperparameter (larger = more informative but more resource-intensive)
Embeddings start as random values and are refined during training through gradient updates

import importlib
import stage1.embeddings
importlib.reload(stage1.embeddings)

<module 'stage1.embeddings' from '/Users/jaju/github/knowledge-garden/content/notes/llm-from-scratch/stage1/embeddings.py'>

What do they look like?

Let’s create a randomly initialized embedding matrix for a vocabulary of size 10 and an embedding dimension of 4. The embeddings matrix is then of size (10, 4), where each row corresponds to a token in the vocabulary and each column corresponds to a dimension in the embedding space.

import torch
import stage1.embeddings
embedding_dim = 4
vocab_size = 10

embeddings = torch.nn.Embedding(vocab_size, embedding_dim)
print(embeddings.weight)
print(f"Embedding of token-id 2 is {embeddings(torch.tensor([1]))}")

Parameter containing:
tensor([[-9.9448e-02, -1.1365e-01, -6.4670e-01,  1.3754e+00],
        [-1.7200e-03, -1.2643e+00, -1.5873e+00,  1.9216e+00],
        [-1.8102e-01,  7.2387e-01,  1.6446e+00,  1.5566e+00],
        [-2.0930e-01,  5.7910e-01, -8.7329e-01,  4.8594e-01],
        [-3.7975e-01,  9.9947e-01,  1.2793e+00, -1.0777e+00],
        [ 1.8252e+00, -5.2073e-01,  9.4681e-01,  4.4129e-01],
        [ 2.0121e+00,  3.1196e-01, -1.1323e-01, -8.9150e-01],
        [-6.9310e-01, -1.2825e+00, -5.1667e-01,  1.5186e+00],
        [ 8.8475e-02, -1.3853e+00,  3.2707e-01, -2.2567e+00],
        [-7.8447e-01, -1.1544e+00,  6.3983e-01, -1.6529e+00]],
       requires_grad=True)
Embedding of token-id 2 is tensor([[-1.7200e-03, -1.2643e+00, -1.5873e+00,  1.9216e+00]],
       grad_fn=<EmbeddingBackward0>)

The embedding of a token with id i is simply the i-th row of the embedding matrix. For example, the embedding of the token with id 0 is the first row of the embedding matrix. Mapping tokens to embeddings is plainly a lookup operation.

V0

Printing the embedding matrix of a made-up vocabulary. The embeddings are randomly initialized. The embeddings matrix size is determined by the vocabulary size and the embedding dimension. The embedding dimension is a hyperparameter that can be tuned.

Embeddings are tensors with a shape of (vocab_size, embedding_dim).

import torch.nn as nn
nn.Embedding(10, 4).weight

Parameter containing:
tensor([[ 1.1944,  1.4964, -0.2055,  1.3608],
        [ 0.0955, -0.1670,  0.7993, -1.2934],
        [-0.4193, -0.6237,  0.4504, -0.5121],
        [ 0.2605, -0.1577,  0.8277, -1.0529],
        [ 1.0344,  0.3868, -0.1455,  0.3573],
        [-0.2662,  0.7649, -0.6125, -0.4384],
        [-1.0989, -0.0393,  0.3866,  0.1433],
        [-0.0861, -0.4805, -0.1692,  2.3211],
        [-0.3632, -0.9352,  0.9874,  0.1108],
        [-1.2629,  0.1632, -1.8517, -0.7092]], requires_grad=True)

Once we have the embeddings matrix, we can map input tokens to their corresponding embeddings.

stage1.embeddings.v0()

Pseudo-randomly initialized embedding layer:
╒══════════════════════════╤══════════════════════════════════════════════════════════════════════════════╕
│ Vocab Size               │ 10                                                                           │
├──────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│ Output Dimension         │ 3                                                                            │
├──────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│ Embedding Layer          │ Embedding(10, 3)                                                             │
├──────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│ Layer Weights            │ Parameter containing: tensor([[-1.1258, -1.1524, -0.2506],         [-0.4339, │
│                          │ 0.8487,  0.6920],         [-0.3160, -2.1152,  0.3223],         [-1.2633,     │
│                          │ 0.3500,  0.3081],         [ 0.1198,  1.2377, -0.1435],         [-0.1116,     │
│                          │ -0.6136,  0.0316],         [-0.4927,  0.2484,  0.4397],         [ 0.1124,    │
│                          │ -0.8411, -2.3160],         [-0.1023,  0.7924, -0.2897],         [ 0.0525,    │
│                          │ 0.5229,  2.3022]], requires_grad=True)                                       │
├──────────────────────────┼──────────────────────────────────────────────────────────────────────────────┤
│ Embedding for token_id 5 │ tensor([-0.1116, -0.6136,  0.0316], grad_fn=<EmbeddingBackward0>)            │
╘══════════════════════════╧══════════════════════════════════════════════════════════════════════════════╛
╒═══════════════════════════════════╤═══════════════════════════════════════════════════════════════════════════╕
│ Embedding for token_ids [2, 4, 6] │ tensor([[-0.3160, -2.1152,  0.3223],         [ 0.1198,  1.2377, -0.1435], │
│                                   │ [-0.4927,  0.2484,  0.4397]], grad_fn=<EmbeddingBackward0>)               │
╘═══════════════════════════════════╧═══════════════════════════════════════════════════════════════════════════╛

V1

Data loaders create batches of token-ids for efficient processing
Batching is valuable for large text corpora
Batched processing leverages GPU acceleration when available
Even on CPU, batching enables more efficient multi-threaded processing
The example demonstrates creating embeddings for “The Verdict” text using previously created data loader with batch processing

stage1.embeddings.v1()

╒════════════════╤═══════╕
│ vocab_size     │ 50257 │
├────────────────┼───────┤
│ embedding_size │   256 │
├────────────────┼───────┤
│ input_length   │     4 │
├────────────────┼───────┤
│ batch_size     │     8 │
╘════════════════╧═══════╛
Embedding Layer: Parameter containing:
tensor([[ 0.9383,  0.4889, -0.6731,  ...,  1.2948,  1.4628, -0.6204],
        [ 0.6257, -1.2231, -0.6232,  ...,  0.3260,  0.5352,  1.9733],
        [-1.4115, -1.0295,  0.1267,  ...,  0.5027, -0.8871,  1.9974],
        ...,
        [ 0.6928, -0.5382, -0.8726,  ..., -0.5148,  0.9695,  0.7689],
        [-0.5866,  0.6971,  1.8386,  ...,  0.4298, -0.5139,  1.6624],
        [ 0.6073,  0.2991,  0.7669,  ..., -1.3811, -1.4284, -0.5630]],
       requires_grad=True)
Inputs shape: torch.Size([8, 4])
Targets shape: torch.Size([8, 4])
Input tensor shape:  torch.Size([1, 4])
Input Tensor: tensor([[39936, 24254,  7996, 42174]])
Output tensor shape:  torch.Size([1, 4, 256])
Output Tensor: tensor([[[-0.9905,  0.4149, -0.1217,  ...,  2.3362, -0.5502,  0.3072],
         [ 2.0188,  0.2669, -0.0151,  ..., -0.1302,  0.0308, -0.0452],
         [ 1.7518,  1.2162, -1.0058,  ..., -1.2053, -0.8477, -0.3506],
         [ 0.3572, -1.6816,  1.1135,  ...,  0.8234,  0.5311, -0.3427]]],
       grad_fn=<EmbeddingBackward0>)

Aside: What Do The Layers and Operations Do?

vocab_size = 512
embeddings_size = 8
dummy_token_ids = torch.arange(vocab_size)
embeddings_layer = torch.nn.Embedding(vocab_size, embeddings_size)

# Print the first 4 embeddings out of `vocab_size` - corresponding to the first 4 token ids
print(embeddings_layer(torch.arange(4)))
# Print the token-id 3 (4th token). Note it is just the last tensor of the previous output
print(embeddings_layer(torch.tensor(3)))


# Position embeddings layer
context_length = 8
output_dim = embeddings_size
# Randomly initialized but not our current concern
pos_embeddings_layer = torch.nn.Embedding(context_length, output_dim)
print(pos_embeddings_layer)

tensor([[ 0.7480,  2.0246,  0.8172, -0.6008, -0.9860,  1.7485,  0.3155, -0.7663],
        [ 0.0486,  0.3841,  0.6013,  1.2091,  0.7474,  1.2037, -0.5525, -0.5882],
        [ 0.9393,  2.0120,  0.7987, -0.0642, -0.0212,  0.2408, -1.3335, -2.3596],
        [ 0.4245, -1.2994, -0.8715, -0.1588, -0.8418,  1.3300, -0.2200, -0.4653]],
       grad_fn=<EmbeddingBackward0>)
tensor([ 0.4245, -1.2994, -0.8715, -0.1588, -0.8418,  1.3300, -0.2200, -0.4653],
       grad_fn=<EmbeddingBackward0>)
Embedding(8, 8)

V2 - Using word positions

Taking the previous example and thinking further, while embeddings capture semantic relationships in a denser space, we also need to encode token positions in sentences.
- A word’s meaning can change based on its position
- A sentence’s meaning can change based on word order
- We hypothesize this and hope our neural network architecture will learn it
We’ll abstract away the details of position encoding implementation:
- No clean notion of position exists (no defined start/end for text we process)
- Focus on the current input and calculations relative to it
Even within the current input batch, position is a hazy concept due to sliding windows:
- Options include using absolute positions within current sequence/batch
- Or encoding relative positions instead

stage1.embeddings.v2pos()

╒════════════════╤═══════╕
│ vocab_size     │ 50257 │
├────────────────┼───────┤
│ embedding_size │   256 │
├────────────────┼───────┤
│ input_length   │     4 │
├────────────────┼───────┤
│ batch_size     │     8 │
╘════════════════╧═══════╛
Embedding Layer: Parameter containing:
tensor([[-0.7690,  1.6553, -0.7893,  ..., -0.6867, -0.2874, -0.3671],
        [ 0.8989,  0.8925, -0.3567,  ..., -0.1085,  0.1085, -0.6313],
        [ 0.2561,  0.1745, -0.0582,  ..., -0.6859,  0.5682, -1.6920],
        ...,
        [-1.4095, -0.0738,  0.0833,  ...,  0.2731, -0.4048,  0.7096],
        [ 0.7286, -0.1222, -0.2912,  ..., -0.9472,  1.4767,  0.2563],
        [ 1.0063, -1.2769, -0.8027,  ..., -0.8209, -0.4761,  0.5803]],
       requires_grad=True)
Positional Embedding Layer: Parameter containing:
tensor([[-0.9512,  0.4695,  0.0280,  ..., -1.0012, -1.7668, -1.7920],
        [-0.1993,  0.4765,  0.0246,  ...,  1.4640, -0.5069,  1.0845],
        [ 1.4029, -1.0208, -0.0899,  ..., -1.2818, -0.1527, -0.1735],
        [ 0.9177,  1.3376,  1.4023,  ..., -0.9248,  0.3450, -1.5700]],
       requires_grad=True)
Input shape:  torch.Size([8, 4, 256])
Plain token embeddings:  tensor([[[-1.0169e-01, -2.8884e-03,  6.7786e-02,  ..., -1.4209e+00,
           3.5037e-01, -2.3330e-01],
         [ 1.3920e+00,  5.9468e-02,  2.6447e-01,  ..., -1.1079e+00,
          -4.8958e-01, -9.2617e-01],
         [ 7.3564e-01,  2.8131e-01, -7.7062e-01,  ...,  3.4536e-02,
          -7.3732e-01,  9.5018e-01],
         [ 4.0259e-01,  6.2462e-01, -1.2347e+00,  ..., -1.6737e-02,
          -1.2412e+00, -1.1772e+00]],

        [[-3.4561e-01,  8.8373e-02,  1.2933e+00,  ..., -2.6212e+00,
           3.0364e+00, -2.3153e-02],
         [ 1.1975e+00, -8.2988e-01,  3.2502e-01,  ...,  1.2352e+00,
           1.5225e+00, -8.3964e-01],
         [-1.4959e-01, -3.8170e-01,  7.7954e-01,  ...,  1.2170e+00,
           1.3088e-02,  8.1160e-01],
         [ 1.3594e+00, -6.7180e-02, -1.4525e+00,  ...,  8.3537e-01,
          -7.4209e-01, -1.0441e+00]],

        [[ 1.5813e+00, -1.1984e+00,  6.5373e-01,  ..., -2.3055e-02,
          -9.5937e-01,  2.7789e-01],
         [ 1.6993e+00,  1.0975e+00,  1.3544e-01,  ...,  7.4869e-01,
           4.8076e-01,  9.3548e-01],
         [-1.3745e+00, -9.8519e-02, -3.4389e-01,  ...,  6.4930e-01,
          -8.8472e-01,  1.0561e+00],
         [-9.4417e-01, -8.7664e-01, -1.2534e+00,  ...,  1.1642e+00,
          -3.8781e-01,  4.7189e-01]],

        ...,

        [[ 4.0259e-01,  6.2462e-01, -1.2347e+00,  ..., -1.6737e-02,
          -1.2412e+00, -1.1772e+00],
         [ 1.2422e+00,  2.1564e+00, -5.3766e-02,  ...,  1.9843e-02,
          -1.7387e+00,  3.6680e-01],
         [ 1.1095e+00, -1.8102e+00, -5.0129e-01,  ...,  6.6180e-01,
          -1.1181e+00, -2.5616e-01],
         [ 2.7070e-01,  8.6201e-01, -4.7629e-01,  ...,  4.3038e-02,
           8.6910e-02,  1.9684e+00]],

        [[-2.9615e-01,  2.9772e-01,  1.5972e-01,  ...,  1.3382e-01,
          -1.6013e+00, -5.4816e-01],
         [ 4.1694e-01,  5.9602e-02,  1.4777e-01,  ...,  9.9320e-03,
           1.4118e+00, -1.1616e+00],
         [-9.3665e-01, -1.5489e-01,  9.6246e-01,  ..., -9.2383e-02,
           2.2189e-01, -2.0200e-01],
         [-3.9775e-01,  8.9565e-01, -2.8307e-01,  ...,  5.8753e-01,
           3.4892e-01,  5.4884e-01]],

        [[-3.5788e+00,  2.9069e-01,  1.8510e-01,  ..., -7.8466e-01,
           3.6996e-01, -4.4498e-01],
         [ 2.9396e-01,  2.9446e-01,  7.0959e-01,  ...,  8.4308e-01,
           1.3235e+00,  9.8919e-01],
         [-1.2524e+00,  2.1871e+00,  6.8456e-01,  ..., -6.1270e-01,
          -6.3723e-01, -8.0530e-02],
         [-6.2119e-01,  1.5308e-01,  1.1285e+00,  ..., -1.3810e+00,
          -3.9482e-01, -1.2915e+00]]], grad_fn=<EmbeddingBackward0>)
Positional embeddings:  tensor([[[-1.0529,  0.4666,  0.0958,  ..., -2.4221, -1.4164, -2.0253],
         [ 1.1927,  0.5360,  0.2890,  ...,  0.3561, -0.9965,  0.1583],
         [ 2.1385, -0.7395, -0.8605,  ..., -1.2473, -0.8901,  0.7767],
         [ 1.3203,  1.9622,  0.1676,  ..., -0.9415, -0.8962, -2.7471]],

        [[-1.2968,  0.5579,  1.3213,  ..., -3.6224,  1.2696, -1.8152],
         [ 0.9982, -0.3534,  0.3496,  ...,  2.6992,  1.0155,  0.2448],
         [ 1.2533, -1.4025,  0.6897,  ..., -0.0648, -0.1397,  0.6381],
         [ 2.2771,  1.2704, -0.0502,  ..., -0.0894, -0.3971, -2.6140]],

        [[ 0.6300, -0.7289,  0.6817,  ..., -1.0242, -2.7261, -1.5141],
         [ 1.5000,  1.5740,  0.1600,  ...,  2.2127, -0.0262,  2.0200],
         [ 0.0283, -1.1193, -0.4338,  ..., -0.6325, -1.0375,  0.8826],
         [-0.0264,  0.4609,  0.1489,  ...,  0.2394, -0.0428, -1.0981]],

        ...,

        [[-0.5487,  1.0941, -1.2067,  ..., -1.0179, -3.0080, -2.9692],
         [ 1.0429,  2.6329, -0.0292,  ...,  1.4838, -2.2457,  1.4513],
         [ 2.5123, -2.8310, -0.5912,  ..., -0.6200, -1.2709, -0.4296],
         [ 1.1884,  2.1996,  0.9260,  ..., -0.8817,  0.4319,  0.3984]],

        [[-1.2474,  0.7672,  0.1877,  ..., -0.8674, -3.3680, -2.3402],
         [ 0.2176,  0.5361,  0.1723,  ...,  1.4739,  0.9049, -0.0772],
         [ 0.4662, -1.1757,  0.8726,  ..., -1.3742,  0.0691, -0.3755],
         [ 0.5200,  2.2332,  1.1192,  ..., -0.3372,  0.6939, -1.0211]],

        [[-4.5300,  0.7602,  0.2131,  ..., -1.7858, -1.3968, -2.2370],
         [ 0.0946,  0.7710,  0.7342,  ...,  2.3071,  0.8166,  2.0737],
         [ 0.1504,  1.1663,  0.5947,  ..., -1.8945, -0.7900, -0.2540],
         [ 0.2965,  1.4907,  2.5308,  ..., -2.3057, -0.0498, -2.8615]]],
       grad_fn=<AddBackward0>)

Temporary Section

Python

print(42)

C++

#include <iostream>
int main(){ std::cout << 42; }

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” https://arxiv.org/abs/2005.14165.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” https://arxiv.org/abs/1810.04805.

Li, Chuan. 2020. “GPT-3: A Technical Overview.” https://lambdalabs.com/blog/demystifying-gpt-3.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” https://arxiv.org/abs/1301.3781.

Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” https://arxiv.org/abs/2203.02155.

Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. “GloVe: Global Vectors for Word Representation.” In Empirical Methods in Natural Language Processing (EMNLP), 1532–43. http://www.aclweb.org/anthology/D14-1162.

Soldaini, Luca, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, et al. 2024. “Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.” https://arxiv.org/abs/2402.00159.