LLMs: Exploring To Understand

Delving into LLMs to probe how they do what they do.

A Token of Appreciation

Delve. Delve. Delve! For what are we, our selve, If not the mind, That likes to delve.

I wish you deeply embed in your memory, a token of appreciation for the previous article in this series.

Indeed, I have a lame sense of humor. But all I mean to say is, there will be references to the previous article linked above, and it might be helpful to keep that in context.

What is a Language Model?

Please don’t run away. This is a customary section. Every article needs to explain what an LM means, while ensuring that the reader ends up being further bored, and no wiser. Like, er, monads!

Play time. Let’s play autocomplete.

The ball hit the bat, and _

What did you think of? Cricket?

For most of the member nations of the Commonwealth (my country being one), cricket is a sport first. If the British wouldn’t have arrived, but English would have, it’d have been the insect first.

So, here are a few possible completions that might not surprise you much if you were reminded of the sport.

  • The ball hit the bat, and it raced to the boundary.
  • The ball hit the bat, and off it flew into the fielder’s hand!
  • The ball hit the bat, and the players took off for the winning run.

Here’s a completely different one.

The ball hit the bat, and it fell down to the ground, gravely injured.

Depending on your current frame of mind, it might take you just a wee bit more time to regroup your thinking and anchor the word bat into the animal kingdom context, and then it all makes sense.

I’m not going to belabor this point much - but the term language model, in the context of the term LLM, is used to describe a set of characteristics of a language captured in a neural network’s architecture and values of the weights and biases stored there-in.

So, if you were continuing your discussion with a biologist, the above sentence would veer towards the animal kingdom. Otherwise, towards the sport. But, it would depend on the context. We’ll keep it to the sport context, and here’s one last before we move on…

The ball hit the bat, and the bat swung hard, pushing it away furiously. Why, of course. That bat was being handled by the one and only Sachin.

MoDelve: Model Delve

Let us get started by exploring a reasonably sized, and high quality LLM (as of the first half of 2024) - the microsoft/Phi-3-mini-128k-instruct.

The sizing…

This is the most important engineering aspect we should get out of the way before moving to our primary track.

As we deal with large, billion-scale parameter models, and code to go by the side, we have to be mindful of the kind of memory and processing that will be demanded.

Let’s choose a model that is small, but high quality. And, as I explain below, we need to make it smaller still for a regular developer machine with about 16 gigs of RAM. (If you are on a lower-configuration machine in 2024, I feel sorry. I already feel sorry for 16…)

The microsoft/Phi-3-mini-128k-instruct model can run on a phone. But there are a few things we must understand clearly before we move forward.

The model size is indicated in terms of the number of parameters it holds - which are all the weights and biases of the network. Each of these values is a numeric value held in specific data-types that have their own sizes.

Naïvely speaking, if each parameter is held in a 32-bit float data-type, then we get the total size in bytes by multiplying the number of parameters by 4. A so-called 4B model will take up roughly 16 GB of memory.

We can do away with the need for 32 bits for our numbers, because, it turns out, reducing the fidelity of these numbers to 16, 8 and 4 bits does not always bring down the quality of the inference in a perceptible manner. Oh, let me emphasize this - we are only talking about downsizing during inference time. Training computations are sensitive and we will keep that out of scope for this article.

This idea of down-sizing parameters is called quantization, and it is a pretty useful idea that allows us to run inferencing on less expensive hardware, and even low-power devices. Be aware that quantization is not about downsizing every single number you come across in a model’s representation, and certainly not downsizing in the same manner across the board.

We’ll implicitly use quantization in our examples that follow, using the quanto backend (you can read an overview here), or whatever that makes sense. That way, I avoid running into memory requirement problems.

Inspecting our model

Let’s get going. It always gives us a kick to go on a test drive before we buy something - like this new model. First, the looks!

from transformers import AutoModelForCausalLM, QuantoConfig
model_name = "microsoft/Phi-3-mini-128k-instruct"
device = "cpu" # Can be "mps" on Apple Silicon, or "cuda" if you have it.

quantization_config = QuantoConfig(weights="int4")
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             quantization_config=quantization_config,
                                             device_map=device,
                                             trust_remote_code=True)

print(model)
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attenton` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.

Loading checkpoint shards:   0% 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50% 1/2 [00:06<00:06,  6.97s/it]
Loading checkpoint shards: 100% 2/2 [00:11<00:00,  5.79s/it]
Loading checkpoint shards: 100% 2/2 [00:11<00:00,  5.97s/it]
Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): QLinear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): QLinear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3SuScaledRotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): QLinear(in_features=3072, out_features=16384, bias=False)
          (down_proj): QLinear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)

Note the use of AutoModelForCausalLM. It knows how to pick the right instance from a list of supported causal LMs. (Note: The link to the list is to a specific git commit and the latest version may be different.)

Note the layers in the model.

  • A few layers (embedding, dropout) in the beginning. The embedding layer reads input of size 32064 and converts them to the embedding dimension of 3072
  • 32 Phi3Attention layers
  • 1 Phi3MLP
  • A few more - Phi3RMSNorm, dropouts
  • Finally, a Linear layer that transforms a 3072-dimensional vector to a 32064-dimensional output.

32064 is the size of the vocabulary - the distinct number of tokens the model uses to convert the input, or generate the output. If you inspect the model data, the number of tokens is 32011 (32000 + 11 special tokens). 32064 seems to be what the transformers library has chosen as the token vocabulary size, and I suspect the tokens 32011 onwards are unused (0-based counter).

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(tokenizer)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
LlamaTokenizerFast(name_or_path='microsoft/Phi-3-mini-128k-instruct', vocab_size=32000, model_max_length=131072, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '<|endoftext|>', 'unk_token': '<unk>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=False),
	32000: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	32001: AddedToken("<|assistant|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=True),
	32002: AddedToken("<|placeholder1|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=True),
	32003: AddedToken("<|placeholder2|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=True),
	32004: AddedToken("<|placeholder3|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=True),
	32005: AddedToken("<|placeholder4|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=True),
	32006: AddedToken("<|system|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=True),
	32007: AddedToken("<|end|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=True),
	32008: AddedToken("<|placeholder5|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=True),
	32009: AddedToken("<|placeholder6|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=True),
	32010: AddedToken("<|user|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=True),
}

Let’s explore some specific ones to understand. Nothing gets printed for 32011, and is likely empty as conjectured.

print(0, tokenizer.decode([0]))
print(1, tokenizer.decode([1]))
print(1001, tokenizer.decode([1001]))
print(15987, tokenizer.decode([15987]))
print(32009, tokenizer.decode([32009]))
print(32010, tokenizer.decode([32010]))
print(32011, tokenizer.decode([32011]))
0 <unk>
1 <s>
1001 ER
15987 effect
32009 <|placeholder6|>
32010 <|user|>
32011

FIXME This is a direct pick from the transformers library, with a few minor tweaks that allows us to delve into some finer parts.

Let’s see what the model can do for a Hello World. Depending on your hardware, this can take some non-trivial amount of time. Note that we have used a quantized model - using 4-bit representations for the parameters. So, while memory should not be a problem, the number of computations is very high. The CPU is going to be busy. If you have a GPU supported by torch, you might be lucky to experience some good speedup.

example_text = "Today is"

# The .to(device) is not needed and defaults to using the "cpu"
example_text_tokens = tokenizer([example_text], return_tensors="pt").to(device)

example_text_generated_ids = model.generate(**example_text_tokens)
example_text_generated = tokenizer.batch_decode(example_text_generated_ids, skip_special_tokens=True)
print(example_text_generated[0])
Today is a great day to be alive.

Let’s convert this into a function a try a few more examples

def complete_the_input_text(prompt):
    prompt_tokens = tokenizer([prompt], return_tensors="pt").to(device)
    generated_ids = model.generate(**prompt_tokens)
    generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    return generated_text[0]
print(complete_the_input_text("India is a"))
print(complete_the_input_text("Pakistan is a"))
print(complete_the_input_text("China is a"))
print(complete_the_input_text("Burma is a"))
print(complete_the_input_text("June 21 is a"))
print(complete_the_input_text("Oracle is a"))
'India is a country in South Asia. It is located in the Indian subcontinent and is'
'Pakistan is a country that has been facing a significant challenge in terms of its population growth. The'
'China is a country with a large population and a growing economy, but also a country with a'
'Burma is a country in Southeast Asia. It is located in the region known as'
'June 21 is a Wednesday'
'Oracle is a database management system that is part of the Oracle Corporation suite of database applications. It'

**

Code

Files