Interview Questions
Can you provide a high-level overview of Transformers' architecture?
Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another, as illustrated below,

Getting closer into the black box, transformers have on the inside:
- An encoding component: which is a stack of
Nencoders. - A decoding component: which is a stack of
Ndecoders, - and connections between them.

Now, each encoder is broken down into two sub-layers: the self-attention layer and the feed-forward neural network layer.
The inputs first flow through a self-attention layer, and the outputs of the self-attention layer are fed to a feed-forward neural network. And this sequence is repeated till reaches the last encoder.
Finally, the decoder receives the output of the encoder component and also has both the self-attention layer and feed-forward layer, and the flow is similar to before, but between them there is an attention layer that helps the decoder focus on relevant parts of the input sentence.

How next sentence prediction (NSP) is used in language modeling?
Next sentence prediction (NSP) is used in language modeling as one-half of the training process behind the BERT model (the other half is masked-language modeling (MLM)). The objective of next-sentence prediction training is to predict whether one sentence logically follows the other sentence presented to the model.
During training, the model is presented with pairs of sentences, some of which are consecutive in the original text, and some of which are not. The model is then trained to predict whether a given pair of sentences are adjacent or not. This allows the model to understand longer-term dependencies across sentences.
Researchers have found that without NSP, BERT performs worse on every single metric - so its use it’s relevant to language modeling.
How can you evaluate the performance of Language Models?
There are two ways to evaluate language models in NLP: Extrinsic evaluation and Intrinsic evaluation.
- Intrinsic evaluation captures how well the model captures what it is supposed to capture, like probabilities.
- Extrinsic evaluation (or task-based evaluation) captures how useful the model is in a particular task.
A common intrinsic evaluation of LM is the perplexity. It's a geometric average of the inverse probability of words predicted by the model. Intuitively, perplexity means to be surprised. We measure how much the model is surprised by seeing new data. The lower the perplexity, the better the training is. Another common measure is the cross-entropy, which is the Logarithm (base 2) of perplexity. As a thumb rule, a reduction of 10-20% in perplexity is noteworthy.
The extrinsic evaluation will depend on the task. Example: For speech recognition, we can compare the performance of two language models by running the speech recognizer twice, once with each language model, and seeing which gives the more accurate transcription.
How do generative language models work?
The very basic idea is the following: they take n tokens as input, and produce one token as output.

A token is a chunk of text. In the context of OpenAI GPT models, common and short words typically correspond to a single token and long and less commonly used words are generally broken up into several tokens.
This basic idea is applied in an expanding-window pattern. You give it n tokens in, it produces one token out, then it incorporates that output token as part of the input of the next iteration, produces a new token out, and so on. This pattern keeps repeating until a stopping condition is reached, indicating that it finished generating all the text you need.
Now, behind the output is a probability distribution over all the possible tokens. What the model does is return a vector in which each entry expresses the probability of a particular token being chosen.

This probability distribution comes from the training phase. During training, the model is exposed to a lot of text, and its weights are tuned to predict good probability distributions, given a sequence of input tokens.
GPT generative models are trained with a large portion of the internet, so their predictions reflect a mix of the information they’ve seen.
What is a token in the Large Language Models context?
ChatGPT and other LLMs rely on input text being broken into pieces. Each piece is about a word-sized sequence of characters or smaller. We call those sub-word tokens. That process is called tokenization and is done using a tokenizer.
Tokens can be words or just chunks of characters. For example, the word "hamburger" gets broken up into the tokens "ham", "bur" and "ger", while a short and common word like "pear" is a single token. Many tokens start with whitespace, for example, " hello" and " bye".
The models understand the statistical relationships between these tokens and excel at producing the next token in a sequence of tokens.
The number of tokens processed in a given API request depends on the length of both your inputs and outputs. As a rough rule of thumb, the 1 token is approximately 4 characters or 0.75 words for English text.
Consider:

What's the advantage of using transformer-based vs LSTM-based architectures in NLP?
To create sequence-to-sequence models before the Transformer, we used the famous LSTM with its Encoder-Decoder architecture, where
- The "Encoder" part that creates a vector representation of a sequence of words.
- The "Decoder" returns a sequence of words from the vector representation.
The LSTM model takes into account the interdependence of words, so we need inputs of the previous state to make any operations on the current state. This model has a limitation: it is relatively slow to train and the input sequence can't be passed in parallel.
Now, the idea of the Transformer is to maintain the interdependence of the words in a sequence without using a recurrent network but only the attention mechanism that is at the center of its architecture. The attention measures how closely two elements of two sequences are related.
In transformer-based architectures, the attention mechanism is applied to a single sequence (also known as a self-attention layer). The self-attention layer determines the interdependence of different words in the same sequence, to associate a relevant representation with it. Take for example the sentence: "The dog didn't cross the street because it was too tired". It is obvious to a human being that "it" refers to the "dog" and not to the "street". The objective of the self-attention process will therefore be to detect the link between "dog" and "it". This feature makes transformers much faster to train compared to their predecessors, and they have been proven to be more robust against noisy and missing data.
As a plus, in contextual embeddings, transformers can draw information from the context to correct missing or noisy data and that is something that other neural networks couldn’t offer.
Can you provide some examples of alignment problems in Large Language Models?
The alignment problem refers to the extent to which a model's goals and behavior align with human values and expectations.
Large Language Models, such as GPT-3, are trained on vast amounts of text data from the internet and are capable of generating human-like text, but they may not always produce output that is consistent with human expectations or desirable values.
The alignment problem in Large Language Models typically manifests as:
- Lack of helpfulness: when the model is not following the user's explicit instructions.
- Hallucinations: when the model is making up unexisting or wrong facts.
- Lack of interpretability: when it is difficult for humans to understand how the model arrived at a particular decision or prediction.
- Generating biased or toxic output: when a language model that is trained on biased/toxic data may reproduce that in its output, even if it was not explicitly instructed to do so.
How Adaptative Softmax is useful in Large Language Models?
Adaptive softmax is useful in large language models because it allows for efficient training and inference when dealing with large vocabularies. Traditional softmax involves computing probabilities for each word in the vocabulary, which can become computationally expensive as the vocabulary size grows.
Adaptive softmax reduces the number of computations required by grouping words together into clusters based on how common the words are. This reduces the number of computations required to compute the probability distribution over the vocabulary.
Therefore, by using adaptive softmax, large language models can be trained and run more efficiently, allowing for faster experimentation and development.
How does BERT training work?
BERT (Bidirectional Encoder Representations from Transformers) utilizes a transformer architecture that learns contextual relationships between words in a text and since BERT’s goal is to generate a language representation model, it only needs the encoder part.
The input to the encoder for BERT is a sequence of tokens, which are first converted into vectors and then processed in the neural network. Then, the BERT algorithm makes use of the following two training techniques:
- Masked LM (MLM): Before feeding word sequences into BERT, a percentage of the words in each sequence are replaced with a
[MASK]token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. - Next Sentence Prediction (NSP): the model concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, and sometimes not. The model then has to predict if the two sentences were following each other or not.
Now, to help the model distinguish between the two sentences in training, the input is processed with some extra metadata such as:
- Token embeddings: A
[CLS]token is inserted at the beginning of the first sentence and a[SEP]token is inserted at the end of each sentence. - Segment embeddings: these assign markers to identify each sentence and allows the encoder to distinguish between them.
- Positional embeddings: to indicate the token position in the sentence.
Then, to predict if the second sentence is indeed connected to the first, the following steps are performed:
- The entire input sequence goes through the Transformer model.
- The output of the
[CLS]token is transformed into a2×1shaped vector, using a simple classification layer (learned matrices of weights and biases). - Calculating the probability of
IsNextSequencewith softmax.
When training the BERT model, Masked LM and Next Sentence Prediction are trained together, with the goal of minimizing the combined loss function of the two strategies.

How is the Transformer Network better than CNNs and RNNs?
- With RNN, you have to go word by word to access to the cell of the last word. If the network is formed with a long reach, it may take several steps to remember, each masked state (output vector in a word) depends on the previous masked state. This becomes a major problem for GPUs. This sequentiality is an obstacle to the parallelization of the process. In addition, in cases where such sequences are too long, the model tends to forget the contents of the distant positions one after the other or to mix with the contents of the following positions. In general, whenever long-term dependencies are involved, we know that RNN suffers from the Vanishing Gradient Problem.
- Early efforts were trying to solve the dependency problem with sequential convolutions for a solution to the RNN. A long sequence is taken and the convolutions are applied. The disadvantage is that CNN approaches require many layers to capture long-term dependencies in the sequential data structure, without ever succeeding or making the network so large that it would eventually become impractical.
- The Transformer presents a new approach, it proposes to encode each word and apply the mechanism of attention in order to connect two distant words, then the decoder predicts the sentences according to all the words preceding the current word. This workflow can be parallelized, accelerating learning and solving the long-term dependencies problem.
Is there a way to train a Large Language Model (LLM) to store a specific context?
The only way at the moment to "memorize" past conversations is to include past conversations in the prompt.
Consider:
You are a friendly support person. The customer will ask you questions, and you will provide polite responses
Q: My phone won't start. What do I do? <-- This is a past question
A: Try plugging your phone into the charger for an hour and then turn it on. The most common cause for a phone not starting is that the battery is dead.
Q: I've tried that. What else can I try? <-- This is a past question
A: Hold the button for 15 seconds. It may need a reset.
Q: I did that. It worked, but the screen is blank. <-- This is a current question
A:
You will hit a token limit at some point (if you chat long enough). Each GPT-3 model has a maximum number of tokens you can pass to it. In the case of text-davinci-003, it is 4096 tokens. When you hit this limit, the OpenAI API will throw an error.
What Transfer learning Techniques can you use in LLMs?
There are several Transfer Learning techniques that are commonly used in LLMs. Here are three of the most popular:
- Feature-based transfer learning: This technique involves using a pre-trained language model as a feature extractor, and then training a separate model on top of the extracted features for the target task.
- Fine-tuning: involves taking a pre-trained language model and training it on a specific task. Sometimes when fine-tuning, you can keep the model weights fixed and just add a new layer that you will train. Other times you can slowly unfreeze the layers one at a time. You can also use unlabelled data when pre-training, by masking words and trying to predict which word was masked.
- Multi-task learning: involves training a single model on multiple related tasks simultaneously. The idea is that the model will learn to share information across tasks and improve performance on each individual task as a result.
What is Transfer Learning and why is it important?
A pre-trained model, such as GPT-3, essentially takes care of massive amounts of hard work for the developers: It teaches the model to do a basic understanding of the problem and provides solutions in a generic format.
With transfer learning, given that the pre-trained models can generate basic solutions, we can transfer the learning to another context. As a result, we will be able to customize the model to our requirements using fine-tuning without the need to retrain the entire model.
What's the difference between Encoder vs Decoder models?
Encoder models:
- They use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence.
- The pretraining of these models usually revolves around somehow corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence.
- They are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (and more general word classification), and extractive question answering.
Decoder models:
- They use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence.
- The pretraining of decoder models usually revolves around predicting the next word in the sentence.
- They are best suited for tasks involving text generation.

What's the difference between Wordpiece vs BPE?
WordPiece and BPE are both subword tokenization algorithms. They work by breaking down words into smaller units, called subwords. We then define a desired vocabulary size and keep adding subwords till the limit is reached.
- BPE starts with a vocabulary of all the characters in the training data. It then iteratively merges the most frequent pairs of characters until the desired vocabulary size is reached. The merging is done greedily, meaning that the most frequent pair of characters is always merged first.
- WordPiece also starts with a vocabulary of all the characters in the training data. It then uses a statistical model to choose the pair of characters that is most likely to improve the likelihood of the training data until the vocab size is reached.
What's the difference between Global and Local Attention in LLMs?
Consider the example sentence "Where is Wally" which should be translated to its Italian counterpart "Dove è Wally". In the transformer architecture, the encoder processes the input word by word, producing three different hidden states.
Then, the attention layer produces a single fixed-size context vector from all the encoder hidden states (often with a weighted sum) and it represents the "attention" that must be given to that context when processing such input word. Here is when global and local attention comes into play.
Global attention considers all the hidden states in creating the context vector. When is applied, a lot of computation occurs. This is because all the hidden states must be taken into consideration, concatenated into a matrix, and processed by a NN to compute their weights.
On the other hand, local attention considers only a subset of all the hidden states in creating the context vector. The subset can be obtained in many different ways, such as with Monotonic Alignment and Predictive Alignment.

What's the difference between next-token-prediction vs masked-language-modeling in LLM?
Both are techniques for training large language models and involve predicting a word in a sequence of words.
-
Next token prediction: the model is given a sequence of words with the goal of predicting the next word. For example, given the phrase Hannah is a ____, the model would try to predict:
-
Hannah is a sister
-
Hannah is a friend
-
Hannah is a marketer
-
Hannah is a comedian
-
Masked-language-modeling: the model is given a sequence of words with the goal of predicting a masked word in the middle. For example, given the phrase, Jako mask reading, the model would try to fill the gap as,
-
Jacob fears reading
-
Jacob loves reading
-
Jacob enjoys reading
-
Jacon hates reading