Skip to main content

Attention in Transformers

Attention in Transformers: Concepts and Code in PyTorch.

The attention mechanism was a breakthrough that led to transformers, the architecture powering large language models like ChatGPT. Transformers, introduced in the 2017 paper: "Attention is All You Need" by Viswani and others, took off because of its highly scalable design.

In this course, you’ll learn how the attention mechanism, a key element of transformer-based LLMs, works and implement it in PyTorch. You'll develop deep intuition about building reliable, functional, and scalable AI applications.

What you will do

  • Understand the evolution of the attention mechanism, a key breakthrough that led to transformers.
  • Learn the relationships between word embeddings, positional embeddings, and attention.
  • Learn about the Query, Key, and Value matrices, and how to produce and use them in attention.
  • Walk through the math required to calculate self-attention and masked self-attention to learn why and how they work.
  • Understand the difference between self-attention and masked self-attention and how one is used in the encoder to build context-aware embeddings and the other is used in the decoder for generative outputs.
  • Learn the details of the encoder-decoder architecture, cross-attention, and multi-head attention and how they are all incorporated into a transformer.
  • Use PyTorch to code a class that implements self-attention, masked self-attention, and multi-head attention.

Attention in Transformers: Concepts and Code in PyTorch - DeepLearning.AI