31 January 2025

Generative Models Focus On Decoder Architecture

The influential LLM model milestone to come to light that popularized to some degree the cascade of generative models was the GPT-1 with decoder architecture and eventually followed into GPT-3. Since then, there has been a trend for LLM models to be more decoder focused. Other models that have played significant roles include: Bert, Transformer-XL, ELMo, ULMFiT, and LaMDA. T5 has been one of the few exceptions for an encoder-decoder architecture. You would wonder why this is the case. 

Why Decoders Dominate:
  • Autoregressive Generation: Decoders are good at predicting next item in a sequence. This autoregressive approach is fundamental to generation for coherence and novel content.
  • Sequential Processing: Decoders process information step-by-step building on past generations which gives them long-range dependency capabilities to create complex and structured outputs.
  • Task-Specific Optimization: They can handle complexities in grammar, semantics, and context.
  • Simplified Training: Decoder-only models are simpler to train as they only need to learn the conditional probability distribution of next token given the previous ones. No need for a separate encoding step.
  • Focus on Sequence-to-Sequence Tasks: Early successes on sequence-to-sequence tasks showed that for generation tasks the decoder component was most important and the computationally expensive aspect.
  • Efficiency: For tasks that mainly require generating new content, you could achieve comparable or better results with just a decoder-only model and with less computational cost.
  • Attention is All You Need: Decoders, by their nature, already used self-attention and adapting to a decoder-only setup was straightforward and highly effective.
  • Scalability: Decoder-only Transformers scale well paving the way for very large models to generate highly coherent and creative text
  • Computational Cost: Training encoder-decoder models is more expensive and often prohibitive especially for companies on limited budgets.
  • Performance Gains: For generative tasks, performance gains were not seen as substantial with the introduction of an encoder step to justify the additional computational cost.

The Role of Encoders: While decoders are used for generation, encoders are used for understanding and representation of inputs. 
  • Contextual Understanding: Encoders provide for rich representations and capture meaning and context that enables better relevancy to decoder generation tasks.
  • Feature Extraction: Encoders extract key features which the decoder can then use to generate context-specific output.

Gemini and T5: These models use both encoder and decoder architectures as a balance for better context understanding and generation.
  • T5: This model takes in input and output as texts in a unified architecture where the encoder understands the input and the decoder generates the output.
  • Gemini: This model supports multiple modalities where the encoder provides understanding and representation from various modalities and generates a more context-specific output in a specific modality. 

GPT Recap 
  • GPT-1 (2018): demonstrated potential of decoder-only transformer architecture with 117 million parameters
  • GPT-2 (2019): scaled up with 1.5 billion parameters, this increase led to improvements in text generation quality and coherence, raising concerns for misuse
  • GPT-3 (2020): massive step forward with 175 billion parameters, scale up achieved greater performance, wider coverage on NLP tasks for text generation, translation, and few-shot learning

Throughout this process training techniques and model architectures were improved. Eventually, this effort shaped the advancements of LLM and the trend towards more versatile models.