Mabble Rabble: Generative Models Focus On Decoder Architecture

31 January 2025

Generative Models Focus On Decoder Architecture

The influential LLM model milestone to come to light that popularized to some degree the cascade of generative models was the GPT-1 with decoder architecture and eventually followed into GPT-3. Since then, there has been a trend for LLM models to be more decoder focused. Other models that have played significant roles include: Bert, Transformer-XL, ELMo, ULMFiT, and LaMDA. T5 has been one of the few exceptions for an encoder-decoder architecture. You would wonder why this is the case.

Why Decoders Dominate:

Autoregressive Generation: Decoders are good at predicting next item in a sequence. This autoregressive approach is fundamental to generation for coherence and novel content.
Sequential Processing: Decoders process information step-by-step building on past generations which gives them long-range dependency capabilities to create complex and structured outputs.
Task-Specific Optimization: They can handle complexities in grammar, semantics, and context.
Simplified Training: Decoder-only models are simpler to train as they only need to learn the conditional probability distribution of next token given the previous ones. No need for a separate encoding step.
Focus on Sequence-to-Sequence Tasks: Early successes on sequence-to-sequence tasks showed that for generation tasks the decoder component was most important and the computationally expensive aspect.
Efficiency: For tasks that mainly require generating new content, you could achieve comparable or better results with just a decoder-only model and with less computational cost.
Attention is All You Need: Decoders, by their nature, already used self-attention and adapting to a decoder-only setup was straightforward and highly effective.
Scalability: Decoder-only Transformers scale well paving the way for very large models to generate highly coherent and creative text
Computational Cost: Training encoder-decoder models is more expensive and often prohibitive especially for companies on limited budgets.
Performance Gains: For generative tasks, performance gains were not seen as substantial with the introduction of an encoder step to justify the additional computational cost.

The Role of Encoders: While decoders are used for generation, encoders are used for understanding and representation of inputs.

Contextual Understanding: Encoders provide for rich representations and capture meaning and context that enables better relevancy to decoder generation tasks.
Feature Extraction: Encoders extract key features which the decoder can then use to generate context-specific output.

Gemini and T5: These models use both encoder and decoder architectures as a balance for better context understanding and generation.

T5: This model takes in input and output as texts in a unified architecture where the encoder understands the input and the decoder generates the output.
Gemini: This model supports multiple modalities where the encoder provides understanding and representation from various modalities and generates a more context-specific output in a specific modality.

GPT Recap

GPT-1 (2018): demonstrated potential of decoder-only transformer architecture with 117 million parameters
GPT-2 (2019): scaled up with 1.5 billion parameters, this increase led to improvements in text generation quality and coherence, raising concerns for misuse
GPT-3 (2020): massive step forward with 175 billion parameters, scale up achieved greater performance, wider coverage on NLP tasks for text generation, translation, and few-shot learning

Throughout this process training techniques and model architectures were improved. Eventually, this effort shaped the advancements of LLM and the trend towards more versatile models.