The influential LLM model milestone to come to light that popularized to some degree the cascade of generative models was the GPT-1 with decoder architecture and eventually followed into GPT-3. Since then, there has been a trend for LLM models to be more decoder focused. Other models that have played significant roles include: Bert, Transformer-XL, ELMo, ULMFiT, and LaMDA. T5 has been one of the few exceptions for an encoder-decoder architecture. You would wonder why this is the case.
Why Decoders Dominate:
- Autoregressive Generation: Decoders are good at predicting next item in a sequence. This autoregressive approach is fundamental to generation for coherence and novel content.
- Sequential Processing: Decoders process information step-by-step building on past generations which gives them long-range dependency capabilities to create complex and structured outputs.
- Task-Specific Optimization: They can handle complexities in grammar, semantics, and context.
- Simplified Training: Decoder-only models are simpler to train as they only need to learn the conditional probability distribution of next token given the previous ones. No need for a separate encoding step.
- Focus on Sequence-to-Sequence Tasks: Early successes on sequence-to-sequence tasks showed that for generation tasks the decoder component was most important and the computationally expensive aspect.
- Efficiency: For tasks that mainly require generating new content, you could achieve comparable or better results with just a decoder-only model and with less computational cost.
- Attention is All You Need: Decoders, by their nature, already used self-attention and adapting to a decoder-only setup was straightforward and highly effective.
- Scalability: Decoder-only Transformers scale well paving the way for very large models to generate highly coherent and creative text
- Computational Cost: Training encoder-decoder models is more expensive and often prohibitive especially for companies on limited budgets.
- Performance Gains: For generative tasks, performance gains were not seen as substantial with the introduction of an encoder step to justify the additional computational cost.
The Role of Encoders: While decoders are used for generation, encoders are used for understanding and representation of inputs.
- Contextual Understanding: Encoders provide for rich representations and capture meaning and context that enables better relevancy to decoder generation tasks.
- Feature Extraction: Encoders extract key features which the decoder can then use to generate context-specific output.
Gemini and T5: These models use both encoder and decoder architectures as a balance for better context understanding and generation.
- T5: This model takes in input and output as texts in a unified architecture where the encoder understands the input and the decoder generates the output.
- Gemini: This model supports multiple modalities where the encoder provides understanding and representation from various modalities and generates a more context-specific output in a specific modality.
GPT Recap
- GPT-1 (2018): demonstrated potential of decoder-only transformer architecture with 117 million parameters
- GPT-2 (2019): scaled up with 1.5 billion parameters, this increase led to improvements in text generation quality and coherence, raising concerns for misuse
- GPT-3 (2020): massive step forward with 175 billion parameters, scale up achieved greater performance, wider coverage on NLP tasks for text generation, translation, and few-shot learning
Throughout this process training techniques and model architectures were improved. Eventually, this effort shaped the advancements of LLM and the trend towards more versatile models.