MLSys Class LLM Introduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Introduction to

Language Models
Eve Fleisig & Kayo Yin
CS 294-162
August 28, 2023
Language Modeling

Image credit: jalammar.github.io/illustrated-word2vec/


Masked Language Modeling
BERT

Image credit: jalammar.github.io/illustrated-bert/


Causal Language Modeling
GPT

Image credit: jalammar.github.io/illustrated-gpt2/


BERT vs. GPT

● Bidirectional encoder models (BERT) do better than generative models at


non-generation tasks, for comparable training data/model complexity.

● Generative models (GPT) have training efficiency and scalability advantages


that may make them ultimately more accurate. They can also solve
downstream tasks in a zero-shot setting.
Transformer

Image credit: jalammar.github.io/illustrated-transformer/


Transformer

Image credit: jalammar.github.io/illustrated-transformer/


Transformer

Image credit: jalammar.github.io/illustrated-transformer/ v


Attention
Self-Attention
Self-Attention

Image credit: jalammar.github.io/illustrated-gpt2/


Self-Attention

Image credit: jalammar.github.io/illustrated-gpt2/


Self-Attention
Self-Attention
Self-Attention
Self-Attention
Multi-headed Attention
Multi-headed Attention
Transformer

Image credit: jalammar.github.io/illustrated-transformer/


Transformer Input
Transformer Encoder

Image credit: jalammar.github.io/illustrated-transformer/


Adding the Decoder

Image credit: jalammar.github.io/illustrated-transformer/


BERT

Image credit: jalammar.github.io/illustrated-bert/


BERT
GPT
GPT
T5

Text-to-Text Transfer Transformer


Pretraining & Fine-tuning
Pretraining & Fine-tuning
Pretraining & Fine-tuning

Unsupervised objective

Supervised objective
Prefixes & Prompting
Few- & Zero-Shot Learning
Few- & Zero-Shot Learning
Few- & Zero-Shot Learning
Few- & Zero-Shot Learning

Generalization to new tasks without fine-tuning enabled by:

Scaling
Data Compute
Scaling Data
Common Crawl dataset: introduced with T5; still in use
GPT-3 Training Data:
Scaling Data & Compute

Kaplan et al., 2020;


Hoffmann et al., 2022
Reinforcement Learning from Human Feedback
Reinforcement Learning from Human Feedback
Reinforcement Learning from Human Feedback
Discussion
● What are the advantages and disadvantages of different training or tuning methods
that have been tried (task-specific training, pretrain/fine-tune, prompting, RLHF)?
● What is the role of systems research in scaling up LLMs? How could advances in
systems research change scaling “laws”?
● What security considerations do we need to consider when deploying LLMs into the
real world?
● How can we improve the energy efficiency and carbon footprint of LLMs?

You might also like