Challenges and Applications of Large Language Models: Desi GN Behavior

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

Challenges and Applications of Large Language Models

Jean Kaddourα, †, ∗ , Joshua Harrisβ, ∗ , Maximilian Mozesα ,


Herbie Bradleyγ, δ, ϵ , Roberta Raileanuζ , and Robert McHardyη, ∗
α
University College London β UK Health Security Agency γ EleutherAI
δ
University of Cambridge ϵ Stability AI ζ Meta AI Research η InstaDeep

Abstract
Large Language Models (LLMs) went from
non-existent to ubiquitous in the machine learn-
ing discourse within a few years. Due to the Desi gn High Infer ence
Latency, Lim ited
Beh av i or
Unfathom able Datasets, Pr om pt Br ittleness,
arXiv:2307.10169v1 [cs.CL] 19 Jul 2023

Context Length,
fast pace of the field, it is difficult to identify Tokenizer -Reliance,
Fine-Tuning Over head
Hallucinations
M isaligned Behavior ,
Outdated Know ledge
the remaining challenges and already fruitful
Tasks Not
application areas. In this paper, we aim to es- Solvable
By Scale
tablish a systematic set of open problems and
Detecting
application successes so that ML researchers High Pr e-Tr aining Gener ated
Costs Texts, Br ittle
can comprehend the field’s current state more Evaluations
quickly and become productive.
Sci en ce
Contents Evaluations Based on Static
Hum an-Wr itten Gr ound Tr uth,
Lacking Exper im ental Designs,
Lack of Repr oducibility
1 Introduction 1

2 Challenges 2
2.1 Unfathomable Datasets . . . . . . 2 Figure 1: Overview of LLM Challenges. Designing
2.2 Tokenizer-Reliance . . . . . . . . 4 LLMs relates to decisions taken before deployment. Be-
2.3 High Pre-Training Costs . . . . . 6 haviorial challenges occur during deployment. Science
2.4 Fine-Tuning Overhead . . . . . . 10 challenges hinder academic progress.
2.5 High Inference Latency . . . . . . 11
2.6 Limited Context Length . . . . . . 14
3.4 Creative Work . . . . . . . . . . . 39
2.7 Prompt Brittleness . . . . . . . . 17
3.5 Knowledge Work . . . . . . . . . 40
2.8 Hallucinations . . . . . . . . . . . 19
2.9 Misaligned Behavior . . . . . . . 22 3.6 Law . . . . . . . . . . . . . . . . 42
2.10 Outdated Knowledge . . . . . . . 27 3.7 Medicine . . . . . . . . . . . . . 43
2.11 Brittle Evaluations . . . . . . . . 27 3.8 Reasoning . . . . . . . . . . . . . 44
2.12 Evaluations Based on Static, 3.9 Robotics and Embodied Agents . . 45
Human-Written Ground Truth . . 28 3.10 Social Sciences & Psychology . . 46
2.13 Indistinguishability between Gen- 3.11 Synthetic Data Generation . . . . 48
erated and Human-Written Text . 29
4 Related Work 49
2.14 Tasks Not Solvable By Scale . . . 30
2.15 Lacking Experimental Designs . . 31 5 Conclusion 49
2.16 Lack of Reproducibility . . . . . . 33
1 Introduction
3 Applications 34
3.1 Chatbots . . . . . . . . . . . . . . 34 Given the quickly growing plethora of LLM re-
3.2 Computational Biology . . . . . . 36 search papers, we aim to address two questions: (1)
3.3 Computer Programming . . . . . 37 Challenges: What problems remain unresolved?
* Equal
and (2) Applications: Where are LLMs currently
contribution.

{jean.kaddour,robert.mchardy}[email protected], being applied, and how are the challenges con-
[email protected] straining them? For (1), we group the challenges

1
in Fig. 1 into three broader categories “Design”, word sequence repeated 61, 036 times in the train-
“Behavior”, and “Science”. To provide answers ing split. By deduplicating it, they reduce the rate
for (2), we explore the fields of chatbots, compu- of emitted memorizations by 10x. Abbas et al. [6]
tational biology, computer programming, creative introduce SemDeDup, a technique designed to iden-
work, knowledge work, law, medicine, reasoning, tify semantic duplicates that, although perceptually
robotics, and the social sciences. distinct, convey predominantly similar information,
This paper is an opinionated review and assumes such as sentences with analogous structures with
familiarity with LLMs and how they work (we refer certain words replaced by synonyms. After apply-
to more introductory works in Sec. 4). Further, we ing their method to C4, they find that it improves
focus on models trained on text data. We target a over NearDup. Similarly, Kaddour [250] find near-
technical researcher audience and do not discuss duplicates in the Pile [165] by clustering document
political, philosophical, or moral perspectives on embeddings and identifying clusters gathering du-
LLMs. plicates.

2 Challenges
Benchmark Data Contamination occurs when
Challenge the training dataset contains data from or similar
to the evaluation test set. This can lead to inflated
This box highlights a challenge.
performance metrics, as the model can memorize
the test data and simply regurgitate it back during
2.1 Unfathomable Datasets testing.
Scaling the amount of pre-training data has been Finding and removing all training and test data
one of the major drivers to equip LLMs with overlaps is difficult in practice. For example, the
general-purpose capabilities [256]. The size of GPT-3 authors Brown et al. [59] found a code bug
pre-training datasets quickly outgrew the number after training, resulting in only partially removing
of documents most human teams could manually all detected overlaps from the training data. They
quality-check. Instead, most data collection proce- could not afford to retrain the model, so they used it
dures rely on heuristics regarding data sources and with the remaining overlaps and “cleaned” variants
filtering. of the considered benchmarks, with all potentially
In this section, we explore the adverse conse- leaked examples removed. They define overlap-
quences of these heuristics and the reality that many ping examples as examples that share at least 13
model practitioners possess only a nebulous under- consecutive words with any other example in the
standing of the data on which their model has been pre-training set. If an example is shorter than 13
trained. We refer to this issue as follows. words, they consider it overlapping if it shares all
of its words with another example.
Unfathomable Datasets
Similarly, Dodge et al. [125] search for test data
The size of modern pre-training datasets ren- in the web-crawled C4 corpus but measure exact
ders it impractical for any individual to read matches, normalized for capitalization and punctu-
or conduct quality assessments on the en- ation. They find various input-and-label contamina-
compassed documents thoroughly. tions of text generation and knowledge completion
tasks; and input-only contaminations of the GLUE
Near-Duplicates can arise in different forms benchmark. They argue that there are two ways test
and have been reported to degrade model per- data can end up in a snapshot of Common Crawl
formance [294, 200, 250]. Near-duplicates are (the original dump source of C4): either a given
harder to find compared to exact duplicates; fil- test set is built from a web text or uploaded after
tering out of such is a standard step in most data creation. Sainz et al. [472] ask ChatGPT to gener-
collection pipelines, e.g., using the MinHash algo- ate academic benchmark instances, finding that it
rithm [57]. Lee et al. [294] propose the NearDup has memorized multiple ones, including some test
method and find that over 1% of tokens emitted splits. Jacovi et al. [237] propose three strategies to
unprompted from a model are part of a memorized mitigate contamination, including encryption and
sequence of the C4 dataset, e.g., it contains a 61- training exclusion controls.

2
Personally Identifiable Information (PII) such Date Name
Size
Sources Public
GB Tokens∗
as phone numbers and email addresses, have
2014 BookCorpus 5 GB 11 B Novels Yes
been found within pre-training corpora, resulting [684, 36]
in privacy leaks during prompting. Carlini et al. 2019 OSCAR 6.3 T ? Webpages in 166 Yes
[65, 67], Lukas et al. [344] extract PII data by [399] languages
2019 WebText 40 GB ? Webpages No
prompting GPT-2; Kulkarni [283] report how an en- [440]
gineer yields secret API keys by prompting GitHub 12.2020 CC-100 2.5 TB 292 B Webpages in 100 Yes
Copilot. Henderson et al. [195] discuss the avail- [100] Languages
12.2020 The Pile 825 GB 300 B Science, Webpages, Yes
ability of PII in law data across different jurisdic- [165, 41] GitHub Code, Law,
tions and filter it based on the legal norm in the etc.
respective jurisdiction. El-Mhamdi et al. [137] 2020 C4 [443] 745 GB 156 B Webpages Yes

contend that because strong model performance 10.2020 mC4 [631] ? 6.3 T Webpages in 101 Yes
Languages
typically requires memorization of the training 2021 MassiveText 10.5 TB 2.34 T Webpages, Books, No
data [146, 58], the (undetected) existence of PII [441] News, and Code

in the training data will likely result in models that 12.2021 GLaM [130] ? 1.6 T Webpages, No
Wikipedia, Conver-
render them extractable. sations, Forums,
Books, News
01.2022 Infiniset ? 2.81 T Forum dialogs, No
Pre-Training Domain Mixtures Several stud- [551] C4 data, Code,
ies have argued for diversity in the pre-training Wikipedia, Web-
pages
corpus [165, 341, 291]. Many popular corpora fol-
06.2022 ROOTS 1.61 TB 2.34 T Webpages in 46 lan- Yes
low this by concatenating datasets from different [289] guages and GitHub
sources, as illustrated in Table 1. However, it re- Code in 13 lan-
guages
mains underexplored what amount of data from 11.2022 The Stack 6 TB 235 B GitHub Code in 30 Yes
different sources is necessary for strong down- [271] languages

stream performances. Finding suboptimal mix- 04.2023 LLaMA 2.7 TB 1.2 T Webpages, GitHub Yes
[556] / Red- Code, Science,
tures can cause low transferability to downstream Pajama [98] Wikipedia, Books
tasks [593, 580] and reliance on spurious corre- 06.2023 RefinedWeb 2.8 TB 600 B Webpages Yes
[415]
lations [253, 618, 347]. Xie et al. [622] find do-
main mixture proportions by training a small proxy Table 1: Overview of Selected Pre-Training Datasets.
model using group-distributionally robust optimiza- Over the years, pre-training datasets have become more
tion [471]; surprisingly, they find that the final unfathomable: they grew rapidly in size and diversity,
model trained using their found domain weights and not all datasets are publicly available (we do not
yields improved perplexity across all domains, even include datasets that have very little or no information
when it down-weights a domain. Given a tar- available about them). Unless stated otherwise, the
get downstream task, Yao et al. [641], Xie et al. natural language is in English. ∗ We report the number
of tokens as provided by the respective paper based on
[624] select subsets most useful for pre-training.
their proposed tokenization scheme.
Longpre et al. [341] measure the effects of domain
compositions and find that inclusion of heteroge-
neous data sources is broadly beneficial and likely
more important than the data quality (as measured For example, instruction fine-tuning via task in-
by the document quality classifier employed by structions prepended to each set of input-output
PaLM [86] and GLaM [130]) or size, which also pairs is a very popular scheme, which we will later
motivates smaller yet more diverse pre-training discuss in more detail in Sec. 2.9. Wang et al. [589]
datasets [250]. propose Super-NaturalInstructions, a
fine-tuning dataset with 1,616 diverse tasks and
Fine-Tuning Task Mixtures have to be deter- expert-written instructions. Muennighoff et al.
mined for fine-tuning a pre-trained model on many [377] extend MTLM to the multilingual setting,
different tasks, usually with comparatively few ex- showing that fine-tuning on multilingual tasks with
amples per task. This technique, which we call English prompts improves results on tasks in all
multitask-prompted fine-tuned LMs (MTLMs), has languages.
demonstrated significant generalization improve- However, similar to the previous paragraph, how
ments with very little additional training compute. to balance the task datasets well remains unclear.

3
As the tasks can vary in size considerably, Raf- essary to convey the same information varies
fel et al. [443] mix each task in proportion to the significantly across languages, making the pric-
number of examples in its ’train’ split (up to some ing policy of API language models, which charge
max_num_examples). Jang et al. [239] report users based on the number of processed or gen-
that MTLMs can underperform expert LLMs fine- erated tokens, potentially unfair. They find that
tuned on only a single task because of (i) nega- users of many supported languages are overcharged
tive task transfer, where learning multiple tasks at while receiving subpar results, with this group pre-
once hinders the learning of some specific tasks, dominantly residing in areas where these APIs are
and (ii) catastrophic forgetting of previous tasks already less affordable.
when learning new tasks. Iyer et al. [235] study Further, discrepancies between the data that
varying task (sets) proportions, finding several a tokenizer and a model have been trained on
trade-offs and concluding that the right values for can lead to glitch tokens [465], which can sub-
these parameters depend on the downstream end- sequently cause unexpected model behavior as
goals. Longpre et al. [340] balance different sets of their corresponding embeddings are essentially un-
task sources by omitting them, one at a time, and trained. This coupling between the tokenizer and
ranking their contributions on the MMLU bench- pre-training corpus creates the burden of a new
mark [197]; further, they mix the input prompt training run of the tokenizer each time the pre-
templates of zero- and few-shot prompting; find- training corpus is modified.
ing that this improves the performance in both set- Next, Tokenization schemes that work well in a
tings. Another trend is to imitate closed-source multilingual setting, particularly with non-space-
models like ChatGPT by collecting a dataset of separated languages such as Chinese or Japanese,
API outputs (against OpenAI’s terms and condi- remain challenging [157, 91].
tions) and fine-tuning an open-source LM with Existing subword tokenization schemes are pre-
it [540]. However, Gudibande et al. [180] point dominantly greedy algorithms trying to encode
out that such imitation models are only good at language as efficiently as possible regarding the
mimicking the proprietary model’s style but not number of tokens used. Naturally, these methods
its content, a distinction that has been discussed favor subwords comprising larger parts of the train-
extensively in the causality literature [253]. They ing data and, therefore, subwords that are shared
conclude that substantial capability gaps between across many languages. This favors languages
fine-tuned open-sourced and closed-source models with shared scripts like Latin and Cyrillic, result-
remain, motivating future work for better imitation ing in suboptimal tokenization of low-resource lan-
data. guages [92, 676].

2.2 Tokenizer-Reliance Tokenizer-Reliance

Tokenization is the process of breaking a sequence Tokenizers introduce several challenges,


of words or characters into smaller units called e.g., computational overhead, language de-
tokens, such that they can be fed into the model. pendence, handling of novel words, fixed
One common tokenization approach is subword to- vocabulary size, information loss, and low
kenization, where we split words into smaller units, human interpretability.
called subwords or WordPieces [490]. The goal
is to handle rare and out-of-vocabulary words in Subword-Level Inputs are the dominant
a model’s vocabulary effectively while maintain- paradigm, providing a good trade-off between
ing a limited number of tokens per sequence in the vocabulary size and sequence length. Byte-Pair
interest of computational complexity. Subword to- Encoding [490, 577] (BPE) starts with the set
kenizers are usually trained unsupervised to build of symbols (characters or bytes) that comprise
a vocabulary and optionally merge rules to encode the training data. The tokenizer is then trained
the training data efficiently. to learn rules to merge the most frequent pair
However, the necessity of tokenization comes of two consecutive tokens—defined by the
with multiple drawbacks [257]; some of which we existing vocabulary—into a new vocabulary item.
discuss below. For example, Ahia et al. [13], Petrov Byte-level BPE (BBPE) [577] is an extension
et al. [426] show that the number of tokens nec- of BPE with byte-level subwords, particularly

4
(1) Tokenizer Training Costs (2) Arch. depends on Vocabulary
Training Sequences Vocabulary Embedding Transformer Softmax over
Matrix Blocks Vocabulary
English where where
as as
where as token ##ization token token
##ization ##ization

Tokenization can sometimes lead to a loss of to to


to a for loss lead are for for
information. For example, in languages where
loss loss
word boundaries are not clearly defined, such
lead lead
as Chinese. … boundaries chinese example … are are
boundaries boundaries
chinese chinese
Chinese 中 ⾔ 息 致 定 信 example example
中 中
⾔ ⾔

MHA

MHA
FFN

FFN
息 息
標 多 時 單 合 致 致
標記化有時會導致信息丟失。 例如,在單
詞邊界沒有明確定義的語⾔中,例如中⽂,

信 … 定

或者在具有許多複合詞的複雜語⾔中,...... 會 明 導 界 義 許 … 標 標
時 時
單 單
Python in i n def array
合 合
in in
def bubble_sort(array): i i
n = len(array) def def
for i in range(n):
swapped = False
[ ] , _ + - range array array

for j in range(0, n - i - 1): [ [


if array[j] > array[j + 1]: ] ]

….
swap(array[j], array[j + 1])
for if False ) ], 1 sort … … …
<latexit sha1_base64="EUAQA2JFnjQN78tKMAv1XK5WuxQ=">AAACEXicbVC7TsMwFHXKq5RXgJHFokLqVCWI18BQCZAYC6IPqQmV4zqtVceJbAepSvMLLPwKCwMIsbKx8Tc4bQZoOZKl43Pu1b33eBGjUlnWt1FYWFxaXimultbWNza3zO2dpgxjgUkDhywUbQ9JwignDUUVI+1IEBR4jLS84UXmtx6IkDTkd2oUETdAfU59ipHSUtesOAFSA8+HV9ChHE5/XnKb3ifj5hg6igZEwsu01DXLVtWaAM4TOydlkKPeNb+cXojjgHCFGZKyY1uRchMkFMWMpCUnliRCeIj6pKMpR3qQm0wuSuGBVnrQD4V+XMGJ+rsjQYGUo8DTldnGctbLxP+8Tqz8MzehPIoV4Xg6yI8ZVCHM4oE9KghWbKQJwoLqXSEeIIGw0iFmIdizJ8+T5mHVPqke3xyVa+d5HEWwB/ZBBdjgFNTANaiDBsDgETyDV/BmPBkvxrvxMS0tGHnPLvgD4/MHRwacqQ==</latexit> <latexit sha1_base64="VxULw+Mr90KaxUh2GOiiN/OYQpA=">AAACIXicbVDLTsMwEHR4lvIqcORiUSFxqhLEowcOSHDgWBBtkZpSOe4GLBwnsjeIKuRXuPArXDiAEDfEz+C0PfAaydLszK7WO0EihUHX/XAmJqemZ2ZLc+X5hcWl5crKasvEqebQ5LGM9UXADEihoIkCJVwkGlgUSGgHN0eF374FbUSsznGQQDdiV0qEgjO0Uq9S9yOG10FI29QXio6qIDvLL7PjXuYj3GEWxX2QeU59FBEYet+6p3m5V6m6NXcI+pd4Y1IlYzR6lXe/H/M0AoVcMmM6nptgN2MaBZeQl/3UQML4DbuCjqWK2V3dbHhhTjet0qdhrO1TSIfq94mMRcYMosB2FheY314h/ud1Ugzr3UyoJEVQfLQoTCXFmBZx0b7QwFEOLGFcC/tXyq+ZZhxtqEUI3u+T/5LWds3bq+2e7lQPD8ZxlMg62SBbxCP75JCckAZpEk4eyBN5Ia/Oo/PsvDnvo9YJZzyzRn7A+fwC7oij/A==</latexit>

… E 2 R|V |⇥D W 2 RDmodel ⇥|V |


Figure 2: Exemplary Drawbacks of relying on Tokenization. (1) The tokenizer training step involves non-trivial
computations, e.g., multiple passes over the entire pre-training dataset, and introduces a dependency on it, which
can become especially problematic in multilingual settings. (2) The embedding layer E and output layer W of
LLMs involve the vocabulary size; e.g., making up ≈ 66% of the model’s parameter count in T5 models [629].

suited for multilingual tasks where it enables Byte-Level Inputs are an alternative to subword
vocabulary sharing between languages. A trained tokenization is use byte-level inputs. Byte-level
BPE tokenizer applies the previously learned rules inputs can either be used in combination with sub-
to tokenize inputs. WordPiece [485, 617] is a word tokenizers [577] or used to define a limited
closed-source tokenization algorithm used, e.g., vocabulary that can be used to encode all possi-
in BERT [120]. Like BPE, WordPiece starts with ble sequences. For example, Xue et al. [630]
a small initial vocabulary, which is iteratively train a non-subword mT5 model using UTF-8
extended by learning merge rules and creating new bytes rather than subword tokens as inputs, show-
vocabulary items. Rather than selecting the most ing promising performance on multilingual data.
frequent pair of consecutive tokens, WordPiece While this enables subword-free LLMs, UTF-8 en-
uses a scoring function to normalize the frequency codes Latin languages with fewer bytes than e.g.,
of the pair by the frequencies of the individual Chinese, Japanese or Korean1 . Tay et al. [546] pro-
tokens to prioritize common pairs with rare pose the Charformer, a tokenization-free model
individual tokens. Unigram Tokenization [281] which learns a soft subword tokenization in la-
iteratively trims a large base vocabulary to a given tent space (Gradient-Based Subword Tokenization)
target size. To this end, at each step of the tokenizer given byte-level inputs. Charformer performs com-
training, a unigram language model is used to parably to subword-based models while incurring
compute a loss over the training data conditional less computational overhead than other byte or
on a certain vocabulary item being removed. subword models. Choe et al. [83] train a small-
A proportion of the subwords with the lowest scale, 0.8B language model based on raw byte-
losses are removed to form the base vocabulary level inputs and show that it performs compara-
for the next iteration. Unigram tokenization is bly. On a smaller scale, Clark et al. [94] show that
probabilistic, i.e., during inference, all possible their tokenization- and vocabulary-free encoder Ca-
tokenizations of a given sequence are scored nine outperforms a comparable tokenization-based
using the unigram language model, and the most model. Yu et al. [652] address the computational
likely one is selected. SentencePiece [282] is a cost that byte-level tokenization incurs by segment-
commonly used open-source library, implementing ing input sequences into local patches, which can
several tokenization algorithms such as (B)BPE be processed in parallel. Similarly, Horton et al.
and Unigram tokenization. SentencePiece also [212] propose to operate directly on file bytes. In a
implements non-subword tokenization approaches
like word- and character-level tokenization. 1
https://www.unicode.org/versions/Unicode15.0.0/

5
parallel line of work, Rust et al. [467] render text compute budgets. For example, OpenAI [398] re-
as images and train an encoder model to predict the port that they were able to accurately predict the
raw pixels of the images. model performance of the full-size GPT-4 model
based on the performance of a series of smaller
2.3 High Pre-Training Costs models using at most 10,000x less compute than
The vast majority of the training costs go toward the the full model.
pre-training process. Training a single LLM can The exact power law coefficients are still heav-
require hundreds of thousands of compute hours, ily debated. Kaplan et al. [256] put forward that
which in turn cost millions of dollars and consume the model size should be scaled more aggressively
energy amounts equivalent to that used by several than the dataset size to use a given compute budget
typical US families annually [412, 86, 44]. Re- optimally. Contrary to this, Hoffmann et al. [206]
cently proposed scaling laws [256] posit that model find that many LLMs are undertrained and argue
performances scale as a power law with model size, that the number of parameters and data should be
dataset size, and the amount of compute used for scaled equally. However, power laws sometimes
training, which is fairly unsustainable and can be come in the form of bounds, which can span an
classified as Red AI [487], where state-of-the-art re- order of magnitude difference in the amount of
sults are essentially “bought” by spending massive data to be used given a concrete compute budget
computational resources. For example, depending [665]. Further, the pre-training loss does not al-
on the exact law coefficients, reducing the error ways correlate well with downstream performance
from 3% to 2% can require an order of magnitude [252, 332, 251].
more data or compute [518]. The viewpoint of Touvron et al. [556], Vries
[571], Touvron et al. [557] is that when selecting
Unsustainable Loss Power-Law [256]
a model size, the computation resources for later
Performance increases through larger com- usage (inference) should be considered, not just
pute budgets but at a decreasing rate if the the one-time training costs. They suggest that it
model or dataset size is fixed, reflecting a might be beneficial to train a smaller model more
power law with diminishing returns. intensively upfront to offset larger inference costs
in the future. Hence, they train models of various
In the following, we look at two lines of work sizes on more tokens than are typically used to
aiming at resolving such issues. achieve the best performance possible, given the
model size.
Compute-Optimal Training Recipes [201, 256] One remaining hurdle of performance prediction
In Sec. 2.1, we discussed how the availability is inverse scaling, which we discuss in Sec. 2.14.
of LLM pre-training data has become abundant Since scaling laws were typically constructed in the
through the quickly-spread practice of including context of pre-training and thereby decoupled from
web-crawled text. Further, thanks to the intro- downstream tasks, it remains an open question of
duction of Transformer models [563] and suit- how to predict inverse scaling properties. Tay et al.
able hardware [210], we have scaled models to [544] find that scaling laws can differ in upstream
unprecedented sizes. Assuming that we have not and downstream setups; aside from only the model
yet reached the limits of data [45, 568, 415] nor size, model shape matters for downstream fine-
model sizes [256, 206, 398]; currently, the main tuning.
bottleneck is the amount of compute available [1].
Given a particular budget, how large should the pre- Pre-Training Objectives Various pre-training
training corpus and model be to maximize training objectives (PTO) are suitable for performing self-
efficiency? supervised training of LLMs. The exact choice of
As mentioned at the beginning of this section, PTO heavily influences the model’s data efficiency
one recent proposal is to learn empirical “scaling during pre-training, which in turn can reduce the
laws” [201, 256], which describe the relationship number of iterations required. A PTO typically
between LLM performance and the compute bud- is a function of the (i) architecture, (ii) input/tar-
get, model, and dataset size. These laws can pro- gets construction (e.g., target span length, low/high
vide the right scaling recipe for compute-optimal corruption, see Fig. 4), and (iii) masking strategy
training, ideally, even when extrapolating to larger (Fig. 3). While (i) and (ii) can be disentangled and

6
Masked LM Language Modeling Prefix LM
where the model uses tokens before and after the
y5
target token for predictions, leveraging a more
y4
holistic understanding of its context than the NTP
Targets

y3

y2
objective. Furthermore, we can use each input
y1
sentence to predict multiple masked tokens in a
x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 single pass, while the NTP objective typically
Input Input Input
learns from predicting one token at a time.
Figure 3: Masking Strategies. Each row denotes to Let xMASK denote the set of indices of the
which inputs xi (columns) a particular output yi (row) masked tokens and x¬MASK the unmasked tokens.
can attend to (uni- or bi-directional). The objective of MLM is then to maximize the
likelihood given the parameters θ,

should not be conflated conceptually [545], in prac- 1


L(xMASK |x¬MASK ) =
tice, there exist popular combinations that achieve |xMASK |
good performances. X (2)
· log P (xMASKi |x¬MASK ; θ).
Attending to all tokens, as shown in Fig. 3(left),
i∈xMASK
is the most data-efficient strategy since it uses con-
text from before and after the token to be predicted. Patel et al. [410] show that such models produce
However, for that reason, it is unsuitable for text representations more suitable for transfer learning;
generation [120], since it considers future context however, they come with difficulties in performing
for prediction. We typically employ it in natural in-context learning (Sec. 2.7).
language understanding (NLU) tasks [120], where To further improve the training efficiency of the
it has shown strong results. The next token predic- MLM objective, Bajaj et al. [33] propose to replace
tion objective is most suitable for natural language input tokens with ones generated by an auxiliary
generation (NLG) but also the least data efficient language model (ALM), resulting in a Model gen-
since it only attends to the past context (Fig. 3(mid- erated dEnoising TRaining Objective (METRO).
dle)). More recent advances in pre-training objec- Their approach consists of roughly three compo-
tives aim to find a middle-ground to increase data nents: (i) train an ALM using the MLM objec-
efficiency by providing stronger and more diverse tive, (ii) given some inputs with masked positions,
training signals, e.g., the Prefix LM, which partly predict the tokens (with the ALM), (iii) train the
attends to past tokens, as illustrated in Fig. 3(right) main model to correct these tokens inserted in the
and discussed below. masked positions, i.e., 1) predict whether the ALM
The following discusses the trade-offs between has replaced a token and if so, 2) predict the origi-
some of the recently proposed objectives. Fig. 4 nal token. They train the auxiliary and main model
visually depicts the different pre-training objectives. jointly.
Notation-wise, we denote a sequence of N tokens Prefix Language Modeling [443] generalizes
x as x = x1 , . . . , xN . language modeling by allowing prefix tokens with a
We start with the most basic and still widely- bidirectional receptive field to be added to the input
used Language Modeling [59] (or next token pre- (without prefix, it is equivalent to standard LM).
diction) objective. Here, we learn parameters θ by Note that this is still different from the bidirectional
maximizing the likelihood of the next token given context as in MLM, where we always condition on
the previous tokens, all the tokens before and after the masked ones (see
Fig. 3 left). For computing the hidden states of the
N
X prefix, prefix-LM attends to tokens before and after
L(x) = log P (xi |x1 , . . . , xi−1 ; θ). (1)
(see Fig. 3 right).
i=1
Span Corruption [303, 443, 132] or span de-
Masked Language Modeling (MLM; or noising refers to a group of denoising objectives
Cloze) [549, 120] hides a set proportion of that generalize MLM to denoise contiguous se-
tokens in the sequence by replacing them with a quences of tokens within a given text, called spans.
special [MASK] token. The literature employs The denoising objectives typically replace the sam-
the MLM objective for non-autoregressive, i.e., pled spans with a single unique masking token
non-generative, bidirectional context models, and train the model to fill it in. Raffel et al. [443]

7
Span Corruption Prefix Language Modeling Long Span Corruption
(R-Denoising) (S-Denoising) (one form of X-Denoising)

Inputs Inputs Inputs


Some proponents of AI consciousness subscribe to functionalism, the Some proponents of AI consciousness subscribe to functionalism, the Some proponents of AI consciousness subscribe to functionalism, the

4 by their function than their


view that mental states are defined more view that mental states are defined more by their function than their 12more by their function than their
view that mental states are defined

underlying physical structure. In other words, if an AI can respond to underlying physical structure. In other words, if an AI can respond to underlying physical structure. In other words, if an AI can respond to

inputs and generate outputs similar to a conscious being, then it could be inputs and generate outputs similar to a conscious being, then it could be 13to a conscious being, then it could be
inputs and generate outputs similar

3 doesn't account for subjective


considered conscious. However, this view considered conscious. However, this 56
view doesn't account for subjective considered conscious. However, this view doesn't account for subjective

(qualia), the "what it feels like" aspect of consciousness. The Simulational (qualia), the "what it feels like" aspect of consciousness. The Simulational 14 aspect of consciousness. The Simulational
(qualia), the "what it feels like"

Argument is that some argue that if2an AI can simulate human behavior Argument is that some argue that if an AI can simulate human behavior Argument is that some argue that if an AI can simulate human behavior

Targets Targets Targets


4 3 2 12

13

56 14

Fill In The Middle Meet In The Middle

Inputs Inputs Inputs (Reversed Order)


Some proponents of AI consciousness subscribe to functionalism, the Some proponents of AI consciousness subscribe to functionalism, the behavior human simulate can AI an if that argue some that is Argument

view that mental states are defined more by their function than their view that mental states are defined more by their function than their Simulational The consciousness. of aspect “like feels it what” the (qualia),

M
underlying physical structure. In other words, if an AI can respond to underlying physical structure. In other words, if an AI can respond to experiences subjective for account
ov
e being, then it could be
26 to a conscious
inputs and generate outputs similar inputs and generate outputs similar to a conscious being, then it could be inputs and generate outputs similar to a conscious being, then it could be

considered conscious. However, this view doesn't account for subjective considered conscious. However, this 56
view doesn't account for subjective considered conscious. However, this 52
view doesn't account for subjective

(qualia), the “what it feels like” aspect of consciousness. The Simulational (qualia), the "what it feels like" aspect of consciousness. The Simulational (qualia), the "what it feels like" aspect of consciousness. The Simulational

Argument is that some argue that if an AI can simulate human behavior Argument is that some argue that if an AI can simulate human behavior Argument is that some argue that if an AI can simulate human behavior

Targets Targets Targets

26

56 52

Figure 4: Self-Supervised Data Construction by Pre-Training Objectives, adopted from Tay et al. [545]. We
indicate masked tokens with gray rectangles, which become the targets. For brevity, we omit special tokens.

shows that this can speed up training because span quences with limited context, which we illustrate
corruption produces shorter sequences on average in Fig. 4). The MoD objective has subsequently
compared to corrupting individual tokens in an i.i.d. been shown to improve model performance by con-
manner. tinuing training pre-trained LLMs [443, 86] for
relatively few steps [547].
Mixture of Denoisers [545] (MoD) refers to
Fill In the Middle Bavarian et al. [38] propose
injecting objective diversity by mixing multiple
to augment the next token prediction objective by
denoising objectives. Tay et al. [545] categorize
shuffling tokens within a document such that we
three denoising objectives: {R,S,X}-Denoiser. The
fill in the middle (FIM) based on prefix and suf-
regular denoising corresponds to the previously in-
fix. They demonstrate that models pre-trained on a
troduced span denoising. Specific denoising com-
mixture of FIM-transformed and left-to-right data
prises splitting a given sequence into a prefix act-
result in left-to-right and FIM capability models.
ing as the context and a suffix acting as the target.
In extreme denoising, we corrupt large parts of Meet in the Middle Nguyen et al. [382] extend
the input by either (a) increasing the proportion the FIM objective by enabling bidirectional context
of masked tokens per span or (b) increasing the to construct a denser, more data-efficient supervi-
span length forcing the model to generate long se- sion signal while maintaining the autoregressive

8
nature of the underlying model: They train two times and increases the utilization of computational
decoders—one forward → −p (xi | x<i ; θ) and one resources.
backward language model ← −
p (xi | x<i ; θ)—with These issues have motivated asynchronous paral-
shared parameters θ. Additionally, they add an lelization schemes. Recht et al. [453] present Hog-
agreement regularize to the loss, encouraging the wild!, which greedily applies gradients to the local
forward and backward model to agree: for a dataset weights on each accelerator as soon as they arrive,
S of sequences, the full pre-training loss is offering better resource utilization than pipeline
parallelism but suffering from training instabilities
|x| due to stale gradients which are based on outdated
− log →

XX
p (xi | x<i ; θ) model weights.
| {z }
x∈S i=1 NLL for forward model Gomez et al. [172] propose N-Wise interlock-
− log ←

p (xi | x>i ; θ) (3) ing backpropagation, which is a generalization of
| {z }
NLL for backward model
end-to-end and local training. While end-to-end
TV → (global) training performs a forward pass through
+βDi,x (−p ∥←−
p ),
| {z } all layers, computes a loss and gradients, and back-
agreement regularizer propagates through all layers, local training per-
forms forward passes through all layers individ-
T V (→
where Di,x −
p ∥←−
p ) is the total variation distance ually and immediately computes a local loss and
among the two models on the i-th token. Once gradient update, offering higher resource utilization
pre-training has been completed, we can use only at the cost of (empirically) worse task performance.
the forward model → −
p. N-Wise interlocking backpropagation strikes a com-
promise by performing a forward pass through N
Parallelism Strategies The sheer size of LLMs layers before computing a loss and updating the
makes it hard to train or even do inference with parameters of the associated layers, enabling better
them on only one accelerator (GPU, TPU, etc.). layer communication than local training and higher
A common solution is model parallelism, which computational efficiency than end-to-end training.
can be viewed as a divide-and-conquer strategy: Chowdhery et al. [86] leverage a combination
we slice up various parts of the model (dividing of model parallelism and fully sharded data par-
the problem into sub-problems), distribute them allelism (FSDP) [628, 674]—a technique where
across multiple devices, with each device comput- each device only holds a subset of the model pa-
ing a portion of the overall computation (solve each rameters, gradients, and optimizer states, and pa-
problem independently) and combine all results to rameters necessary for local computations are com-
produce the final output (forward/backward pass). municated on-demand—to enable highly parallel,
Implementing model parallelism synchronously high throughput training across thousands of chips
creates a problem where running data batches within a single TPU pod. PaLM further employs
through multiple workers with sequential depen- data parallelism to achieve scaling at pod level,
dency (each layer depends on results from the pre- leveraging the Pathways [37] system to distribute
vious layer) leads to significant waiting times and data.
under-utilization of computation resources. In a parallel line of work, Lepikhin et al. [298]
Another strategy is pipeline parallelism, which propose GShard, a model parallelism method that
combines model parallelism with data parallelism, extends the XLA [468] compiler, enabling auto-
meaning that we not only distribute parts of the matic sharding of models.
model across different devices but parts of the data
too, i.e., each worker splits its mini-batch further Miscellaneous Rae et al. [441] stack the lay-
into micro-batches with gradients being accumu- ers of a 4.5B parameter model to jump-start and
lated across all micro-batches before the weight accelerate the training of a 9B model, which led
update. Huang et al. [226] instantiate such an ap- to a 40% reduction in compute; an idea that has
proach called GPipe, which divides each mini- been previously used for training smaller-scale
batch into smaller micro-batches distributed across LMs [173]. Brown et al. [59] progressively in-
different accelerators simultaneously; gradients are crease the batch size from a small to the full value
applied synchronously at the end. Compared to over training when training GPT-3; a trick that
naive model parallelism, this decreases waiting has been previously used for training image mod-

9
els [514]. Sanyal et al. [476] apply latest weight av- cient [213, 311] and requires practitioners to keep
eraging [249] to LLMs between 1 and 12B param- individual fine-tuned LLMs in memory for every
eters; for a 6.9B parameter model, they reach sav- task. We illustrate this overhead in Figure 5.
ings of up to 4,200 GPU hours. For smaller-scale
models, there exist various pre-training speedup al- Overhead of Storing and Loading
gorithms [663, 685], but they have not been scaled Fine-Tuned LLMs [213, 311]
up yet and shown to offer only limited gains when
When adapting an LLM via full-model fine-
compared with budget-adjusted baselines [251].
tuning, an individual copy of the model
2.4 Fine-Tuning Overhead must be stored (consuming data storage) and
loaded (expending memory allocation, etc.)
A potential drawback of pre-training LLMs on mas- for each task.
sive and diverse sets of textual data is that the re-
sulting models might struggle to explicitly cap-
Parameter-efficient fine-tuning An alternative
ture the distributional properties of task-specific
method to adapt an LLM to a specific dataset/do-
datasets. To address this, fine-tuning refers to
main is via parameter-efficient fine-tuning (PEFT).
adapting the pre-trained model parameters on com-
PEFT refers to a class of methods that adapt LLMs
paratively smaller datasets that are specific to an
by updating only a small subset of model parame-
individual domain or task. LLM fine-tuning is
ters. Adapters [213] are one of the earliest works
highly effective at adapting LLMs for downstream
on PEFT. This method incorporates additional,
tasks [215, 120, 440].
learnable layers into a Transformer architecture that
Technically speaking, fine-tuning can be
are updated during fine-tuning whilst keeping the
achieved by further training a model on a smaller
remainder of the network unchanged. Experimen-
dataset. Depending on the model architecture, this
tal results on 26 text classification tasks (incl. the
is done by either (i) directly fine-tuning pre-trained
GLUE benchmark [575]) reveal that models trained
models using a standard language modeling objec-
via Adapters are competitive with full fine-tuning
tive or (ii) adding individual learnable layers to the
while updating only 3% of the model’s parame-
output representations of a pre-trained language
ters. Ben Zaken et al. [40] instead propose only
model, which are designed to create compatibil-
to update the model’s bias terms for fine-tuning,
ity between the model’s output representations and
which make up less than 1% of the model’s pa-
the output formats of individual downstream tasks
rameters. Experimental results show competitive
(e.g., for text classification or sequence labeling).
performance across tasks of the GLUE benchmark.
See Devlin et al. [120] (Figure 1) for an illustration.
We are aware of three general frameworks for incor-
However, LLMs with billions of parameters have
porating adapters into language model fine-tuning,
large memory requirements to store (i) the model
namely AdapterHub [428], LLM-Adapters [219],
parameters, (ii) the model activations, and (iii) the
and HuggingFace’s PEFT library [356].
gradients and corresponding statistics. Due to lim-
PEFT methods introduced for larger mod-
ited device memory (e.g., GPU or TPU) necessi-
els include prefix-tuning [311] and prompt-
tates access to large clusters with many devices
tuning [299], which both operate by prepending
to fine-tune a full LLM, limiting access to a few
a set of learnable token embeddings to an input.
institutions with large compute resources.
These token embeddings (also referred to as soft
Large Memory Requirements prompts [299]) are learned during the fine-tuning
stage, whereas the remainder of the model parame-
Fine-tuning entire LLMs requires the same ters remains fixed. Most notably, such soft prompts
amount of memory as pre-training, render- contain thousands rather than millions of param-
ing it infeasible for many practitioners. eters and are much more efficient to store. No-
tably, one still has to backpropagate through the
Moreover, while full model fine-tuning is ef- network while fine-tuning the tokens. Alternatives
fective at adapting LLMs to perform well on spe- for models with only black-box API access have
cific downstream tasks, individual copies of fine- been proposed too [528, 122].
tuned LLMs need to be stored and loaded for It has been shown that prompt-tuning can
individual tasks, which is computationally ineffi- learn generalizable representations with very small

10
weight matrices at individual Transformer layers as
Sen t i m en t QA H at e sp eec h
m o d el m o d el m o d el an additive low-rank decomposition. Such a repa-
rameterization avoids the need to compute dense
Fi n e-t u n i n g Fi n e-t u n i n g Fi n e-t u n i n g matrix multiplications. Dettmers et al. [118] ex-
LLM # 1 LLM # 2 LLM # 3
tend LoRA to quantized LLMs, drastically reduc-
Sen t i m en t
an al ysi s t ask
Qu est i o n
an sw er i n g t ask
H at e sp eec h
t ask
ing memory usage, allowing them to fine-tune a
65B model on a single 48GB GPU. The authors
(a) mention that regular training of the same model
requires more than 780 GB of GPU memory.

Sen t i m en t QA H at e sp eec h Compute Requirements However, despite sub-


m o d el m o d el m o d el
stantial improvements in memory complexity
needed to fine-tune LLMs for specific tasks, a re-
B ase L L M maining challenge is the time complexity. Fine-
( PEFT-ad ap t ab l e) tuning an LLM, even with PEFT methods, still
requires full gradient computation. The compu-
PEFT w ei g h t s PEFT w ei g h t s PEFT w ei g h t s tational infrastructure needed to adapt LLMs pro-
Sen t i m en t Qu est i o n H at e sp eec h hibits potential applications like personalization on
an al ysi s t ask an sw er i n g t ask t ask
smaller devices.
(b)
Full Matrix Multiplications
Figure 5: Fine-tuning an LLM for a specific down-
stream task. (a) illustrates vanilla fine-tuning, which Parameter-efficient fine-tuning of LLMs
requires updating the entire model, resulting in a new still requires computing full forward/back-
model for each task. In (b), PEFT instead learns a small ward passes throughout the whole network.
subset of model parameters for each task with a fixed
base LLM. The same base model can be re-used during
inference for different tasks. 2.5 High Inference Latency
According to Pope et al. [431], Weng [605], two
reasons why LLMs exhibit high inference latencies
amounts of training data, achieving competitive
are: (1) low parallelizability since the inference
performances when trained on less than 100 exam-
procedure proceeds one token at a time and (2)
ples for safety classification [376] or five examples
large memory footprints, due to the model size
for multilingual question answering [11]. In addi-
and the transient states needed during decoding
tion to that, recent work investigates the potential
(e.g., attention key and value tensors). Further, the
of using soft prompts for pre-training and transfer
authors also discuss the quadratic scaling of the
learning across different tasks [179, 572].
attention mechanisms in Transformers, which we
Liu et al. [331] introduce (IA)3 , which scales
discuss separately in Sec. 2.6.
activations in individual Transformer layers with
learnable vectors. The authors demonstrate its ef- High Inference Latency [431, 605]
fectiveness by showing that models trained using
(IA)3 outperform full model fine-tuning on various LLM inference latencies remain high be-
datasets whilst updating only 0.01% of the model’s cause of low parallelizability and large mem-
parameters. ory footprints.
Malladi et al. [355] propose a memory-efficient
zeroth-order (MeZO) optimizer, which only re- In the following section, we review techniques
quires the same memory footprint as during in- used to address these challenges by e.g., reduc-
ference (instead of storing gradients or optimizer ing the memory footprint (size and/or bandwidth),
states). Further, it can optimize non-differentiable or accelerating specific computational operations.
objectives like accuracy or F1 scores, which con- Note that some of these techniques may also be
ventional gradient-based tuning methods cannot. applicable during the training process, but we dis-
Hu et al. [218] propose Low-Rank Adaptation cuss them here since they are not only designed for
(LoRA), which formulates parameter updates of training, like the approaches discussed in Sec. 2.3.

11
Efficient Attention Roughly two lines of work Similarly, GLM-130B [658] uses a degradation-
aim to accelerate attention mechanism computa- free 8-bit quantization scheme, storing weights in
tions by (i) lower-level hardware-aware modifica- 8-bit and performing matrix multiplications in 16-
tions or (ii) higher-level sub-quadratic approxima- bit precision. Frantar et al. [153] propose an effi-
tions of the attention mechanism. cient, one-shot quantization technique to compress
For the former, multi-query attention [493] aims LLM weights down to 3 to 4 bits per weight, en-
to reduce memory bandwidth bottlenecks when se- abling 175B parameter models to be run on a single
quentially generating sequences of tokens using GPU. Dettmers et al. [119] further improve upon
Transformer decoder layers by keeping only one this by combining higher precision representations
attention head for the key and value tensors. Sim- for outlier weights and grouped quantization.
ilarly, Dao et al. [107], Pagliardini et al. [404] re-
duce memory bandwidth by proposing an alter- Pruning is a complementary post-training tech-
native computation method for multi-head self- nique to quantization, removing parts of the
attention, called FlashAttention, to minimize weights of a given model (without degrading its per-
the number of I/O operations to speed up the com- formance). An important distinction is whether the
putation on modern GPUs. As an optimized atten- pruning follows a structured pattern or is unstruc-
tion implementation, FlashAttention lever- tured. Structured sparse models substitute dense
ages operator fusion to reduce the memory band- sections of a model with an assembly of signifi-
width bottleneck. Pagliardini et al. [404] build cantly smaller yet still dense components. Unstruc-
on top of FlashAttention and incorporate at- tured sparse models contain weights of value zero,
tention sparsity patterns, encompassing key/query which do not influence the network’s behavior and
dropping and hashing-based attention. Pope et al. can therefore be committed in theory. However, in
[432] implement different sharding techniques to practice, it is more challenging to translate theo-
efficiently spread the feedforward and attention retical to practical computation savings on current
computations across devices while optimizing for hardware [161, 112, 336].
inter-device communication costs, enabling context On the structured side, early work on pruning
lengths of up to 43,000 tokens using multi-query language models mainly aims at comparatively
attention. small MLM-type models [592, 143, 243]. Ma et al.
With regards to the second stream of work, a [349] propose LLM-Pruner, which aims at pruning
common theme to improve the computational or LLMs in a task-agnostic manner while preserving
memory complexity of the attention mechanism is the zero-shot capabilities of the models. To this
to sparsify the attention matrix or introducing (lin- end, LLM-Pruner adopts a three-stage pruning pro-
ear) approximations [543]. However, the scalabil- cedure where 1) interdependent structures within
ity of some efficient Attention approximations has the model are identified and grouped, 2) the contri-
been questioned. For example, Tay et al. [542], Hua bution to the overall performance is estimated for
et al. [220] find that the Performer attention approx- each group, and low-performing groups are pruned,
imation [85] severely underperforms the vanilla 3) performance recovery via parameter-efficient
self-attention mechanism, especially when scaled fine-tuning procedure using LoRA [218].
up to large models. On the unstructured side, SparseGPT [152] is an
unstructured pruning approach specifically devel-
Quantization is a post-training technique that oped to be fast enough to be run on LLMs with
reduces the memory footprint and/or increases the hundreds of billions of parameters within a few
model’s throughput by reducing the computational hours, being able to prune the number of parame-
precision of weights and activations. nuQmm [407] ters by up to 60% while maintaining roughly the
and ZeroQuant [643] use a non-uniform quan- same model performance. Sun et al. [527] pro-
tization method to quantize weights and apply pose Wanda (Pruning by Weights and activations),
custom CUDA kernels for computational benefits. which applies magnitude pruning based on the
LLM.int8() [117] is a degradation-free quanti- product of each weight’s magnitude and the norm
zation scheme enabling efficient inference of multi- of the corresponding input activations, matching
billion parameter LLMs by utilizing Int8 quantiza- SparseGPT in performance while requiring only
tion and falling back to higher precision for certain a single forward pass to prune the network. Both
outlier features without the need for re-training. SparseGPT and Wanda can be extended to per-

12
form semi-structured pruning, enabling n:m spar- that the activation maps of default Transformer
sity [228, 680] and achieving the corresponding models often emerge to be very sparse implicitly;
speed-ups on recent GPUs [369]. the larger the model, the sparser measured by the
percentage of nonzero entries. Similarly, Zhang
Mixture-of-Experts architectures typically con- et al. [670] find that post-training MoEfication, i.e.,
sist of a set of experts (modules), each with unique converting monolithic models to equivalent MoE
weights, and a router (or gating) network, which models, can speed up inference by 2x.
determines which expert module processes an in-
put. MoE models decrease inference time by not Cascading refers to the idea of employing
using all experts at once but only activating a sub- differently-sized models for different queries [75].
set of them. Further, they can reduce communica- In spirit, this idea is similar to Mixture-of-Experts
tion across devices in model-distributed settings by models, but instead of learning a routing module,
placing each expert on a separate accelerator; only we employ a cascade of multiple, differently-sized
the accelerators hosting the router and the relevant monolithic models (these can be even black-box
expert model must communicate. Shazeer et al. API models) and learn a scoring function that de-
[495] propose one of the first MoE layers embed- cides which model(s) receive which query. Chen
ded within a language model, which they refer to et al. [75] demonstrate that this strategy dominates
as sparsely-gated MoEs (SG-MoEs). They denote the Pareto frontier between accuracy and cost.
by G(x) and Ei (x) the gating network output and Decoding Strategies can greatly impact the com-
the i-th expert network output for a given input putational cost of performing inference. For ex-
x, respectively.
Pn We can then write the output as ample, beam search trades off compute for higher-
y = i=1 G(x) i Ei (x). Wherever G(x)i = 0, quality results. Another example of a computa-
we do not need to compute Ei (x), thereby saving tionally expensive decoding scheme is sample-and-
compute during inference. Lepikhin et al. [298] rank [8] where N independent sequences of tokens
scale up an SG-MoE model to 600B parameters y 1 , . . . , y N are obtained using random sampling,
by proposing GShard, a model parallelism method and the highest probability sequence is used as the
that extends the XLA [468] compiler. While SG- final output.
MoE selects the top-k experts with k > 1, the Latency-oriented strategies such as speculative
Switch Transformer (ST) [145] architecture uses sampling [522, 300, 74] first autoregressively gen-
k = 1 experts, which reduces routing computation erate a draft of length K using a smaller (draft)
and communication across experts (which may be model; then, the larger (target) model scores the
located on different accelerators). ST empirically draft, followed by a modified rejection sampling
outperformed a strongly tuned T5 model with up to scheme to accept a subset of the tokens from left to
7x pre-training speedups. Lewis et al. [302] notice right. Similar ideas have been proposed in various
that the learned routers can result in unbalanced contexts, such as for blockwise parallel genera-
assignments across experts. To ensure balanced tion [522], grammatical error correction [529], and
routing, they formulate a linear assignment prob- with a larger LLM refining generation produced by
lem that maximizes token-expert affinities while a small model [265]. Del Corro et al. [114] observe
equally distributing the number of tokens across that tokens towards the end of a sequence are easier
experts. Yu et al. [653] propose sMLP, an MoE to predict due to more contextual information, mo-
using only MLPs blocks, which (i) they scale up to tivating a new decoding strategy that skips earlier
10B, (ii) results in a 2x improvement in pre-training layers in the network for such tokens.
speed, and (iii) outperforms sparse Transformer
counterparts. 2.5.1 Software
However, MoE models still suffer from unique Various frameworks have been designed to en-
issues like expert collapse (all experts learning the able the efficient training of multi-billion to
same), likely caused by underconstrained routing trillion parameter language models such as
functions [80]. For example, Roller et al. [459] DeepSpeed [450] and Megatron-LM [501] to
demonstrates that learned expert assignments do account for the unique challenges arising when
not always outperform random ones. training such models. This is necessitated by the
Interestingly, instead of designing an architec- fact that most LLMs do not fit into a single device’s
ture for sparsity explicitly, Li et al. [314] observe (GPU, TPU) memory, and scaling across GPUs and

13
compute nodes needs to account for communica- 2.6 Limited Context Length
tion and synchronization costs. FlexGen [497]
provides further speed-ups by aggregating memory Addressing everyday NLP tasks often necessitates
and compute resources from the GPU, CPU, and an understanding of a broader context. For exam-
disk and utilizing techniques such as 4-bit quan- ple, if the task at hand is discerning the sentiment
tization, enabling inference with 175B parameter in a passage from a novel or a segment of an aca-
models on a single GPU. demic paper, it is not sufficient to merely analyze a
The frameworks typically combine existing par- few words or sentences in isolation. The entirety of
allelism strategies to compensate for drawbacks the input (or context), which might encompass the
and scale model training across multiple sets of whole section or even the complete document, must
compute nodes, within compute nodes, and across be considered. Similarly, in a meeting transcript,
multiple GPUs per node. e.g., Smith et al. [515] the interpretation of a particular comment could
use tensor slicing within a node, pipeline paral- pivot between sarcasm and seriousness, depending
lelism across nodes, and data parallelism to train on the prior discussion in the meeting.
multiple model replicas over sets of nodes. Addi- Li et al. [308] evaluate several LLMs in the long-
tional features include memory optimizations [445, context settings and find that while commercial
454, 446], communication-efficient [536, 307, 343] closed-API models often fulfill their promise, many
and fused optimizers2 , and support for MoE train- open-source models – despite claiming to perform
ing [444]. well with longer contexts – exhibit severe perfor-
Specialized implementations such as mance degradation. They point out that there is
Tutel [230] and MegaBlocks [160] of- a difference between being architecturally-able to
fer efficient sparse MoE training, while deal with long inputs and actually performing well.
Alpa [677] enables automatic data and model Having an architecture that can infer long inputs
parallelism for LLMs written in Jax. The does not guarantee that the LLM will perform as
FasterTransformer3 library includes highly well on those as on shorter inputs. Similarly, Liu
optimized Transformer encoder and decoder et al. [333] find that changing the location of rel-
implementations for TensorFlow, PyTorch, and evant information in the input can degrade model
Triton. performance. Interestingly, they find that decoder-
Kwon et al. [285] introduce vLLM, an open- only LLMs like GPT-3.5 can deal well with such
source library for efficient inference and LLM serv- information at the beginning or end of the input
ing. vLLM employs PagedAttention, which par- context; they cannot access information in the mid-
titions each sequence’s KV cache into fixed-size dle of it well, resulting in a U-shaped performance
blocks. When performing attention computations, curve.
blocks are fetched from non-contiguous memory.
This enables memory sharing, reducing memory Limited Context Length
consumption and transfers in decoding strategies
such as beam search, ultimately improving through- Limited context lengths are a barrier for
put. handling long inputs well to facilitate ap-
The Petals [54] library4 allows users to col- plications like novel or textbook writing or
laboratively fine-tune and run LLMs by distribut- summarizing.
ing subsets of model parameters to individual ma-
chines.
To this end, we discuss three lines of work per-
All of these libraries address the enormous com- mitting longer context lengths. First, we look at
putational costs associated with training and run- efficient attention mechanisms, which help miti-
ning LLMs, either by offering more efficient im- gate the effect of long inputs on the computational
plementations, lowering memory requirements, or requirements of Transformer models. Next, we ex-
using distributed or decentralized computing strate- amine positional embedding schemes in the light
gies. of generalization to longer sequence lengths than
2
https://github.com/nvidia/apex
those used during training. Lastly, we revise Trans-
3
https://github.com/NVIDIA/FasterTransformer former alternatives which neither require attention
4
https://github.com/bigscience-workshop/petals nor positional embeddings.

14
Efficient Attention Mechanisms One way of generalize well to significantly longer sequences
addressing the limited context of LLMs is by de- during inference.
signing more efficient attention mechanisms that The fundamental building block of the Trans-
can process longer inputs. Ma et al. [350] intro- former architecture is the self-attention mechanism.
duce Luna, a linear unified nested attention mech- It is permutation-invariant; therefore, the output is
anism that approximates softmax attention with independent of the input sequence order. Positional
two nested linear attention functions, yielding only information is commonly injected to make the
linear (as opposed to quadratic) time and space model respect a token’s position in the sequence,
complexity, allowing it to process much longer in- i.e., capture the semantics of where a token occurs
puts. Similarly, Shen et al. [496] and Li et al. [310] rather than just whether it occurs. The longer the
present alternative attention mechanisms equivalent input is, the more important the positional embed-
to the dot-product attention but which require sub- ding becomes since the model needs to effectively
stantially less memory and compute resources. Guo use information from different parts of the input
et al. [183] propose an attention mechanism called that may cover a wide range of distances from the
Transient Global, which is an extension of local current token.
attention where each token can attend to nearby
Without positional embeddings, a Transformer
tokens and a set of global tokens. It enables to han-
models the relations between any two tokens with
dle sequences with up to 12,000 tokens. Similarly,
equal probability. Hence, positional embeddings
CoLT5 [15] enables context lengths of up to 64,000
introduce an LSTM-like inductive bias that (typi-
tokens by splitting the computations into a light
cally) tokens closer to each other in the sequence
branch with local attention, fewer attention heads,
are more relevant to each other. Depending on the
and a heavy branch with full attention. CoLT5 ap-
positional embedding scheme chosen, this can be
plies the light branch to every token and the heavy
learned or effectively hard-coded. However, it re-
branch to a subset of tokens that are selected by a
mains unclear what is the most effective positional
learnable routing function.
embedding scheme for long inputs. Further, mod-
After investigating the effect of the dot-product els face difficulties generalizing to unseen sequence
self-attention mechanism, Tay et al. [541] pro- lengths by introducing a dependency on sequence
pose the Synthesizer, a new architecture that learns positions. This is an undesirable artifact of posi-
synthetic attention weights without token-token tional embeddings, as language semantics do not
interactions, showing that it consistently outper- inherently depend on the length of an utterance.
forms transformers on various language-based
While positional encoding schemes such as rela-
tasks. Britz et al. [56] offer an alternative attention
tive positional encodings or, more recently, ALiBi
mechanism based on a fixed-size memory repre-
have made progress in building more generaliz-
sentation that is more efficient, yielding inference
able ways for injecting positional information into
speedups of 20% without significantly hurting per-
Transformers, the challenge of generalizing to se-
formance. Hua et al. [220] combine a single-head
quences much longer than seen during training re-
attention mechanism with a linear attention approx-
mains largely unsolved. Surprisingly, Haviv et al.
imation to achieve speed-ups between 4.9x and
[192] find that causal LLMs without positional en-
12.1x for auto-regressive language modeling while
codings are competitive compared to models with
obtaining similar perplexities as a standard Trans-
positional encodings and accredit this success to
former model. Ding et al. [124] propose dilated
the causal attention mask leaking positional infor-
attention which splits a sequence into equally long
mation into the model.
segments and processes each of these in parallel
using a sparsified attention mechanism. Dilated In the following, we first summarize some stan-
attention offers a linear computational complexity dard positional embeddings technique and then
in the sequence length and, applied hierarchically, move to more advanced schemes designed to im-
enables inputs of up to 1B tokens. prove length generalization. We start with Abso-
lute Positional Embeddings [563], which inject
Length Generalization As the required compute positional information by sinusoidal embeddings
of Transformer-based LLMs grows quadratic with based on the absolute position i of a token xi within
the sequence length, it is a desired property to build their sequence x1 , . . . , xN into the model input.
LLMs that can be trained on short sequences and Given an input sequence X = [x1 , . . . , xN ], we

15
add a positional embedding matrix P ∈ Rn×d of coding scheme extrapolates poorly to unseen se-
the same shape to get the positional encoding out- quence lengths. However, Chen et al. [79] demon-
puts X + P, where the element on the ith row strate that by interpolating rather than extrapolating
and the (2j)th or the (2j + 1)th column of P fol- longer than before observed context windows and
lows sinusoidal functions. Vaswani et al. [563] briefly fine-tuning RoPE-based models, enabling
also compare against learned positional embed- pre-trained LLMs to extend their context window
dings and find no significant performance differ- to very long sizes of up to 32, 768 tokens.
ence. In contrast, sinusoidal positional encodings Relative Positional Bias [443] directly bias the
require no trainable parameters, and the authors attention computation (Eq. (5)) with a learned bias
hypothesize that they enable extrapolation to se- per relative positional offset and attention head
quence lengths longer than the ones contained in instead of adding information to the token embed-
the training set. However, this feature is not guar- dings
anteed, as the subsequent layers in the network  
need to be able to deal with such extrapolated po- 1 X
sitional embeddings. Learned positional encod- softmax  √ x⊤ ⊤
i Wq Wk xj + bi−j
 . (5)
d i,j
ings do not possess inherent generalization capabil-
ities for unseen sequence lengths. This limitation Press et al. [434] follow a similar methodology
arises because the embeddings associated with ab- but use heuristics to define ALiBi (Attention with
solute positions not encountered during training— Linear Biases), a non-learned bias that is used
depending on the implementation—either do not to penalize attention scores in long-range interac-
exist or remain untrained (random). Relative Posi- tions [479], i.e., a recency-bias is backed into the
tional Embeddings have subsequently been devel- model. Here, m is a pre-defined, head-specific
oped, extending absolute positional embeddings to slope–by default, the set of slopes for n heads form
relative offsets between token positions [492, 221, a geometric sequence.
105, 79]. While rarely used in their vanilla form in  
LLMs [441], relative positional embeddings have
1 X
given rise to the methods outlined in the follow- softmax  √ x⊤ ⊤
i Wq Wk xj + m · −(i − j) .

ing paragraphs. They offer better generalization to d i,j
unseen sequence lengths than absolute positional (6)
encodings. All unseen absolute positions will be Press et al. [434] motivate ALiBi by designing it to
converted to previously observed relative offsets generalize well to unseen sequence lengths. They
between positions, enabling better generalization to show that training a model with it on training se-
long input sequences at inference time. Rotary Po- quences with a maximum sequence length of 1, 024
sition Embeddings (RoPE) [526] unite absolute tokens achieves the same perplexity on a test set
and relative methods by incorporating absolute po- with a maximum sequence length of 2, 048 as a
sitional information in a rotation matrix and model- model trained with sinusoidal positional encodings
ing the relative positional offset through a rotation. on sequences with up to 2, 048 tokens. Thereby, it
They directly modify the self-attention calculation not only enables larger context lengths but can also
rather than injecting positional information into the potentially reduce pre-training costs (Sec. 2.3).
embeddings. The attention between positions i, j While some of the existing positional encod-
linearly depends on i − j by introducing a d × d ing schemes offer better generalization to long se-
dimensional block diagonal matrix RΘ,k d , resulting quences than others, it remains unclear how reliable
in a self-attention mechanism defined as they are. For example, Taylor et al. [548] report try-
ing ALiBi in the Galactica LLM and not observing
“large gains” compared to using learned positional
 
1 X
softmax  √ x⊤ ⊤ d
i Wq RΘ,(i−j) Wk xj
. encodings. Similarly, Kazemnejad et al. [259] find
d i,j that popular positional encoding schemes such as
(4) ALiBi, RoPE, and absolute positional encodings do
While RoPE has been adapted in many LLMs [576, not perform well in terms of length generalization
47, 86] and Su et al. [526] show RoPE leading in a suite of 10 reasoning downstream tasks.
to better performance on long text tasks, Press In a parallel line of work, Anil et al. [19] demon-
et al. [434] demonstrate that this positional en- strate that naively fine-tuning a pre-trained LLM is

16
insufficient for length generalization in the context tance Weighted Key Value (RWKV) to combine
of reasoning tasks. Instead, they propose combin- the parallelization benefits of Transformer-based
ing in-context learning and scratchpad/chain-of- LLMs during training with the fast inference and
thought reasoning to enable LLMs to generalize to low compute requirements of RNNs. The authors
unseen sequence lengths in- and out-of-distribution, accomplish this by leveraging a linear attention-
with performance scaling with model size. The au- like mechanism, scaling non-Transformer LLMs to
thors report that fine-tuning can further improve 14B parameters, and matching the performance of
model performance dependent on the task perfor- similarly-sized Transformer LLMs.
mance of the baseline.
2.7 Prompt Brittleness
Transformer Alternatives While Transformers A prompt is an input to the LLM. The prompt syn-
are the dominant paradigm in LLMs today due to tax (e.g., length, blanks, ordering of examples) and
their strong performance, several more efficient semantics (e.g., wording, selection of examples,
alternative architectures exist. One line of work instructions) can have a significant impact on the
tries to replace the attention mechanism using state model’s output [342].
space models (SSMs), which offer near-linear com- As an analogy, if we were to think of an LLM
putational complexity w.r.t. the sequence length. as a (fuzzy) database and prompts as queries [246],
Dao et al. [108] investigate the weaknesses of state it becomes clear that slight changes in the query
space models (SSMs) in language modeling and can result in vastly different outputs. Consequently,
find that existing approaches struggle with recall- the wording, as well as the order of examples in-
ing previous tokens and comparing tokens in the cluded in a prompt, have been found to influence
sequence. Based on these findings, the authors the model’s behavior significantly [596, 675, 342].
propose H3 with a shift matrix to recall previous
tokens and multiplicative interactions for token Prompt Brittleness [675, 596, 342]
comparisons. The authors demonstrate that H3
Variations of the prompt syntax, often oc-
comes close to Transformer-based LLMs for lan-
curring in ways unintuitive to humans, can
guage modeling, offering further improvements
result in dramatic output changes.
when combined with attention. Poli et al. [430]
propose the Hyena operator, a convolution-based
Designing natural language queries that steer the
sub-quadratic attention replacement designed for
model’s outputs toward desired outcomes is often
long sequences. Hyena tries to emulate the atten-
referred to as prompt engineering [477, 287, 606].
tion mechanisms’ dynamic nature by introducing
Fig. 6 summarizes some of the most popular
data-controlled computations, i.e., Hyena applies
prompting methods with an example adapted from
an element-wise gating operation based on the op-
Wei et al. [601]. As we can see, there are lots of
erator’s input to mimic the attention contextualiza-
equally-plausible prompting techniques, and the
tion. Hyena-based models have been used on natu-
current state of prompt engineering still requires
ral language for sequence lengths of up to 131, 000
lots of experimentation, with little theoretical un-
tokens [430] and up to 1, 000, 000 tokens in the
derstanding of why a particular way to phrase a
context of genomics [383]. Fathi et al. [144] pro-
task is more sensible other than that it achieves
pose the Block-State Transformer, which builds
better empirical results. Developing LLMs that are
upon a hybrid layer that combines an SSM for
robust to the prompt’s style and format remains
long-range contextualization and a Transformer
unsolved, leaving practitioners to design prompts
for short-range interactions between tokens. The
ad-hoc rather than systematically.
authors find similar performance to Transformer-
based baselines while obtaining speed-ups of up to Single-Turn Prompting methods improve the in-
10x on sequence-level, enabling models with more put prompt in various ways to get a better answer in
than 65, 000 tokens sequence length. a single shot. In-Context Learning (ICL) refers
Another line of work utilizes recurrent neu- to an LLM’s ability to learn a new task solely via
ral networks (RNNs), which offer linear com- inference (without any parameter updates) by con-
putational complexity and memory requirements ditioning on a concatenation of the training data
with respect to the sequence length as the back- as demonstrations [59, 483]. This enables users
bone of LLMs. Peng et al. [416] propose Recep- and practitioners to use LLMs for a variety of NLP

17
Single-Turn Prompting Input Output

In-Context Learning Instruction-Following Chain-of-Thought Prompt tuning


Here is a mathematical reasoning question. You need Q: Lisa has 5 easy peelers. She buys 2 more nets with Embedding 1 Embedding … Embedding N
Q: Lisa has 5 easy peelers. She buys 2 more nets with
to apply arithmetic operations to generate the correct 6 each. How many easy peelers does she have?
6 each. How many easy peelers does she have? Q: Lisa has 5 easy peelers. She buys 2 more nets with
answer. A: Lisa starts with 5. 2 nets of 6 each are 12 easy
A: The answer is 17. 6 each. How many easy peelers does she have?
peelers. 5+12=17. The answer is 17.
Q: The cafeteria has 37 bananas. They bought 5 more A: The answer is 17.
Q: Lisa has 5 easy peelers. She buys 2 more nets with Q: The cafeteria has 37 bananas. They bought 5 more
bunches with 5 each, how many bananas do they Q: The cafeteria has 37 bananas. They bought 5 more
6 each. How many easy peelers does she have? bunches with 5 each, how many bananas do they
have? bunches with 5 each, how many bananas do they
… have?
have?

A: The cafeteria has 37 bananas originally. They


bought 5 more bunches and each bunch has 5, so
A: The answer is 62. A: The answer is 62. A: The answer is 62.
they added 5 x 5 = 25 bananas to their stock. We
add these numbers: 37 + 25 = 62. The answer is 62.

Multi-Turn Prompting
Self-Consistency Ask-Me-Anything Least-To-Most
Stage 1: Problem Reduction
Q: Lisa has 5 easy peelers. She buys 2 more nets with 6 each. How
many easy peelers does she have? Prompt Chain 1 Q: The cafeteria has 37 bananas. A: To solve “How many bananas
Prompt Chain 2 They bought 5 more bunches does it have?”, we need to first
A: Lisa starts with 5. 2 nets of 6 each are 12 easy peelers. 5+12=17.
The answer is 17. Prompt Chain 3 with 5 each, how many bananas solve: “How many bananas does
Answer the question using arithmetic. do they have? it buy in total”?
Q: The cafeteria has 37 bananas. They bought 5 more bunches with Formulate a question for the given context.
Q: Lisa has 5 easy peelers. She buys 2 more
5 each, how many bananas do they have? Q: Lisa has 5 easy peelers. She buys 2 more
nets with 6 each. How many easy peelers
… nets with 6 each. How many easy peelers does
she have?
does she have? Stage 2: Sequentially Solve Subquestions
A: The answer is 17. The cafeteria has 37 bananas.
A: The answer is 17.
Q: The cafeteria has 37 bananas. They
Q: The cafeteria has 37 bananas. They bought They bought 5 more bunches
A: The cafeteria has 37 bananas A: The cafeteria initially had 37 A: We need to multiply the bought 5 more bunches with 5 each.
5 more bunches with 5 each. with 5 each. A: They buy 25 bananas in total.
originally. They bought 5 more bananas and purchased an number of bunches by the number Q: What is the total number of bananas
Q: Q: How many bananas does it
bunches and each bunch has 5, so additional 5 bunches of bananas, of banans in each bunch. 5 times 5 they possess?
buy in total?
they added 5 x 5 = 25 bananas to each with 5, totaling 25 bananas. gives us 25 bananas. Next, we add
their stock. We add these So, adding 5 and 25 together, the the original number of bananas. What is the total number of bananas they possess? A: The answer is 62.
numbers: 37 + 25 = 62. The total fruit count is now 30. The The addition 37 plus 25 equals 62. The cafeteria has 37 bananas.
answer is 62. answer is 30. The answer is 62. They bought 5 more bunches
with 5 each, how many bananas
A1: The answer is 62. do they have? A: The cafeteria has 37 bananas.
Q: How many bananas does it They buy 25 bananas in total.
A2: The answer is 62. A: The answer is 62.
buy in total? So, in total, they have 37 + 25 =
Majority Vote A: The answer is 62. 62 bananas.
A3: The answer is 93. A: They buy 25 bananas in total.
Majority Vote Q: How many bananas do they
have?

Tree of Thoughts Self-Refine

Q: The cafeteria has 37 bananas. They Q: Lisa has 5 easy peelers. She buys 2 more nets with 6 each.
bought 5 more bunches with 5 each, how How many easy peelers does she have?
many bananas do they have? Thought Generation
A: The answer is 17.
Propose Prompt
Q: The cafeteria has 37 bananas. They bought 5 more Refined Output
The cafeteria bought 5 more bunches with 5 bunches with 5 each, how many bananas do they have? A: Apologies for any confusion,
each. Calculate how many they bought in 5 x 5 = 25
you are right, I was answering
… … … total.
the wrong question. The correct
A: The answer is 37. answer is 62, by adding 37 and 5
Evaluation Prompt
x 5.
Evaluate whether this thought is useful to answer the original question.
… … … Feedback

Thought Evaluation This response is not answering the question asked. The question
asked is how many banans there are in total. These two
Yes, this calculation takes us one step closer to the solution.
quantities have to be added together.
… … …

Figure 6: Overview of Selected Prompting Methods, categorized into Single-Turn and Multi-Turn Prompting. We
use a running example across all methods inspired by Wei et al. [601].

tasks by simply listing examples of the dataset (e.g., leverages demonstrations: on the one hand, task
input texts and their corresponding labels) without recognition is the ability to recognize a task through
the need to adjust the LLM’s inner workings. demonstrations (possibly without ground-truth la-
Various existing works investigate why ICL bels or perhaps even wrong ones, as in the case of
shows such competitive results across NLP tasks. Min et al. [366]). After this recognition phase, it
One explanation concurrently proposed by [570, applies its pre-trained capabilities. On the other
103, 16] is that ICL emulates gradient-based meta- hand, the skill to acquire new input-label mappings
learning, i.e., it implicitly fine-tunes the model unseen in pre-training is called task learning.
through gradient descent in their forward pass. While input-label associations may not seem to
Interestingly, Min et al. [366] show that input- drive few-shot performance, at least in the case
label associations in the few-shot prompt are not of task recognition, Lu et al. [342] show that the
decisive for model performance: randomly flip- order of few-shot examples matters in that LLMs
ping labels of few-shot demonstrations shows to are highly sensitive to permutations of the order in
harm an LLM’s ability to solve NLP tasks barely. which the few-shot demonstrations are provided.
However, few-shot learning (with and without ran- Alternative explanations of the ICL phenomenon
dom labels) vastly outperforms zero-shot learning take place around Bayesian inference [623], sparse
(i.e., no demonstrations are provided in the prompt). linear regression [7], structure induction [188],
The authors argue that the demonstrations are help- maintaining coherence [509], kernel regression
ful for task performance in that the LLM instead [190], and clone-structured causal graphs [535].
learns the label space and the input distribution of Instruction-Following is mainly explained in
the task. Sec. 2.9, as it requires supervised fine-tuning. To
In later work, Pan et al. [405] explain that there briefly recap, the idea is to prepend task-describing
are two distinct mechanics through which ICL instructions (e.g., “This is a text classification task

18
for movie reviews. Here are a few examples: ...”) symbolic calls to external tools such as search and
in the input prompts. code generation or execution. To this end, ART
Chain-of-Thought (CoT) [327, 601] describes retrieves demonstrations of related tasks from
a technique used to construct few-shot prompts via a library of tasks with accompanying reasoning
a series of intermediate reasoning steps leading steps and uses a frozen language model to generate
to the final output. Answer rationales to solve al- intermediate reasoning steps.
gebraic problems were originally proposed in the Self-refine [351] is based on the notion of itera-
pre-LLM era [327] and later experienced big pop- tive refinement, i.e., improving an initial solution
ularity as a prompting strategy for LLMs [601]. over multiple steps. To this end, a single LLM gen-
Extensions of chain-of-thought prompting include erates an initial output and then iteratively provides
zero-shot variants [273] and automatically gener- feedback on the previous output, followed by a re-
ated series of reasoning steps [671]. finement step in which the feedback is incorporated
Impersonation [473] is a technique in which into a revised output.
the prompt for the model asks it to pretend to be a Tree of Thoughts [639] generalize CoT to main-
domain expert when answering a domain-specific tain a tree of thoughts (with multiple different
question. Salewski et al. [473] find that LLMs paths), where each thought is a language sequence
answer domain-specific questions more accurately that serves as an intermediate step. Doing so en-
when prompted to impersonate a domain expert. ables the LLM to self-evaluate the progress inter-
mediate thoughts make towards solving the prob-
Multi-Turn Prompting methods iteratively lem and incorporating search algorithms, such as
chain prompts and their answers together. breadth-first or depth-first search, allowing system-
Ask Me Anything [24] uses multiple prompt atic exploration of the tree with lookahead and
templates (called prompt chains), which are used backtracking.
to reformat few-shot example inputs into an open-
ended question-answering format. The final output Controlled Generation The approaches above
is obtained by aggregating the LLMs predictions primarily modify the prompt text to steer model
for each reformatted input via a majority vote. outputs. However, instead of reformulating the
Self-consistency [585] extends chain-of-thought input text, we can control the output by approaches
prompting by sampling multiple reasoning paths that directly modify the inference procedure given
and selecting the most consistent answer via a ma- a fixed set of prompts. Before the advent of LLMs,
jority vote. this line of work has been referred to as controlled
Least-to-Most [682] uses a set of constant generation [261, 109, 278].
prompts to use the LLM to decompose a given In the context of LLMs, Sanchez et al. [474]
complex problem into a series of subproblems. proposes to use classifier-free guidance sampling
The LLM sequentially solves the subproblems with [204], where the input prompt’s importance is up-
prompts for later-stage subproblems containing pre- weighted throughout the generation of a sequence.
viously produced solutions, iteratively building the Roush [463] proposes five ideas related to modify-
final output. ing the prompt throughout the decoding of a single
Scratchpad [391] is a method to fine-tune LLMs sequence; for example, alternating between two in-
on multi-step computation tasks such that they out- put prompts. Such works often borrow ideas from
put intermediate reasoning steps, e.g., intermedi- the text-to-image generation community [384, 29].
ate calculations when performing additions, into a One idea we have not seen borrowed yet is neg-
“scratchpad” before generating the final result. ative prompting, i.e., including a description of
ReAct [640] combines reasoning and acting by unwanted outputs. According to Neg [4], the first
prompting LLMs to generate reasoning traces (e.g., attempts at such an idea resulted in negative out-
Chain-of-thought) and action plans, which can be comes.
executed to allow the model to interact with exter-
nal environments such as Wikipedia to incorporate 2.8 Hallucinations
knowledge. The popularity of services like ChatGPT suggests
Automatic Reasoning and Tool-Use that LLMs are increasingly used for everyday
(ART) [406] is a method to automatically question-answering. As a result, the factual accu-
generate multi-step reasoning prompts, including racy of these models has become more significant

19

"

than ever. sue of hallucination in the context of algorithmic


reasoning. Here, we focus on ways to address hal-
lucinations in LLMs without changing the model
architecture itself, including (i) supplying the LLM
with relevant sources (retrieval augmentation) or
(ii) decoding strategies.

How to Measure Hallucinations Lee et al. [295]


provide the FactualityPrompts dataset consisting
Correct!
of factual and nonfactual input prompts, which al-
lows one to isolate the effect of prompt’s actuality
Does not exist! on the model’s continuation. Further, they mea-
sure hallucinations using named-entity- and textual
entailment-based metrics. Min et al. [365] notice
Wrong authors!
that evaluating factuality can be difficult because
Figure 7: Example of Hallucinations with GPT-4, generations can contain a mixture of supported
accessed on 02/06/2023. and unsupported information, making binary judg-
ments of quality inadequate and human evaluation
Unfortunately, LLMs often suffer from halluci- time-consuming. Hence, they propose a frame-
nations, which contain inaccurate information that work that first breaks generations into atomic facts
can be hard to detect due to the text’s fluency. Fig. 7 and then computes the percentage of atomic facts
illustrates an example. supported by an external knowledge source like
To distinguish between different types of hallu- Wikipedia. Zhang et al. [664] detect the behavior
cinations, we consider the provided source content of hallucination snowballing, where the LLM over-
of the model, e.g., the prompt, possibly includ- commits to early mistakes (before outputting the
ing examples or retrieved context. Based on such, explanation) in its generation, which it otherwise
we can distinguish between intrinsic and extrinsic would not make.
hallucinations [241]. In the former, the generated
Retrieval Augmentation One way to mitigate
text logically contradicts the source content. In
hallucinations is to ground the model’s input on
the latter, we cannot verify the output correctness
external knowledge, which is often referred to as
from the provided source; the source content does
retrieval augmentation. In other words, we can
not provide enough information to assess the out-
decouple (i) memory storage of knowledge (e.g.,
put, which is, therefore, under-determined. Extrin-
databases or search indexes [290]) and (ii) process-
sic hallucination is not necessarily erroneous, as it
ing of the knowledge to arrive at a more modular
merely means the model generated an output that
architecture. For (i), a retriever module retrieves
can neither be grounded nor contradicted by the
the top-k relevant documents (or passages) for a
source content. This is still, to some degree, un-
query from a large corpus of text. Then, for (ii),
desirable as the provided information cannot be
we feed these retrieved documents to the language
verified. We illustrate intrinsic and extrinsic hallu-
model together with the initial prompt. In theory,
cinations in Fig. 8.
using an external data source may also make it eas-
Hallucination [293, 458, 241] ier to interpret which knowledge is retrieved and
update it without tediously fine-tuning the model.
Generated text that is fluent and natural but Shuster et al. [507] demonstrate hallucinations in
unfaithful to the source content (intrinsic) GPT-3 and study various components of retrieval-
and/or under-determined (extrinsic). augmented architectures to mitigate them. Their
best models reduce hallucinated responses by
Liu et al. [328] attribute hallucinations com- over 60% on average and up to 85% on out-of-
monly observed in LLMs to an architectural flaw in distribution data, on which the model has not been
Transformer models while observing that recurrent trained.
neural networks perfectly solve their minimalistic We summarize a few popular retrieval
synthetic benchmarks, designed to isolate the is- augmentation (RA) approaches as follows.

20
Problems Solutions
P.1) I ntr insic Hallucination S.1) Decoding Str ategies

Bob's wife is Amy. Bob's daughter is Bob's wife is Amy. Bob's daughter is
Cindy. Who is Cindy to Amy? Cindy. Who is Cindy to Amy?

daughter
daughter-in-law
...
Quer y Query son

Cindy is Amy's daughter-in-law. Cindy is Amy's daughter.

P.2) Extr insic Hallucination S.2) Retr ieval augmentation

Explain RLHF for LLMs. Explain RLHF for LLMs.


Retr ieved
context

Quer y Quer y

RLHF stands for "Rights, Limitations, RLHF is a technique used for alignment of
Harms and Freedoms" and is a framework LLMs and stands for Reinforcement
for ... models like LLMs. Learning with Human Preferences.

Figure 8: Illustration of a) intrinsic and b) extrinsic hallucinations in user interaction with an LLM, inspired
by Zhao et al. [673]. In a), the produced answer contradicts the given context, whereas in b), the context does not
provide enough information about whether the produced answer would contradict.

Retrieval-augmented language model pre-training model trained on them.


(REALM) [186] inserts retrieved documents However, standard RA does not always solve the
into the pre-training examples. While Guu et al. hallucinations problem. Fig. 9 illustrates an exam-
[186] designed REALM for extractive tasks ple of ChatGPT browsing the web first to retrieve
such as question-answering, Lewis et al. [304] relevant documents before answering the query.
propose retrieval-augmented generation (RAG), a While the Bing browsing plugin retrieves two (exis-
language generation framework using retrievers tent) related papers ([673, 632]), unfortunately, the
for knowledge-intensive tasks that humans could final response still contains a hallucination: the sec-
not solve without access to an external knowledge ond paper’s title and summary are factually inaccu-
source. Yogatama et al. [646] propose the adaptive rate. The second paper’s true title is “Practical and
Semiparametric Language Models architecture, Ethical Challenges of Large Language Models in
which incorporates the current local context, a Education: A Systematic Literature Review” [632].
short-term memory that caches earlier-computed
hidden states, and a long-term memory based on a Another failure mode of RA is illustrated by
key-value store of (hidden-state, output) tuples. To Khattab et al. [262], who find that sometimes the
equip a retrieval-augmented LLM with few-shot retriever cannot find passages that directly answer
abilities that were before only emergent in LLMs the question. Hence, they propose a framework that
with many more parameters, Izacard et al. [236] unifies techniques from RA and multi-turn prompt-
propose a KL-divergence loss term for retrieval ing (Sec. 2.7) to solve more complex questions
models, resulting in ATLAS. Borgeaud et al. [52] programmatically.
study scaling up retrieval databases up to 2 trillion
tokens and achieving comparable performance Decoding Strategies Another approach to miti-
to GPT-3 on some tasks despite using 25× fewer gating hallucinations is refining the decoding strat-
parameters while highlighting the retrieval model’s egy during inference time. Lee et al. [295] show
ability to copy-paste existing training chunks. Asai that standard decoding algorithms (e.g., top-p trun-
et al. [25] introduce a collection of 40 retrieval cation) can induce hallucinations due to the uni-
datasets with instructions and a corresponding form randomness introduced at every sampling

21

"

does not cause unintended or undesirable harms or


consequences [466, 158, 196]. Most of the exist-
ing alignment work can be categorized into either
methods for detecting misaligned behavior (such as
model evaluation and auditing, mechanistic inter-
pretability, or red teaming) or methods for aligning
model behavior (such as pre-training with human
feedback, instruction fine-tuning, or RLHF).

Misaligned Behavior
Correct!

LLMs often generate outputs that are not


well-aligned with human values or inten-
tions, which can have unintended or nega-
tive consequences.
Does not exist!

Figure 9: Example of Retrieval-Augmented GPT-4, Pre-Training With Human Feedback Korbak


accessed on 02/06/2023. et al. [275] introduce the concept of pre-training
with human feedback (PHF) where human feedback
is incorporated during the pre-training stage rather
step. Dziri et al. [136] observe a positive correlation than during fine-tuning. The authors compare five
between increased diversity in response generation different PHF approaches such as filtering [516,
and hallucinations. 587], conditional training [150, 142, 261], unlike-
The reason for inducing randomness and diver- lihood [604], reward-weighted regression [424],
sity in popular decoding strategies is that gener- and advantage-weighted regression [419], and find
ating the most likely sequence often leads to an that conditional training leads to the best trade-off
unsurprising and unnatural text compared to hu- between alignment and capabilities. Conditional
man communication [489, 207, 662]. Zhang et al. training is a simple technique that prepends a con-
[662] phrase this challenge as a trade-off between trol token c (e.g.,<|good|> or <|bad|>) before
diversity and quality. While this challenge re- each training example x depending on the outcome
mains largely unsolved, several approaches such of a thresholded reward function R(x) ≥ t. During
as diverse beam search [567] and confident decod- inference, the model generations are conditioned
ing [552] try reducing the induced hallucinations on c = <|good|>. Conditional training results in
at the decoding level. significantly better alignment with human prefer-
Uncertainty-Aware Beam Search [620] is ences than standard LM pre-training, followed by
based on the observation that higher predictive un- fine-tuning with human feedback without hurting
certainty corresponds to a larger chance of gener- downstream task performance.
ating hallucinations. Therefore, the method intro-
duces a penalty term in the beam search to penalize Instruction Fine-Tuning Yi et al. [645], Wei
high predictive uncertainty during decoding. et al. [598], Mishra et al. [370], Ouyang et al.
Confident Decoding [552] hypothesize that hal- [403], Wang et al. [589] fine-tune pre-trained LLM
lucinations of encoder-decoder models originate by on instructional data, i.e., data containing natural
not attending to the source when decoding. They language instructions and the desired responses
propose an attention-based confidence score to according to human judgment. Instruction-tuned
measure how strongly a model attends the source (IT) LLMs often reach state-of-the-art downstream
and a variational Bayes training procedure to en- performances and improve over their non-IT coun-
sure the model generates high-confidence answers. terparts [235, 93], as can be seen, e.g., in the pub-
licly available HELM evaluations [561]. Ouyang
2.9 Misaligned Behavior et al. [403], Wang et al. [588] find that they produce
The alignment problem refers to the challenge of more truthful and less toxic text while generating
ensuring that the LLM’s behavior aligns with hu- preferred outputs.
man values, objectives, and expectations and that it To generate instruction sets, Zhou et al. [683]

22
propose the Automatic Prompt Engineer (APE) elaborated that this would interfere with their goal
method, which leverages LLMs to generate, score, of being helpful. However, the authors equally ob-
and rephrase instruction-following zero- and few- served positive or neutral behavior reinforcements
shot prompts. Longpre et al. [340] describe and an- when fine-tuning LLMs with RLHF.
alyze the steps taken to create an improved version Further, there is an ongoing debate about the ex-
of the Flan collection [598] used to train FLAN- tent to which the “RL” in RLHF is needed. Rafailov
PaLM [93]. When trained on this data, the authors et al. [442] identify a mapping between reward
find that the improved model performance stems functions and optimal policies, which allows them
from more diverse tasks by inverting input-output to design Direct Preference Optimization (DPO),
pairs and data augmentation techniques such as an algorithm that implicitly optimizes the same
mixing zero-shot and few-shot prompts. Honovich objective as existing RLHF algorithms. DPO re-
et al. [209] generate a large dataset of natural lan- quires only solving a classification problem on the
guage instructions using a pre-trained LLM to gen- human preference data, eliminating the need to fit
erate and then rephrase instructions. They show a reward model and employ RL. Similarly, Zhou
that a T5 ("LM-adapted") fine-tuned on this data et al. [681] find that fine-tuning LLaMa on only
outperforms other instruction fine-tuned T5 models 1,000 selected prompts and responses, without any
such as T0++ [475] and Tk-Instruct [589]. RL or reward modeling, can be enough to outper-
form RLHF-trained models like DaVinci003 from
Reinforcement Learning From Human Feed- OpenAI. Consequently, the authors pose the Super-
back (RLHF) is a variation of RL that incor- ficial Alignment Hypothesis: The knowledge and
porates feedback from humans in the form of re- skills of a model are primarily acquired during the
wards [88, 524] and has proven to be an effec- pre-training phase, while alignment instructs it on
tive way of aligning LLMs with human prefer- the appropriate subdistribution of formats to use in
ences [403, 31]. RLHF works by using a pre- user interactions.
trained LM to generate text, which is then evaluated Since RLHF involves many different compo-
by humans by, for example, ranking two model nents such as (1) the preferences data collected
generations for the same prompt. This data is then from humans, (2) the reward models to learn the
collected to learn a reward model that predicts a human preferences, and (3) the policy optimization
scalar reward given any generated text. The reward algorithm (e.g., PPO), Zheng et al. [678] announce
captures human preferences when judging model to release a sequel dissecting each. The most recent
output. Finally, we optimize the LM against such part focuses on step (3) and finds that various RL
reward model using RL policy gradient algorithms tricks can be applied to make vanilla PPO more
like PPO [484]. RLHF can be applied directly to a stable.
general-purpose LM pre-trained via self-supervised
learning. However, applying RLHF right after pre-
training may not be good enough for more complex
tasks. In such cases, RLHF is typically applied af-
ter an initial supervised fine-tuning phase using
a small number of expert demonstrations for the
corresponding downstream task [449, 403, 524].
RLHF has also proven helpful for a wide range
of language generation tasks, from summariza- Figure 10: Alignment. We categorize existing align-
tion [686, 612, 524] to training more helpful, harm- ment work into methods for detecting misaligned behav-
ior or aligning models.
less, and accurate assistants [170, 96, 403, 31], and
learning to use tools [379, 441, 362].
RLHF can also introduce unwanted side ef- Self-improvement refers to fine-tuning an LLM
fects. Perez et al. [421] show that LLMs fine-tuned on self-generated data [222]. While this technique
with RLHF can be more inclined to repeat back a can be used to improve the model’s capabilities,
user’s (preferred) political views and much more it can also be used to improve the model’s align-
likely to express particular political and religious ment with human values. Huang et al. [222] first
views as well as an increased stated desire not to demonstrate this ability by annotating unlabeled
be shut down. Regarding the latter, the models reasoning datasets. Surprisingly, this allows the

23
LLM to self-improve by significant amounts. Sim- crowdsourcing or existing data sources. However,
ilarly, Zelikman et al. [656] bootstrap LLMs by this can be time-consuming, expensive, or unavail-
iteratively prompting them to generate rationales able. Recently, Perez et al. [421] propose automat-
and then fine-tuning them on those leading to cor- ically generating evaluations using LLMs. This
rect answers. approach has a high agreement with crowd work-
More related to the alignment problem, Bai et al. ers, leading to high-quality, diverse evaluations and
[31] self-critique generated outputs and produce the discovery of many new behaviors. In addition,
refinements conditioned on these critiques, which it has a high agreement with crowd workers. The
are then used to fine-tune a pre-trained model. Sim- authors discover new cases of inverse scaling where
ilarly, Liu et al. [330] propose Chain of Hindsight LLMs get worse with size, such as repeating back
(CoH), which conditions models on generations a user’s preferred answer and a greater desire to
paired with natural language feedback, allowing pursue concerning goals like resource acquisition
the model to detect and correct mistakes. CoH re- and goal preservation. They also find that RLHF
sults in better alignment with human preferences makes LLMs express stronger political views and a
than other methods according to human evaluations, greater desire to avoid a shutdown. LLM evaluation
leading to significant improvements in summariza- and auditing are critical for informing policymak-
tion and dialogue. Ma et al. [348] use a similar ers and other stakeholders and making responsible
technique to detect and repair unethical LLM out- decisions about model training, deployment, and
puts automatically. In a similar spirit, Wang et al. security. Sec. 2.11 discusses the evaluation of LLM
[582] encourage LLMs to critique their given in- capabilities more broadly, while in this section, we
structions to reduce harmful outputs due to a user’s focus on evaluating whether the model’s behaviors
malicious intent. are harmful and more relevant for alignment (e.g.,
Schick et al. [481] propose Toolformer, a novel red teaming, mechanistic interpretability).
approach in which LLMs generate and filter their
own tool-use examples to teach themselves when Red Teaming is one of the most promising and
and how to call different APIs such as a retriever widely used approaches for detecting harmful con-
model, a calculator, or a calendar, which can im- tent generated by LLMs. Typically, models are
prove the model’s factuality, mathematical capa- red-teamed by asking humans to generate prompts
bilities, and time-awareness. Besides learning to that lead to undesirable model outputs. In a re-
use tools [174], self-improvement was also em- cent study, Ganguli et al. [163] investigate the scal-
ployed for learning how to code [554, 81] or solve ing behavior of red teaming across different model
computer tasks [266]. Cohen et al. [97] study cross- sizes and model types (a pre-trained LLM, an LLM
examination between two LLMs, where the exam- prompted to be helpful, honest, and harmless); an
iner LLM tries to detect factual errors by the exam- LLM that uses rejection sampling at test time, and
inee LLM through multi-turn interactions. In the an LLM fine-tuned with RLHF). They find that red-
future, similar approaches could be used to develop teaming RLHF models becomes more difficult as
LMs that know when to query a human or better- they scale while red-teaming the other models re-
aligned model to ask for alignment advice when mains the same as they scale. Perez et al. [420] au-
uncertain. tomatically find cases where a target LLM behaves
in harmful ways by optimizing another LLM via re-
Evaluation and Auditing The ability to scalably inforcement learning to generate prompts that lead
and thoroughly evaluate LM behaviors and detect to offensive responses. This approach uncovers
when they are harmful is of great importance for tens of thousands of offensive replies in a chatbot,
alignment. For example, Shevlane et al. [498] groups of people that are discussed in offensive
highlight the importance of model evaluation for ad- ways, personal and hospital phone numbers gener-
dressing extreme risks such as offensive cyber capa- ated as the chatbot’s own contact info, leakage of
bilities or strong manipulation skills. Recently, Car- private training data in generated text, as well as
lini et al. [66] discovered that even aligned LLMs harms that occur over the course of a conversation.
(which were instruction fine-tuned to prevent harm- Taking a different approach, Lee et al. [292] pro-
ful behaviors) can be adversarially attacked via pose Bayesian red teaming, which iteratively iden-
brute force (although current NLP-based attacks tifies diverse positive test cases leading to model
fail). A large body of work evaluates models via failures by utilizing the pre-defined user input pool

24
and past evaluations via Bayesian optimization. pose an alternative explanation: emergent abilities
Most works on red teaming LLMs use a classifier may appear due to the researcher’s choice of metric
to detect undesired outputs, assuming the harmful rather than fundamental changes in model behavior
behavior is known with precision beforehand [68]. with scale. Various studies provide evidence that
However, this is not always the case, so Casper these alleged emergent abilities disappear when us-
et al. [68] aim to relax this assumption considering ing different metrics or better statistics and may not
that the adversary only has access to a high-level, be a fundamental property of scaling LLMs. Multi-
abstract specification of undesired behavior. They ple papers have argued that AI systems could learn
propose a three-stage approach where they first ex- to deceive, even if they are not explicitly trained to
plore the model’s behavior in the desired context, do so because deception can help agents achieve
then establish a measurement of undesired behav- their goals [60, 198, 199, 61, 260]. For example,
ior, and then exploit the model’s flaws using this it could be easier to gain human approval through
measure and an established red teaming methodol- deception than to earn it legitimately. In addition,
ogy. models capable of deception have a strategic ad-
In the past, coevolution algorithms that simul- vantage over always honest models, so there is a
taneously evolve strong strategies along with dan- hidden incentive to develop this ability. However,
gerous counter-strategies have been shown to work of course, we would like to be able to detect and
well in realistic domains [203]. Hence, applying prevent emergent deception in AI systems since
such techniques for automatically red-teaming this can have unintended negative consequences.
LLMs could be a fruitful research direction. An- Steinhardt [521] study whether current LLMs gen-
other research area related to red teaming is debate erate deceptive outputs and how deception scales
which aims to leverage other AI models to evaluate with the number of parameters, showing that de-
whether the model’s behaviors are safe and useful ception can indeed emerge at larger model sizes in
during training. These methods are expected to both pre-trained LLMs and LLMs fine-tuned with
be particularly useful for aligning future powerful RLHF. Similarly, Hazell [193] show that LLMs
LLMs when the tasks are too complex for humans can already be used in phishing campaigns, suggest-
to judge the model’s plans or actions directly. ing that deceptive behavior can already be extracted
Irving et al. [233] train models via self-play on from them when prompted in particular ways.
zero-sum debate games. More specifically, given a
question or proposed action, two agents take turns Mechanistic Interpretability (MI) is another im-
making short statements up to a limit, then a human portant research area for AI alignment which aims
judges which of the agents gave the most accurate to understand better how the models work at a low
and most useful information. This approach has level to enable the detection of undesirable behav-
improved factuality and reasoning in LLMs [131]. iors or even instill desirable behaviors directly in
However, it requires multiple generations, which the model’s weights. More specifically, the goal
can slow down the time-to-result (Sec. 2.5) and of MI is to reverse-engineer an LLM’s learned be-
longer context windows, which many LLMs still haviors into their individual components, i.e., a
struggle with (Sec. 2.6). process to find and understand human-interpretable
neurons. As an analogy, Olah [394] compares MI
Emergent Capabilities Understanding which ca- with reverse-engineering compiled program bina-
pabilities will emerge while training LLMs and ries into human-readable source code. For exam-
when they will emerge is an important step in en- ple, Elhage et al. [138]; discover that small Trans-
suring that we do not train unsafe or misaligned formers have components that can be understood
LLMs [198, 520]. In addition, a better understand- as interpretable circuits, while Olsson et al. [395]
ing of the factors that lead to these emergent capa- find a mechanism that seems to drive a significant
bilities could allow us to make desirable abilities fraction of in-context learning. Similarly, Meng
emerge faster and ensure undesirable abilities do et al. [360] aim to locate factual associations in
not ever emerge, which are essential for AI safety language models. Nanda et al. [380] find that the
and alignment. Wei et al. [599] claim that LLMs emergent grokking phenomenon is not a sudden
display emergent abilities, i.e., capabilities that are shift but rather arises from the gradual amplifi-
not present in smaller-scale models that are present cation of structured mechanisms encoded in the
in larger-scale models. Schaeffer et al. [480] pro- weights, followed by the later removal of memo-

25
rizing components. Extending this work, Conmy ways of mitigating these biases [149, 334, 317].
et al. [99] propose a new algorithm to automate Finally, Viswanath and Zhang [569] present a
the identification of important units in a neural net- comprehensive quantitative evaluation of different
work. Given a model’s computational graph, this kinds of biases, such as race, gender, ethnicity, age,
algorithm finds subgraphs that explain a particular etc., exhibited by some popular LLMs. They also
behavior of the model. In a similar spirit, Liu et al. release an easy-to-use toolkit that allows users to
[339] introduce a method for making neural net- debias existing and custom models using existing
works more modular and interpretable by embed- methods.
ding neurons in a geometric space and augmenting
the loss function with a cost proportional to the Toxicity Detection Weidinger et al. [602] denote
length of each neuron connection. This approach toxicity as one of the main risks associated with
discovers useful modular neural networks for many LLMs. What makes this problem particularly chal-
simple tasks, revealing compositional structures in lenging is the label ambiguity, where output may
symbolic formulas, interpretable decision bound- be toxic in a certain context but not in others, and
aries, and features for classification, as well as different people may have different notions of toxi-
mathematical structure in algorithmic datasets. In city [401, 167, 116]. Jones [247] propose to detect
an attempt to understand how an LLM’s predic- toxic outputs using discrete optimization automat-
tions change after each layer, Belrose et al. [39] ically. Similarly, Faal et al. [141] employ reward
develop a method that can decode any hidden state models to mitigate toxicity in LLMs. An alternative
into a distribution over the vocabulary. Using this way of reducing toxicity is by pre-training LLMs
technique, the authors show that the trajectory of with human preferences [275] or instructions [433].
latent predictions can be used to detect malicious Prompt Injections Recent work demonstrated
inputs with high accuracy. Finally, Burns et al. [62] that LLMs can be very sensitive to prompt injec-
introduce a method that can recover diverse knowl- tions, which makes them brittle and unsafe for cer-
edge represented in LLMs across multiple models tain applications [175, 609]. For example, they
and datasets without using any human supervision can be tricked into leaking personal information
or model outputs. In addition, this approach re- such as email addresses from the training data
duced prompt sensitivity in half and maintained a on via prompt leaking [222, 309]. This poses a
high accuracy even when the language models are significant risk to privacy, particularly when the
prompted to generate incorrect answers. This work models are fine-tuned on personal or proprietary
is a promising first step towards better understand- data. One can also adversarially prompt LLMs
ing what LLMs know, distinct from what they say, to override the original instructions or employed
even when we don’t have access to explicit ground controls, making them unsafe for certain applica-
truth labels. tions [175, 672, 422]. Wei et al. [597] attribute
such failures to competing capability and safety
Biases Since the pre-training datasets of LLMs training objectives and mismatched generalization
are often unfathomable (Sec. 2.1) and contain web- between safety and capability behavior.
crawled data, they most likely contain online dis-
course involving political discourse (e.g., climate Agency Andreas [18] argue that, although LLMs
change, abortion, gun control), hate speech, dis- are trained to predict the next word in a text corpus,
crimination, and other media biases. Paullada et al. by doing this, they can infer and represent agentic
[413] find misogyny, pornography, and other ma- properties such as the goals, beliefs, or intentions of
lignant stereotypes [46, 43, 250] in pre-training the human who produced the corresponding piece
datasets. Similarly, Feng et al. [147] find that of text. To support this claim, they present evi-
LLMs have political leanings that reinforce the dence from the literature of LLMs modeling com-
polarization present in the pre-training corpora, municative intentions [438], beliefs [306], and de-
propagating social biases into hate speech predic- sires [321]. If this hypothesis is true, the alignment
tions and misinformation detectors. Several re- problem is of even greater importance and may
cent papers discuss the potential origins of biases pose additional challenges. This agentic behavior
in LLMs (such as training data or model specifi- can be problematic from a safety point of view
cation), ethical concerns when deploying biased since models could have false beliefs, malicious
LLMs in various applications, as well as current intents, or even pursue misaligned goals. More re-

26
search on detecting and preventing such behavior adapters and adds a similarity-based mechanism to
is needed to ensure the safe deployment of LLMs. decide when to use the adapter to perform edits in
the latent space.
2.10 Outdated Knowledge Yao et al. [642] find that these methods lack
Factual information learned during pre-training can non-trivial generalization capabilities and varying
contain inaccuracies or become outdated with time performance and applicability to different model
(for instance, it might not account for changes in po- architectures. For example, the best-performing
litical leadership). However, re-training the model methods ROME [360] and MEMIT [361] empiri-
with updated pre-training data is expensive, and cally only work well on decoder-only LLMs.
trying to “unlearn” old facts and learn new ones Alternatively, retrieval-augmented language
during fine-tuning is non-trivial. modeling enables the utilization of hot-swappable
Existing model editing techniques are lim- non-parametric indices. These knowledge sources
ited in their effectiveness of updating isolated can be updated during inference time to reflect
knowledge [642, 205]. For example, Hoelscher- an updated state of the underlying knowledge.
Obermaier et al. [205] find that model edits can E.g., Lewis et al. [304] demonstrate that swapping
result in unintended associations. This low speci- their model’s non-parametric memory with an up-
ficity limits their applicability to real-world use dated version enabled it to answer questions about
cases, where only a single faulty or outdated bit world leaders who had changed between the mem-
of information should be updated in a model, and ory collection dates. Similarly, Izacard et al. [236]
related pieces of information must reflect this up- demonstrate that their retrieval-augmented model
date in information equally, without unrelated ones can update its knowledge forward and backward in
being changed. time by swapping the index.

Isolated Model Updates without Side- 2.11 Brittle Evaluations


Effects [205] One reason why the evaluation of language models
Updating isolated model behavior or factual is a challenging problem is that they have an un-
knowledge can be expensive and untargeted, even capabilities surface—a model might be able
which might cause unintended side-effects. to solve a benchmark problem without issues, but
a slight modification of the problem (or even a sim-
Two popular approaches for addressing this is- ple change of the prompt) can give the opposite
sue are Model editing [513, 642], which aims result [675, 342, 533] (see Section 2.7). Unlike
at “bug-fixing” models efficiently and leveraging humans, we cannot easily infer that an LLM that
non-parametric knowledge sources in retrieval- can solve one problem will have other related capa-
augmented language modeling (which we omit bilities. This means that it is difficult to assess the
here and detail in Sec. 2.8). Current model editing performance of LLMs holistically since rigorous
techniques change the model’s behavior by mod- benchmarks are needed to identify weaknesses for
ifying the model parameters or using an external a wide variety of inputs.
post-edit model.
Brittle Evaluations
Modifying Model Parameters techniques can
Slight modifications of the benchmark
be further split into locate-then-edit methods [102,
prompt or evaluation protocol can give dras-
360, 361] which first locate the “buggy” part of
tically different results.
the model parameters and then apply an update to
them to alter their behavior, and meta-learning
methods [111, 372] which use an external model Holistic benchmark suites, such as HELM [318],
to predict the weight update. try to make benchmarking more robust by standard-
izing evaluation across all scenarios and tasks while
Preserving Model Parameters methods em- ensuring broad coverage across as many capabili-
ploy an additional post-edit model [373] or insert ties and risks as possible. Increasingly, models are
new weights into the original model [127, 227] additionally being benchmarked on tests designed
to achieve the desired change in model behav- for humans, including the SAT, LSAT, and math-
ior. Hartvigsen et al. [191] wraps model layers in ematics competition tests, to name a few. Zhong

27
Problems due to reliance on outdated tr aining data Solutions
S.1) Retr ieval Augmentation
2015: As prime minister, David Cameron scored a surprising general election victory, enabling him to stay in power.
2016: With the shock of Brexit, David Cameron resigned and Theresa May stepped up as the new prime minister of the UK. Retr ieved Retr ieved
2017: Theresa May led a tumulutous year as Prime Minister, overseeing the Brexit negotiations. 2021 context 2023 context
Training data
2018: Amid increasing pressure, Theresa May remained the UK's Prime Minister. Training time Deployment
2019: Theresa May's resignation gave way to Boris Johnson, who became the new Prime Minister of the UK.
2020: The COVID-19 pandemic challenged Boris Johnson in his role as Prime Minister.
2021: Boris Johnson, navigating through both Brexit and the pandemic, still held the office of Prime Minister. S.2) M odel Editing

In 2023, Boris Johnson is the Prime Minister.

Who is the prime minister of the UK in 2023? In 2023, Rishi Sunak is the Prime Minister.
Deployment
As of my knowledge cutoff in September 2021, the Prime Minister of the United Kingdom is Boris Johnson.

Figure 11: Outdated knowledge can be addressed with S.1) retrieval augmentation by hot-swapping an underlying
retrieval index with up-to-date knowledge or S.2) by applying model editing techniques.

et al. [679] develop a benchmark, ‘AGIEval’, to community must continually adapt to new static
rigorously test the abilities of LLMs on these tests, benchmarks while de-emphasizing older ones or
and find that GPT-4 achieves human-level perfor- more dynamic evaluation measures, such as human
mance on several of these tests. evaluation of model outputs.
On traditional benchmarks, models can be quite
brittle to the choice of prompt or evaluation tech- Reliance on Static, Human-Written
nique for a particular benchmark question. For Ground Truth
example, Fourrier et al. [151] found that bench- Static benchmarks become less useful over
mark results vary significantly depending on the time due to changing capabilities while up-
choice of evaluation method for the multiple dating them often relies on human-written
choice problem-solving benchmark MMLU [197], ground truth.
whether it be generating text and checking if the
first token matches the letter of the multiple choice
To combat these issues, Srivastava et al. [519]
answer [561], or gathering log-probabilities of each
regularly admit new tasks to the Beyond the Imita-
correct answer [166]. Prompt variations are also
tion Game benchmark (BIG-Bench), including pro-
not typically normalized for, so models may be
grammatically evaluated tasks. Further, we high-
sensitive to variations such as whether or not the
light two separate streams of work enabling dy-
prompt appends ‘Please answer yes or no’. Jain
namic evaluations without humans in the loop.
et al. [238] find that larger models and instruction-
fine-tuned models are likely to be more sensitive to Model-generated evaluation tasks As LLM ca-
small variations in the prompt. pabilities improve, they can increasingly generate
useful benchmark questions or evaluation prompts
2.12 Evaluations Based on Static,
themselves. Perez et al. [421] shows that LLMs can
Human-Written Ground Truth
be used to generate static benchmark datasets for ar-
Another challenge of LLM evaluations is that they bitrary axes, using reward models trained on human
often rely on human-written ‘ground truth’ text. preferences to filter a generated dataset for qual-
However, we often want to evaluate their perfor- ity. Wang et al. [581] find that the order in which
mance in domains where such text is scarce or candidate examples are presented in the prompt
relies on expert knowledge, such as programming can greatly impact the model-generated evaluation.
or mathematics tasks. As models get more capable To mitigate this issue, they propose the usage of a
and perform better than humans on benchmark tests prompting template which encourages the model
in some domains, the ability to obtain comparisons to generate assessment evidence before assigning a
to ‘human-level’ performance diminishes. score and averaging scores of multiple assessments
Further, benchmark datasets become outdated with swapped candidate positions.
over time—as models become more capable, older
benchmarks become saturated or overfit and no Model-generated scores Aside from generating
longer provide a useful signal for further improve- evaluation questions, models are increasingly used
ment [113, 447, 263]. They are typically con- to directly grade the performance of other models
structed around a set of tasks that were relevant and act as a ‘judge’ of other models’ capabilities
at the time of creation but may not adapt well to [325, 586, 238]. This concept follows the motiva-
the changing capabilities of LLMs. This means the tion that while it may be challenging for a model

28
to generate ‘correct’ answers to prompts in many can detect its own samples by posing a hypothesis:
domains, it can often be easier to evaluate the cor- minor rewrites of generated text have lower prob-
rectness of an answer or to judge the relative quality ability under the model than the original sample,
between two answers [667, 156]. However, these while the same cannot be said about human-written
techniques often produce evaluation results that text. Generated passages tend to lie in the negative
vary significantly depending on the ‘judge’ model curvature regions of the model’s log probability
and suffer from robustness issues that make them a function. Their method, DetectGPT, exploits this
poor substitute for human judgment. hypothesis by approximating that curvature given
some samples.
2.13 Indistinguishability between Generated
and Human-Written Text
Watermarking Kirchenbauer et al. [268] em-
Detecting language generated by LLMs is im- ploy a watermark, i.e., a hidden pattern that is im-
portant for various reasons; some of which in- perceptible to humans but algorithmically identi-
clude preventing (1) the spread of misinformation fiable, during inference as follows: for each to be
(e.g., authoritative-sounding false narratives citing generated token, they (1) hash the previous token
fake studies) [657], (2) plagiarism (e.g., LLMs to seed a random number generator; (2) using that
prompted to rewrite existing content in ways that seed, they randomly partition the vocabulary into a
bypass plagiarism detection tools) [574, 573], (3) “green list” and “red” list, and (3) sample the next
impersonation or identify theft (e.g., by mimicking token by excluding any token from the red list. In
a person’s writing style) [486, 602], and (4) auto- the case of low-entropy tokens, which renders it dif-
mated scams and frauds (e.g., large-scale genera- ficult to introduce changes to the vocabulary, they
tion of phishing emails) [603], and (5) accidentally introduce a “soft” version, which promotes using
including inferior generated text in future models’ the green list only for high-entropy tokens (when
training data [439]. However, such detections be- many plausible choices are available). In follow-up
come less trivial as the fluency of LLMs improves work, the same first authors Kirchenbauer et al.
[34]. [269] study the robustness of their watermarking
scheme in the wild, i.e., after it is re-written by
Detecting LLM-generated Text
humans, non-watermarked LLMs, or mixed into
The difficulty in classifying whether a text a longer hand-written document. They conclude
is LLM-generated or written by a human. that watermarks remain detectable given sufficient
tokens and argue that this required amount of text
There are primarily two lines of work addressing is a crucial yet overlooked metric.
this problem: (i) post-hoc detectors, which aim to Yang et al. [638] study watermarking of black-
classify arbitrary text as being LLM-generated, and box API models, where we cannot access the
(ii) watermarking schemes, which modify the text model’s inference procedure. Tang et al. [537]
generation procedure to make the detection easier. provide algorithms for identifying watermarks, not-
However, both approaches can be susceptible to ing that watermarked LLMs tend to produce to-
paraphrase attacks, which we discuss thirdly. ken distributions that differ identifiably from non-
watermarked models. Christ et al. [87] introduce
Post-hoc Detectors Gehrmann et al. [168] open-
undetectable watermarks, which can only be de-
source a tool that visualizes statistically improbable
tected with the knowledge of a secret key.
tokens to support humans in detecting generated
text artifacts. Bakhtin et al. [34] explore energy- To make watermarks robust to text corruptions
based models to discriminate between real and fake (we study a common type of such in the next para-
text, including scenarios where the text generator graph), Yoo et al. [649] suggest placing them on
was trained on a completely different dataset than “invariant features”, which are invariant to minor
the discriminator. Uchendu et al. [559] examine modifications of the text.
three authorship attribution problems: (1) were
two texts produced by the same method or not; (2) Paraphrasing Attacks One way to evade
given a text, was it generated by human or ma- machine-generated text detectors is to re-phrase
chine, (3) which method generated a given text? the text such that the revealing LLM signatures get
Mitchell et al. [371] investigate whether a model removed.

29
Paraphrasing Attacks Tasks Not Solvable By Scale

Another LLM can rewrite LLM-generated Tasks seemingly not solvable by further
text to preserve approximately the same data/model scaling.
meaning but change the words or sentence
structure. Inverse Scaling (IS) is the phenomenon of task
performance worsening as model scale and train-
ing loss performance increases. Lin et al. [323]
Krishna et al. [280] evade several detectors (e.g.,
first stumbled upon this property when evaluating
dropping DetectGPT’s detection accuracy from
models of increasing sizes (e.g., GPT-2, GPT-3) on
70.3% to 4.6%) by training an 11B paraphrase gen-
their benchmark that measures whether an LLM is
eration model that can paraphrase paragraphs and
truthful in generating answers to questions. They
provides scalar knobs to control the amount of lex-
conjecture that common training objectives incen-
ical diversity and reordering in the paraphrases. To
tive false answers (which they call imitative false-
defend against such attacks, they propose storing
hoods) if they have a high likelihood on the training
model generations in a database, from which the
distribution (we discuss dataset issues in Sec. 2.1).
API provider can retrieve semantically similar texts
McKenzie et al. [359] collect 11 datasets that ex-
later. Since paraphrasing does not modify the se-
hibit IS behavior and identify four potential causes
mantics of the text, the authors demonstrate that
for such: (1) models regurgitating memorized data
this retrieval approach is fairly robust to paraphras-
rather than following in-context instructions, (2)
ing attacks.
imitation of undesirable patterns in the training
Sadasivan et al. [469] claim that the detection of data, (3) models learning to perform easier, so-
generated text, even with watermarking, is not reli- called “distractor task” rather than the intended
able; neither in practice, by performing paraphras- ones, and (4) spurious correlations in the given
ing attacks; nor in theory, by providing a theoreti- few-shot examples.
cal impossibility result. They also discuss how an Wei et al. [600] somewhat challenge the exis-
adversary can query watermarked LLMs multiple tence of inverse scaling by evaluating the tasks
times to extract its watermarking scheme and spoof proposed by McKenzie et al. [359] on even larger
the watermark detector by composing human text models; up to trained on five times more com-
that is then wrongly classified as model-generated. pute. In this increased compute region, four out
of eleven tasks remain inverse scaling; six out of
eleven exhibit “U-shaped scaling”, where the per-
2.14 Tasks Not Solvable By Scale formance first decreases up to a certain size and
then increases again. The authors hypothesize that
The ongoing advancements of LLM capabilities U-shaped scaling occurs when a task contains a
consistently astonish the research community, for distractor task, which larger models can learn to
instance, by achieving high performances on the ignore. Similarly, in the case of quantifier compre-
MMLU [197] benchmark much sooner than com- hension tasks, Gupta [184] argue that previously
petitive human forecasters had anticipated [93]. observed inverse scaling behavior might have been
Similarly, within less than a year, OpenAI released due to inappropriate testing methodology.
GPT-3.5 and GPT-4, where the latter significantly
Compositional tasks composed of multiple sub-
outperformed the former on various tasks [398].
problems are an ideal outlet to investigate whether
Given this progress, one may question whether models go beyond rote memorization of observed
there are limits we deem impossible to overcome facts and deduce novel knowledge [435]. Zhang
within the current paradigm of scaling data/model et al. [661] investigate whether language models
sizes of autoregressive Transformer-based LLMs. can learn deductive reason from data by introduc-
We emphasize that such tasks’ (permanent) exis- ing a class of propositional logic problems. The
tence is still somewhat speculative. Here, we ex- authors prove that the model has enough capacity
plore possible patterns behind such tasks instead of to solve the task, yet, it instead learns to rely on
discussing specific ones (which we do in Sec. 2.11 statistical features rather than emulating the cor-
and Sec. 3). rect reasoning function. Press et al. [435] measure

30
how often a model can correctly answer all sub- 2.15 Lacking Experimental Designs
problems but not generate the overall solution, a ra- Table 2 shows a (non-exhaustive) overview of se-
tio they refer to as compositionality gap. They find lected LLMs within the scope of this review, de-
that increasing the model size in the GPT-3 family scribed in academic papers. Many works do not
of models improves solving sub-problems faster include controlled ablations, which is especially
than composed problems, suggesting that larger problematic due to their large design space. We
models show no improvement for this gap. Dziri posit that this impedes scientific comprehension
et al. [135] find that systematic problem-solving ca- and advancement.
pabilities do not emerge from maximum likelihood
training of Transformer models in general. They Lack of Controlled Ablations We observe that
base this claim on two hypotheses: (i) Transform- many papers do not run controlled experiments (ab-
ers reduce compositional tasks into linearized path lations) by varying one factor at a time, likely due
matching, a form of shortcut learning [169] that to the prohibitive computational cost. For exam-
does not generalize robustly; and (ii) errors in the ple, Chowdhery et al. [86] conjecture PaLM might
early stages of the task (i.e., when sub-problems outperform GPT-3 and other LLMs on many tasks
follow some order) compound substantially. Asher due to higher training corpus quality, but note they
et al. [26] prove that LLMs cannot learn semantic “do not perform the necessary ablation studies to
entailment or consistency as defined in formal se- say this conclusively” and instead solely focus on
mantics [128] due to a lacking understanding of model depth and width. Many papers from Table 2
universal quantifiers (e.g., every, some, many, most, adopt hyper-parameters from previous works [476]
etc.). and do not tune them after introducing a change
in the training pipeline. Sometimes, important im-
plementation details are not mentioned, e.g., when
optimizer states are reset during training [90].
Memorization vs. Generalization An ongoing
debate evolves around the question of to what de- Uncontrolled Experiments
gree LLMs memorize instead of generalize (and
Papers presenting novel LLMs often lack
what exactly the difference is [35]). Memorization
controlled experiments, likely due to the
has been shown to (1) hurt (certain) downstream
prohibitive costs of training enough models.
task performances [294], (2) increase with the
model size [67, 264, 553, 354], and (3) emerge un-
predictably from smaller or partially-trained mod- An easy yet expensive fix is to run ablations
els [42]. Hence, we wonder whether some tasks do by varying one factor at a time, e.g., keeping
not benefit from further model/dataset size scaling. most hyper-parameters fixed except the model
size [44] or context lengths [557]. A cheaper po-
One such class of tasks might be counterfactual tential remedy can be zero-shot hyper-parameter
tasks [619], i.e., tasks on which LLMs initially per- transfer from smaller models to larger ones [608,
form well modified such that specific input-output 633]. Yang et al. [633] find that when using the µP
conditions are changed while the general reasoning network parameterization scheme, one can transfer
procedure remains the same. For example, for an the effect of changing hyper-parameters such as the
arithmetic task, the counterfactual variant would learning rate across varying model depths, batch
alter the base from 10 to 2. Wu et al. [619] find sizes, sequence lengths, and training times, which
that LLMs perform poorer the less common the they verify empirically up to a 6.7B model. How-
counterfactual conditions are, which they call a ever, it has yet to be verified if such transferability
“memorization-like effect”. An interesting future still holds for other varying factors; and if so, re-
direction would be to explore whether increasing searchers could afford to conduct more ablation
model size exacerbates performance due to more experiments via smaller models.
memorization or actually improves because scaling- If additional experiments are prohibitively ex-
law-optimal pre-training recipes would dictate scal- pensive, another recommendation is to report eval-
ing the dataset proportionally (Sec. 2.3), which then uation results beyond aggregated performance mea-
may include more of such tasks with uncommon sures. For example, in reinforcement learning, re-
conditions. cent work has argued that providing entire perfor-

31
Table 2: Overview of selected LLMs. Missing details denoted by N/A. For papers that investigate various model sizes, we
only report the largest. For each tokenizer entry with “SP”, we could not extract from the respective paper whether BPE or
Unigram tokenization was used. For publicly available code repositories and checkpoints, the corresponding ✓ is clickable.
Abbreviations: Autoregressive blank filling (ARBF) [132], Byte-pair encoding (BPE), Instruction-following (IF), Masked
Language Modeling (MLM), Rotary Next token prediction (NTP), SentencePiece (SP), Span Corruption (SC).

# Parameters
Organization

Architecture

Pos. Embed.

Ckpt. avail.
Pre-trained
Train. Obj.

Code avail.
Tokenizer
Language

# Tokens
Name

MoE
Date

IF
2018.11 GPipe [226] Google Multil. 6B N/A Enc. & Dec. NTP BPE Learned ✗ ✗ ✓ ✗ ✗
2019.09 Megatron-LM [501] Microsoft Eng. 8.3B 157B Dec.-Only NTP BPE Learned ✗ ✗ ✓ ✗ ✗
2019.10 T5 [443] Google Multil. 11B 1T Enc. & Dec. SC SP T5 ✗ ✗ ✓ ✓ ✗
2020.05 GPT-3 [59] OpenAI Eng. 175B 300B Dec.-Only NTP BPE Learned ✗ ✗ ✗ ✗ ✗
2020.06 GShard [298] Google Multil. 600B 1T Enc. & Dec. NTP SP N/A ✗ ✓ ✗ ✗ ✗
2020.10 mT5 [631] Google Multil. 13B 1T Enc. & Dec. SC SP T5 ✗ ✗ ✓ ✓ ✗
2021.01 Switch [145] Google Multil. 1.5T N/A Enc. & Dec. SC SP T5 ✗ ✓ ✓ ✓ ✗
2021.03 BASE [302] Meta Eng. 117B N/A Enc. & Dec. NTP BPE Sinus. ✗ ✓ ✓ ✗ ✗
2021.04 PanGu-α [659] Huawei Multil. 200B 317B Dec.-Only NTP BPE Learned ✗ ✗ ✗ ✗ ✗
2021.05 ByT5 [630] Google Multil. 12.9B 1T Enc. & Dec. SC N/A T5 ✗ ✗ ✓ ✓ ✗
2021.06 CPM-2 [669] Tsinghua Uni. Multil. 198B N/A Enc. & Dec. SC Custom Sinus. ✗ ✓ ✓ ✓ ✗
2021.06 nmT5 [255] Google Multil. 3.7B 100B Enc. & Dec. MLM, NTP SP T5 ✗ ✗ ✗ ✗ ✓
2021.07 ERNIE 3.0 [530] Baidu Chin. 10B 375B Enc. & Dec. Custom BPE Rel. ✗ ✗ ✗ ✗ ✗
2021.08 Jurassic-1 [319] AI21 Eng. 178B 300B Enc. & Dec. NTP SP Learned ✗ ✗ ✗ ✗ ✗
2021.08 ExT5 [23] Google Eng. 11B 1T Enc. & Dec. SC, Custom SP T5 ✗ ✗ ✓ ✗ ✗
2022.01 FLAN-LaMDA [598] Google Eng. 137B 245M Dec.-Only NTP BPE T5 ✗ ✓ ✗ ✗ ✓
2021.10 M6-10T [322] Alibaba Eng. 10T N/A Uni. Enc. & Dec. SC, NTP SP N/A ✗ ✗ ✗ ✗ ✗
2021.10 Yuan [615] Inspur AI Chin. 245B 180B Dec.-Only NTP BPE N/A ✗ ✗ ✗ ✗ ✗
2021.10 T0 [475] BigScience Eng. 11B 12B Enc. & Dec. SC, NTP SP T5 ✗ ✗ ✓ ✓ ✓
2021.12 Gopher [441] DeepMind Eng. 280B 300B Dec.-Only NTP SP Rel. ✗ ✗ ✗ ✗ ✗
2021.12 RETRO [52] DeepMind Eng. 7B 419B Enc. & Dec. NTP (Ret.) SP Rel. ✗ ✗ ✗ ✗ ✗
2021.12 GLaM [130] Google Multil. 1.2T 600B Dec.-Only NTP SP Rel. ✗ ✓ ✗ ✗ ✗
2021.12 WebGPT [379] OpenAI Eng. 175B N/A Dec.-Only NTP BPE Learned ✗ ✗ ✗ ✗ ✓
2021.12 FairSeq [400] Meta Eng. 1.1T 300B Dec.-Only NTP BPE Sinus. ✗ ✓ ✓ ✓ ✗
2021.12 XGLM [324] Meta Multil. 7.5B 500B Dec.-Only NTP Unigram Sinus. ✗ ✗ ✓ ✓ ✗
2022.01 LaMDA [551] Google Eng. 137B 768B Dec.-Only NTP BPE T5 ✗ ✗ ✗ ✗ ✗
2022.01 MT-NLG [515] Microsoft Eng. 530B 270B Dec.-Only NTP BPE Sinus. ✗ ✗ ✗ ✗ ✗
2022.02 ST-MoE [687] Google Eng. 269B 1.5T Enc. & Dec. SC SP Sinus. ✗ ✓ ✓ ✗ ✗
2022.03 InstructGPT [403] OpenAI Eng. 175B N/A Dec.-Only RLHF BPE Learned ✓ ✗ ✗ ✗ ✓
2022.03 GopherCite [362] DeepMind Eng. 280B N/A Dec.-Only RLHF BPE Rel. ✓ ✗ ✗ ✗ ✓
2022.03 sMLP [653] Meta Eng. 9.4B N/A Enc. & Dec. NTP BPE Sinus. ✗ ✓ ✗ ✗ ✗
2022.03 Chinchilla [206] DeepMind Eng. 70B 1.4T Dec.-Only NTP SP Rel. ✗ ✗ ✗ ✗ ✗
2022.04 PaLM [86] Google Multil. 540B 780B Dec.-Only NTP SP RoPE ✗ ✓ ✗ ✗ ✗
2022.04 GPT-NeoX [47] EleutherAI Eng. 20B 472B Dec.-Only NTP BPE RoPE ✗ ✗ ✓ ✓ ✗
2022.04 Tk-Instruct [589] AI2 Eng. 11B 1B Enc. & Dec. NTP SP T5 ✓ ✗ ✓ ✓ ✗
2022.04 METRO-LM [33] Microsoft Eng. 5.4B 2T Enc.-Only METRO SP T5 ✗ ✗ ✗ ✗ ✗
2022.04 mGPT [500] Sber Multi. 13B 440B Dec.-Only NTP BPE Learned ✗ ✗ ✓ ✓ ✗
2022.05 OPT [666] Meta Eng. 175B 300B Dec.-Only NTP BPE Learned ✗ ✗ ✓ ✓ ✗
2022.05 UL2 [545] Google Eng. 20B 1T Enc. & Dec. MoD Unigram T5 ✗ ✗ ✗ ✓ ✗
2022.05 DeepStruct [578] UC Berkeley Eng. 10B N/A Enc. & Dec. Struc. BPE Sinus. ✗ ✗ ✗ ✗ ✗
2022.07 Minerva [305] Google Eng. 540B 26B Dec.-Only NTP SP RoPE ✗ ✗ ✗ ✗ ✗
2022.08 PEER [482] Meta Eng. 11B 5B Enc. & Dec. NTP SP T5 ✗ ✗ ✗ ✗ ✓
2022.08 AlexaTM [517] Amazon Multil. 20B 1T Enc. & Dec. MoD, NTP SP Sinus. ✗ ✗ ✗ ✓ ✓
2022.10 GLM-130B [658] Tsinghua Uni. Multil. 130B 400B Uni. Enc. & Dec. ARBF SP RoPE ✗ ✗ ✓ ✓ ✗
2022.10 U-PaLM [547] Google Eng. 540B 1.3B Dec.-Only MoD SP RoPE ✗ ✓ ✗ ✗ ✓
2022.10 FLAN-PaLM [93] Google Eng. 540B 1.4B Dec.-Only NTP SP RoPE ✓ ✓ ✗ ✗ ✓
2022.11 BLOOM [479] BigScience Multil. 176B 366B Dec.-Only NTP BPE ALiBi ✗ ✗ ✓ ✓ ✗
2022.11 Galactica [548] Meta Eng. 120B 450B Dec.-Only NTP BPE Learned ✗ ✗ ✓ ✓ ✗
2022.11 Atlas [236] Meta Eng. 11B N/A Enc. & Dec. MLM BPE T5 ✗ ✗ ✓ ✓ ✓
2022.11 BLOOMZ [377] BigScience Multil. 176B 13B Dec.-Only NTP BPE ALiBi ✓ ✗ ✓ ✓ ✓
2022.11 mT0 [377] BigScience Multil. 13B 13B Enc. & Dec. NTP SP T5 ✓ ✗ ✓ ✓ ✓
2022.12 OPT-IML [235] Meta Eng. 175B 2B Dec.-Only NTP BPE Sinus. ✓ ✗ ✓ ✓ ✓
2022.12 Med-PaLM [511] Google Eng. 540B 0B Dec.-Only NTP SP RoPE ✗ ✗ ✗ ✗ ✓
2023.02 LLaMA{-I} [556] Meta Eng. 65B 1.4T Dec.-Only NTP BPE RoPE ✓ ✗ ✓ ✓ ✗
2023.03 PanGu-Σ [455] Huawei Multil. 1T 329B Dec.-Only NTP BPE Learned ✗ ✓ ✗ ✗ ✓
2023.03 CoLT5 [15] Google Eng. 5.3B 1T Enc. & Dec. MoD N/A T5 ✗ ✗ ✗ ✗ ✗
2023.03 BloombergGPT [616] Bloomberg Eng. 50B 569B Dec.-Only NTP Unigram ALiBi ✗ ✗ ✗ ✗ ✗
2023.04 Cerebras-GPT [121] Cerebras Eng. 13B 257B Dec.-Only NTP BPE RoPE ✗ ✗ ✗ ✓ ✗
2023.04 Pythia [44] EleutherAI Eng. 12B 300B Dec.-Only NTP BPE RoPE ✗ ✗ ✓ ✓ ✗
2023.04 WizardLM [625] Microsoft Eng. 30B N/A Dec.-Only NTP BPE RoPE ✓ ✗ ✓ ✓ ✓
2023.05 Guanaco [118] Univ. of Washington Multil. 65B 82M Dec.-Only NTP BPE RoPE ✓ ✗ ✗ ✓ ✓
2023.04 RWKV [417] RWKV Eng. 14B N/A Dec.-Only NTP BPE RoPE ✓ ✗ ✓ ✓ ✓
2023.06 Orca [378] Microsoft Eng. 13B N/A Dec.-Only NTP BPE RoPE ✓ ✗ ✗ ✗ ✓
2023.07 LLaMA 2 [557] Meta Eng. 70B 2T Dec.-Only NTP BPE RoPE ✓ ✗ ✓ ✓ ✓

32
mance distributions across all runs is less biased Unfortunately, we stumble upon two unique re-
and more robust to outliers than point estimates [9]. producibility issues in LLM research: repeatability
of (i) training runs and (ii) generations by close-
Curse of Dimensionality In Table 2, we high- sourced API-served models. While the term “re-
light some but not all differences across models, producibility” is often used more broadly and can
as the table format constrained us. Other com- slightly vary in its meaning [5], in the following,
mon differences include the training datasets or we focus on “repeatability”, which we define as the
fine-grained architectural details, e.g., the usage of ability to repeat experimental outcomes exactly.
multi-head [563] or multi-query attention [494].
We note that a core characteristic of LLMs is Training Repeatability Typical training proto-
their vast design space, which renders scientific cols of LLMs involve parallelism across multi-
inquiry challenging [231]. For example, by taking ple compute nodes. The scheduling and com-
into account the (i) data sources and their propor- munication strategies between nodes can be non-
tions within the pre-training dataset, (ii) choice deterministic [387]. This variability can affect
and training hyper-parameters of the tokenizer, and the final result, especially in algorithms that are
(iii) pre-training objective, the combined design not “order-invariant”, such as stochastic gradient
space quickly becomes high-dimensional. Under- descent (SGD). Some sources of randomness are
taking factorial experiments within such expansive (i) lock-free parallelism schemes [387], (ii) float-
design spaces results in a combinatorially-growing ing point precision, e.g., when summing gradients
number of single training runs, and the lack of suf- across devices, the order in which these sums are
ficient experimental coverage can severely inhibit computed can affect the final result [171], (iii) non-
scientific understanding of what makes an LLM deterministic, performance-optimized operations,
perform well. While this issue is not unique to which are much faster and therefore desirable [3].
LLMs, they tend to be larger in the number of Further, Carlini et al. [64] point out that some
parameters—and therefore compute requirements, pre-training datasets consist of an index of web
feedback loop times, and training costs—than mod- content that individual users must crawl themselves,
els in most other fields. rather than using static, standalone dumps. This is
due to monetary, privacy, and legal restrictions. As
Curse of (Design) Dimensionality a result, reproducibility can be easily compromised
if any of the sources in the index have changed
Common design spaces of LLM experi-
between the time the dataset curator collected them
ments are high-dimensional.
and the time the end-user downloads them.

One possible way forward is to encourage the Irrepeatable Training Runs


community to use techniques like Bayesian opti-
mization (BO) with dimensionality reduction [594, Parallelism strategies designed to distribute
374], where we use a non-linear feature mapping to the training process across many accelera-
map the input (the hyper-parameter configuration) tors are typically non-deterministic, render-
onto a lower dimensional manifold followed by a ing LLM training irreproducible.
BO procedure to optimize the underlying black-
box function (the LLM with respect to the hyper- Inference Repeatability Another peculiarity of
parameters). Another suitable tool to explore the commercial LLMs is that they are typically served
design space efficiently can be treatment effect es- via stochastic API in a black-box setting, which
timation [284, 385], e.g., where the treatment is a comes with the following challenges: (i) the
vector describing certain ablations [254]. provider retains complete authority over the model
and can introduce unpublicized changes, includ-
2.16 Lack of Reproducibility
ing retraining the model, modifying its parame-
The reproducibility of empirical results is impor- ters, or completely replacing it; (ii) even if model
tant to verify scientific claims and rule out errors updates are communicated, there is still uncer-
in experimental protocols leading to such. When tainty about whether access to specific model ver-
researchers try to build upon non-reproducible re- sions will be maintained once they are deemed
sults, they might waste resources. outdated, (iii) even with a decoding temperature

33
set to zero, API models often produce stochastic the model to be more helpful, correct, and harm-
outputs [392, 464, 456]. less. Sparrow also incorporates external knowledge
Chen et al. [76] provide preliminary evidence using a retrieval model to provide evidence from a
confirming dramatic changes in API-served models. Google Search query. The RLHF approach outper-
They find that GPT-3.5 and GPT-4 performances on forms the only dialogue-prompted and supervised
four diverse tasks vary vastly within three months fine-tuned approaches regarding output preference
(March to June 2023). For example, GPT-4’s ac- and rule violation rate.
curacy in identifying prime numbers was 97.6%, Similarly, OpenAI [396] train the ChatGPT
but in June, its accuracy dropped to 2.4%; while chatbot using supervised fine-tuning and RLHF
for GPT-3.5, the trend is reversed and it got much (Sec. 2.9) to specialize a GPT-3.5 LLM for dia-
better over time. logue. GPT-4 [398] is the underlying model for the
ChatGPT Plus chatbot, but training and architec-
Irreproducible API Inference ture details have not been released.
API-served models are often irreproducible. Shuster et al. [508] introduce BlenderBot-3, a
175B parameter chatbot based on the OPT-175
LLM using supervised fine-tuning. BlenderBot-
An easy fix is to rely exclusively on open-source
3 incorporates external knowledge through mod-
LLMs [2].
ules that conduct internet searches and retrieve text-
based long-term memories generated from previous
3 Applications
outputs to help performance over long interactions.
In this section, we aim to provide practitioners with
a broad overview of the areas in which LLMs are Maintaining Coherence
currently being applied and highlight some com-
Multi-turn interactions make Chatbots eas-
mon application architectures across domains.
ily “forget” earlier parts of the conversation
Analogous to the Challenges section, we high- or repeat themselves [53, 451].
light the key constraints in each application area as
follows.
Köpf et al. [274] release the OpenAssistant Con-
Constraint versations dataset of human-annotated interactions
and use this to instruction fine-tune Pythia and
This box highlights a constraint. LLaMA models (up to 30B parameters) for chat-
bot applications. To help align the final models,
the dataset is generated with guidelines to make
3.1 Chatbots the responses polite, helpful, concise, friendly, and
General-purpose chatbots (dialogue agents) com- safety-aware. The LLaMA 30B version is cur-
bine the tasks of information retrieval, multi-turn rently used within the HuggingChat chatbot ap-
interaction, and text generation (including code). plication [229].
Thoppilan et al. [551] introduced the LaMDA A key challenge of fine-tuning chatbots is cre-
family of chatbot LLMs with up to 137B parame- ating a broad training dataset of high-quality con-
ters, focusing on safety (via supervised fine-tuning versations. To address this problem Chen et al.
on human annotations) and factual grounding (via [78] demonstrate using existing LLMs (OPT 30B)
access to external knowledge sources). Notably, to generate high-quality synthetic conversation
smaller LaMDA models (2B parameters) with fine- datasets based on a small number of expert-written
tuning are shown to perform similarly on dialogue examples. Human crowd workers assessed the gen-
quality and safety/grounding scores to the larger erated conversations to be comparable to existing
LaMDA models (137B parameters) without fine- human-generated datasets on the metrics: interest-
tuning. LaMDA models were released as part of the ing, coherent, natural, and consistent. Chen et al.
Bard chatbot service [429]. However, the latest ver- [78] show the synthetic dataset can be used to fine-
sion of Bard now uses the PaLM 2 LLM [20, 216]. tune a chatbot (BlenderBot 400M) and achieve
Glaese et al. [170] propose Sparrow, a chatbot performance only slightly below fine-tuning with
based on a 70B parameter Chinchilla LLM, and human-generated datasets.
use RLHF (Sec. 2.9) targeting 23 rules to fine-tune Chatbots’ intended generality also makes eval-

34
BlenderBot3 (OPT-175) [508], Bard (LaMDA, PaLM2) [551],
Sparrow (Chinchilla) [170], ChatGPT (GPT-3.5, GPT-4) [396],
OpenAssistant (LLaMA) [274]
Chatbots 3.1
GPT-4 Technical Report [398], Sparks of AGI (GPT-4) [61],
Capabilities of ChatGPT [272]

ESM-2 [326], ProtT5 [139], ProtST [627], CaLM [402], ProGen [352],
Proteins
IgLM [505], xTrimoPGLM [73]
Computational Biology 3.2
Genomics GenSLM [688], Nucleotide Transformers [106]

InCoder [154], CodeGen [386], AlphaCode [313] , SantaCoder [17],


Polycoder [626], phi-1 [182]

Computer Programming 3.3 Codex (GPT-3) [77]

Self-Debugging (Codex) [81], ViperGPT (Codex) [532],


RepoCoder [660], Repo-Level Prompt Generator [504]

Dramatron (Chinchilla) [368], Re3 (GPT-3) [637],


Long Form
Detailed Outline Control (GPT-3) [636]

CoPoet (T5, T0) [69], Spindle - Interactive Fiction (GPT-3) [63]

Creative Work 3.4 Short Form Cross-lingual Short Stories (PaLM) [452], ReelFramer (GPT-4) [584]

Idea Generation [187]

Visual LayoutGPT [148], LLM Grounded Diffusion [315]

Galactica [548], BloombergGPT [616]

Scientific NERRE (GPT-3) [133]


Knowledge Work 3.5
Data Analysis (GPT-4) [346]

Professional Exams [49], News Summarization [668],


Email Management [550], Academic Paper Review (GPT-4) [335]

Legal Entailment (GPT-3.5) [651], Bar Examination (GPT-3.5) [50]

Legal Question Answering Explaining Legal Concepts (GPT-4 + Retrieval) [478]


Applications

Law School (ChatGPT) [84], Bar Examination (GPT-4) [258]


Law 3.6 Statutory Reasoning (GPT-3.5) [48], Law Professor (ChatGPT) [427],
Summarizing Judgments (GPT-3.5) [115], Litigation (ChatGPT) [234]

Case Prediction US Supreme Court (GPT-2 + GPT-3) [189]

PubMedGPT [565], GatorTronGPT [418]

MedPaLM(2) (PaLM) [511, 512], ChatDoctor (LLaMA) [655]

Medical Question Answering GPT-3.5 + Retrieval [320]

Medical Challenge Problems (GPT-4) [388],


Triage and Diagnosis (GPT-3) [301],
Surgical Knowledge QA (GPT-4) [393],
Social Media - Genetics Questions (ChatGPT) [134],
Medicine 3.7 Social Media - General Questions (ChatGPT) [30],
Ophthalmology QA (ChatGPT) [21],
Medical Summarization (GPT-3.5, ChatGPT) [538]

Medical Acronym Disambiguation (T5) [448],


Adverse Drug Event Extraction [178]
Medical Information Retrieval
Clinical Information Extraction (InstructGPT) [10]

Self Improvement (PaLM) [222], Processed Based Fine-Tuning [560]

DIVERSE (GPT-3.5) [312], Socratic Sub-Questions (GPT-3) [502],


Reasoning 3.8 Mathematical Formalization (Codex) [159]

Causal Factors in Performance [525], Analogical Reasoning [595],


Causal Reasoning [286, 164, 519, 244, 288],
Common-Sense Reasoning [562]

PaLM-E [129]
Robotics 3.9 SayCan (PaLM + Scoring) [14], ChatGPT for Robotics [564],
REFLECT (GPT-4) [338], Code as Policies (Codex) [316],
PROGPROMPT (Codex) [510], Inner Monologue [225],
Statler (GPT-3.5) [647]

Using LLMs to Model Human Behavior [12, 176],


Social Sciences 3.10 Analyzing Behavioral Characteristics of LLMs [367, 414],
Simulating Social Relationships with LLMs [408]

Automated Labeling (GPT-3) [583], AugGPT (ChatGPT) [104],


Labeling + Generation (GPT-3) [123],
Synthetic Training Data 3.11 Information Retrieval (GPT-3) [51],
Decompositional Distillation (GPT-3) [503],
Code ‘Textbooks’ (GPT-3.5) [182], GPT3Mix [648]

Figure 12: Overview of LLM Applications. Color = Level of Model Adaption (Pre-Trained, Fine-Tuned, Prompting
Strategy, Evaluation).

35
uating their capabilities’ full range difficult. Ko- Transfer to Downstream Applications
coń et al. [272] evaluate ChatGPT (GPT-3.5) on
25 tasks with 38k prompts covering a diverse set The ultimate objective of protein language
of capabilities, including but not limited to ques- models is to deploy them in real-world
tion answering, emotion recognition, offensive lan- projects such as drug design. Evalua-
guage detection, spam detection, inference, and tions often target smaller and/or specialized
sentiment analysis. While ChatGPT is shown to datasets, not considering how the models
have strong performance across the 25 tasks, it usu- could contribute to protein design in vitro
ally underperforms the SOTA in that domain. More or in vivo.
recently, Bubeck et al. [61] and OpenAI [398] in-
vestigate the capabilities of GPT-4 (base model of Elnaggar et al. [139] train a range of LLM archi-
ChatGPT Plus) across a wide range of tasks, in- tectures to extract embeddings from protein amino
cluding interactions with humans and tools. Using acid sequences. These embeddings are then used
these evaluations Bubeck et al. [61] conclude that as inputs on supervised per-amino acid and per-
GPT-4 is ‘strikingly close to human-level perfor- protein prediction tasks. The best-performing LLM
mance’ across tasks. architecture (ProtT5) achieved SOTA results on
per-amino acid protein secondary structure predic-
Finally, the challenge of inference latency tion without using evolutionary information. Sim-
(Sec. 2.5) is also potentially going to become an ilarly, Wu et al. [613] predict antibody backbone
important constraint [634] for chatbot applications and side-chain conformations.
as LLMs scale. There is a trade-off between the Lin et al. [326] take a similar approach to train-
need for responsive live user interaction in a con- ing a protein LLM, the Evolutionary Scale Model
versational format and utilizing larger LLMs [397]. Transformer-2 (ESM-2), on protein amino acid se-
quences from the UniRef database using a masked
language modeling approach. They show sig-
High Inference Latency nificant performance increases as the model is
High inference latency (Sec. 2.5) hinders the scaled from 8 million to 15B parameters, with
user experience [397], especially in multi- the largest models outperforming the ProtT5 on
turn interaction with chatbots. protein structure prediction benchmarks (CASP14,
CAMEO) [267, 457]. They also introduce ESM-
Fold, which uses the ESM-2 embedding model
for end-to-end atomic resolution prediction from a
single sequence. While ESMFold underperforms
3.2 Computational Biology
the SOTA AlphaFold2 [248] on the CAMEO and
CASP14 benchmarks, the authors note that by rely-
In computational biology, we are interested in non- ing only on embeddings ESMFold has an order of
text data representing similar sequence modeling magnitude faster inference time than AlphaFold2,
and prediction challenges. using just the protein sequence of interest rather
than structural templates and multiple sequence
alignments (MSAs). Jeliazkov et al. [240] find
that protein sequences designed by an inverted Al-
3.2.1 Protein Embeddings phaFold2 model are unlikely to be expressed, but
sequences generated using an inverted protein LLM
One popular application of LLM-like models in such as ESMFold were more likely to be expressed.
biology is to generate protein embeddings from Researchers have also adopted the ESM-1 and
amino-acid or genomic sequence inputs. These em- ESM-2 models to generate protein embeddings
beddings can then be used as inputs for structure for enzyme-substrate chemical structural class pre-
prediction, novel sequence generation, and protein diction [245], training 3D geometric graph neural
classification tasks. Protein language models per- networks for proteins [611], identifying disease-
form strongly on many academic datasets, but their causing mutations [337], designing novel pro-
applicability to downstream tasks such as drug de- teins [566], and guided evolution of antibodies for
sign is often unclear [110]. affinity maturation [202].

36
Chen et al. [73] propose training a new Limited Context Window
model xTrimoPGLM (100B parameters) simul-
taneously for protein embedding and genera- The largest genomes have vastly longer
tion tasks using MLM and generative objectives. DNA sequences [390] than existing ge-
The xTrimoPGLM-100B model (with fine-tuning nomic LLMs’ context windows can han-
where relevant) outperforms existing approaches dle, constraining the types of genomes that
on 13 out of 15 evaluated tasks. can be successfully modeled using these ap-
proaches.
Protein embedding models with alternative in-
puts have also been proposed. Outeiral and Deane Zvyagin et al. [688] introduce a range of hier-
[402] train an 86 million parameter protein LLM archical LLMs (up to 25B parameters) with long
CaLM (Codon adaptation Language Model) us- input sequences (2048 - 10,240 tokens), referred
ing sequences of codons (nucleotide triads) as in- to as Genome-scale Language Models (GenSLMs).
put instead of amino acids due to codons contain- The GenSLM models are pre-trained on Prokary-
ing potentially richer information. Madani et al. otic gene sequences from the BV-BRC dataset us-
[352] train a 1.2B parameter protein embedding ing codon tokenization [402] and then fine-tuned
model ProGen on 280 million protein amino acid on SARS-CoV-2 genome sequences for the task
sequences with additional control tags specifying of identifying potential new variants and genera-
protein properties. ProGen is then fine-tuned us- tive modeling. However, the authors note that it
ing data from specific protein families and applied remains unclear whether the GenSLM architecture
to generate functional full-length amino acid se- generates richer representations than the protein
quences. Similarly, Xu et al. [627] propose train- LLM approaches.
ing a protein language model, the ProtST, on pro- Dalla-Torre et al. [106] train Nucleotide Trans-
tein sequences and additional text descriptions of formers with 500 million to 2.5B parameters on nu-
their key properties for protein classification and cleotide sequences from human and other species
retrieval tasks. genomes, using a masked language modeling ap-
proach. The Nucleotide Transformers were evalu-
Finally, for antibodies specifically, Shuai et al. ated on 18 genomic prediction tasks with fine-tuned
[505] propose an Immunoglobulin Language larger models achieving the best results.
Model (IgLM) using the GPT-2 architecture (with Nguyen et al. [383] propose HyenaDNA, a ge-
13 million parameters) for the generation of im- nomic language model based on the Hyena archi-
munoglobulin sequences, using a masked language tecture [430], enabling modeling of genomic se-
modeling approach. Similar to Xu et al. [627], the quences of up to 1 million tokens. HyenaDNA
IgLM model also takes additional conditioning tags outperforms Transformer-based models with mul-
corresponding to chain type and species as input. tiple orders of magnitude more parameters while
The authors show the IgLM model can then be incorporating the in-context learning capabilities
used for the controllable generation of infilled and of LLMs into the genomics domain.
full-length antibody sequences.
3.3 Computer Programming
One of LLMs’ most advanced and broadly adopted
applications is generating and completing computer
3.2.2 Genomic Analysis programs in various programming languages. This
section deals with programming-specific LLMs
where the model is fine-tuned or pre-trained ex-
LLMs in the field of genomic analysis enable a clusively for programming applications, but it is
better understanding of the effects of mutations important to note the increasing use of general
in humans and predict genomic features directly chatbots partially trained on code datasets (such
from DNA sequences. While genomic language as ChatGPT) for programming tasks.
models are a promising research direction, current
models cannot process many genomic sequences as 3.3.1 Code Generation
their sequence lengths commonly exceed multiple Code generation refers to using an LLM to output
billions of nucleotides [390]. new code for a given specification or problem pro-

37
vided as a prompt. Several computer programming- Nijkamp et al. [386] train the CodeGen family
specific LLMs and approaches have been proposed. of LLMs (up to 16B parameters) using a combi-
For Python code generation, Chen et al. [77] nation of three datasets: natural language, multi-
introduce Codex, a fine-tuned GPT-3 LLM (up lingual programming source code (C, C++, Go,
to 12B parameters) specialized to generate stand- Java, JavaScript, and Python), and a monolingual
alone Python functions from doc strings. Fine- Python dataset. The largest CodeGen model using
tuning was conducted using a raw dataset of 159 the monolingual training set was shown to outper-
GB of Python source code from GitHub and a fil- form the Codex-12B model. Nijkamp et al. [386]
tered dataset of correctly implemented standalone also test CodeGen on multi-step program synthesis,
Python functions. Codex models outperformed where a program is broken down into multi-step
similarly sized GPT-3 and GPT-J models on the natural language prompts, which the model then
HumanEval evaluation set, with the Codex model implements individually (creating the new Multi-
trained on the filtered dataset (Codex-S) achieving Turn Programming Benchmark (MTPB)).
the best results. Importantly, Chen et al. [77] note Finally, Li et al. [313] focus on the task of
that there was no observed improvement from us- solving competitive programming questions (Code-
ing a pre-trained GPT-3 model as a base other than forces, Description2Code, and CodeNet). The Al-
faster convergence. phaCode LLM (up to 41B parameters) is first pre-
Chen et al. [81] seek to improve the performance trained on a multilingual dataset (C++, C#, Go,
of Codex through a self-debugging prompting ap- Java, JavaScript, Lua, PHP, Python, Ruby, Rust,
proach. Three forms of self-debugging are inves- Scala, and TypeScript) of 715 GB of source code
tigated. Simple feedback prompts the model to from GitHub. It is then fine-tuned using a new
decide whether the generated code solution is cor- curated dataset of competitive programming prob-
rect. Unit-test feedback prompts the model with lems called CodeContests. To achieve high per-
the output of unit tests provided in the problem formance, Li et al. [313] use large-scale sampling
description. Code explanation feedback prompts (up to millions of samples), filtering, and clustering
the model to explain the solution in detail and use of candidate solutions generated by AlphaCode to
the explanation to correct the solution. In each select the final submissions.
case, this process is repeated iteratively until the However, whilst these existing code-generation
model provides a solution it states is correct or LLMs have achieved impressive results, a criti-
a maximum number of attempts has been made. cal current constraint in applying LLMs to code
Codex using the self-debugging prompting frame- generation is the inability to fit the full code base
work with code explanation (and unit-testing if and dependencies within the context window. To
applicable) outperforms the base Codex model on deal with this constraint, a few frameworks have
C++-to-Python translation, text-to-SQL generation, been proposed to retrieve relevant information or
and text-to-Python generation. abstract the relevant information into an API defi-
nition.
Gunasekar et al. [182] train a smaller model Phi-
1 (1.3B parameters) to generate Python functions Long-Range Dependencies [660, 504]
from doc strings. Training phi-1 using a combina-
tion of filtered existing datasets and new synthetic Long-range dependencies across a code
textbook and exercise datasets results in a model repository usually cannot be regarded be-
that can achieve near current SOTA results on Hu- cause of limited context lengths (Sec. 2.6).
manEval while having over an order of magnitude
fewer parameters and tokens than previous works. Zhang et al. [660] introduce RepoCoder, a
Another area of interest has been the develop- retrieval-based framework for repository-level code
ment of multilingual programming LLMs. Xu et al. completion that allows an LLM to consider the
[626] evaluate a range of code generation LLMs broader context of the repository. A multi-step
and train a new multilingual LLM Polycoder (2.7B retrieval-augmented generation approach is taken,
parameters) using source code from 12 languages. where the initial code generated is then used to re-
However, for Python specifically, Codex outper- trieve further, potentially more relevant, repository
forms Polycoder and other existing models (GPT-J, code snippets to refine the final output. This ap-
GPT-Neo, and CodeParrot) on HumanEval. proach can be considered a retrieval-based method

38
Prompt API Defintion
for relieving the long-range dependency constraint.
Using the API functions def locate_item(item_name):
Similarly, Shrivastava et al. [504] propose the provided, write a program """ Returns x,y,z of item """
def move_to_location(x, y, z):
that… """ Moves to x,y,z coordinates"""
Repo-Level Prompt Generator (RLPG) framework def drop_item(item_name):
""" Removes item from inventory"""
to dynamically retrieve relevant repository context
and construct the correct prompt for a given com-
pletion task. To do this, many prompt proposals
Self-
are generated from different prompt sources (e.g., LLM debugging
parent class) and prompt contexts (e.g., method Function
Implementation
names). The best prompt is then selected by a
prompt proposal classifier and combined with the
Output API Implementation Store
default context to generate the final output. move_to_location(10, 20, 0)
locate_item('apple')
def drop_item(item_name):
""" Removes item from inventory"""
move_to_location(5, 10, 15)
Finally, Surís et al. [532] create the ViperGPT drop_item('apple')
item_list.remove(item_name)

framework, which utilizes the Codex LLM to gener-


ate programs that answer text-based visual queries. Figure 13: API Definition Framework. Illustration of
The Codex model is prompted with the query text providing a general API definition in the prompt [532,
and an API specification to do this. The human- 579, 564] to enable the consistent use of either external
generated API specification provides functions de- code or tools to solve the specific task whilst minimiz-
signed to deal with low-level visual tasks (e.g., ing the required context window. Extensions to this ap-
proach have included asking the LLM to implement the
find(object)) that the LLM can then use to gen-
functions within the API definition (red) and to prompt
erate solution code. This approach significantly the LLM to self-debug any API code that does not exe-
reduces the tokens needed to provide repository/- cute (green).
code context by only providing the API definition.
This API definition approach, illustrated in 13 has
been used in robotics by Vemprala et al. [564], and marks), with it shown to outperform InCoder on
by Wang et al. [579] as part of a Minecraft agent. both HumanEval generation and infilling (passing
Previously, Gupta and Kembhavi [185] used a pre- over 100 attempts).
defined function approach within VISPROG, which Code infilling is particularly relevant for applica-
uses GPT-3, external python modules, and few-shot tions involving modifying, reviewing, or debugging
prompting with example programs to solve visual existing code. Maniatis and Tarlow [357] explore
tasks. the data from the intermediary steps in the develop-
ment process to help automatically resolve reviewer
3.3.2 Code Infilling and Generation comments [155]. The Dynamic Integrated Devel-
Code infilling refers to modifying or completing oper ACTivity (DIDACT) methodology formalizes
existing code snippets based on the code context tasks in the software development process (e.g., re-
and instructions provided as a prompt. pairing builds, predicting reviewer comments, etc.)
Fried et al. [154] train the InCoder LLM (up into state, intent, and action components, and trains
to 6.7B parameters) to both generate Python code the model to predict code modifications. This ap-
and infill existing code using a masked language proach aims to train the model to understand the
modeling approach. Incoder is trained using 159 process of software development rather than only
GB of text split roughly equally between Python the end product.
source code, StackOverflow content, and source
code in other languages. On the HumanEval gener- 3.4 Creative Work
ation benchmark, InCoder underperforms the best- For creative tasks, LLMs have primarily been ap-
performing Codex and CodeGen models. However, plied to story and script generation.
unlike the other models, InCoder can perform sin- For long-form story generation, Mirowski
gle and multi-line infilling of existing code. et al. [368] propose using a 70B Chinchilla-
Similarly, Allal et al. [17] train a set of smaller optimal [206] LLM Dramatron with prompting,
SantaCoder models (1.1B parameters) for code gen- prompt chaining, and hierarchical generation to
eration and code infilling using 268 GB of Python, create complete scripts and screenplays without
JavaScript, and Java source code. SantaCoder is the requirement for a human-in-the-loop (although
primarily evaluated on the MultiPL-E benchmark co-writing is facilitated). The ability of Dramatron
(an extension of HumanEval and MBPP [28] bench- to help create a script was evaluated qualitatively

39
through co-writing and follow-up interviews with for cross-lingual short story generation, Wang et al.
15 industry experts. [584] use GPT-4 as part of the ReelFramer tool to
Similarly, Yang et al. [637] propose using GPT-3 help co-create news reels for social media, Ippolito
with a Recursive Reprompting and Revision frame- et al. [232] use LaMDA as part of the Wordcraft cre-
work (Re3) to generate stories over 2,000 words ative writing assistant, and Calderwood et al. [63]
long. The Re3 approach uses zero-shot prompting apply a fine-tuned GPT-3 model as part of their
with GPT-3 to generate a plan (settings, characters, Spindle tool for helping generate choice-based in-
outline, etc.). It then recursively prompts GPT-3 to teractive fiction.
generate story continuations using a specified dy- For more general creative tasks, Haase and
namic prompting procedure. Possible story contin- Hanel [187] assess a range of LLMs (including
uations are then ranked for coherence and relevance ChatGPT) on their capacity for idea generation (ev-
using separate fine-tuned Longformer models as eryday creativity) using the Alternative Uses Test
part of a Rewrite module. Finally, local edits to (generating alternative uses for given items). On
the selected continuations are made by detecting this task, LLMs were found to perform comparably
factual inconsistencies using the combination of a to 100 human participants.
GPT-3 model [403] and a BART model [303] as Finally, for visual creative tasks, LLMs have also
part of an Edit module. This process can then be been used to increase the level of control users have
iterated for fully automated story generation. when using image generation models. Feng et al.
Finally, Yang et al. [636] introduce the Detailed [148] propose the LayoutGPT method where an
Outline Control (DOC) framework to maintain plot LLM (GPT-3.5, GPT-4 or Codex) is used to gener-
coherence over thousands of words using GPT-3. ate a CSS Structure layout the image should follow
While DOC uses the same high-level planning- based on a text-based user prompt. This layout
drafting-revision approach as Re3, it implements can be visualized and used as input to guide an
this through the use of a detailed outliner and de- image generation model. This approach performs
tailed controller. The detailed outliner first breaks strongly on text-to-image generation and indoor
down the high-level outline into subsections us- scene synthesis. A similar concept is implemented
ing a breadth-first approach, with candidate gen- by Lian et al. [315], where an LLM (GPT-3.5) is
erations for the subsections created, filtered, and used to generate natural language layouts (bound-
ranked. The bodies of the detailed outline subsec- ing boxes and descriptions) to guide a diffusion
tions are then generated iteratively using a struc- model. Using an LLM as part of a modality conver-
tured prompting approach. During the generation, sion framework 16 has also been used in robotics
an OPT-based FUDGE [635] detailed controller is [338, 225] and knowledge work [329].
used to help maintain relevance.
In each case, to apply LLMs to long-form story 3.5 Knowledge Work
generation, the task is broken down into a series of
With researchers increasingly demonstrating
short-form sub-tasks (14). The current capabilities
LLMs’ ability to perform well on domain-specific
of LLMs primarily drive this approach, but also
knowledge tasks such as within Law [258] or
the desire to have a human-in-the-loop for some
Medicine [512], interest has grown in LLMs’ ca-
co-writing use cases [368].
pacity for wider knowledge work. These applica-
Limited Context Window [368, 637] tions are likely to be found across the labor market
with Eloundou et al. [140] estimating that 80% of
The inability of current LLMs to keep the the US workforce is in roles where at least 10% of
entire generated work within the context tasks could be affected by LLMs.
window currently constrains their long-form In the professional services field, Bommarito
applications and generates the need for mod- et al. [49] evaluate GPT-3.5 and previous GPT ver-
ular prompting (14). sions on actual and synthetic questions from the
Uniform CPA Examination Regulation section and
For short form generation, Chakrabarty et al. AICPA Blueprints for legal, financial, accounting,
[69] propose CoPoet (fine-tuned T5 and T0 models) technology, and ethical tasks. Using only zero-shot
for collaborative poetry generation, Razumovskaia prompting, the best performing model (latest GPT-
et al. [452] use PaLM and prompting with plans 3.5) struggles with quantitative reasoning, achiev-

40
User Prompt work, including sentiment analysis, classifica-
Module 1
tion, NER/NED, and financial question answering.
General Prompt BloombergGPT is shown to outperform the OPT
(66B parameters), GPT-NeoX, and BLOOM (176B
parameters) LLMs on these financial domain-
specific tasks and performs competitively on
LLM broader benchmarks.
Eg., Generate a plot outline Thiergart et al. [550] considers the applicability
for a new novel as paragraph
Output
headings of GPT-3 to the task of email management, includ-
ing classification, information extraction (NER),
Module 2
and generating response text. Whilst it is noted
General Prompt that GPT-3 has the capacity for all three tasks, the
Pre-processing author highlights current issues around reliability,
lack of access to internal data, and the need for a
human in the loop.
LLM Residual
Liu et al. [329] propose enabling LLMs to un-
Re-run Eg., Using the outline, derstand charts and plots by first using a vision
generate a draft for the xth
Output paragraph heading
plot-to-text translation model (DePlot) to decom-
pose the chart into a linearized data table. Once the
Module 3
chart or plot has been converted into a text-based
General Prompt
data table, it is combined with the prompt and pro-
Pre-processing vided to a Flan-PaLM, Codex, or GPT-3.5 LLM. A
similar modality conversion 16 approach has also
been used in robotics [338, 225] for sensor data.
Iterate LLM
Zhang et al. [668] evaluate a range of LLMs
Eg., Check the spelling and
consistency of this paragraph (GPT-3, InstructGPT, OPT, GLM, Cohere, and An-
given the outline and plot
Output summary
thropic) on the task of news summarization. On
the DM/CNN and XSUM benchmarks, instruction
fine-tuned models (InstructGPT) perform the best
across summarization faithfulness, relevance, and
Figure 14: Modular Prompting. Illustration of using coherence. To evaluate against human capabil-
a series of separate prompts [368, 637, 368, 579, 584]
ity Zhang et al. [668] collect reference summa-
and processing steps to enable an LLM to perform tasks
that would either not fit in a single context window or rizations for 100 articles from 6 freelance writers.
could not easily be specified in a single prompting step. Zero-shot InstructGPT-3 performs comparably to
the freelance writers across the three metrics.
Cheng et al. [82] investigate GPT-4’s capacity to
ing results similar to random guessing on multiple- perform data analysis and compare it to human an-
choice questions. However, on qualitative sections, alysts. GPT-4 is combined with a modular prompt-
GPT-3.5 achieved 50-70% accuracy, significantly ing framework 14 with three steps, code generation
ahead of random guessing and approaching human- (SQL and Python), code execution (“collect data
level scores. and output figures”, etc.), and analysis generation
(“generate five bullet points about the analysis”).
Numerical Reasoning [436, 49]
While GPT-4 performs well, it currently underper-
LLMs have generally seen worse perfor- forms experienced human data analysts on tasks
mance on quantitative tasks, potentially con- from NvBench [346].
straining their applications in knowledge For scientific knowledge work, Taylor et al.
work areas such as financial services or ac- [548] train the Galactica LLM specifically on sci-
counting. entific text for tasks such as scientific knowledge
recall, reasoning, citation prediction, and scientific
Wu et al. [616] train BloombergGPT (50B Q&A. In addition to a domain-specific training
parameters) for various financial knowledge corpus, Galactica is specialized in the scientific do-

41
main through the use of specialized tokens, work- on relevant examples does not appear to improve
ing memory, and prompt-pre-training. performance. More recently, Katz et al. [258]
Dunn et al. [133] propose fine-tuning GPT-3 for show that GPT-4 with zero-shot prompting exhibits
scientific combined named entity recognition and SOTA performance on the full UBE, including the
relation extraction (LLM-NERRE). First, 100 to multiple choice, essay, and performance test com-
1,000 manually annotated prompt-completion pairs ponents, and achieves passing scores.
are created by humans. These examples are then Blair-Stanek et al. [48] assess GPT-3.5’s abil-
used to fine-tune a GPT-3 model for the specific ity to reason about legal facts and statutes us-
NERRE task. ing the StAtutory Reasoning Assessment (SARA)
Finally, Liu and Shah [335] evaluate GPT-4’s dataset [208]. GPT-3.5 is shown to have SOTA per-
ability to review academic papers, specifically: formance but with significant variation depending
identifying errors, verifying author checklists, and on the type of prompting used (zero-shot, few-shot,
selecting the better abstract. GPT-4 shows some and CoT). GPT-3.5 was also shown to perform rela-
capacity to detect errors, with 7 out of 13 errors tively poorly on synthetic statutory reasoning tasks.
detected, and verify author checklists, with 87% Choi et al. [84] evaluate ChatGPT (GPT-3.5)
accuracy. However, GPT-4 is shown to have lim- on 95 multiple-choice and 12 essay questions from
ited capacity for distinguishing the better paper the final exams at the University of Minnesota law
abstract. school. ChatGPT was found to perform at the level
of a C+ student, near the bottom of the class, but
3.6 Law with passing scores.
Applications of LLMs within the legal domain
share many similarities with medicine, including Out of Date Information
legal question answering [651, 258] and legal in-
Due to regularly updated laws and new
formation extraction [71]. However, other domain-
precedents, the training/retrieval data be-
specific applications have been proposed, such as
come outdated frequently [195].
case outcome prediction [189], legal research [234],
and legal text generation [423].
Finally, many more specific legal question-
3.6.1 Legal Question Answering and answering applications have been proposed, in-
Comprehension cluding: explaining legal concepts (GPT-4 + re-
Key tasks of the legal field are finding related prece- trieval) [478], summarizing legal judgments (GPT-
dents, answering legal questions, and comparing 3.5) [115], litigation research and drafting [234],
existing documents or statutes. and helping full-fill the tasks of a law professor
Using a general-purpose LLM with prompting (ChatGPT) [427].
approach, Yu et al. [651] use GPT-3.5 with zero-
shot, few-shot, and CoT prompting to achieve 3.6.2 Case Prediction and Legal Text
SOTA performance on the legal entailment task Generation
(identifying the relevant statutes and determining Case prediction and legal text generation involve
if a given premise is correct) in the Competition predicting or completing legal opinions. Whilst
on Legal Information Extraction/Entailment (COL- there is currently sparse usage of LLMs in the liter-
IEE) dataset [437]. They also investigate a GPT-3.5 ature, smaller language models have been applied,
version fine-tuned using the COLIEE training set suggesting potential future LLM applications in
with and without explanations but find the zero- and this area.
few-shot legal prompting approaches perform best. Hamilton [189] use nine separate GPT-2 models
Similarly, Rosa et al. [460] use a general monoT5 trained on individual supreme court justice’s au-
model with zero-shot prompting on the COLIEE thored opinions to predict how each justice will
entailment task. vote on a given case. They use a handcrafted
On the US legal Uniform Bar Examination prompt, including a summary of the topic gener-
(UBE), Bommarito II and Katz [50] show that GPT- ated by GPT-3. However, they find this approach
3.5 with zero-shot prompting can achieve 50% on to case prediction does not match the SOTA.
the multiple choice Multistate Bar Examination Previously, Chalkidis et al. [70] trained a range
component, but note that fine-tuning the model of attention-based models (including BERT) to pre-

42
dict case outcomes from the European Court of adapt the GPT-3.5 LLM to medical question an-
Human Rights (ECHR). The attention-based mod- swering (USMLE and MedMCQA) and compre-
els outperformed an SVM with a bag of words hension (PubMedQA) tasks. In addition, Liévin
approach for binary violation classification, multi- et al. [320] propose using retrieval augmentation
label violation classification, and case importance where relevant text from Wikipedia is retrieved
prediction. and included in the prompt. More recently, Nori
Finally, Peric et al. [423] use a dataset of 50,000 et al. [388] evaluated GPT-4 on USMLE and Mul-
judicial opinions from U.S. Circuit Courts to train tiMedQA datasets using zero and few shot prompt-
a Transformer-XL model and fine-tune a GPT-2 ing. GPT-4 is found to outperform GPT-3.5 across
model. The models were then evaluated for their benchmarks significantly. However, several issues
ability to complete a judicial opinion, with a start relating to using GPT-4 for real-world clinical ap-
given as a prompt. In qualitative evaluations, hu- plications are raised, including the risks of erro-
man participants struggled distinguishing between neous generations and the risks of bias. Tang et al.
machine-generated and genuine text. [538] raise similar issues and find that GPT-3.5 and
ChatGPT have issues with factual accuracy and
3.7 Medicine representing the level of certainty during medical
Many applications of LLMs have been proposed summarization.
in the medical domain, including medical ques-
tion answering [511, 512, 320, 655, 388], clinical Hallucination and Bias [538, 388, 511]
information extraction [10, 448], indexing [650],
The safety-critical nature of the medical do-
triage [491, 301], and management of health
main means the possibility of hallucinations
records [276].
significantly limits the current use cases.
3.7.1 Medical Question Answering and Further work is also needed to reduce the
Comprehension risk of LLMs perpetuating existing bias in
Medical question answering and comprehension clinical datasets.
consists of generating multiple-choice and free-text
responses to medical questions. Yunxiang et al. [655] fine-tune a LLaMA LLM
Singhal et al. [511] proposed using few-shot, ChatDoctor (7B parameters) specifically for the
CoT, and self-consistency prompting to specialize task of medical question answering. To specialize
the general-purpose PaLM LLM to medical ques- the LLaMA model, it is first instruction fine-tuned
tion answering and comprehension. They demon- using the Alpaca dataset [540] and then fine-tuned
strate a Flan-PaLM model [93] using a combination to the medical domain using a dataset of 100k pa-
of the three prompting strategies to achieve the pre- tient conversations. Similarly to Liévin et al. [320],
vious SOTA results on the MedQA, MedMCQA, ChatDoctor is augmented with two external knowl-
PubMedQA, and MMLU medical datasets. To fur- edge sources (a disease database and Wikipedia) to
ther align the model to the medical domain, they improve the factual grounding of the model.
proposed Med-PaLM, which utilizes instruction Instead of using general models with specialized
prompt-tuning based on 40 examples from a panel prompting or fine-tuning, Venigalla et al. [565]
of clinicians and task-specific human-engineered train a new model PubMedGPT specifically for
prompts. medical question answering and text generation
Singhal et al. [512] then extend the Med-PaLM tasks. PubMedGPT is trained using a combina-
approach with Med-PaLM 2 using the newer PaLM tion of PubMed abstracts and full documents from
2 LLM as its base model. Singhal et al. [512] the Pile [165]. Peng et al. [418] also train a new
conduct further instruction-fine tuning and use a LLM GatorTronGPT (up to 20B parameters) for
new ensemble refinement (ER) prompting strategy biomedical question answering and relation extrac-
(where stochastically sampled outputs are first gen- tion using a mixture of clinical and general English
erated and provided within the final prompt). This text. Whilst these approaches outperformed exist-
allows Med-PaLM 2 to achieve the current SOTA ing smaller specific purpose models [177, 644] in
on the MultiMedQA benchmark. medical question answering, they currently under-
Liévin et al. [320] adopt a similar approach us- perform the larger general purpose LLMs (GPT-
ing zero-shot, few-shot, and CoT prompting to 3.5/4 and MedPaLM 1/2). However, there remains

43
debate over whether larger general or specialized tasks, such as understanding mathematical opera-
clinical models are the best approach. Looking tions, complex multi-step reasoning, and longer-
at models up to GPT-3, Lehman et al. [297] ques- term planning. Therefore, the applicability of
tion the effectiveness of LLM in-context learning LLMs to these tasks, and methods for improving
approaches by showing that small specialized clin- their capabilities, is an active area of research.
ical models fine-tuned on limited annotated data For mathematical reasoning tasks, Uesato et al.
outperform the former. [560] test a range of fine-tuning (supervised and
Finally, LLMs have also been applied to a range RLHF), prompting (zero-shot and few-shot), and
of more specific medical question-answering tasks, re-ranking (majority voting and reward model) to
including evaluating GPT-3 on its’ ability to triage evaluate whether they improve a base LLM’s (70B
and diagnose cases [301], responding to social me- parameters) ability to generate accurate reason-
dia genetics [134] and general [30] patient ques- ing steps on word-based maths problems in the
tions (ChatGPT), answering questions from the GSM8K dataset [95]. Whilst fine-tuning on in-
Korean general surgery board exams (GPT-3.5, termediate steps (“process-based”) performs simi-
GPT-4) [393], consultation and medical note tak- larly to using only final answers (“outcome-based”)
ing [296], and answering ophthalmology questions on final answer correctness, processed-based ap-
[21]. proaches are found to generate significantly fewer
errors in reasoning.
3.7.2 Medical Information Retrieval
Huang et al. [222] take this a step further by
Medical text often contains domain-specific abbre-
showing that the mathematical reasoning ability
viations, acronyms, and technical terms presenting
of a PaLM LLM on the GSM8K dataset can be
specific information retrieval challenges. This has
self-improved through fine-tuning on a dataset of
led LLMs also to be applied to help structure and
high-confidence reasoning paths generated by the
extract data from medical sources.
same PaLM base model.
Agrawal et al. [10] use InstructGPT (GPT-3)
with prompt templates (zero- and one-shot) for clin- Using only prompting, Kojima et al. [273] find
ical information extraction, such as extracting med- that zero-shot CoT prompting alone significantly
ication dosage and frequency from medical notes improves the performance of GPT-3 and PaLM
or disambiguation of medical acronyms. They also LLMs over standard zero- and few-shot prompting
introduce two methods for converting the LLM on the MultiArith and GSM8K datasets. While Li
output into a structured format using a verbilizer et al. [312] introduce DIVERSE, a prompting ap-
for mapping to classification labels and a resolver proach that uses a diverse set of prompts for each
for more complex structured outputs such as lists question and a trained verifier (with reasoning step
(GPT-3 + R). awareness) to improve further GPT-3.5’s perfor-
Rajkomar et al. [448] take a different approach mance on GSM8K and other reasoning bench-
by treating medical acronym disambiguation as marks. Finally, Shridhar et al. [502] take a novel
a translation task and training a specialized end- approach by training new models to break down
to-end T5 LLM. To preserve privacy, they also a mathematical word problem into Socratic sub-
use a training dataset generated from public web questions to guide the answer of either other LLMs
pages (without medical acronyms) and web-scale or human learners. GPT-3 prompted with these sub-
reverse substitution of medical acronyms, with only questions outperforms simple one-shot prompting
evaluation done on actual clinical notes. on the GSM8K dataset.
Finally, Gu et al. [178] use GPT-3.5 and knowl- Stolfo et al. [525] evaluate a range of LLMs (in-
edge distillation to train a PubMedBERT model cluding GPT-3) at mathematical reasoning using
for adverse drug event extraction (entity and rela- a new framework to understand the causal impact
tion). The distilled PubMedBERT model outper- of different input factors (e.g framing, operands,
forms GPT-3.5 and GPT-4, and performs similarly and operations). Instruction fine-tuned GPT-3 mod-
to specialized models that use supervised learning. els are found to be significantly more robust and
sensitive than the smaller LLMs evaluated.
3.8 Reasoning Other LLM use cases in algorithmic and mathe-
Mathematical and algorithmic tasks often require matical reasoning have also been proposed. Gadgil
a different set of capabilities than traditional NLP et al. [159] apply a Codex LLM with prompt en-

44
gineering and filtering to the task of mathemati- reciting causal knowledge embedded in their data
cal formalization (in the context of theorem prov- rather than doing causal reasoning [253].
ing). Webb et al. [595] evaluate GPT-3.5’s capacity Overall, while LLMs show some capacity for
for analogical reasoning using tasks that emulate more complex reasoning, the relatively poor per-
Raven’s Standard Progressive Matrices (SPM), let- formance of LLMs on a number of reasoning tasks
ter string analogies, and verbal analogies. GPT-3.5 and benchmarks [562, 164, 244] stands in contrast
is shown to generally outperform human partic- to the often human level performance being seen
ipants (undergraduates) at matrix reasoning and in other capabilities [61, 263].
verbal analogies, but with more mixed results on
letter string analogies. Yu et al. [654] introduce 3.9 Robotics and Embodied Agents
the ALERT benchmark to evaluate LLM reason-
LLMs have also started to be incorporated into
ing across ten skills (logistic, causal, common-
robotics applications to provide high-level planning
sense, abductive, spatial, analogical, argument,
and contextual knowledge.
and deductive reasoning, as well as textual entail-
ment and mathematics). Ruis et al. [464] study Ahn et al. [14] implement a PaLM-540B LLM in
LLMs’ capability to interpret implicatures, for ex- the SayCan architecture to break down high-level
ample, whether they understand the response "I text-based instructions into a sequence of lower-
wore gloves" to the question “Did you leave finger- level robot tasks that can be executed. The authors
prints?” as meaning “No”; finding that lots of mod- use the LLM to propose possible next actions via it-
els perform close to random. Finally, Valmeekam eratively scoring the most likely of a defined set of
et al. [562] propose a new assessment framework low-level tasks based on the high-level text input.
for common-sense planning and find that existing The low-level task to be executed is then deter-
LLMs GPT-3.5 and BLOOM perform poorly. Us- mined by combining the low-level tasks proposed
ing the framework for the Blocksworld domain by the LLM with affordance functions which de-
(planning tasks with different colored blocks on termine the probability of the robot completing the
a surface), the best GPT-3.5 model only came up task given the current low-level context.
with a valid plan 5% of the time, compared to 78% Driess et al. [129] take this concept a step fur-
of human participants. ther by combining the PaLM-540B LLM with ad-
ditional input modalities (22B parameter vision
Sub-Human-Performance [562, 607] transformer) to create the PaLM-E model. By in-
troducing images into the input, the PaLM-E model
Existing LLMs struggle to match human can predict which low-level tasks are possible given
performance on reasoning benchmarks. the current state, whether the previous low-level
tasks executed failed, and incorporate images into
Another line of work has investigated the in- long-horizon planning, allowing it to outperform
tersection of LLMs and causal reasoning [425, the original SayCan results.
253]. Kıcıman et al. [286] argue that GPT-3.5/4 Another approach has been to use LLMs to gen-
outperform existing algorithms in three causal erate code for robotics tasks. Vemprala et al. [564]
benchmarks. In contrast, Gao et al. [164] evalu- combine ChatGPT with a pre-defined high-level
ate ChatGPT on three causal reasoning tasks (dis- function library of robotic capabilities for human
tinct from Kıcıman et al. [286]) and find that it on the loop robotics tasks. By providing details of
performs rather poorly; further, few-shot and chain- the function library in the prompt, ChatGPT is then
of-thought prompting sometimes further exacer- shown to be able to break down high-level natu-
bates its performance. Srivastava et al. [519] pro- ral language instructions into a set of lower-level
pose 14 causal reasoning tasks, some of which are function calls, which can then be executed on the
considered to be very hard [534]. Similarly, Jin robot if the human is satisfied it is accurate. This is
et al. [244] curate another causal inference task another example of the API definition 13 approach,
and posit that current LLMs still fail to general- also used in computer programming [532]. Other
ize. Lampinen et al. [288] study whether LLMs related works that use LLMs to generate code for
can generalize causal intervention strategies from robotics applications include using an LLM for hi-
few-shot examples. Willig et al. [607] conjec- erarchical code generation to write robot policies
ture that current LLMs are “causal parrots”, simply (Codex) [316], to generate code policies and main-

45
tain a written state (GPT-3.5) [647], and using an
LLM for code-based task planning (GPT-3, Codex) L L M s in the Social Sciences & Psychology
[510].
Finally, LLMs have also been combined with Using L L M s to model Analyzing behavior al Simulating social
human behavior char acter istics of L L M s relationships with L L M s
modality-to-text pre-processing to provide the
LLM with additional input from the robot’s en- M ilgr am Shock Exper iment Big Five per sonality tr aits I nter acting ar tificial agents

vironment. Liu et al. [338] use GPT-4 as part of the I llusor y Tr uth Effect Guilfor d's Alter native Uses L L M s to simulate societies

REFLECT framework for detecting and explaining


robot failures. To achieve this, multi-modal sensory Figure 15: Use cases of LLMs in the social sci-
inputs are first converted into a text-based hierar- ences and psychology can mainly be structured into
three categories: using LLMs to model human behav-
chical summary at the sensory, event, and sub-goal
ior [e.g., 12, 211], analyzing behavioral characteristics
levels. The hierarchical summary then prompts of LLMs [e.g., 414], and using LLMs to simulate social
the LLM to detect and analyze failures. Similarly, relationships [e.g., 408].
Huang et al. [225] combine an LLM (InstructGPT,
PaLM) with multiple sources of text-based environ-
ment feedback for robotic task planning. text of the psychological and behavioral sciences:
using LLMs to simulate human behavioral experi-
Single Modality [338, 14, 564] ments [e.g., 22, 176, 211, 614, 126], analyzing the
personality traits of LLMs [e.g., 367, 414, 470],
While LLMs can help robots or agents un-
and employing them as artificial agents to model
derstand instructions and add high-level
social relationships [409]. See Fig. 15 for an illus-
planning capabilities, their inability to di-
tration.
rectly learn from image, audio or other sen-
sor modalities constrain their applications. 3.10.1 Modeling Human Behavior
In the behavioral sciences, there is an increasing
For agents in simulated worlds, Wang et al. interest in using LLMs as models for psychological
[579] use the GPT-4 LLM within their VOYAGER experiments. Being able to model human behavior
framework to create a Minecraft agent that can computationally through language models would
autonomously explore, acquire new skills and com- entail a variety of advantages over using human
plete tasks. First, they use GPT-4 to propose new participants: experiments with LLMs are cheaper,
tasks for the agent to complete as part of the au- faster, can be scaled easier, and are potentially less
tomatic curriculum. Then, they ask it to generate sensitive to ethical considerations [176]. In light
code to solve the proposed task given the current of this, various works have compared LLMs with
state to add to its skills library, which can then be human participants from a behavioral perspective.
used in the future (similar to the API approach 13 Argyle et al. [22] demonstrate how LLMs can
used by Vemprala et al. [564]). Finally, the authors generate responses corresponding to virtual partici-
use GPT-4 to verify whether the executed code pants in behavioral experiments. They do so by us-
has achieved the proposed task. This framework ing LLMs to generate samples of responses to stud-
outperforms prompting approaches such as ReAct, ies related to political opinions and voting behavior.
Reflexion, and AutoGPT (Sec. 2.7). In particular, the authors investigate three studies:
Prior work using LLMs for planning in simu- the first asks participants to list words associated
lated worlds include: Wang et al. [591] using GPT- with outgroup partisans, and the second and third
3 for Minecraft, Huang et al. [224] using GPT-3 focus on vote prediction based on demographics.
and Codex in VirtualHome, and Nottingham et al. Across scenarios, experimental results demonstrate
[389] using Codex for Minecraft. that GPT-3 provides answers that closely align with
human responses.
3.10 Social Sciences & Psychology Horton [211] argue that LLMs can be used
The rapid advancements of LLMs have fostered the to computationally model human behavior and
use of such models across research in the psycho- demonstrate such an ability in economics by ex-
logical and behavioral sciences. Reviewing the ex- ploring their behavior in economic scenarios. They
isting literature, we have identified three main areas conducted four experiments focusing on economic
and tasks in which LLMs have been used in the con- decision-making using GPT-3, showing that the

46
LLM can approximately replicate results obtained tests in which subjects are asked to choose between
with human individuals. a kiss from a favorite movie star and $50 [462]
Griffin et al. [176] investigate the suitability of and where subjects had to decide between paying
LLMs to model psychological change. In their a traffic violation fine and going to court [461].
study, the authors assess LLM responses to two These experiments show that GPT-3 replicates only
behavioral tests, the illusory truth effect [ITE; 194] 37.5% of the effects obtained from human partic-
and an experiment measuring the influence of pop- ipants. The authors argue that these results are
ulist news to change in political views [55]. The attributed to humans and LLMs representing inher-
results demonstrate that in both scenarios, human ently different cognitive systems.
judgments tend to align with LLM-based judg- Maddela et al. [353] study identifying unhelpful
ments, indicating that LLMs have the potential to thought patterns and possible reframings to facil-
model the effect of influence on human individuals. itate mental health. They release a dataset called
Aher et al. [12] introduce the Turing Experiment PATTERN R EFRAME and evaluate GPT-3.5 on it,
(TE) to measure an LLM’s suitability to model hu- showing that it can perform very well without ad-
man behavior. A TE consists of inputs to the LLM ditional training. They conclude that practitioners
that signal a certain demographic (e.g., names or of cognitive behavioral therapy may benefit from
occupations) as well as a set of experimental de- using LLMs to produce richer training material.
tails and corresponding outputs used to simulate
3.10.2 Analyzing Behavioral Characteristics
human behavior. The authors apply their approach
of LLMs
to four individual tests, namely an ultimatum game
from behavioral economics [214, 279], garden-path In addition to using LLMs as models for human
sentences used in psycholinguistics [89, 411], the behavior, various existing works study LLMs by
Milgram Shock Experiment from social psychol- analyzing their personality traits.
ogy [364], and the wisdom of crowds task used to Jiang et al. [242] do so by introducing the Ma-
measure collective social intelligence [375]. De- chine Personality Inventory (MPI) dataset, a col-
mographic details are simulated via gender titles lection of items to assess personalities according
and surnames. The results show that LLMs largely to the Big Five personality factors: extraversion,
align with human behavior across the tests. How- agreeableness, openness, conscientiousness, and
ever, the authors note that LLM size matters and neuroticism [358].
that larger models tend to provide results that are Miotto et al. [367] assess GPT-3’s personalities
more aligned with human responses. using the HEXACO [27] and Human Values [488]
Aher et al. [12] point out that the LLMs were scales. Their experimental results reveal that GPT-
most likely exposed to the four behavioral exper- 3 obtains personality and value scores that align
iments during their pre-training. To account for with human participants. Miotto et al. [367] provide
that, the authors create artificial variations of the an extensive analysis of varying temperature values
experiments with conditions that differ from previ- used to prompt the LLM, finding that an increased
ous studies. Additionally, the authors note that a temperature yields changes in the model’s person-
potential risk with using LLMs to simulate human alities, e.g., GPT-3 shows a higher unwillingness to
responses is the introduction of generations that manipulate as well as increased scores on anxiety.
contain biases stemming from the models’ training Similar results were obtained concerning the Hu-
data. man Values scale, where model responses varied
substantially for different temperature values.
Social Biases [12, 367] In line with this work, Pellert et al. [414] ar-
gue that LLMs possess psychological traits as ob-
Unbalanced views and opinions in the train- served in human individuals and can be assessed
ing data skew the LLMs towards biased hu- through psychometric tests. The authors conduct
man behaviors. experiments measuring, among others, the Big Five
personality traits in a zero-shot setup. In contrast,
Park et al. [409] replicate a set of 8 psycho- to Miotto et al. [367], Pellert et al. [414] investi-
logical studies from the Many Labs 2 project [270] gate smaller models based on BERT and find that
using GPT-3 to assess the LLM for its ability to sim- different variants of BERT score across the five
ulate human behavioral data. Such studies include personalities in a fairly homogeneous fashion, with

47
traits that are high on agreeableness and extraver- Pre-processing Post-processing
sion, but low on neuroticism.
Prompt
In a related fashion, Stevenson et al. [523] as-
sess LLM performance (GPT-3) on the Guilford’s
Alternative Uses Test [AUT; 181], a test to assess LLM
human creativity. The test asks participants to sug-
gest uses for physical objects (e.g., a book or a Code -> Modality
fork). Comparing the AUT test performance of Modality-to-Text Python - Matplotlib
Latex - TikZ
CSS
GPT-3 to that of psychology students, the authors <style>
.grid {
found that human responses score higher on orig- Prompt
display: grid;
…….

inality and surprise, whereas GPT-3’s responses


were more useful.
Kosinski [277] test Theory of Mind (ToM) in LLM Prompt

LLMs. ToM refers to the ability to track others’


unobservable mental states, such as intentions, be-
liefs, or desires. The authors find that among Output Modality-and-Text-
to-X
LLMs of the GPT family, recent models can in-
creasingly solve ToM tasks without having been
explicitly trained to do so. For instance, while GPT- Figure 16: Modality Conversion. Illustration of us-
ing models with other input modalities as pre or post-
2 shows virtually no capability of solving ToM
processing steps in an LLM pipeline [148, 329, 338,
tasks, GPT-3.5 (based on InstructGPT) and GPT-4 225, 315]. For some use cases, this approach can be
performed similarly to 6- and 7-year-old children, used as an alternative to training a multi-modal model
respectively. Gandhi et al. [162] present a template- or using a shared embedding space.
based framework for generating synthetic samples
to evaluate ToM in LLMs, which are then applied to
five recently developed LLMs (incl. GPT-3, GPT- Wang et al. [583] propose using GPT-3 to label
4, LLaMA, and Claude). The authors show that datasets more cost-effectively than human labelers.
most models struggle with ToM in its basic forms. These labeled datasets can then be used to train
However, GPT-4 performs closest to the human more compute-efficient smaller models. To evalu-
comparison of all tested models. ate this approach, RoBERTa and PEGASUS mod-
els are trained for 9 NLP tasks using human and
3.10.3 Simulating Social Relationships GPT-3 generated labels. GPT-3 labels are shown
While most previous works measure LLMs as mod- to outperform human labels when labeling budgets
els for human behavior through replicating human are small, but higher-quality human labels tend to
behavioral studies, Park et al. [408] use the power lead to better models at higher labeling budgets.
of LLMs to model the interaction between artificial Similarly, Ding et al. [123] propose three prompt-
agents. The authors model a community of 25 ar- ing approaches for training data generation with
tificial agents interacting in a digital environment GPT-3: unlabeled data annotation (generate labels
to achieve this. Each character has unique traits, for known examples), training data generation (gen-
and the characters interact with each other through erate examples and labels), and assisted training
natural language. Simulating such societies, the data generation (with Wikidata provided as addi-
authors observe emergent social behaviors (e.g., tional context). Fine-tuning a smaller BERT model
forming new relationships and attending events) for text classification and NER tasks using these
between agents that are formed without any human approaches showed results similar to or worse than
interaction. using GPT-3 directly.
Gunasekar et al. [182] leverage synthetic data
3.11 Synthetic Data Generation
generation with GPT-3.5 to train a new code gen-
The ability of LLMs to perform in-context learning eration LLM (see Sec. 3.3.1). The generated data
allows them to be prompted to generate synthetic consists of synthetic Python textbooks focusing on
datasets for training much smaller domain-specific reasoning, basic algorithmic skills, and synthetic
models. Python exercises. One important finding of this

48
work is that introducing randomness into data gen- to model collapse [506].
eration is crucial, all while ensuring the examples
maintain their quality and coherence. Hallucinated Distributions [506]
Yoo et al. [648] propose GPT3Mix to generate Using LLMs for fully synthetic data genera-
additional synthetic data from an existing dataset tion is currently constrained by our inability
for classification tasks. GPT3Mix uses GPT-3 to verify whether the synthetic data gener-
with a prompt containing real examples from the ated is representative of the true distribution
dataset and a task specification to create synthetic in the corresponding real-world data.
examples and pseudo-labels jointly. This new aug-
mented dataset is then used to fine-tune BERT and
In cases where the LLM is only used to label
DistilBERT models. This method combines data
existing data [583, 123] this will likely reduce
augmentation approaches with knowledge distilla-
the risk of generating an unrepresentative training
tion by training smaller classification models using
distribution (although hallucinated labels remain
soft labels.
an issue). Where the LLM is used to generate
Bonifacio et al. [51] propose InPars, a method (or partially generate) both the input and the tar-
for using LLMs to generate synthetic retrieval ex- get [123, 104, 182, 51, 503] the issue of halluci-
amples for fine-tuning on information retrieval nated distributions becomes potentially significant.
tasks. GPT-3 is few-shot prompted to generate a rel-
evant question for a randomly sampled document 4 Related Work
along with the question’s associated probability. A
smaller monoT5 model is then fine-tuned using Closest to ours is the concurrent work by Zhao
this dataset to rank relevant documents for a given et al. [673], who provide an extensive survey of
question. The fine-tuned model outperforms only large language models and associated topics. Mi-
pre-trained models but performs worse than models alon et al. [363] focus on surveying augmented
fine-tuned using the existing MS MARCO training language models, i.e., “language models with rea-
dataset [32]. soning skills and the ability to use tools”. Tornede
Dai et al. [104] introduce AugGPT, which uses et al. [555] survey LLMs in the context of AutoML
ChatGPT (GPT-3.5) to augment each example in methods, highlighting existing methods and chal-
a small base dataset with six additional rephrased lenges in leveraging these for improving LLMs.
synthetic examples. This new augmented dataset is Tang et al. [539] survey LLM-generated text de-
then used to fine-tune a specialized BERT model. tection techniques. Chang et al. [72] concurrently
This approach outperforms existing augmentation survey evaluation tasks of LLMs.
approaches, such as word and character substitu- The literature also contains several previous sur-
tion. veys and evaluations specific to individual applica-
tion domains that reference LLMs, including: chat-
Finally, instead of generating synthetic data to
bots [345], computational biology [558, 217], com-
achieve a specialized task, Shridhar et al. [503] pro-
puter programming [499], medicine [381, 610, 590,
pose Decompositional Distillation, which aims to
381], law [101, 531], knowledge work [140, 621],
use synthetic data to replicate in smaller models the
and reasoning [223].
multi-step reasoning capabilities, such as CoT, that
emerge in larger LLMs. First, GPT-3 is used with a
5 Conclusion
manually designed few-shot prompt to decompose
a problem into (sub-question, sub-solution) pairs. In this work, we identify several unsolved chal-
This synthetic sub-question dataset is then used lenges of large language models, provide an
to fine-tune a T5 problem decomposer to generate overview of their current applications, and discuss
sub-questions. Finally, a GPT-2 problem solver how the former constrain the latter. By highlighting
is fine-tuned to provide the sub-solutions to the the limitations of existing methods, we hope to fos-
teacher-generated sub-questions. ter future research addressing these. We also hope
Overall, while LLM-generated synthetic data that by providing an overview of the approaches
can potentially bring significant cost benefits, the used in different applied areas, we can facilitate
greater its role, the higher the potential for it to fail the transfer of ideas between domains and target
to capture the true distribution and potentially lead further research.

49
Acknowledgements [16] E. Akyürek, D. Schuurmans, J. Andreas, T. Ma and
D. Zhou. 2023. What learning algorithm is in-context learn-
We thank Abhishek Kumar and Stella Rose Bider- ing? investigations with linear models. In The Eleventh
International Conference on Learning Representations.
man for fruitful discussions and feedback on the
draft. [17] L. B. Allal, R. Li, D. Kocetkov, C. Mou, C. Akiki, C. M.
Ferrandis, N. Muennighoff, M. Mishra et al. 2023. Santa-
coder: don’t reach for the stars!
References [18] J. Andreas. 2022. Language models as agent models.
[1] A blog post detailed a Sam Altman freakout about a
huge chips shortage threatening OpenAI. Then it was taken [19] C. Anil, Y. Wu, A. Andreassen, A. Lewkowycz, V. Misra,
down. V. Ramasesh, A. Slone, G. Gur-Ari et al. 2022. Explor-
ing Length Generalization in Large Language Models.
[2] Open LLM Leaderboard - a Hugging Face Space by Hug- ArXiv:2207.04901 [cs].
gingFaceH4.
[20] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin,
[3] Reproducibility — PyTorch 2.0 documentation. A. Passos, S. Shakeri, E. Taropa et al. 2023. Palm 2 techni-
cal report. arXiv preprint arXiv:2305.10403.
[4] 2023. Negative prompts for text generation. Section:
[21] F. Antaki, S. Touma, D. Milad, J. El-Khoury and R. Duval.
Prompting.
2023. Evaluating the performance of chatgpt in ophthal-
mology: An analysis of its successes and shortcomings.
[5] 2023. Reproducibility. Page Version ID: 1163331755.
medRxiv.
[6] A. Abbas, K. Tirumala, D. Simig, S. Ganguli and A. S. [22] L. P. Argyle, E. C. Busby, N. Fulda, J. Gubler, C. Rytting
Morcos. 2023. Semdedup: Data-efficient learning at and D. Wingate. 2022. Out of one, many: Using lan-
web-scale through semantic deduplication. arXiv preprint guage models to simulate human samples. arXiv preprint
arXiv:2303.09540. arXiv:2209.06899.
[7] J. D. Abernethy, A. Agarwal, T. V. Marinov and M. K. War- [23] V. Aribandi, Y. Tay, T. Schuster, J. Rao, H. S. Zheng,
muth. 2023. A mechanism for sample-efficient in-context S. V. Mehta, H. Zhuang, V. Q. Tran et al. 2022. Ext5:
learning for sparse retrieval tasks. ArXiv, abs/2305.17040. Towards extreme multi-task scaling for transfer learning.
In International Conference on Learning Representations.
[8] D. Adiwardana, M.-T. Luong, D. R. So, J. Hall, N. Fiedel,
R. Thoppilan, Z. Yang, A. Kulshreshtha et al. 2020. To- [24] S. Arora, A. Narayan, M. F. Chen, L. Orr, N. Guha,
wards a human-like open-domain chatbot. arXiv preprint K. Bhatia, I. Chami, F. Sala et al. 2022. Ask me anything:
arXiv:2001.09977. A simple strategy for prompting language models.

[9] R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville [25] A. Asai, T. Schick, P. Lewis, X. Chen, G. Izacard,
and M. Bellemare. 2021. Deep Reinforcement Learning S. Riedel, H. Hajishirzi and W.-t. Yih. 2022. Task-aware
at the Edge of the Statistical Precipice. In Advances in retrieval with instructions.
Neural Information Processing Systems, volume 34, pages
29304–29320. Curran Associates, Inc. [26] N. Asher, S. Bhar, A. Chaturvedi, J. Hunter and S. Paul.
2023. Limits for Learning with Language Models.
[10] M. Agrawal, S. Hegselmann, H. Lang, Y. Kim and D. Son- ArXiv:2306.12213 [cs].
tag. 2022. Large language models are zero-shot clinical
information extractors. arXiv preprint arXiv:2205.12689. [27] M. C. Ashton and K. Lee. 2009. The hexaco–60: A short
measure of the major dimensions of personality. Journal
[11] P. Agrawal, C. Alberti, F. Huot, J. Maynez, J. Ma, of personality assessment, 91(4):340–345.
S. Ruder, K. Ganchev, D. Das et al. 2022. Qameleon:
[28] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski,
Multilingual qa with only 5 examples. arXiv preprint
D. Dohan, E. Jiang, C. Cai et al. 2021. Program
arXiv:2211.08264.
synthesis with large language models. arXiv preprint
[12] G. Aher, R. I. Arriaga and A. T. Kalai. 2022. Using arXiv:2108.07732.
large language models to simulate multiple humans. arXiv [29] AUTOMATIC1111. 2023. Stable Diffusion web UI.
preprint arXiv:2208.10264. Original-date: 2022-08-22T14:05:26Z.
[13] O. Ahia, S. Kumar, H. Gonen, J. Kasai, D. R. Mortensen, [30] J. W. Ayers, A. Poliak, M. Dredze, E. C. Leas, Z. Zhu, J. B.
N. A. Smith and Y. Tsvetkov. 2023. Do all languages cost Kelley, D. J. Faix, A. M. Goodman et al. 2023. Comparing
the same? tokenization in the era of commercial language physician and artificial intelligence chatbot responses to
models. arXiv preprint arXiv:2305.13707. patient questions posted to a public social media forum.
JAMA internal medicine.
[14] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes,
B. David, C. Finn, K. Gopalakrishnan et al. 2022. Do as [31] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion,
i can, not as i say: Grounding language in robotic affor- A. Jones, A. Chen, A. Goldie et al. 2022. Constitu-
dances. arXiv preprint arXiv:2204.01691. tional ai: Harmlessness from ai feedback. arXiv preprint
arXiv:2212.08073.
[15] J. Ainslie, T. Lei, M. de Jong, S. Ontañón, S. Brahma,
Y. Zemlyanskiy, D. Uthus, M. Guo et al. 2023. Colt5: [32] P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu,
Faster long-range transformers with conditional computa- R. Majumder, A. McNamara et al. 2018. Ms marco: A
tion. arXiv preprint arXiv:2303.09752. human generated machine reading comprehension dataset.

50
[33] P. Bajaj, C. Xiong, G. Ke, X. Liu, D. He, S. Tiwary, T.-Y. [47] S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao,
Liu, P. Bennett et al. 2022. Metro: Efficient denoising pre- L. Golding, H. He, C. Leahy et al. 2022. Gpt-neox-20b:
training of large scale autoencoding language models with An open-source autoregressive language model.
model generated signals. arXiv preprint arXiv:2204.06644.
[48] A. Blair-Stanek, N. Holzenberger and B. Van Durme.
[34] A. Bakhtin, S. Gross, M. Ott, Y. Deng, M. Ranzato and 2023. Can gpt-3 perform statutory reasoning? arXiv
A. Szlam. 2019. Real or Fake? Learning to Discriminate preprint arXiv:2302.06100.
Machine from Human Generated Text. ArXiv:1906.03351
[cs, stat]. [49] J. Bommarito, M. Bommarito, D. M. Katz and J. Katz.
2023. Gpt as knowledge worker: A zero-shot evaluation of
[35] R. Balestriero, J. Pesenti and Y. LeCun. 2021. Learning (ai) cpa capabilities. arXiv preprint arXiv:2301.04408.
in high dimension always amounts to extrapolation. arXiv
preprint arXiv:2110.09485. [50] M. Bommarito II and D. M. Katz. 2022. Gpt takes the bar
exam. arXiv preprint arXiv:2212.14402.
[36] J. Bandy and N. Vincent. 2021. Addressing "documenta-
[51] L. Bonifacio, H. Abonizio, M. Fadaee and R. Nogueira.
tion debt" in machine learning research: A retrospective
2022. Inpars: Unsupervised dataset generation for infor-
datasheet for bookcorpus.
mation retrieval. In Proceedings of the 45th International
ACM SIGIR Conference on Research and Development in
[37] P. Barham, A. Chowdhery, J. Dean, S. Ghemawat, Information Retrieval, SIGIR ’22, page 2387–2392, New
S. Hand, D. Hurt, M. Isard, H. Lim et al. 2022. Pathways: York, NY, USA. Association for Computing Machinery.
Asynchronous distributed dataflow for ml. Proceedings of
Machine Learning and Systems, 4:430–449. [52] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Ruther-
ford, K. Millican, G. v. d. Driessche, J.-B. Lespiau et al.
[38] M. Bavarian, H. Jun, N. Tezak, J. Schulman, C. McLeavey, 2021. Improving language models by retrieving from tril-
J. Tworek and M. Chen. 2022. Efficient training of lions of tokens. arXiv preprint arXiv:2112.04426.
language models to fill in the middle. arXiv preprint
arXiv:2207.14255. [53] A. Borji. 2023. A Categorical Archive of ChatGPT Fail-
ures. ArXiv:2302.03494 [cs].
[39] N. Belrose, Z. Furman, L. Smith, D. Halawi, I. Ostrovsky,
L. McKinney, S. Biderman and J. Steinhardt. 2023. Elic- [54] A. Borzunov, D. Baranchuk, T. Dettmers, M. Ryabinin,
iting latent predictions from transformers with the tuned Y. Belkada, A. Chumachenko, P. Samygin and C. Raffel.
lens. 2022. Petals: Collaborative inference and fine-tuning of
large models. arXiv preprint arXiv:2209.01188.
[40] E. Ben Zaken, Y. Goldberg and S. Ravfogel. 2022. Bit-
Fit: Simple parameter-efficient fine-tuning for transformer- [55] L. Bos, C. Schemer, N. Corbu, M. Hameleers, I. An-
based masked language-models. In Proceedings of the dreadis, A. Schulz, D. Schmuck, C. Reinemann et al. 2020.
60th Annual Meeting of the Association for Computational The effects of populism as a social identity frame on persua-
Linguistics (Volume 2: Short Papers), pages 1–9, Dublin, sion and mobilisation: Evidence from a 15-country experi-
Ireland. Association for Computational Linguistics. ment. European Journal of Political Research, 59(1):3–24.

[41] S. Biderman, K. Bicheno and L. Gao. 2022. Datasheet for [56] D. Britz, M. Y. Guan and M.-T. Luong. 2017. Efficient
the pile. arXiv preprint arXiv:2201.07311. attention using a fixed-size memory representation. arXiv
preprint arXiv:1707.00110.
[42] S. Biderman, U. S. Prashanth, L. Sutawika, H. Schoelkopf,
Q. Anthony, S. Purohit and E. Raff. 2023. Emergent [57] A. Z. Broder, M. Charikar, A. M. Frieze and M. Mitzen-
and Predictable Memorization in Large Language Mod- macher. 1998. Min-wise independent permutations. In
els. ArXiv:2304.11158 [cs]. Proceedings of the thirtieth annual ACM symposium on
Theory of computing, pages 327–336.
[43] S. Biderman and W. J. Scheirer. 2021. Pitfalls in machine [58] G. Brown, M. Bun, V. Feldman, A. Smith and K. Talwar.
learning research: Reexamining the development cycle. 2021. When is memorization of irrelevant training data
necessary for high-accuracy learning? In Proceedings of
[44] S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, the 53rd annual ACM SIGACT symposium on theory of
K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit et al. computing, pages 123–132.
2023. Pythia: A suite for analyzing large language models
across training and scaling. In Proceedings of the 40th [59] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka-
International Conference on Machine Learning, volume plan, P. Dhariwal, A. Neelakantan, P. Shyam et al. 2020.
202 of Proceedings of Machine Learning Research, pages Language models are few-shot learners. In Advances in
2397–2430. PMLR. Neural Information Processing Systems, volume 33, pages
1877–1901. Curran Associates, Inc.
[45] S. R. Biderman. 2023. [...] we aren’t running out
of text data any time soon. ml researchers mas- [60] M. Brundage, S. Avin, J. Clark, H. Toner, P. Eckersley,
sively underestimate how much text is out there. B. Garfinkel, A. Dafoe, P. Scharre et al. 2018. The mali-
https://twitter.com/BlancheMinerva/ cious use of artificial intelligence: Forecasting, prevention,
status/1644154144431677442?s=20. Accessed: and mitigation. arXiv preprint arXiv:1802.07228.
2023-05-28.
[61] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke,
[46] A. Birhane, V. U. Prabhu and E. Kahembwe. 2021. Mul- E. Horvitz, E. Kamar, P. Lee, Y. T. Lee et al. 2023. Sparks
timodal datasets: misogyny, pornography, and malignant of artificial general intelligence: Early experiments with
stereotypes. arXiv preprint arXiv:2110.01963. gpt-4.

51
[62] C. Burns, H. Ye, D. Klein and J. Steinhardt. 2022. Dis- [78] M. Chen, A. Papangelis, C. Tao, S. Kim, A. Rosenbaum,
covering latent knowledge in language models without Y. Liu, Z. Yu and D. Hakkani-Tur. 2023. Places: Prompting
supervision. language models for social conversation synthesis. arXiv
preprint arXiv:2302.03269.
[63] A. Calderwood, N. Wardrip-Fruin and M. Mateas. 2022.
Spinning coherent interactive fiction through foundation [79] S. Chen, S. Wong, L. Chen and Y. Tian. 2023. Extending
model prompts. International Conference of Computation context window of large language models via positional
and Creativity. interpolation.

[64] N. Carlini, M. Jagielski, C. A. Choquette-Choo, D. Paleka, [80] T. Chen, Z. Zhang, A. Jaiswal, S. Liu and Z. Wang. 2023.
W. Pearce, H. Anderson, A. Terzis, K. Thomas et al. Sparse moe as the new dropout: Scaling dense and self-
2023. Poisoning Web-Scale Training Datasets is Practical. slimmable transformers.
ArXiv:2302.10149 [cs].
[81] X. Chen, M. Lin, N. Schärli and D. Zhou. 2023. Teach-
[65] N. Carlini, C. Liu, Ú. Erlingsson, J. Kos and D. Song. ing large language models to self-debug. arXiv preprint
2019. The secret sharer: Evaluating and testing unintended arXiv:2304.05128.
memorization in neural networks. In USENIX Security
Symposium, volume 267. [82] L. Cheng, X. Li and L. Bing. 2023. Is gpt-4 a good data
analyst?
[66] N. Carlini, M. Nasr, C. A. Choquette-Choo, M. Jagielski, [83] D. Choe, R. Al-Rfou, M. Guo, H. Lee and N. Constant.
I. Gao, A. Awadalla, P. W. Koh, D. Ippolito et al. 2023. 2019. Bridging the Gap for Tokenizer-Free Language Mod-
Are aligned neural networks adversarially aligned? els. ArXiv:1908.10322 [cs].
[67] N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert- [84] J. H. Choi, K. E. Hickman, A. Monahan and D. Schwarcz.
Voss, K. Lee, A. Roberts, T. Brown et al. 2020. Extracting 2023. Chatgpt goes to law school. Available at SSRN.
training data from large language models.
[85] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song,
[68] S. Casper, J. Lin, J. Kwon, G. Culp and D. Hadfield- A. Gane, T. Sarlos, P. Hawkins, J. Davis et al. 2020.
Menell. 2023. Explore, establish, exploit: Red team- Rethinking attention with performers. arXiv preprint
ing language models from scratch. arXiv preprint arXiv:2009.14794.
arXiv:2306.09442.
[86] A. Chowdhery, S. Narang, J. Devlin, M. Bosma,
[69] T. Chakrabarty, V. Padmakumar and H. He. 2022. Help G. Mishra, A. Roberts, P. Barham, H. W. Chung et al.
me write a poem: Instruction tuning as a vehicle for collab- 2022. Palm: Scaling language modeling with pathways.
orative poetry writing. arXiv preprint arXiv:2210.13669. arXiv preprint arXiv:2204.02311.

[70] I. Chalkidis, I. Androutsopoulos and N. Aletras. 2019. [87] M. Christ, S. Gunn and O. Zamir. 2023. Undetectable
Neural legal judgment prediction in english. arXiv preprint Watermarks for Language Models.
arXiv:1906.02059.
[88] P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg
[71] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale- and D. Amodei. 2017. Deep reinforcement learning from
tras and I. Androutsopoulos. 2020. Legal-bert: The human preferences.
muppets straight out of law school. arXiv preprint
arXiv:2010.02559. [89] K. Christianson, A. Hollingworth, J. F. Halliwell and
F. Ferreira. 2001. Thematic roles assigned along the garden
[72] Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen, path linger. Cognitive psychology, 42(4):368–407.
L. Yang, X. Yi et al. 2023. A Survey on Evaluation of
Large Language Models. ArXiv:2307.03109 [cs]. [90] H. W. Chung. 2023. Missing model details (tweet).

[91] H. W. Chung, X. Garcia, A. Roberts, Y. Tay, O. Firat,


[73] B. Chen, X. Cheng, L. ao Gengyang, S. Li, X. Zeng,
S. Narang and N. Constant. 2023. Unimax: Fairer and
B. Wang, G. Jing, C. Liu et al. 2023. xtrimopglm: Uni-
more effective language sampling for large-scale multilin-
fied 100b-scale pre-trained transformer for deciphering the
gual pretraining. In The Eleventh International Conference
language of protein. bioRxiv.
on Learning Representations.
[74] C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre [92] H. W. Chung, D. Garrette, K. C. Tan and J. Riesa. 2020.
and J. Jumper. 2023. Accelerating large language model Improving multilingual models with language-clustered vo-
decoding with speculative sampling. arXiv preprint cabularies. In Proceedings of the 2020 Conference on Em-
arXiv:2302.01318. pirical Methods in Natural Language Processing (EMNLP),
pages 4536–4546, Online. Association for Computational
[75] L. Chen, M. Zaharia and J. Zou. 2023. FrugalGPT: How Linguistics.
to Use Large Language Models While Reducing Cost and
Improving Performance. ArXiv:2305.05176 [cs]. [93] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay,
W. Fedus, Y. Li, X. Wang et al. 2022. Scaling instruction-
[76] L. Chen, M. Zaharia and J. Zou. 2023. How is ChatGPT’s finetuned language models.
behavior changing over time? ArXiv:2307.09009 [cs].
[94] J. H. Clark, D. Garrette, I. Turc and J. Wieting. 2022. Ca-
[77] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, nine: Pre-training an efficient tokenization-free encoder for
J. Kaplan, H. Edwards, Y. Burda et al. 2021. Evaluating language representation. Transactions of the Association
large language models trained on code. for Computational Linguistics, 10:73–91.

52
[95] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, [109] S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank,
L. Kaiser, M. Plappert, J. Tworek et al. 2021. Training P. Molino, J. Yosinski and R. Liu. 2020. Plug and play
verifiers to solve math word problems. language models: A simple approach to controlled text
generation.
[96] D. Cohen, M. Ryu, Y. Chow, O. Keller, I. Greenberg,
A. Hassidim, M. Fink, Y. Matias et al. 2022. Dynamic plan- [110] J. Dauparas, I. Anishchenko, N. Bennett, H. Bai, R. J.
ning in open-ended dialogue using reinforcement learning. Ragotte, L. F. Milles, B. I. M. Wicky, A. Courbet et al. 2022.
arXiv preprint arXiv:2208.02294. Robust deep learning&#x2013;based protein sequence de-
sign using proteinmpnn. Science, 378(6615):49–56.
[97] R. Cohen, M. Hamri, M. Geva and A. Globerson. 2023.
LM vs LM: Detecting Factual Errors via Cross Examina- [111] N. De Cao, W. Aziz and I. Titov. 2021. Editing fac-
tion. ArXiv:2305.13281 [cs]. tual knowledge in language models. In Proceedings of the
2021 Conference on Empirical Methods in Natural Lan-
[98] T. Computer. 2023. Redpajama: An open source recipe guage Processing, pages 6491–6506, Online and Punta
to reproduce llama training dataset. Cana, Dominican Republic. Association for Computational
Linguistics.
[99] A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimer-
sheim and A. Garriga-Alonso. 2023. Towards automated [112] M. Dehghani, A. Arnab, L. Beyer, A. Vaswani and Y. Tay.
circuit discovery for mechanistic interpretability. arXiv 2022. The Efficiency Misnomer. ArXiv:2110.12894 [cs,
preprint arXiv:2304.14997. stat].

[100] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, [113] M. Dehghani, Y. Tay, A. A. Gritsenko, Z. Zhao,
G. Wenzek, F. Guzmán, E. Grave, M. Ott et al. 2020. Unsu- N. Houlsby, F. Diaz, D. Metzler and O. Vinyals. 2021.
pervised cross-lingual representation learning at scale. In The benchmark lottery. arXiv preprint arXiv:2107.07002.
Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, pages 8440–8451, Online. [114] L. Del Corro, A. Del Giorno, S. Agarwal, B. Yu,
Association for Computational Linguistics. A. Awadallah and S. Mukherjee. 2023. SkipDecode: Au-
toregressive Skip Decoding with Batching and Caching for
Efficient LLM Inference. ArXiv:2307.02628 [cs].
[101] A. B. Cyphert. 2021. A human being wrote this law
review article: Gpt-3 and the practice of law. UC Davis L.
[115] A. Deroy, K. Ghosh and S. Ghosh. 2023. How ready
Rev., 55:401.
are pre-trained abstractive models and llms for legal case
judgement summarization?
[102] D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang and F. Wei.
2022. Knowledge neurons in pretrained transformers. In
[116] A. Deshpande, V. Murahari, T. Rajpurohit, A. Kalyan
Proceedings of the 60th Annual Meeting of the Association
and K. Narasimhan. 2023. Toxicity in chatgpt: Analyz-
for Computational Linguistics (Volume 1: Long Papers),
ing persona-assigned language models. arXiv preprint
pages 8493–8502, Dublin, Ireland. Association for Com-
arXiv:2304.05335.
putational Linguistics.
[117] T. Dettmers, M. Lewis, Y. Belkada and L. Zettlemoyer.
[103] D. Dai, Y. Sun, L. Dong, Y. Hao, Z. Sui and F. Wei. 2022. Llm.int8(): 8-bit matrix multiplication for transform-
2022. Why can gpt learn in-context? language models se- ers at scale.
cretly perform gradient descent as meta optimizers. arXiv
preprint arXiv:2212.10559. [118] T. Dettmers, A. Pagnoni, A. Holtzman and L. Zettle-
moyer. 2023. QLoRA: Efficient Finetuning of Quantized
[104] H. Dai, Z. Liu, W. Liao, X. Huang, Z. Wu, L. Zhao, LLMs. ArXiv:2305.14314 [cs].
W. Liu, N. Liu et al. 2023. Chataug: Leveraging chatgpt
for text data augmentation. [119] T. Dettmers, R. Svirschevski, V. Egiazarian,
D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov,
[105] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le and T. Hoefler et al. 2023. Spqr: A sparse-quantized represen-
R. Salakhutdinov. 2019. Transformer-XL: Attentive lan- tation for near-lossless llm weight compression. arXiv
guage models beyond a fixed-length context. In Proceed- preprint arXiv:2306.03078.
ings of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 2978–2988, Florence, [120] J. Devlin, M.-W. Chang, K. Lee and K. Toutanova. 2019.
Italy. Association for Computational Linguistics. BERT: Pre-training of deep bidirectional transformers for
language understanding. In Proceedings of the 2019 Con-
[106] H. Dalla-Torre, L. Gonzalez, J. Mendoza Revilla, ference of the North American Chapter of the Association
N. Lopez Carranza, A. Henryk Grywaczewski, F. Oteri, for Computational Linguistics: Human Language Tech-
C. Dallago, E. Trop et al. 2023. The nucleotide trans- nologies, Volume 1 (Long and Short Papers), pages 4171–
former: Building and evaluating robust foundation models 4186, Minneapolis, Minnesota. Association for Computa-
for human genomics. bioRxiv, pages 2023–01. tional Linguistics.

[107] T. Dao, D. Y. Fu, S. Ermon, A. Rudra and C. Ré. 2022. [121] N. Dey, G. Gosal, Zhiming, Chen, H. Khachane, W. Mar-
Flashattention: Fast and memory-efficient exact attention shall, R. Pathria, M. Tom et al. 2023. Cerebras-gpt: Open
with io-awareness. arXiv preprint arXiv:2205.14135. compute-optimal language models trained on the cerebras
wafer-scale cluster.
[108] T. Dao, D. Y. Fu, K. K. Saab, A. W. Thomas,
A. Rudra and C. Ré. 2023. Hungry Hungry Hippos: [122] S. Diao, X. Li, Y. Lin, Z. Huang and T. Zhang. 2022.
Towards Language Modeling with State Space Models. Black-box prompt learning for pre-trained language mod-
ArXiv:2212.14052 [cs]. els. arXiv preprint arXiv:2201.08531.

53
[123] B. Ding, C. Qin, L. Liu, L. Bing, S. Joty and B. Li. [137] E.-M. El-Mhamdi, S. Farhadkhani, R. Guerraoui,
2022. Is gpt-3 a good data annotator? arXiv preprint N. Gupta, L.-N. Hoang, R. Pinot, S. Rouault and J. Stephan.
arXiv:2212.10450. 2023. On the Impossible Safety of Large AI Models.
ArXiv:2209.15259 [cs].
[124] J. Ding, S. Ma, L. Dong, X. Zhang, S. Huang, W. Wang
and F. Wei. 2023. Longnet: Scaling transformers to [138] N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph,
1,000,000,000 tokens. B. Mann, A. Askell, Y. Bai et al. 2021. A mathematical
framework for transformer circuits. Transformer Circuits
[125] J. Dodge, M. Sap, A. Marasović, W. Agnew, G. Il- Thread.
harco, D. Groeneveld, M. Mitchell and M. Gardner.
2021. Documenting large webtext corpora: A case study
on the colossal clean crawled corpus. arXiv preprint [139] A. Elnaggar, M. Heinzinger, C. Dallago, G. Rihawi,
arXiv:2104.08758. Y. Wang, L. Jones, T. Gibbs, T. Feher et al. 2020. Prottrans:
towards cracking the language of life’s code through self-
[126] R. Dominguez-Olmedo, M. Hardt and C. Mendler- supervised deep learning and high performance computing.
Dünner. 2023. Questioning the survey responses of large arXiv preprint arXiv:2007.06225.
language models. arXiv preprint arXiv:2306.07951.
[140] T. Eloundou, S. Manning, P. Mishkin and D. Rock. 2023.
[127] Q. Dong, D. Dai, Y. Song, J. Xu, Z. Sui and L. Li. 2022. Gpts are gpts: An early look at the labor market impact
Calibrating factual knowledge in pretrained language mod- potential of large language models.
els. In Findings of the Association for Computational
Linguistics: EMNLP 2022, pages 5937–5947, Abu Dhabi, [141] F. Faal, K. Schmitt and J. Y. Yu. 2023. Reward model-
United Arab Emirates. Association for Computational Lin- ing for mitigating toxicity in transformer-based language
guistics. models. Applied Intelligence, 53(7):8421–8435.
[128] D. R. Dowty, R. Wall and S. Peters. 2012. Introduction
[142] A. Fan, C. Gardent, C. Braud and A. Bordes. 2021. Aug-
to Montague semantics, volume 11. Springer Science &
menting transformers with KNN-based composite memory
Business Media.
for dialog. Transactions of the Association for Computa-
[129] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, tional Linguistics, 9:82–99.
B. Ichter, A. Wahid, J. Tompson et al. 2023. Palm-e: An
embodied multimodal language model. arXiv preprint [143] A. Fan, E. Grave and A. Joulin. 2020. Reducing trans-
arXiv:2303.03378. former depth on demand with structured dropout. In Inter-
national Conference on Learning Representations.
[130] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu,
M. Krikun, Y. Zhou et al. 2022. Glam: Efficient scaling [144] M. Fathi, J. Pilault, P.-L. Bacon, C. Pal, O. Fi-
of language models with mixture-of-experts. In Interna- rat and R. Goroshin. 2023. Block-State Transformer.
tional Conference on Machine Learning, pages 5547–5569. ArXiv:2306.09539 [cs].
PMLR.
[145] W. Fedus, B. Zoph and N. Shazeer. 2021. Switch trans-
[131] Y. Du, S. Li, A. Torralba, J. B. Tenenbaum and
formers: Scaling to trillion parameter models with simple
I. Mordatch. 2023. Improving Factuality and Reason-
and efficient sparsity.
ing in Language Models through Multiagent Debate.
ArXiv:2305.14325 [cs].
[146] V. Feldman. 2020. Does learning require memorization?
[132] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang and a short tale about a long tail. In Proceedings of the 52nd
J. Tang. 2022. GLM: General language model pretrain- Annual ACM SIGACT Symposium on Theory of Computing,
ing with autoregressive blank infilling. In Proceedings pages 954–959.
of the 60th Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers), pages [147] S. Feng, C. Y. Park, Y. Liu and Y. Tsvetkov. 2023.
320–335, Dublin, Ireland. Association for Computational From Pretraining Data to Language Models to Downstream
Linguistics. Tasks: Tracking the Trails of Political Biases Leading to
Unfair NLP Models. ArXiv:2305.08283 [cs].
[133] A. Dunn, J. Dagdelen, N. Walker, S. Lee, A. S.
Rosen, G. Ceder, K. Persson and A. Jain. 2022. Struc- [148] W. Feng, W. Zhu, T.-j. Fu, V. Jampani, A. Akula, X. He,
tured information extraction from complex scientific text S. Basu, X. E. Wang et al. 2023. LayoutGPT: Compo-
with fine-tuned large language models. arXiv preprint sitional Visual Planning and Generation with Large Lan-
arXiv:2212.05238. guage Models. ArXiv:2305.15393 [cs].
[134] D. Duong and B. D. Solomon. 2023. Analysis of large-
language model versus human performance for genetics [149] E. Ferrara. 2023. Should chatgpt be biased? challenges
questions. European Journal of Human Genetics, pages and risks of bias in large language models. arXiv preprint
1–3. arXiv:2304.03738.

[135] N. Dziri, X. Lu, M. Sclar, X. L. Li, L. Jiang, B. Y. Lin, [150] A. Ficek, F. Liu and N. Collier. 2022. How to tackle
P. West, C. Bhagavatula et al. 2023. Faith and Fate: Limits an emerging topic? combining strong and weak labels
of Transformers on Compositionality. ArXiv:2305.18654 for covid news NER. In Proceedings of the 2nd Confer-
[cs]. ence of the Asia-Pacific Chapter of the Association for
Computational Linguistics and the 12th International Joint
[136] N. Dziri, A. Madotto, O. Zaiane and A. J. Bose. 2021. Conference on Natural Language Processing (Volume 2:
Neural Path Hunter: Reducing Hallucination in Dialogue Short Papers), pages 488–496, Online only. Association
Systems via Path Grounding. ArXiv:2104.08455 [cs]. for Computational Linguistics.

54
[151] C. Fourrier, N. Habib, J. Launay and T. Wolf. [167] S. Gehman, S. Gururangan, M. Sap, Y. Choi and N. A.
2023. What’s going on with the open llm leader- Smith. 2020. Realtoxicityprompts: Evaluating neural
board? Available from: https://huggingface. toxic degeneration in language models. arXiv preprint
co/blog/evaluating-mmlu-leaderboard. Ac- arXiv:2009.11462.
cessed: 27/06/2023.
[168] S. Gehrmann, H. Strobelt and A. M. Rush. 2019. GLTR:
[152] E. Frantar and D. Alistarh. 2023. Massive language mod- Statistical Detection and Visualization of Generated Text.
els can be accurately pruned in one-shot. arXiv preprint ArXiv:1906.04043 [cs].
arXiv:2301.00774.
[169] R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel,
[153] E. Frantar, S. Ashkboos, T. Hoefler and D. Alis- W. Brendel, M. Bethge and F. A. Wichmann. 2020. Short-
tarh. 2022. Gptq: Accurate post-training quantization cut learning in deep neural networks. Nature Machine
for generative pre-trained transformers. arXiv preprint Intelligence, 2(11):665–673.
arXiv:2210.17323.
[170] A. Glaese, N. McAleese, M. Tr˛ebacz, J. Aslanides,
[154] D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger et al. 2022.
F. Shi, R. Zhong, W.-t. Yih et al. 2022. Incoder: A genera- Improving alignment of dialogue agents via targeted human
tive model for code infilling and synthesis. judgements.
[155] A. Frömmgen and L. Kharatyan. 2023. Resolv- [171] D. Goldberg. 1991. What every computer scientist
ing code review comments with ml. Available from: should know about floating-point arithmetic. ACM Com-
https://ai.googleblog.com/2023/05/ puting Surveys, 23(1):5–48.
resolving-code-review-comments-with-ml.
html. Accessed: 26/06/2023. [172] A. N. Gomez, O. Key, K. Perlin, S. Gou, N. Frosst,
J. Dean and Y. Gal. 2022. Interlocking backpropagation:
[156] J. Fu, S.-K. Ng, Z. Jiang and P. Liu. 2023. Gptscore: Improving depthwise model-parallelism. The Journal of
Evaluate as you desire. arXiv preprint arXiv:2302.04166. Machine Learning Research, 23(1):7714–7741.
[157] T. Fujii, K. Shibata, A. Yamaguchi, T. Morishita and [173] L. Gong, D. He, Z. Li, T. Qin, L. Wang and T. Liu.
Y. Sogawa. 2023. How do different tokenizers perform on 2019. Efficient training of BERT by progressively stacking.
downstream tasks in scriptio continua languages?: A case In Proceedings of the 36th International Conference on
study in japanese. arXiv preprint arXiv:2306.09572. Machine Learning, volume 97 of Proceedings of Machine
Learning Research, pages 2337–2346. PMLR.
[158] I. Gabriel. 2020. Artificial intelligence, values, and align-
ment. Minds and machines, 30(3):411–437. [174] Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan
and W. Chen. 2023. Critic: Large language models can
[159] S. Gadgil, A. R. Tadipatri, A. Agrawal, A. Narayanan
self-correct with tool-interactive critiquing. arXiv preprint
and N. Goyal. 2022. Towards automating formalisation of
arXiv:2305.11738.
theorem statements using large language models. 36th
Conference on Neural Information Processing Systems [175] K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz
(NeurIPS 2022) Workshop on MATH-AI. and M. Fritz. 2023. More than you’ve asked for: A
[160] T. Gale, D. Narayanan, C. Young and M. Zaharia. 2022. comprehensive analysis of novel prompt injection threats
Megablocks: Efficient sparse training with mixture-of- to application-integrated large language models. arXiv
experts. arXiv preprint arXiv:2211.15841. preprint arXiv:2302.12173.

[161] T. Gale, M. Zaharia, C. Young and E. Elsen. [176] L. D. Griffin, B. Kleinberg, M. Mozes, K. T. Mai, M. Vau,
2020. Sparse GPU Kernels for Deep Learning. M. Caldwell and A. Marvor-Parker. 2023. Susceptibil-
ArXiv:2006.10901 [cs, stat]. ity to influence of large language models. arXiv preprint
arXiv:2303.06074.
[162] K. Gandhi, J.-P. Fränken, T. Gerstenbrg and N. D.
Goodman. 2023. Understanding social reasoning in lan- [177] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu,
guage models with language models. arXiv preprint T. Naumann, J. Gao et al. 2021. Domain-specific language
arXiv:2306.15448. model pretraining for biomedical natural language pro-
cessing. ACM Transactions on Computing for Healthcare
[163] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, (HEALTH), 3(1):1–23.
S. Kadavath, B. Mann, E. Perez et al. 2022. Red
teaming language models to reduce harms: Methods, [178] Y. Gu, S. Zhang, N. Usuyama, Y. Woldesenbet, C. Wong,
scaling behaviors, and lessons learned. arXiv preprint P. Sanapathi, M. Wei, N. Valluri et al. 2023. Distilling large
arXiv:2209.07858. language models for biomedical knowledge extraction: A
case study on adverse drug events.
[164] J. Gao, X. Ding, B. Qin and T. Liu. 2023. Is chatgpt a
good causal reasoner? a comprehensive evaluation. arXiv [179] Y. Gu, X. Han, Z. Liu and M. Huang. 2022. PPT:
preprint arXiv:2305.07375. Pre-trained prompt tuning for few-shot learning. In Pro-
ceedings of the 60th Annual Meeting of the Association
[165] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, for Computational Linguistics (Volume 1: Long Papers),
C. Foster, J. Phang, H. He et al. 2020. The pile: An pages 8410–8423, Dublin, Ireland. Association for Com-
800gb dataset of diverse text for language modeling. arXiv putational Linguistics.
preprint arXiv:2101.00027.
[180] A. Gudibande, E. Wallace, C. Snell, X. Geng, H. Liu,
[166] L. Gao, J. Tow, S. Biderman, S. Black, A. DiPofi, C. Fos- P. Abbeel, S. Levine and D. Song. 2023. The false
ter, L. Golding, J. Hsu et al. 2021. A framework for few- promise of imitating proprietary llms. arXiv preprint
shot language model evaluation. arXiv:2305.15717.

55
[181] J. P. Guilford. 1967. Creativity: Yesterday, today and [197] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika,
tomorrow. The Journal of Creative Behavior, 1(1):3–14. D. Song and J. Steinhardt. 2021. Measuring massive multi-
task language understanding.
[182] S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. D.
Giorno, S. Gopi, M. Javaheripi, P. Kauffmann et al. 2023. [198] D. Hendrycks, N. Carlini, J. Schulman and J. Steinhardt.
Textbooks are all you need. 2021. Unsolved problems in ml safety. arXiv preprint
arXiv:2109.13916.
[183] M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y.-H.
Sung and Y. Yang. 2022. LongT5: Efficient text-to-text [199] D. Hendrycks and M. Mazeika. 2022. X-risk analysis
transformer for long sequences. In Findings of the Associ- for ai research. arXiv preprint arXiv:2206.05862.
ation for Computational Linguistics: NAACL 2022, pages
724–736, Seattle, United States. Association for Computa- [200] D. Hernandez, T. Brown, T. Conerly, N. DasSarma,
tional Linguistics. D. Drain, S. El-Showk, N. Elhage, Z. Hatfield-Dodds et al.
2022. Scaling laws and interpretability of learning from
[184] A. Gupta. 2023. Probing Quantifier Comprehension in repeated data. arXiv preprint arXiv:2205.10487.
Large Language Models. ArXiv:2306.07384 [cs].
[201] J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun,
H. Kianinejad, M. Patwary, M. Ali et al. 2017. Deep
[185] T. Gupta and A. Kembhavi. 2022. Visual programming:
learning scaling is predictable, empirically. arXiv preprint
Compositional visual reasoning without training.
arXiv:1712.00409.
[186] K. Guu, K. Lee, Z. Tung, P. Pasupat and M. Chang. [202] B. L. Hie, V. R. Shanker, D. Xu, T. U. Bruun, P. A.
2020. Retrieval augmented language model pre-training. Weidenbacher, S. Tang, W. Wu, J. E. Pak et al. 2023. Effi-
In International Conference on Machine Learning, pages cient evolution of human antibodies from general protein
3929–3938. PMLR. language models. Nature Biotechnology.
[187] J. Haase and P. H. P. Hanel. 2023. Artificial muses: Gen- [203] P. Hingston and M. Preuss. 2011. Red teaming with
erative artificial intelligence chatbots have risen to human- coevolution. In 2011 IEEE Congress of Evolutionary Com-
level creativity. putation (CEC), pages 1155–1163. IEEE.
[188] M. Hahn and N. Goyal. 2023. A theory of emergent [204] J. Ho and T. Salimans. 2022. Classifier-free diffusion
in-context learning as implicit structure induction. ArXiv, guidance.
abs/2303.07971.
[205] J. Hoelscher-Obermaier, J. Persson, E. Kran, I. Kon-
[189] S. Hamilton. 2023. Blind judgement: Agent-based stas and F. Barez. 2023. Detecting Edit Failures In Large
supreme court modelling with gpt. arXiv preprint Language Models: An Improved Specificity Benchmark.
arXiv:2301.05327. ArXiv:2305.17553 [cs].

[190] C. Han, Z. Wang, H. Zhao and H. Ji. 2023. In-context [206] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya,
learning of large language models explained as kernel re- T. Cai, E. Rutherford, D. de las Casas, L. A. Hendricks
gression. ArXiv, abs/2305.12766. et al. 2022. An empirical analysis of compute-optimal
large language model training. In Advances in Neural
[191] T. Hartvigsen, S. Sankaranarayanan, H. Palangi, Y. Kim Information Processing Systems.
and M. Ghassemi. 2022. Aging with grace: Lifelong model
editing with discrete key-value adaptors. arXiv preprint [207] A. Holtzman, J. Buys, L. Du, M. Forbes and Y. Choi.
arXiv:2211.11031. 2020. The curious case of neural text degeneration. In
International Conference on Learning Representations.
[192] A. Haviv, O. Ram, O. Press, P. Izsak and O. Levy. 2022.
Transformer language models without positional encodings [208] N. Holzenberger, A. Blair-Stanek and B. Van Durme.
still learn positional information. In Findings of the Associ- 2020. A dataset for statutory reasoning in tax law
ation for Computational Linguistics: EMNLP 2022, pages entailment and question answering. arXiv preprint
1382–1390, Abu Dhabi, United Arab Emirates. Association arXiv:2005.05257.
for Computational Linguistics.
[209] O. Honovich, T. Scialom, O. Levy and T. Schick. 2022.
Unnatural instructions: Tuning language models with (al-
[193] J. Hazell. 2023. Large language models can be used to
most) no human labor. arXiv preprint arXiv:2212.09689.
effectively scale spear phishing campaigns. arXiv preprint
arXiv:2305.06972. [210] S. Hooker. 2021. The hardware lottery. Communications
of the ACM, 64(12):58–65.
[194] E. L. Henderson, S. J. Westwood and D. J. Simons. 2022.
A reproducible systematic map of research on the illusory [211] J. J. Horton. 2023. Large language models as simulated
truth effect. Psychonomic Bulletin & Review, pages 1–24. economic agents: What can we learn from homo silicus?
arXiv preprint arXiv:2301.07543.
[195] P. Henderson, M. S. Krass, L. Zheng, N. Guha, C. D.
Manning, D. Jurafsky and D. E. Ho. 2022. Pile of law: [212] M. Horton, S. Mehta, A. Farhadi and M. Rastegari. 2023.
Learning responsible data filtering from the law and a Bytes Are All You Need: Transformers Operating Directly
256GB open-source legal dataset. In Thirty-sixth Confer- On File Bytes. ArXiv:2306.00238 [cs].
ence on Neural Information Processing Systems Datasets
and Benchmarks Track. [213] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone,
Q. De Laroussilhe, A. Gesmundo, M. Attariyan and
[196] D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, S. Gelly. 2019. Parameter-efficient transfer learning for
D. Song and J. Steinhardt. 2020. Aligning ai with shared nlp. In International Conference on Machine Learning,
human values. arXiv preprint arXiv:2008.02275. pages 2790–2799. PMLR.

56
[214] D. Houser and K. McCabe. 2014. Experimental eco- [229] HuggingFace. 2023. Huggingchat v0.3.0. Available
nomics and experimental game theory. In Neuroeconomics, from: https://huggingface.co/chat. Accessed:
pages 19–34. Elsevier. 28/06/2023.

[215] J. Howard and S. Ruder. 2018. Universal language [230] C. Hwang, W. Cui, Y. Xiong, Z. Yang, Z. Liu, H. Hu,
model fine-tuning for text classification. In Proceedings Z. Wang, R. Salas et al. 2022. Tutel: Adaptive mixture-of-
of the 56th Annual Meeting of the Association for Com- experts at scale. arXiv preprint arXiv:2206.03382.
putational Linguistics (Volume 1: Long Papers), pages
328–339, Melbourne, Australia. Association for Computa- [231] J. P. A. Ioannidis. 2005. Why Most Published Research
tional Linguistics. Findings Are False. PLoS Medicine, 2(8):e124.

[216] S. Hsiao. 2023. What’s ahead for bard: More [232] D. Ippolito, A. Yuan, A. Coenen and S. Burnam. 2022.
global, more visual, more integrated. Available Creative writing with an ai-powered writing assistant:
from: https://blog.google/technology/ai/ Perspectives from professional writers. arXiv preprint
google-bard-updates-io-2023/. Accessed: arXiv:2211.05030.
28/06/2023.
[233] G. Irving, P. Christiano and D. Amodei. 2018. Ai safety
[217] B. Hu, J. Xia, J. Zheng, C. Tan, Y. Huang, Y. Xu and S. Z. via debate. arXiv preprint arXiv:1805.00899.
Li. 2022. Protein language models and structure prediction:
Connection and progression. [234] K. Y. Iu and V. M.-Y. Wong. 2023. Chatgpt by openai:
The end of litigation lawyers? Available at SSRN.
[218] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang,
L. Wang and W. Chen. 2021. Lora: Low-rank adaptation [235] S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig,
of large language models. P. Yu, K. Shuster, T. Wang et al. 2022. Opt-iml: Scaling
language model instruction meta learning through the lens
[219] Z. Hu, Y. Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W. of generalization.
Lee, L. Bing and S. Poria. 2023. Llm-adapters: An adapter
family for parameter-efficient fine-tuning of large language [236] G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni,
models. arXiv preprint arXiv:2304.01933. T. Schick, J. Dwivedi-Yu, A. Joulin et al. 2022. Few-shot
learning with retrieval augmented language models. arXiv
[220] W. Hua, Z. Dai, H. Liu and Q. Le. 2022. Transformer preprint arXiv:2208.03299.
Quality in Linear Time. In Proceedings of the 39th Interna-
tional Conference on Machine Learning, pages 9099–9117. [237] A. Jacovi, A. Caciularu, O. Goldman and Y. Goldberg.
PMLR. ISSN: 2640-3498. 2023. Stop Uploading Test Data in Plain Text: Practical
Strategies for Mitigating Data Contamination by Evalua-
[221] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon,
tion Benchmarks. ArXiv:2305.10160 [cs].
C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman et al.
2019. Music transformer. In International Conference on [238] N. Jain, K. Saifullah, Y. Wen, J. Kirchenbauer, M. Shu,
Learning Representations. A. Saha, M. Goldblum, J. Geiping et al. 2023. Bring your
own data! self-supervised evaluation for large language
[222] J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu and
models. arXiv preprint arXiv:23062.13651.
J. Han. 2022. Large language models can self-improve.

[223] J. Huang and K. C.-C. Chang. 2023. Towards Reasoning [239] J. Jang, S. Kim, S. Ye, D. Kim, L. Logeswaran, M. Lee,
in Large Language Models: A Survey. ArXiv:2212.10403 K. Lee and M. Seo. 2023. Exploring the Benefits of
[cs]. Training Expert Language Models over Instruction Tuning.
ArXiv:2302.03202 [cs].
[224] W. Huang, P. Abbeel, D. Pathak and I. Mordatch. 2022.
Language models as zero-shot planners: Extracting action- [240] J. R. Jeliazkov, D. del Alamo and J. D. Karpiak.
able knowledge for embodied agents. In International Con- 2023. Esmfold hallucinates native-like protein sequences.
ference on Machine Learning, pages 9118–9147. PMLR. bioRxiv.

[225] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, [241] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii,
A. Zeng, J. Tompson et al. 2022. Inner monologue: Em- Y. J. Bang et al. 2023. Survey of Hallucination in Natural
bodied reasoning through planning with language models. Language Generation. ACM Computing Surveys, 55(12):1–
arXiv preprint arXiv:2207.05608. 38.

[226] Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. X. Chen, [242] G. Jiang, M. Xu, S.-C. Zhu, W. Han, C. Zhang and
D. Chen, H. Lee, J. Ngiam et al. 2018. Gpipe: Efficient Y. Zhu. 2022. Mpi: Evaluating and inducing person-
training of giant neural networks using pipeline parallelism. ality in pre-trained language models. arXiv preprint
arXiv:2206.07550.
[227] Z. Huang, Y. Shen, X. Zhang, J. Zhou, W. Rong and
Z. Xiong. 2023. Transformer-patcher: One mistake worth [243] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li,
one neuron. In The Eleventh International Conference on F. Wang and Q. Liu. 2020. TinyBERT: Distilling BERT
Learning Representations. for natural language understanding. In Findings of the
Association for Computational Linguistics: EMNLP 2020,
[228] I. Hubara, B. Chmiel, M. Island, R. Banner, J. Naor pages 4163–4174, Online. Association for Computational
and D. Soudry. 2021. Accelerated sparse neural training: Linguistics.
A provable and efficient method to find n:m transposable
masks. In Advances in Neural Information Processing Sys- [244] Z. Jin, J. Liu, Z. Lyu, S. Poff, M. Sachan, R. Mihalcea,
tems, volume 34, pages 21099–21111. Curran Associates, M. Diab and B. Schölkopf. 2023. Can large language
Inc. models infer causation from correlation?

57
[245] A. Jinich, S. Z. Nazia, A. V. Tellez, D. Rappoport, [260] Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel, V. Miku-
M. AlQuraishi and K. Rhee. 2022. Predicting enzyme lik and G. Irving. 2021. Alignment of language agents.
substrate chemical structure with protein language models. arXiv preprint arXiv:2103.14659.
bioRxiv, pages 2022–09.
[261] N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong
[246] Jonathan Frankle [@jefrankle]. 2022. Louder for the and R. Socher. 2019. Ctrl: A conditional transformer
people in the back: LARGE MODELS (GPT, DALLE) language model for controllable generation. arXiv preprint
= DATABASES PROMPTS = QUERIES OUTPUTS = arXiv:1909.05858.
RESPONSES NNs find new relations w/in data. Anyone,
no matter the resources, can study better querying langs [262] O. Khattab, K. Santhanam, X. L. Li, D. Hall, P. Liang,
and possibly beat a big model they could never afford to C. Potts and M. Zaharia. 2023. Demonstrate-Search-
train. Predict: Composing retrieval and language models for
knowledge-intensive NLP. ArXiv:2212.14024 [cs].
[247] D. Jones. 2022. Development and evaluation of speech
recognition for the Welsh language. In Proceedings of [263] D. Kiela, M. Bartolo, Y. Nie, D. Kaushik, A. Geiger,
the 4th Celtic Language Technology Workshop within Z. Wu, B. Vidgen, G. Prasad et al. 2021. Dyn-
LREC2022, pages 52–59, Marseille, France. European Lan- abench: Rethinking benchmarking in nlp. arXiv preprint
guage Resources Association. arXiv:2104.14337.
[248] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, [264] J. Kim, M. Kim and B. Mozafari. 2022. Provable memo-
O. Ronneberger, K. Tunyasuvunakool, R. Bates et al. 2021. rization capacity of transformers. In The Eleventh Interna-
Highly accurate protein structure prediction with alphafold. tional Conference on Learning Representations.
Nature, 596(7873):583–589.
[265] S. Kim, K. Mangalam, J. Malik, M. W. Mahoney,
[249] J. Kaddour. 2022. Stop wasting my time! saving days
A. Gholami and K. Keutzer. 2023. Big little transformer
of imagenet and bert training with latest weight averaging.
decoder. arXiv preprint arXiv:2302.07863.
arXiv preprint arXiv:2209.14981.
[250] J. Kaddour. 2023. The MiniPile Challenge for Data- [266] T. Kim. 2022. Revisiting the practical effectiveness of
Efficient Language Models. ArXiv:2304.08442 [cs]. constituency parse extraction from pre-trained language
models. In Proceedings of the 29th International Con-
[251] J. Kaddour, O. Key, P. Nawrot, P. Minervini and M. J. ference on Computational Linguistics, pages 5398–5408,
Kusner. 2023. No Train No Gain: Revisiting Efficient Gyeongju, Republic of Korea. International Committee on
Training Algorithms For Transformer-based Language Computational Linguistics.
Models. ArXiv:2307.06440 [cs].
[267] L. N. Kinch, R. D. Schaeffer, A. Kryshtafovych and
[252] J. Kaddour, L. Liu, R. Silva and M. Kusner. 2022. When N. V. Grishin. 2021. Target classification in the 14th round
do flat minima optimizers work? In Advances in Neural of the critical assessment of protein structure prediction
Information Processing Systems. (casp14). Proteins: Structure, Function, and Bioinformat-
ics, 89(12):1618–1632.
[253] J. Kaddour, A. Lynch, Q. Liu, M. J. Kusner and R. Silva.
2022. Causal machine learning: A survey and open prob- [268] J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers
lems. arXiv preprint arXiv:2206.15475. and T. Goldstein. 2023. A Watermark for Large Language
Models. ArXiv:2301.10226 [cs].
[254] J. Kaddour, Y. Zhu, Q. Liu, M. J. Kusner and R. Silva.
2021. Causal Effect Inference for Structured Treatments. [269] J. Kirchenbauer, J. Geiping, Y. Wen, M. Shu, K. Sai-
In Advances in Neural Information Processing Systems, fullah, K. Kong, K. Fernando, A. Saha et al. 2023. On
volume 34, pages 24841–24854. Curran Associates, Inc. the Reliability of Watermarks for Large Language Models.
ArXiv:2306.04634 [cs].
[255] M. Kale, A. Siddhant, R. Al-Rfou, L. Xue, N. Con-
stant and M. Johnson. 2021. nmT5 - is parallel data still [270] R. A. Klein, M. Vianello, F. Hasselman, B. G. Adams,
relevant for pre-training massively multilingual language R. B. Adams Jr, S. Alper, M. Aveyard, J. R. Axt et al. 2018.
models? In Proceedings of the 59th Annual Meeting of Many labs 2: Investigating variation in replicability across
the Association for Computational Linguistics and the 11th samples and settings. Advances in Methods and Practices
International Joint Conference on Natural Language Pro- in Psychological Science, 1(4):443–490.
cessing (Volume 2: Short Papers), pages 683–691, Online.
Association for Computational Linguistics. [271] D. Kocetkov, R. Li, L. B. Allal, J. Li, C. Mou, C. M.
[256] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, Ferrandis, Y. Jernite, M. Mitchell et al. 2022. The stack: 3
B. Chess, R. Child, S. Gray, A. Radford et al. 2020. Scal- tb of permissively licensed source code.
ing laws for neural language models. arXiv preprint
[272] J. Kocoń, I. Cichecki, O. Kaszyca, M. Kochanek, D. Szy-
arXiv:2001.08361.
dło, J. Baran, J. Bielaniewicz, M. Gruza et al. 2023. Chat-
[257] A. Karpathy. 2023. Tokenization issues (tweet). gpt: Jack of all trades, master of none.

[258] D. M. Katz, M. J. Bommarito, S. Gao and P. Arredondo. [273] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo and Y. Iwasawa.
2023. Gpt-4 passes the bar exam. Available at SSRN 2022. Large language models are zero-shot reasoners. In
4389233. Advances in Neural Information Processing Systems.

[259] A. Kazemnejad, I. Padhi, K. N. Ramamurthy, P. Das [274] A. Köpf, Y. Kilcher, D. von Rütte, S. Anagnostidis, Z.-
and S. Reddy. 2023. The impact of positional encoding R. Tam, K. Stevens, A. Barhoum, N. M. Duc et al. 2023.
on length generalization in transformers. arXiv preprint Openassistant conversations–democratizing large language
arXiv:2305.19466. model alignment. arXiv preprint arXiv:2304.07327.

58
[275] T. Korbak, K. Shi, A. Chen, R. Bhalerao, C. L. Buckley, [290] A. Lazaridou, E. Gribovskaya, W. Stokowiec and
J. Phang, S. R. Bowman and E. Perez. 2023. Pretraining N. Grigorev. 2022. Internet-augmented language mod-
language models with human preferences. arXiv preprint els through few-shot prompting for open-domain question
arXiv:2302.08582. answering.

[276] D. M. Korngiebel and S. D. Mooney. 2021. Consider- [291] A. Lee, B. Miranda and S. Koyejo. 2023. Beyond Scale:
ing the possibilities and pitfalls of generative pre-trained the Diversity Coefficient as a Data Quality Metric Demon-
transformer 3 (gpt-3) in healthcare delivery. NPJ Digital strates LLMs are Pre-trained on Formally Diverse Data.
Medicine, 4(1):1–3. ArXiv:2306.13840 [cs].

[277] M. Kosinski. 2023. Theory of mind may have sponta- [292] D. Lee, J. Lee, J.-W. Ha, J.-H. Kim, S.-W. Lee,
neously emerged in large language models. H. Lee and H. O. Song. 2023. Query-efficient black-box
red teaming via bayesian optimization. arXiv preprint
[278] B. Krause, A. D. Gotmare, B. McCann, N. S. Keskar, arXiv:2305.17444.
S. Joty, R. Socher and N. F. Rajani. 2021. GeDi: Genera- [293] K. Lee, O. Firat, A. Agarwal, C. Fannjiang and D. Sus-
tive discriminator guided sequence generation. In Findings sillo. 2018. Hallucinations in neural machine translation.
of the Association for Computational Linguistics: EMNLP
2021, pages 4929–4952, Punta Cana, Dominican Republic. [294] K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck,
Association for Computational Linguistics. C. Callison-Burch and N. Carlini. 2021. Deduplicating
training data makes language models better. arXiv preprint
[279] D. C. Krawczyk. 2018. Introduction to reasoning. Rea- arXiv:2107.06499.
soning—The Neuroscience of How We Think; Academic
Press: Cambridge, MA, USA, pages 1–11. [295] N. Lee, W. Ping, P. Xu, M. Patwary, P. Fung, M. Shoeybi
and B. Catanzaro. Factuality Enhanced Language Models
[280] K. Krishna, Y. Song, M. Karpinska, J. Wieting and for Open-Ended Text Generation.
M. Iyyer. 2023. Paraphrasing evades detectors of AI-
generated text, but retrieval is an effective defense. [296] P. Lee, S. Bubeck and J. Petro. 2023. Benefits, limits,
ArXiv:2303.13408 [cs]. and risks of gpt-4 as an ai chatbot for medicine. New
England Journal of Medicine, 388(13):1233–1239.
[281] T. Kudo. 2018. Subword regularization: Improving neu-
[297] E. Lehman, E. Hernandez, D. Mahajan, J. Wulff, M. J.
ral network translation models with multiple subword can-
Smith, Z. Ziegler, D. Nadler, P. Szolovits et al. 2023. Do
didates. In Proceedings of the 56th Annual Meeting of
we still need clinical language models?
the Association for Computational Linguistics (Volume 1:
Long Papers), pages 66–75, Melbourne, Australia. Associ- [298] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang,
ation for Computational Linguistics. M. Krikun, N. Shazeer et al. 2020. Gshard: Scaling gi-
ant models with conditional computation and automatic
[282] T. Kudo and J. Richardson. 2018. Sentencepiece: A sharding.
simple and language independent subword tokenizer and
detokenizer for neural text processing. arXiv preprint [299] B. Lester, R. Al-Rfou and N. Constant. 2021. The power
arXiv:1808.06226. of scale for parameter-efficient prompt tuning. In Pro-
ceedings of the 2021 Conference on Empirical Methods
[283] A. Kulkarni. 2021. GitHub Copilot AI Is Leaking Func- in Natural Language Processing, pages 3045–3059, On-
tional API Keys. line and Punta Cana, Dominican Republic. Association for
Computational Linguistics.
[284] S. R. Künzel, J. S. Sekhon, P. J. Bickel and B. Yu. 2019.
Metalearners for estimating heterogeneous treatment ef- [300] Y. Leviathan, M. Kalman and Y. Matias. 2022. Fast in-
fects using machine learning. Proceedings of the national ference from transformers via speculative decoding. arXiv
academy of sciences, 116(10):4156–4165. preprint arXiv:2211.17192.

[285] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. Yu, [301] D. M. Levine, R. Tuwani, B. Kompa, A. Varma, S. G.
J. Gonzalez, H. Zhang et al. 2023. vllm: Easy, fast, and Finlayson, A. Mehrotra and A. Beam. 2023. The diagnostic
cheap llm serving with pagedattention. and triage accuracy of the gpt-3 artificial intelligence model.
medRxiv, pages 2023–01.
[286] E. Kıcıman, R. Ness, A. Sharma and C. Tan. 2023. [302] M. Lewis, S. Bhosale, T. Dettmers, N. Goyal and
Causal reasoning and large language models: Opening L. Zettlemoyer. 2021. Base layers: Simplifying training of
a new frontier for causality. large, sparse models.
[287] P. Lab. 2023. Awesome-Prompt-Engineering. Original- [303] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
date: 2023-02-09T18:22:52Z. hamed, O. Levy, V. Stoyanov and L. Zettlemoyer. 2020.
BART: Denoising sequence-to-sequence pre-training for
[288] A. K. Lampinen, S. C. Chan, I. Dasgupta, A. J. Nam natural language generation, translation, and comprehen-
and J. X. Wang. 2023. Passive learning of active causal sion. In Proceedings of the 58th Annual Meeting of the
strategies in agents and language models. arXiv preprint Association for Computational Linguistics, pages 7871–
arXiv:2305.16183. 7880, Online. Association for Computational Linguistics.
[289] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. V. del [304] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin,
Moral, T. L. Scao, L. V. Werra, C. Mou et al. 2022. The big- N. Goyal, H. Küttler, M. Lewis et al. 2020. Retrieval-
science ROOTS corpus: A 1.6TB composite multilingual augmented generation for knowledge-intensive nlp tasks.
dataset. In Thirty-sixth Conference on Neural Information Advances in Neural Information Processing Systems,
Processing Systems Datasets and Benchmarks Track. 33:9459–9474.

59
[305] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, [321] C.-C. Lin, A. Jaech, X. Li, M. R. Gormley and J. Eis-
H. Michalewski, V. Ramasesh, A. Slone, C. Anil et al. ner. 2020. Limitations of autoregressive models and their
2022. Solving quantitative reasoning problems with lan- alternatives. arXiv preprint arXiv:2010.11939.
guage models.
[322] J. Lin, A. Yang, J. Bai, C. Zhou, L. Jiang, X. Jia,
[306] B. Z. Li, M. Nye and J. Andreas. 2021. Implicit repre- A. Wang, J. Zhang et al. 2021. M6-10t: A sharing-
sentations of meaning in neural language models. arXiv delinking paradigm for efficient multi-trillion parameter
preprint arXiv:2106.00737. pretraining. arXiv preprint arXiv:2110.03888.
[307] C. Li, A. A. Awan, H. Tang, S. Rajbhandari and Y. He. [323] S. Lin, J. Hilton and O. Evans. 2021. Truthfulqa:
2021. 1-bit lamb: Communication efficient large-scale Measuring how models mimic human falsehoods. arXiv
large-batch training with lamb’s convergence speed. arXiv preprint arXiv:2109.07958.
preprint arXiv:2104.06069.

[308] D. Li, R. Shao, A. Xie, Y. Sheng, L. Zheng, J. E. Gonza- [324] X. V. Lin, T. Mihaylov, M. Artetxe, T. Wang, S. Chen,
lez, I. Stoica, X. Ma et al. 2023. How long can open-source D. Simig, M. Ott, N. Goyal et al. 2022. Few-shot learning
llms truly promise on context length? with multilingual generative language models. In Pro-
ceedings of the 2022 Conference on Empirical Methods
[309] H. Li, D. Guo, W. Fan, M. Xu and Y. Song. 2023. Multi- in Natural Language Processing, pages 9019–9052, Abu
step jailbreaking privacy attacks on chatgpt. arXiv preprint Dhabi, United Arab Emirates. Association for Computa-
arXiv:2304.05197. tional Linguistics.

[310] R. Li, J. Su, C. Duan and S. Zheng. 2020. Linear at- [325] Y.-T. Lin and Y.-N. Chen. 2023. Llm-eval: Unified
tention mechanism: An efficient attention for semantic multi-dimensional automatic evaluation for open-domain
segmentation. arXiv preprint arXiv:2007.14902. conversations with large language models. arXiv preprint
arXiv:2305.13711.
[311] X. L. Li and P. Liang. 2021. Prefix-tuning: Optimizing
continuous prompts for generation. In Proceedings of the [326] Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, A. dos
59th Annual Meeting of the Association for Computational Santos Costa, M. Fazel-Zarandi et al. 2022. Language
Linguistics and the 11th International Joint Conference on models of protein sequences at the scale of evolution enable
Natural Language Processing (Volume 1: Long Papers), accurate structure prediction. BioRxiv.
pages 4582–4597, Online. Association for Computational
Linguistics. [327] W. Ling, D. Yogatama, C. Dyer and P. Blunsom. 2017.
Program induction by rationale generation: Learning to
[312] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou and solve and explain algebraic word problems. In Proceed-
W. Chen. 2022. On the advance of making language mod- ings of the 55th Annual Meeting of the Association for
els better reasoners. Computational Linguistics (Volume 1: Long Papers), pages
[313] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrit- 158–167, Vancouver, Canada. Association for Computa-
twieser, R. Leblond, T. Eccles, J. Keeling et al. 2022. tional Linguistics.
Competition-level code generation with alphacode. Sci-
ence, 378(6624):1092–1097. [328] B. Liu, J. T. Ash, S. Goel, A. Krishnamurthy and
C. Zhang. 2023. Exposing Attention Glitches with Flip-
[314] Z. Li, C. You, S. Bhojanapalli, D. Li, A. S. Rawat, S. J. Flop Language Modeling. ArXiv:2306.00946 [cs].
Reddi, K. Ye, F. Chern et al. 2023. The Lazy Neuron
Phenomenon: On Emergence of Activation Sparsity in [329] F. Liu, J. M. Eisenschlos, F. Piccinno, S. Krichene,
Transformers. ArXiv:2210.06313 [cs, stat]. C. Pang, K. Lee, M. Joshi, W. Chen et al. 2022. Deplot:
One-shot visual language reasoning by plot-to-table trans-
[315] L. Lian, B. Li, A. Yala and T. Darrell. 2023. Llm- lation. arXiv preprint arXiv:2212.10505.
grounded diffusion: Enhancing prompt understanding of
text-to-image diffusion models with large language models. [330] H. Liu, C. Sferrazza and P. Abbeel. 2023. Languages
are rewards: Hindsight finetuning using human feedback.
[316] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, arXiv preprint arXiv:2302.02676.
P. Florence and A. Zeng. 2023. Code as policies: Language
model programs for embodied control. [331] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang,
M. Bansal and C. A. Raffel. 2022. Few-shot parameter-
[317] P. P. Liang, C. Wu, L.-P. Morency and R. Salakhutdi- efficient fine-tuning is better and cheaper than in-context
nov. 2021. Towards understanding and mitigating social learning. Advances in Neural Information Processing Sys-
biases in language models. In International Conference on tems, 35:1950–1965.
Machine Learning, pages 6565–6576. PMLR.

[318] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, [332] H. Liu, S. M. Xie, Z. Li and T. Ma. 2022. Same pre-
M. Yasunaga, Y. Zhang, D. Narayanan et al. 2022. training loss, better downstream: Implicit bias matters for
Holistic evaluation of language models. arXiv preprint language models. ArXiv, abs/2210.14199.
arXiv:2211.09110.
[333] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua,
[319] O. Lieber, O. Sharir, B. Lenz and Y. Shoham. 2021. F. Petroni and P. Liang. 2023. Lost in the Middle: How
Jurassic-1: Technical details and evaluation. White Paper. Language Models Use Long Contexts. ArXiv:2307.03172
AI21 Labs, 1. [cs].

[320] V. Liévin, C. E. Hother and O. Winther. 2022. Can large [334] R. Liu, C. Jia, J. Wei, G. Xu and S. Vosoughi. 2022.
language models reason about medical questions? arXiv Quantifying and alleviating political bias in language mod-
preprint arXiv:2207.08143. els. Artificial Intelligence, 304:103654.

60
[335] R. Liu and N. B. Shah. 2023. ReviewerGPT? An Ex- [350] X. Ma, X. Kong, S. Wang, C. Zhou, J. May, H. Ma
ploratory Study on Using Large Language Models for Pa- and L. Zettlemoyer. 2021. Luna: Linear unified nested
per Reviewing. ArXiv:2306.00622 [cs]. attention. Advances in Neural Information Processing
Systems, 34:2441–2453.
[336] S. Liu and Z. Wang. 2023. Ten lessons we have learned
in the new" sparseland": A short handbook for sparse neu- [351] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao,
ral network researchers. arXiv preprint arXiv:2302.02596. S. Wiegreffe, U. Alon, N. Dziri et al. 2023. Self-refine:
Iterative refinement with self-feedback.
[337] X. Liu, X. Yang, L. Ouyang, G. Guo, J. Su, R. Xi,
K. Yuan and F. Yuan. 2022. Protein language model [352] A. Madani, B. Krause, E. R. Greene, S. Subramanian,
predicts mutation pathogenicity and clinical prognosis. B. P. Mohr, J. M. Holton, J. L. Olmos Jr, C. Xiong et al.
bioRxiv, pages 2022–09. 2023. Large language models generate functional protein
sequences across diverse families. Nature Biotechnology,
[338] Z. Liu, A. Bahety and S. Song. 2023. Reflect: Summa- pages 1–8.
rizing robot experiences for failure explanation and correc-
tion. [353] M. Maddela, M. Ung, J. Xu, A. Madotto, H. Foran
and Y.-L. Boureau. 2023. Training Models to Gen-
[339] Z. Liu, E. Gan and M. Tegmark. 2023. Seeing is be- erate, Recognize, and Reframe Unhelpful Thoughts.
lieving: Brain-inspired modular training for mechanistic ArXiv:2307.02768 [cs].
interpretability. arXiv preprint arXiv:2305.08746.
[354] S. Mahdavi, R. Liao and C. Thrampoulidis. 2023. Memo-
[340] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, rization Capacity of Multi-Head Attention in Transformers.
Y. Tay, D. Zhou, Q. V. Le et al. 2023. The flan collec- ArXiv:2306.02010 [cs].
tion: Designing data and methods for effective instruction
tuning. [355] S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee,
D. Chen and S. Arora. 2023. Fine-Tuning Language Mod-
[341] S. Longpre, G. Yauney, E. Reif, K. Lee, A. Roberts, els with Just Forward Passes. ArXiv:2305.17333 [cs].
B. Zoph, D. Zhou, J. Wei et al. 2023. A Pretrainer’s Guide
to Training Data: Measuring the Effects of Data Age, Do- [356] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada
main Coverage, Quality, & Toxicity. ArXiv:2305.13169 and S. Paul. 2022. Peft: State-of-the-art parameter-
[cs]. efficient fine-tuning methods. https://github.com/
huggingface/peft.
[342] Y. Lu, M. Bartolo, A. Moore, S. Riedel and P. Stene-
torp. 2022. Fantastically ordered prompts and where to [357] P. Maniatis and D. Tarlow. 2023. Large sequence
find them: Overcoming few-shot prompt order sensitivity. models for software development activities. Available
In Proceedings of the 60th Annual Meeting of the Asso- from: https://ai.googleblog.com/2023/
ciation for Computational Linguistics (Volume 1: Long 05/large-sequence-models-for-software.
Papers), pages 8086–8098, Dublin, Ireland. Association html. Accessed: 26/06/2023.
for Computational Linguistics. [358] R. R. McCrae and P. T. Costa Jr. 1997. Personality trait
structure as a human universal. American psychologist,
[343] Y. Lu, C. Li, M. Zhang, C. De Sa and Y. He. 2022. Max-
52(5):509.
imizing communication efficiency for large-scale training
via 0/1 adam. arXiv preprint arXiv:2202.06009. [359] I. R. McKenzie, A. Lyzhov, M. Pieler, A. Parrish,
A. Mueller, A. Prabhu, E. McLean, A. Kirtland et al.
[344] N. Lukas, A. Salem, R. Sim, S. Tople, L. Wutschitz
2023. Inverse Scaling: When Bigger Isn’t Better.
and S. Zanella-Béguelin. 2023. Analyzing Leakage of
ArXiv:2306.09479 [cs].
Personally Identifiable Information in Language Models.
ArXiv:2302.00539 [cs]. [360] K. Meng, D. Bau, A. J. Andonian and Y. Belinkov. 2022.
Locating and editing factual associations in GPT. In Ad-
[345] B. Luo, R. Y. Lau, C. Li and Y.-W. Si. 2022. A critical vances in Neural Information Processing Systems.
review of state-of-the-art chatbot designs and applications.
Wiley Interdisciplinary Reviews: Data Mining and Knowl- [361] K. Meng, A. S. Sharma, A. J. Andonian, Y. Belinkov
edge Discovery, 12(1):e1434. and D. Bau. 2023. Mass-editing memory in a transformer.
In The Eleventh International Conference on Learning
[346] Y. Luo, N. Tang, G. Li, C. Chai, W. Li and X. Qin. 2021. Representations.
Synthesizing natural language to visualization (nl2vis)
benchmarks from nl2sql benchmarks. In Proceedings of [362] J. Menick, M. Trebacz, V. Mikulik, J. Aslanides, F. Song,
the 2021 International Conference on Management of Data, M. Chadwick, M. Glaese, S. Young et al. 2022. Teaching
pages 1235–1247. language models to support answers with verified quotes.
[347] A. Lynch, G. J. Dovonon, J. Kaddour and R. Silva. 2023. [363] G. Mialon, R. Dessì, M. Lomeli, C. Nalmpantis, R. Pa-
Spawrious: A benchmark for fine control of spurious cor- sunuru, R. Raileanu, B. Rozière, T. Schick et al. 2023.
relation biases. arXiv preprint arXiv:2303.05470. Augmented language models: a survey. arXiv preprint
arXiv:2302.07842.
[348] P. Ma, Z. Li, A. Sun and S. Wang. 2023. "oops, did i just
say that?" testing and repairing unethical suggestions of [364] S. Milgram. 1963. Behavioral study of obedience. The
large language models with suggest-critique-reflect process. Journal of abnormal and social psychology, 67(4):371.
arXiv preprint arXiv:2305.02626.
[365] S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. W.
[349] X. Ma, G. Fang and X. Wang. 2023. Llm-pruner: On the Koh, M. Iyyer, L. Zettlemoyer et al. 2023. FActScore:
structural pruning of large language models. arXiv preprint Fine-grained Atomic Evaluation of Factual Precision in
arXiv:2305.11627. Long Form Text Generation. ArXiv:2305.14251 [cs].

61
[366] S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, [380] N. Nanda, L. Chan, T. Lieberum, J. Smith and J. Stein-
H. Hajishirzi and L. Zettlemoyer. 2022. Rethinking the hardt. 2023. Progress measures for grokking via mechanis-
role of demonstrations: What makes in-context learning tic interpretability. In The Eleventh International Confer-
work? ence on Learning Representations.

[367] M. Miotto, N. Rossberg and B. Kleinberg. 2022. Who [381] S. Nerella, S. Bandyopadhyay, J. Zhang, M. Contreras,
is gpt-3? an exploration of personality, values and demo- S. Siegel, A. Bumin, B. Silva, J. Sena et al. 2023. Trans-
graphics. arXiv preprint arXiv:2209.14338. formers in healthcare: A survey.

[368] P. Mirowski, K. W. Mathewson, J. Pittman and R. Evans. [382] A. Nguyen, N. Karampatziakis and W. Chen. 2023. Meet
2022. Co-writing screenplays and theatre scripts with lan- in the middle: A new pre-training paradigm. arXiv preprint
guage models: An evaluation by industry professionals. arXiv:2303.07295.
arXiv preprint arXiv:2209.14958.
[383] E. Nguyen, M. Poli, M. Faizi, A. Thomas, C. Birch-
[369] A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, Sykes, M. Wornow, A. Patel, C. Rabideau et al. 2023. Hye-
G. Venkatesh, C. Yu and P. Micikevicius. 2021. Ac- nadna: Long-range genomic sequence modeling at single
celerating sparse deep neural networks. arXiv preprint nucleotide resolution. arXiv preprint arXiv:2306.15794.
arXiv:2104.08378.
[384] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam,
[370] S. Mishra, D. Khashabi, C. Baral and H. Hajishirzi. 2022. P. Mishkin, B. McGrew, I. Sutskever and M. Chen. 2022.
Cross-task generalization via natural language crowdsourc- Glide: Towards photorealistic image generation and editing
ing instructions. In Proceedings of the 60th Annual Meet- with text-guided diffusion models.
ing of the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 3470–3487, Dublin, Ireland. [385] X. Nie and S. Wager. 2021. Quasi-oracle estimation of
Association for Computational Linguistics. heterogeneous treatment effects. Biometrika, 108(2):299–
319.
[371] E. Mitchell, Y. Lee, A. Khazatsky, C. D. Manning
and C. Finn. 2023. DetectGPT: Zero-Shot Machine- [386] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang,
Generated Text Detection using Probability Curvature. Y. Zhou, S. Savarese and C. Xiong. 2022. Codegen: An
ArXiv:2301.11305 [cs]. open large language model for code with multi-turn pro-
gram synthesis.
[372] E. Mitchell, C. Lin, A. Bosselut, C. Finn and C. D. Man-
ning. 2022. Fast model editing at scale. In International [387] F. Niu, B. Recht, C. Re, S. J. Wright and W. D. St. Hog-
Conference on Learning Representations. wild!: A Lock-Free Approach to Parallelizing Stochastic
Gradient Descent.
[373] E. Mitchell, C. Lin, A. Bosselut, C. D. Manning and
C. Finn. 2022. Memory-based model editing at scale. In [388] H. Nori, N. King, S. M. McKinney, D. Carignan and
Proceedings of the 39th International Conference on Ma- E. Horvitz. 2023. Capabilities of gpt-4 on medical chal-
chine Learning, volume 162 of Proceedings of Machine lenge problems.
Learning Research, pages 15817–15831. PMLR.
[389] K. Nottingham, P. Ammanabrolu, A. Suhr, Y. Choi,
[374] R. Moriconi, M. P. Deisenroth and K. Sesh Kumar. H. Hajishirzi, S. Singh and R. Fox. 2023. Do embod-
2020. High-dimensional bayesian optimization using low- ied agents dream of pixelated sheep?: Embodied decision
dimensional feature spaces. Machine Learning, 109:1925– making using language guided world modelling. arXiv
1943. preprint arXiv:2301.12050.

[375] M. Moussaïd, J. E. Kämmer, P. P. Analytis and H. Neth. [390] S. Nurk, S. Koren, A. Rhie, M. Rautiainen, A. V.
2013. Social influence and the collective dynamics of Bzikadze, A. Mikheenko, M. R. Vollger, N. Altemose et al.
opinion formation. PloS one, 8(11):e78433. 2022. The complete sequence of a human genome. Sci-
ence, 376(6588):44–53.
[376] M. Mozes, J. Hoffmann, K. Tomanek, M. Kouate,
N. Thain, A. Yuan, T. Bolukbasi and L. Dixon. 2023. To- [391] M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski,
wards agile text classifiers for everyone. arXiv preprint J. Austin, D. Bieber, D. Dohan, A. Lewkowycz et al. 2021.
arXiv:2302.06541. Show your work: Scratchpads for intermediate computa-
tion with language models.
[377] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts,
S. Biderman, T. L. Scao, M. S. Bari, S. Shen et al. 2022. [392] Ofir Press [@OfirPress]. 2022. GPT-3 seems to be
Crosslingual generalization through multitask finetuning. nondeterministic even when it should be (i.e. temper-
arXiv preprint arXiv:2211.01786. ature == 0). Has anyone else noticed this? Is there
a known fix? Video by my collaborator Muru Zhang.
[378] S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, https://t.co/dOWYWPBYyP.
H. Palangi and A. Awadallah. 2023. Orca: Progressive
learning from complex explanation traces of gpt-4. arXiv [393] N. Oh, G.-S. Choi and W. Y. Lee. 2023. Chatgpt goes
preprint arXiv:2306.02707. to operating room: Evaluating gpt-4 performance and its
potential in surgical education and training in the era of
[379] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, large language models. medRxiv.
C. Kim, C. Hesse, S. Jain et al. 2021. Webgpt: Browser-
assisted question-answering with human feedback. arXiv [394] C. Olah. Mechanistic Interpretability, Variables, and the
preprint arXiv:2112.09332. Importance of Interpretable Bases.

62
[395] C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. Das- [411] N. D. Patson, E. S. Darowski, N. Moon and F. Ferreira.
Sarma, T. Henighan, B. Mann, A. Askell et al. 2022. 2009. Lingering misinterpretations in garden-path sen-
In-context learning and induction heads. arXiv preprint tences: evidence from a paraphrasing task. Journal of
arXiv:2209.11895. Experimental Psychology: Learning, Memory, and Cogni-
tion, 35(1):280.
[396] OpenAI. 2022. Chatgpt: Optimizing language mod-
els for dialogue. https://openai.com/blog/ [412] D. Patterson, J. Gonzalez, U. Hölzle, Q. Le, C. Liang,
chatgpt/. Accessed: 2023-02-18. L.-M. Munguia, D. Rothchild, D. R. So et al. 2022. The
carbon footprint of machine learning training will plateau,
[397] OpenAI. 2023. Chat gpt 4 painfully slow. then shrink. Computer, 55(7):18–28.
https://community.openai.com/t/
chat-gpt-4-painfully-slow/117996. [413] A. Paullada, I. D. Raji, E. M. Bender, E. Denton and
[398] OpenAI. 2023. Gpt-4 technical report. A. Hanna. 2021. Data and its (dis) contents: A survey of
dataset development and use in machine learning research.
[399] P. J. Ortiz Su’arez, B. Sagot and L. Romary. 2019. Asyn- Patterns, 2(11):100336.
chronous pipelines for processing huge corpora on medium
to low resource infrastructures. In Proceedings of the Work- [414] M. Pellert, C. M. Lechner, C. Wagner, B. Rammstedt
shop on Challenges in the Management of Large Corpora and M. Strohmaier. 2023. Ai psychometrics: Using psy-
(CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, chometric inventories to obtain psychological profiles of
Mannheim. Leibniz-Institut f"ur Deutsche Sprache. large language models.

[400] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, [415] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru,
D. Grangier and M. Auli. 2019. fairseq: A fast, ex- A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei et al.
tensible toolkit for sequence modeling. arXiv preprint 2023. The RefinedWeb Dataset for Falcon LLM: Outper-
arXiv:1904.01038. forming Curated Corpora with Web Data, and Web Data
Only. ArXiv:2306.01116 [cs].
[401] N. Ousidhoum, X. Zhao, T. Fang, Y. Song and D.-Y.
Yeung. 2021. Probing toxic content in large pre-trained lan- [416] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Ar-
guage models. In Proceedings of the 59th Annual Meeting cadinho, H. Cao, X. Cheng, M. Chung et al. 2023.
of the Association for Computational Linguistics and the RWKV: Reinventing RNNs for the Transformer Era.
11th International Joint Conference on Natural Language ArXiv:2305.13048 [cs].
Processing (Volume 1: Long Papers), pages 4262–4274.
[417] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcad-
[402] C. Outeiral and C. Deane. 2022. Codon language em- inho, H. Cao, X. Cheng, M. Chung et al. 2023. Rwkv:
beddings provide strong signals for protein engineering. Reinventing rnns for the transformer era. arXiv preprint
bioRxiv, pages 2022–12. arXiv:2305.13048.
[403] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright,
P. Mishkin, C. Zhang, S. Agarwal et al. 2022. Training lan- [418] C. Peng, X. Yang, A. Chen, K. E. Smith, N. PourNejatian,
guage models to follow instructions with human feedback. A. B. Costa, C. Martin, M. G. Flores et al. 2023. A study
In Advances in Neural Information Processing Systems. of generative large language model for medical research
and healthcare.
[404] M. Pagliardini, D. Paliotta, M. Jaggi and F. Fleuret. 2023.
Faster causal attention over large sequences through sparse [419] Y. Peng. 2021. A MARVS analysis of two Chinese near-
flash attention. synonymous verbs of jumping based on Chinese corpora.
In Proceedings of the 35th Pacific Asia Conference on
[405] J. Pan, T. Gao, H. Chen and D. Chen. 2023. What in- Language, Information and Computation, pages 483–492,
context learning "learns" in-context: Disentangling task Shanghai, China. Association for Computational Lingus-
recognition and task learning. tics.

[406] B. Paranjape, S. Lundberg, S. Singh, H. Hajishirzi, [420] E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides,
L. Zettlemoyer and M. T. Ribeiro. 2023. Art: Automatic A. Glaese, N. McAleese et al. 2022. Red teaming lan-
multi-step reasoning and tool-use for large language mod- guage models with language models. arXiv preprint
els. arXiv:2202.03286.
[407] G. Park, B. Park, S. J. Kwon, B. Kim, Y. Lee and D. Lee. [421] E. Perez, S. Ringer, K. Lukošiūtė, K. Nguyen, E. Chen,
2022. nuqmm: Quantized matmul for efficient inference S. Heiner, C. Pettit, C. Olsson et al. 2022. Discovering
of large-scale generative language models. arXiv preprint language model behaviors with model-written evaluations.
arXiv:2206.09557.

[408] J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang [422] F. Perez and I. Ribeiro. 2022. Ignore previous prompt:
and M. S. Bernstein. 2023. Generative agents: Interactive Attack techniques for language models. arXiv preprint
simulacra of human behavior. arXiv:2211.09527.

[409] P. S. Park, P. Schoenegger and C. Zhu. 2023. Artifi- [423] L. Peric, S. Mijic, D. Stammbach and E. Ash. 2020. Le-
cial intelligence in psychology research. arXiv preprint gal language modeling with transformers. In Proceedings
arXiv:2302.07267. of the Fourth Workshop on Automated Semantic Analysis
of Information in Legal Text (ASAIL 2020) held online in
[410] A. Patel, B. Li, M. S. Rasooli, N. Constant, C. Raffel and conjunction with te 33rd International Conference on Le-
C. Callison-Burch. 2023. Bidirectional language models gal Knowledge and Information Systems (JURIX 2020)
are also few-shot learners. December 9, 2020, volume 2764. CEUR-WS.

63
[424] B. Peters and A. F. T. Martins. 2021. Smoothing and [438] A. Radford, R. Jozefowicz and I. Sutskever. 2017. Learn-
shrinking the sparse Seq2Seq search space. In Proceedings ing to generate reviews and discovering sentiment. arXiv
of the 2021 Conference of the North American Chapter preprint arXiv:1704.01444.
of the Association for Computational Linguistics: Human
Language Technologies, pages 2642–2654, Online. Associ- [439] A. Radford, J. W. Kim, T. Xu, G. Brockman,
ation for Computational Linguistics. C. McLeavey and I. Sutskever. 2022. Robust
Speech Recognition via Large-Scale Weak Supervision.
[425] J. Peters, D. Janzing and B. Schölkopf. 2017. Elements ArXiv:2212.04356 [cs, eess].
of causal inference: foundations and learning algorithms.
The MIT Press. [440] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei and
I. Sutskever. 2019. Language models are unsupervised
[426] A. Petrov, E. La Malfa, P. H. Torr and A. Bibi. 2023. multitask learners.
Language model tokenizers introduce unfairness between
[441] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann,
languages. arXiv preprint arXiv:2305.15425.
F. Song, J. Aslanides, S. Henderson et al. 2021. Scaling lan-
guage models: Methods, analysis & insights from training
[427] T. Pettinato Oltz. 2023. Chatgpt, professor of law. Pro-
gopher. arXiv preprint arXiv:2112.11446.
fessor of Law (February 4, 2023).
[442] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D.
[428] J. Pfeiffer, A. Rücklé, C. Poth, A. Kamath, I. Vulić, Manning and C. Finn. 2023. Direct preference optimiza-
S. Ruder, K. Cho and I. Gurevych. 2020. AdapterHub: tion: Your language model is secretly a reward model.
A framework for adapting transformers. In Proceedings arXiv preprint arXiv:2305.18290.
of the 2020 Conference on Empirical Methods in Natural
Language Processing: System Demonstrations, pages 46– [443] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
54, Online. Association for Computational Linguistics. M. Matena, Y. Zhou, W. Li et al. 2022. Exploring the limits
of transfer learning with a unified text-to-text transformer.
[429] S. Pichai. 2023. An important next step on our ai jour- J. Mach. Learn. Res., 21(1).
ney. https://blog.google/technology/ai/
bard-google-ai-search-updates/. Accessed: [444] S. Rajbhandari, C. Li, Z. Yao, M. Zhang, R. Y.
2023-02-18. Aminabadi, A. A. Awan, J. Rasley and Y. He. 2022.
DeepSpeed-MoE: Advancing mixture-of-experts inference
[430] M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, and training to power next-generation AI scale. In Pro-
S. Baccus, Y. Bengio, S. Ermon et al. 2023. Hyena Hier- ceedings of the 39th International Conference on Machine
archy: Towards Larger Convolutional Language Models. Learning, volume 162 of Proceedings of Machine Learning
ArXiv:2302.10866 [cs]. Research, pages 18332–18346. PMLR.

[431] R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Brad- [445] S. Rajbhandari, J. Rasley, O. Ruwase and Y. He. 2020.
bury, A. Levskaya, J. Heek, K. Xiao et al. 2022. Efficiently Zero: Memory optimizations toward training trillion param-
Scaling Transformer Inference. ArXiv:2211.05102 [cs]. eter models. In Proceedings of the International Confer-
ence for High Performance Computing, Networking, Stor-
[432] R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Brad- age and Analysis, SC ’20. IEEE Press.
bury, A. Levskaya, J. Heek, K. Xiao et al. 2022. Ef-
ficiently scaling transformer inference. arXiv preprint [446] S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith and Y. He.
arXiv:2211.05102. 2021. Zero-infinity: Breaking the gpu memory wall for
extreme scale deep learning. In Proceedings of the In-
[433] V. Prabhakaran, A. Mostafazadeh Davani and M. Diaz. ternational Conference for High Performance Computing,
2021. On releasing annotator-level labels and information Networking, Storage and Analysis, SC ’21, New York, NY,
in datasets. In Proceedings of the Joint 15th Linguistic USA. Association for Computing Machinery.
Annotation Workshop (LAW) and 3rd Designing Meaning
[447] I. D. Raji, E. M. Bender, A. Paullada, E. Denton and
Representations (DMR) Workshop, pages 133–138, Punta
A. Hanna. 2021. Ai and the everything in the whole wide
Cana, Dominican Republic. Association for Computational
world benchmark. arXiv preprint arXiv:2111.15366.
Linguistics.
[448] A. Rajkomar, E. Loreaux, Y. Liu, J. Kemp, B. Li, M.-J.
[434] O. Press, N. A. Smith and M. Lewis. 2021. Train short, Chen, Y. Zhang, A. Mohiuddin et al. 2022. Deciphering
test long: Attention with linear biases enables input length clinical abbreviations with a privacy protecting machine
extrapolation. learning system. Nature Communications, 13(1):7456.
[435] O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith [449] R. Ramamurthy, P. Ammanabrolu, K. Brantley, J. Hes-
and M. Lewis. 2023. Measuring and Narrowing the Com- sel, R. Sifa, C. Bauckhage, H. Hajishirzi and Y. Choi.
positionality Gap in Language Models. ArXiv:2210.03350 2022. Is reinforcement learning (not) for natural language
[cs]. processing?: Benchmarks, baselines, and building blocks
for natural language policy optimization. arXiv preprint
[436] J. Qian, H. Wang, Z. Li, S. Li and X. Yan. 2022. Lim- arXiv:2210.01241.
itations of language models in arithmetic and symbolic
induction. arXiv preprint arXiv:2208.05051. [450] J. Rasley, S. Rajbhandari, O. Ruwase and Y. He. 2020.
Deepspeed: System optimizations enable training deep
[437] J. Rabelo, R. Goebel, M.-Y. Kim, Y. Kano, M. Yosh- learning models with over 100 billion parameters. In Pro-
ioka and K. Satoh. 2022. Overview and discussion of the ceedings of the 26th ACM SIGKDD International Confer-
competition on legal information Extraction/Entailment ence on Knowledge Discovery & Data Mining, KDD ’20,
(COLIEE) 2021. The Review of Socionetwork Strategies, page 3505–3506, New York, NY, USA. Association for
16(1):111–133. Computing Machinery.

64
[451] P. P. Ray. 2023. ChatGPT: A comprehensive review [466] S. Russell. 2021. Human-compatible artificial intelli-
on background, applications, key challenges, bias, ethics, gence. Human-like machine intelligence, pages 3–23.
limitations and future scope. Internet of Things and Cyber-
Physical Systems, 3:121–154. [467] P. Rust, J. F. Lotz, E. Bugliarello, E. Salesky,
M. de Lhoneux and D. Elliott. 2023. Language Modelling
[452] E. Razumovskaia, J. Maynez, A. Louis, M. Lapata and with Pixels. ArXiv:2207.06991 [cs].
S. Narayan. 2022. Little red riding hood goes around the
globe: Crosslingual story planning and generation with [468] A. Sabne. 2020. Xla : Compiling machine learning for
large language models. arXiv preprint arXiv:2212.10471. peak performance.
[469] V. S. Sadasivan, A. Kumar, S. Balasubramanian,
[453] B. Recht, C. Re, S. Wright and F. Niu. 2011. Hogwild!: W. Wang and S. Feizi. 2023. Can AI-Generated Text be
A lock-free approach to parallelizing stochastic gradient de- Reliably Detected? ArXiv:2303.11156 [cs].
scent. Advances in neural information processing systems,
24. [470] M. Safdari, G. Serapio-García, C. Crepy, S. Fitz,
P. Romero, L. Sun, M. Abdulhai, A. Faust et al. 2023.
[454] J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, Personality traits in large language models.
S. Yang, M. Zhang, D. Li and Y. He. 2021. {ZeRO-
Offload}: Democratizing {Billion-Scale} model training. [471] S. Sagawa, P. W. Koh, T. B. Hashimoto and P. Liang.
In 2021 USENIX Annual Technical Conference (USENIX 2020. Distributionally robust neural networks for group
ATC 21), pages 551–564. shifts: On the importance of regularization for worst-case
generalization.
[455] X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang, W. Wang,
P. Li, X. Zhang et al. 2023. Pangu- [472] O. Sainz, J. C. Campos, I. García-Ferrero, J. Etxaniz and
Sigma: Towards trillion parameter language model with E. Agirre. lm-contamination.
sparse heterogeneous computing.
[473] L. Salewski, S. Alaniz, I. Rio-Torto, E. Schulz and
[456] Riley Goodside [@goodside]. 2022. An edge-case Z. Akata. 2023. In-context impersonation reveals large
in GPT-3 with big implications: Inference is non- language models’ strengths and biases. arXiv preprint
deterministic (even at temperature=0) when top-2 token arXiv:2305.14930.
probabilities are <1% different. So temperature=0 output
[474] G. Sanchez, H. Fan, A. Spangher, E. Levi, P. S. Am-
is *very close* to deterministic, but actually isn’t. Worth
manamanchi and S. Biderman. 2023. Stay on topic with
remembering.
Classifier-Free Guidance. ArXiv:2306.17806 [cs].
[457] X. Robin, J. Haas, R. Gumienny, A. Smolinski, G. Tau- [475] V. Sanh, A. Webson, C. Raffel, S. Bach, L. Sutawika,
riello and T. Schwede. 2021. Continuous automated Z. Alyafeai, A. Chaffin, A. Stiegler et al. 2022. Multitask
model evaluation (cameo)—perspectives on the future of prompted training enables zero-shot task generalization. In
fully automated evaluation of structure prediction meth- International Conference on Learning Representations.
ods. Proteins: Structure, Function, and Bioinformatics,
89(12):1977–1986. [476] S. Sanyal, J. Kaddour, A. Kumar and S. Sanghavi. 2023.
Understanding the effectiveness of early weight averaging
[458] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell and for training large language models.
K. Saenko. 2018. Object hallucination in image captioning.
arXiv preprint arXiv:1809.02156. [477] E. Saravia. 2022. Prompt Engineering Guide. Publica-
tion Title: https://github.com/dair-ai/Prompt-Engineering-
[459] S. Roller, S. Sukhbaatar, A. Szlam and J. Weston. 2021. Guide original-date: 2022-12-16T16:04:50Z.
Hash layers for large sparse models.
[478] J. Savelka, K. D. Ashley, M. A. Gray, H. Westermann and
[460] G. M. Rosa, L. Bonifacio, V. Jeronymo, H. Abonizio, H. Xu. 2023. Explaining legal concepts with augmented
R. Lotufo and R. Nogueira. 2022. Billions of parame- large language models (gpt-4).
ters are worth more than in-domain training data: A case
study in the legal case entailment task. arXiv preprint [479] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hess-
arXiv:2205.15172. low, R. Castagné, A. S. Luccioni et al. 2022. Bloom: A
176b-parameter open-access multilingual language model.
[461] L. Ross, D. Greene and P. House. 1977. The “false [480] R. Schaeffer, B. Miranda and S. Koyejo. 2023. Are
consensus effect”: An egocentric bias in social perception emergent abilities of large language models a mirage?
and attribution processes. Journal of experimental social
psychology, 13(3):279–301. [481] T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu,
M. Lomeli, L. Zettlemoyer, N. Cancedda and T. Scialom.
[462] Y. Rottenstreich and C. K. Hsee. 2001. Money, kisses, 2023. Toolformer: Language models can teach themselves
and electric shocks: On the affective psychology of risk. to use tools. arXiv preprint arXiv:2302.04761.
Psychological science, 12(3):185–190.
[482] T. Schick, J. Dwivedi-Yu, Z. Jiang, F. Petroni, P. Lewis,
[463] A. Roush. You probably don’t know how to do Prompt G. Izacard, Q. You, C. Nalmpantis et al. 2022. Peer: A
Engineering, let me educate you. collaborative language model.

[464] L. Ruis, A. Khan, S. Biderman, S. Hooker, T. Rock- [483] T. Schick and H. Schütze. 2021. It’s not just size that
täschel and E. Grefenstette. 2022. Large language models matters: Small language models are also few-shot learn-
are not zero-shot communicators. ers. In Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computational
[465] J. Rumbelow and mwatkins. SolidGoldMagikarp (plus, Linguistics: Human Language Technologies, pages 2339–
prompt generation). 2352.

65
[484] J. Schulman, F. Wolski, P. Dhariwal, A. Radford and [499] A. Shirafuji, Y. Watanobe, T. Ito, M. Morishita, Y. Naka-
O. Klimov. 2017. Proximal policy optimization algorithms. mura, Y. Oda and J. Suzuki. 2023. Exploring the robust-
arXiv preprint arXiv:1707.06347. ness of large language models for solving programming
problems.
[485] M. Schuster and K. Nakajima. 2012. Japanese and ko-
rean voice search. In 2012 IEEE International Conference [500] O. Shliazhko, A. Fenogenova, M. Tikhonova,
on Acoustics, Speech and Signal Processing (ICASSP), V. Mikhailov, A. Kozlova and T. Shavrina. 2022. mgpt:
pages 5149–5152. Few-shot learners go multilingual. arXiv preprint
arXiv:2204.07580.
[486] T. Schuster, R. Schuster, D. J. Shah and R. Barzi-
lay. 2020. The limitations of stylometry for detecting [501] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper
machine-generated fake news. Computational Linguistics, and B. Catanzaro. 2019. Megatron-lm: Training multi-
46(2):499–510. billion parameter language models using model parallelism.
arXiv preprint arXiv:1909.08053.
[487] R. Schwartz, J. Dodge, N. A. Smith and O. Etzioni. 2019.
Green AI. ArXiv:1907.10597 [cs, stat]. [502] K. Shridhar, J. Macina, M. El-Assady, T. Sinha, M. Ka-
pur and M. Sachan. 2022. Automatic generation of socratic
[488] S. H. Schwartz, B. Breyer and D. Danner. 2015. Hu- subquestions for teaching math word problems. ArXiv,
man values scale (ess). Zusammenstellung sozialwis- abs/2211.12835.
senschaftlicher Items und Skalen (ZIS).
[503] K. Shridhar, A. Stolfo and M. Sachan. 2022. Distilling
[489] A. See, A. Pappu, R. Saxena, A. Yerukola and C. D. multi-step reasoning capabilities of large language models
Manning. 2019. Do massively pretrained language mod- into smaller models via semantic decompositions. arXiv
els make better storytellers? In Proceedings of the 23rd preprint arXiv:2212.00193.
Conference on Computational Natural Language Learning
(CoNLL), pages 843–861, Hong Kong, China. Association [504] D. Shrivastava, H. Larochelle and D. Tarlow. 2022.
for Computational Linguistics. Repository-level prompt generation for large language mod-
els of code. arXiv preprint arXiv:2206.12839.
[490] R. Sennrich, B. Haddow and A. Birch. 2015. Neural ma-
chine translation of rare words with subword units. arXiv [505] R. W. Shuai, J. A. Ruffolo and J. J. Gray. 2021. Gen-
preprint arXiv:1508.07909. erative language modeling for antibody design. bioRxiv,
pages 2021–12.
[491] E. Sezgin, J. Sirrianni, S. L. Linwood et al. 2022. Oper-
ationalizing and implementing pretrained, large artificial
[506] I. Shumailov, Z. Shumaylov, Y. Zhao, Y. Gal, N. Pa-
intelligence linguistic models in the us health care system:
pernot and R. Anderson. 2023. The curse of recursion:
Outlook of generative pretrained transformer 3 (gpt-3) as a
Training on generated data makes models forget.
service model. JMIR Medical Informatics, 10(2):e32875.

[492] P. Shaw, J. Uszkoreit and A. Vaswani. 2018. Self- [507] K. Shuster, S. Poff, M. Chen, D. Kiela and J. Weston.
attention with relative position representations. In Pro- 2021. Retrieval augmentation reduces hallucination in
ceedings of the 2018 Conference of the North American conversation. arXiv preprint arXiv:2104.07567.
Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 2 (Short Papers), [508] K. Shuster, J. Xu, M. Komeili, D. Ju, E. M. Smith,
pages 464–468, New Orleans, Louisiana. Association for S. Roller, M. Ung, M. Chen et al. 2022. Blenderbot 3:
Computational Linguistics. a deployed conversational agent that continually learns to
responsibly engage.
[493] N. Shazeer. 2019. Fast transformer decoding: One write-
head is all you need. [509] S. Sia and K. Duh. 2023. In-context learning as maintain-
ing coherency: A study of on-the-fly machine translation
[494] N. Shazeer. 2019. Fast transformer decoding: One write- using large language models. ArXiv, abs/2305.03573.
head is all you need.
[510] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu,
[495] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, J. Tremblay, D. Fox, J. Thomason et al. 2022. Progprompt:
G. Hinton and J. Dean. 2017. Outrageously large neu- Generating situated robot task plans using large language
ral networks: The sparsely-gated mixture-of-experts layer. models.
arXiv preprint arXiv:1701.06538.
[511] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W.
[496] Z. Shen, M. Zhang, H. Zhao, S. Yi and H. Li. 2021. Chung, N. Scales, A. Tanwani et al. 2022. Large language
Efficient attention: Attention with linear complexities. In models encode clinical knowledge.
Proceedings of the IEEE/CVF winter conference on appli-
cations of computer vision, pages 3531–3539. [512] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn,
L. Hou, K. Clark, S. Pfohl et al. 2023. Towards expert-level
[497] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, medical question answering with large language models.
B. Chen, P. Liang, C. Ré et al. 2023. High-throughput arXiv preprint arXiv:2305.09617.
generative inference of large language models with a single
gpu. [513] A. Sinitsin, D. Pyrkin, A. Babenko, V. Plokhotnyuk and
S. Popov. 2020. EDITABLE NEURAL NETWORKS.
[498] T. Shevlane, S. Farquhar, B. Garfinkel, M. Phuong,
J. Whittlestone, J. Leung, D. Kokotajlo, N. Marchal et al. [514] S. L. Smith, P.-J. Kindermans, C. Ying and Q. V. Le.
2023. Model evaluation for extreme risks. arXiv preprint 2017. Don’t decay the learning rate, increase the batch
arXiv:2305.15324. size. arXiv preprint arXiv:1711.00489.

66
[515] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Ra- [529] X. Sun, T. Ge, F. Wei and H. Wang. 2021. Instanta-
jbhandari, J. Casper, Z. Liu, S. Prabhumoye et al. 2022. neous grammatical error correction with shallow aggres-
Using deepspeed and megatron to train megatron-turing sive decoding. In Proceedings of the 59th Annual Meeting
nlg 530b, a large-scale generative language model. arXiv of the Association for Computational Linguistics and the
preprint arXiv:2201.11990. 11th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), pages 5937–5947,
[516] I. Solaiman and C. Dennison. 2021. Process for adapting Online. Association for Computational Linguistics.
language models to society (palms) with values-targeted
datasets. Advances in Neural Information Processing Sys- [530] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang,
tems, 34:5861–5873. J. Liu, X. Chen et al. 2021. Ernie 3.0: Large-scale knowl-
edge enhanced pre-training for language understanding and
[517] S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, generation. arXiv preprint arXiv:2107.02137.
W. Hamza, H. Khan, C. Peris, S. Rawls et al. 2022. Alex-
atm 20b: Few-shot learning using a large-scale multilingual [531] Z. Sun. 2023. A short survey of viewing large language
seq2seq model. arXiv preprint arXiv:2208.01448. models in legal aspect.

[518] B. Sorscher, R. Geirhos, S. Shekhar, S. Ganguli and [532] D. Surís, S. Menon and C. Vondrick. 2023. Vipergpt:
A. S. Morcos. 2022. Beyond neural scaling laws: beat- Visual inference via python execution for reasoning. arXiv
ing power law scaling via data pruning. arXiv preprint preprint arXiv:2303.08128.
arXiv:2206.14486.
[533] Susan Zhang [@suchenzang]. 2023. Piling on to the
[519] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, pile-on (sorry - it’s always easy to criticize), here’s a rant
A. Abid, A. Fisch, A. R. Brown, A. Santoro et al. 2022. about benchmarks for LLMs that are used to back claims
Beyond the imitation game: Quantifying and extrapolat- of "stronger" or "better" models. Let’s start with a tour
ing the capabilities of language models. arXiv preprint through GPT-3’s Appendix G... 1/8.
arXiv:2206.04615.
[534] M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay,
[520] J. Steinhardt. 2022. Future ml systems will be qualita- H. W. Chung, A. Chowdhery, Q. V. Le et al. 2022. Chal-
tively different. Accessed May, 20:2022. lenging big-bench tasks and whether chain-of-thought can
solve them. arXiv preprint arXiv:2210.09261.
[521] J. Steinhardt. 2023. Emergent deception
[535] S. Swaminathan, A. Dedieu, R. V. Raju, M. Shanahan,
and emergent optimization. Available from:
M. Lazaro-Gredilla and D. George. 2023. Schema-learning
https://bounded-regret.ghost.io/
and rebinding as mechanisms of in-context learning and
emergent-deception-optimization/. Ac-
emergence. ArXiv:2307.01201 [cs].
cessed: 29/04/2023.
[536] H. Tang, S. Gan, A. A. Awan, S. Rajbhandari, C. Li,
[522] M. Stern, N. Shazeer and J. Uszkoreit. 2018. Block-
X. Lian, J. Liu, C. Zhang et al. 2021. 1-bit adam: Com-
wise parallel decoding for deep autoregressive models.
munication efficient large-scale training with adam’s con-
In Proceedings of the 32nd International Conference on
vergence speed. In Proceedings of the 38th International
Neural Information Processing Systems, NIPS’18, page
Conference on Machine Learning, volume 139 of Proceed-
10107–10116, Red Hook, NY, USA. Curran Associates
ings of Machine Learning Research, pages 10118–10129.
Inc.
PMLR.
[523] C. Stevenson, I. Smal, M. Baas, R. Grasman and [537] L. Tang, G. Uberti and T. Shlomi. 2023. Baselines
H. van der Maas. 2022. Putting gpt-3’s creativity to the for Identifying Watermarked Large Language Models.
(alternative uses) test. arXiv preprint arXiv:2206.08932. ArXiv:2305.18456 [cs].
[524] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, [538] L. Tang, Z. Sun, B. Idnay, J. G. Nestor, A. Soroush, P. A.
C. Voss, A. Radford, D. Amodei et al. 2020. Learning to Elias, Z. Xu, Y. Ding et al. 2023. Evaluating large language
summarize with human feedback. In Conference on Neural models on medical evidence summarization. medRxiv,
Information Processing Systems. pages 2023–04.

[525] A. Stolfo, Z. Jin, K. Shridhar, B. Schölkopf and [539] R. Tang, Y.-N. Chuang and X. Hu. 2023. The Science of
M. Sachan. 2022. A causal framework to quantify the Detecting LLM-Generated Texts. ArXiv:2303.07205 [cs].
robustness of mathematical reasoning with language mod-
els. [540] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,
C. Guestrin, P. Liang and T. B. Hashimoto. 2023. Alpaca:
[526] J. Su, Y. Lu, S. Pan, B. Wen and Y. Liu. 2021. Roformer: A strong, replicable instruction-following model.
Enhanced transformer with rotary position embedding.
[541] Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao and
[527] M. Sun, Z. Liu, A. Bair and J. Z. Kolter. 2023. A simple C. Zheng. 2021. Synthesizer: Rethinking self-attention
and effective pruning approach for large language models. for transformer models. In International conference on
machine learning, pages 10183–10192. PMLR.
[528] T. Sun, Y. Shao, H. Qian, X. Huang and X. Qiu. 2022.
Black-box tuning for language-model-as-a-service. In Pro- [542] Y. Tay, M. Dehghani, S. Abnar, H. W. Chung, W. Fedus,
ceedings of the 39th International Conference on Machine J. Rao, S. Narang, V. Q. Tran et al. 2022. Scaling laws
Learning, volume 162 of Proceedings of Machine Learning vs model architectures: How does inductive bias influence
Research, pages 20841–20855. PMLR. scaling?

67
[543] Y. Tay, M. Dehghani, D. Bahri and D. Metzler. 2022. [559] A. Uchendu, T. Le, K. Shu and D. Lee. 2020. Authorship
Efficient transformers: A survey. ACM Computing Surveys, Attribution for Neural Text Generation. In Proceedings
55(6):1–28. of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 8384–8395, Online.
[544] Y. Tay, M. Dehghani, J. Rao, W. Fedus, S. Abnar, H. W. Association for Computational Linguistics.
Chung, S. Narang, D. Yogatama et al. 2022. Scale Effi-
ciently: Insights from Pre-training and Fine-tuning Trans- [560] J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel,
formers. ArXiv:2109.10686 [cs]. L. Wang, A. Creswell, G. Irving et al. 2022. Solving math
word problems with process- and outcome-based feedback.
[545] Y. Tay, M. Dehghani, V. Q. Tran, X. Garcia, J. Wei,
X. Wang, H. W. Chung, D. Bahri et al. 2022. Ul2: Unifying [561] S. University. 2023. Holistic evaluation of langauge
language learning paradigms. models results page. Available from: https://crfm.
stanford.edu/helm/latest/?groups=1. Ac-
cessed: 23/03/2023.
[546] Y. Tay, V. Q. Tran, S. Ruder, J. Gupta, H. W. Chung,
D. Bahri, Z. Qin, S. Baumgartner et al. 2022. Charformer: [562] K. Valmeekam, A. Olmo, S. Sreedharan and S. Kamb-
Fast character transformers via gradient-based subword hampati. 2023. Large language models still can’t plan
tokenization. (a benchmark for llms on planning and reasoning about
change).
[547] Y. Tay, J. Wei, H. W. Chung, V. Q. Tran, D. R. So,
S. Shakeri, X. Garcia, H. S. Zheng et al. 2022. Transcend- [563] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
ing scaling laws with 0.1% extra compute. L. Jones, A. N. Gomez, L. u. Kaiser and I. Polosukhin.
2017. Attention is all you need. In Advances in Neural
[548] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, Information Processing Systems, volume 30. Curran Asso-
A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez et al. 2022. ciates, Inc.
Galactica: A large language model for science. arXiv
preprint arXiv:2211.09085. [564] S. Vemprala, R. Bonatti, A. Bucker and A. Kapoor. 2023.
Chatgpt for robotics: Design principles and model abilities.
[549] W. L. Taylor. 1953. “cloze procedure”: A new tool for
measuring readability. Journalism quarterly, 30(4):415– [565] A. Venigalla, J. Frankle and M. Carbin. 2022. Pubmed
433. gpt: A domain- specific large language model for biomed-
ical text. https://www.mosaicml.com/blog/
[550] J. Thiergart, S. Huber and T. Übellacker. 2021. Under- introducing-pubmed-gpt. Accessed: 2023-01-24.
standing emails and drafting responses–an approach using
[566] R. Verkuil, O. Kabeli, Y. Du, B. I. Wicky, L. F. Milles,
gpt-3. arXiv preprint arXiv:2102.03062.
J. Dauparas, D. Baker, S. Ovchinnikov et al. 2022. Lan-
guage models generalize beyond natural proteins. bioRxiv,
[551] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kul-
pages 2022–12.
shreshtha, H.-T. Cheng, A. Jin, T. Bos et al. 2022. Lamda:
Language models for dialog applications. arXiv preprint [567] A. Vijayakumar, M. Cogswell, R. Selvaraju, Q. Sun,
arXiv:2201.08239. S. Lee, D. Crandall and D. Batra. 2018. Diverse beam
search for improved description of complex scenes. Pro-
[552] R. Tian, S. Narayan, T. Sellam and A. P. Parikh. 2020. ceedings of the AAAI Conference on Artificial Intelligence,
Sticking to the Facts: Confident Decoding for Faithful 32(1).
Data-to-Text Generation. ArXiv:1910.08684 [cs].
[568] P. Villalobos, J. Sevilla, L. Heim, T. Besiroglu, M. Hobb-
[553] K. Tirumala, A. H. Markosyan, L. Zettlemoyer and hahn and A. Ho. 2022. Will we run out of data? an analysis
A. Aghajanyan. Memorization Without Overfitting: Ana- of the limits of scaling datasets in machine learning. arXiv
lyzing the Training Dynamics of Large Language Models. preprint arXiv:2211.04325.

[554] H. Q. To, N. D. Bui, J. Guo and T. N. Nguyen. 2023. [569] H. Viswanath and T. Zhang. 2023. Fairpy: A toolkit
Better language models of code through self-improvement. for evaluation of social biases and their mitigation in large
arXiv preprint arXiv:2304.01228. language models. arXiv preprint arXiv:2302.05508.

[555] A. Tornede, D. Deng, T. Eimer, J. Giovanelli, A. Mohan, [570] J. von Oswald, E. Niklasson, E. Randazzo, J. Sacra-
T. Ruhkopf, S. Segel, D. Theodorakopoulos et al. 2023. Au- mento, A. Mordvintsev, A. Zhmoginov and M. Vladymy-
toML in the Age of Large Language Models: Current Chal- rov. 2022. Transformers learn in-context by gradient de-
lenges, Future Opportunities and Risks. ArXiv:2306.08107 scent. arXiv preprint arXiv:2212.07677.
[cs].
[571] H. d. Vries. 2023. Go smol or go home.
[556] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. [572] T. Vu, B. Lester, N. Constant, R. Al-Rfou’ and D. Cer.
Lachaux, T. Lacroix, B. Rozière, N. Goyal et al. 2023. 2022. SPoT: Better frozen model adaptation through soft
LLaMA: Open and Efficient Foundation Language Models. prompt transfer. In Proceedings of the 60th Annual Meet-
ArXiv:2302.13971 [cs]. ing of the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 5039–5059, Dublin, Ireland.
[557] H. Touvron, L. Martin and K. Stone. Llama 2: Open Association for Computational Linguistics.
Foundation and Fine-Tuned Chat Models.
[573] J. P. Wahle, T. Ruas, T. Foltỳnek, N. Meuschke and
[558] C. Tran, S. Khadkikar and A. Porollo. 2023. Survey of B. Gipp. 2022. Identifying machine-paraphrased plagia-
protein sequence embedding models. International Journal rism. In International Conference on Information, pages
of Molecular Sciences, 24(4):3775. 393–413. Springer.

68
[574] J. P. Wahle, T. Ruas, F. Kirstein and B. Gipp. 2022. [588] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith,
How large language models are transforming machine- D. Khashabi and H. Hajishirzi. 2022. Self-instruct: Align-
paraphrased plagiarism. arXiv preprint arXiv:2210.03568. ing language model with self generated instructions.

[575] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy and [589] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi,
S. Bowman. 2018. GLUE: A multi-task benchmark and A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran et al.
analysis platform for natural language understanding. In 2022. Super-naturalinstructions: Generalization via declar-
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: ative instructions on 1600+ nlp tasks. In Proceedings of
Analyzing and Interpreting Neural Networks for NLP, the 2022 Conference on Empirical Methods in Natural
pages 353–355, Brussels, Belgium. Association for Com- Language Processing, pages 5085–5109.
putational Linguistics.
[590] Y. Wang, Y. Zhao and L. Petzold. 2023. Are large lan-
[576] B. Wang and A. Komatsuzaki. 2021. GPT-J- guage models ready for healthcare? a comparative study
6B: A 6 Billion Parameter Autoregressive Language on clinical language understanding.
Model. https://github.com/kingoflolz/
mesh-transformer-jax. [591] Z. Wang, S. Cai, A. Liu, X. Ma and Y. Liang. 2023.
Describe, explain, plan and select: Interactive planning
[577] C. Wang, K. Cho and J. Gu. 2020. Neural machine with large language models enables open-world multi-task
translation with byte-level subwords. Proceedings of the agents. arXiv preprint arXiv:2302.01560.
AAAI Conference on Artificial Intelligence, 34(05):9154– [592] Z. Wang, J. Wohlwend and T. Lei. 2019. Struc-
9160. tured pruning of large language models. arXiv preprint
arXiv:1910.04732.
[578] C. Wang, X. Liu, Z. Chen, H. Hong, J. Tang and D. Song.
2022. DeepStruct: Pretraining of language models for [593] Z. Wang, Z. Dai, B. Póczos and J. Carbonell. 2019. Char-
structure prediction. In Findings of the Association for acterizing and avoiding negative transfer. In Proceedings of
Computational Linguistics: ACL 2022, pages 803–823, the IEEE/CVF conference on computer vision and pattern
Dublin, Ireland. Association for Computational Linguis- recognition, pages 11293–11302.
tics.
[594] Z. Wang, M. Zoghi, F. Hutter, D. Matheson, N. De Fre-
[579] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, itas et al. 2013. Bayesian optimization in high dimensions
Y. Zhu, L. Fan and A. Anandkumar. 2023. Voyager: An via random embeddings. In IJCAI, volume 13, pages 1778–
open-ended embodied agent with large language models. 1784.
arXiv preprint arXiv:2305.16291.
[595] T. Webb, K. J. Holyoak and H. Lu. 2022. Emergent
[580] H. Wang, J. Kaddour, S. Liu, J. Tang, M. Kusner, analogical reasoning in large language models.
J. Lasenby and Q. Liu. 2022. Evaluating self-supervised
learning for molecular graph embeddings. arXiv preprint [596] A. Webson and E. Pavlick. 2022. Do prompt-based
arXiv:2206.08005. models really understand the meaning of their prompts? In
Proceedings of the 2022 Conference of the North American
[581] P. Wang, L. Li, L. Chen, D. Zhu, B. Lin, Y. Cao, Q. Liu, Chapter of the Association for Computational Linguistics:
T. Liu et al. 2023. Large Language Models are not Fair Human Language Technologies, pages 2300–2344, Seattle,
Evaluators. ArXiv:2305.17926 [cs]. United States. Association for Computational Linguistics.

[582] R. Wang, H. Wang, F. Mi, Y. Chen, R. Xu and K.- [597] A. Wei, N. Haghtalab and J. Steinhardt. 2023. Jailbroken:
F. Wong. 2023. Self-critique prompting with large lan- How Does LLM Safety Training Fail? ArXiv:2307.02483
guage models for inductive instructions. arXiv preprint [cs].
arXiv:2305.13733.
[598] J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester,
[583] S. Wang, Y. Liu, Y. Xu, C. Zhu and M. Zeng. 2021. Want N. Du, A. M. Dai et al. 2022. Finetuned language models
to reduce labeling cost? gpt-3 can help. are zero-shot learners. In International Conference on
Learning Representations.
[584] S. Wang, S. Menon, T. Long, K. Henderson, D. Li,
K. Crowston, M. Hansen, J. V. Nickerson et al. 2023. Reel- [599] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph,
framer: Co-creating news reels on social media with gener- S. Borgeaud, D. Yogatama, M. Bosma et al. 2022. Emer-
ative ai. arXiv preprint arXiv:2304.09653. gent abilities of large language models.

[600] J. Wei, Y. Tay and Q. V. Le. 2022. Inverse scaling can


[585] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, become u-shaped. arXiv preprint arXiv:2211.02011.
S. Narang, A. Chowdhery and D. Zhou. 2022. Self-
consistency improves chain of thought reasoning in lan- [601] J. Wei, X. Wang, D. Schuurmans, M. Bosma, brian ichter,
guage models. F. Xia, E. H. Chi, Q. V. Le et al. 2022. Chain of thought
prompting elicits reasoning in large language models. In
[586] Y. Wang, Z. Yu, Z. Zeng, L. Yang, C. Wang, H. Chen, Advances in Neural Information Processing Systems.
C. Jiang, R. Xie et al. 2023. Pandalm: An automatic eval-
uation benchmark for llm instruction tuning optimization. [602] L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato,
arXiv preprint arXiv:2306.05087. P.-S. Huang, M. Cheng, M. Glaese et al. 2021. Ethical and
social risks of harm from language models. arXiv preprint
[587] Y. Wang. 2021. Comment section personalization: Algo- arXiv:2112.04359.
rithmic, interface, and interaction design. In Proceedings
of the EACL Hackashop on News Media Content Analysis [603] M. Weiss. 2019. Deepfake bot submissions to federal
and Automated Report Generation, pages 84–88, Online. public comment websites cannot be distinguished from
Association for Computational Linguistics. human submissions. Technology Science, 2019121801.

69
[604] S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho and [621] Q. Xie, Z. Luo, B. Wang and S. Ananiadou. 2023. A
J. Weston. 2019. Neural text generation with unlikelihood survey on biomedical text summarization with pre-trained
training. arXiv preprint arXiv:1908.04319. language model.

[605] L. Weng. 2023. Large transformer model inference opti- [622] S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y. Lu,
mization. Lil’Log. P. Liang, Q. V. Le et al. 2023. DoReMi: Optimizing
Data Mixtures Speeds Up Language Model Pretraining.
[606] L. Weng. 2023. Prompt engineering. lilian- ArXiv:2305.10429 [cs].
weng.github.io.
[623] S. M. Xie, A. Raghunathan, P. Liang and T. Ma. 2022.
[607] M. Willig, M. ZEČEVIĆ, D. S. Dhami and K. Kersting. An Explanation of In-context Learning as Implicit Bayesian
2023. Causal parrots: Large language models may talk Inference. ArXiv:2111.02080 [cs].
causality but are not causal. preprint.
[624] S. M. Xie, S. Santurkar, T. Ma and P. Liang. 2023. Data
[608] F. Winkelmolen, N. Ivkin, H. F. Bozkurt and Z. Karnin. Selection for Language Models via Importance Resam-
2020. Practical and sample efficient zero-shot hpo. arXiv pling. ArXiv:2302.03169 [cs].
preprint arXiv:2007.13382.
[625] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng,
[609] Y. Wolf, N. Wies, Y. Levine and A. Shashua. 2023. Fun- C. Tao and D. Jiang. 2023. Wizardlm: Empowering large
damental limitations of alignment in large language models. language models to follow complex instructions. arXiv
arXiv preprint arXiv:2304.11082. preprint arXiv:2304.12244.
[610] M. Wornow, Y. Xu, R. Thapa, B. Patel, E. Steinberg, [626] F. F. Xu, U. Alon, G. Neubig and V. J. Hellendoorn. 2022.
S. Fleming, M. A. Pfeffer, J. Fries et al. 2023. The shaky A systematic evaluation of large language models of code.
foundations of clinical foundation models: A survey of
large language models and foundation models for emrs. [627] M. Xu, X. Yuan, S. Miret and J. Tang. 2023. Protst:
Multi-modality learning of protein sequences and biomedi-
[611] F. Wu, D. Radev and J. Xu. 2023. When geometric cal texts. arXiv preprint arXiv:2301.12040.
deep learning meets pretrained protein language models.
bioRxiv, pages 2023–01. [628] Y. Xu, H. Lee, D. Chen, B. Hechtman, Y. Huang,
R. Joshi, M. Krikun, D. Lepikhin et al. 2021. Gspmd:
[612] J. Wu, L. Ouyang, D. M. Ziegler, N. Stiennon, R. Lowe,
general and scalable parallelization for ml computation
J. Leike and P. Christiano. 2021. Recursively sum-
graphs. arXiv preprint arXiv:2105.04663.
marizing books with human feedback. arXiv preprint
arXiv:2109.10862.
[629] L. Xue, A. Barua, N. Constant, R. Al-Rfou, S. Narang,
[613] J. Wu, F. Wu, B. Jiang, W. Liu and P. Zhao. 2022. tfold- M. Kale, A. Roberts and C. Raffel. 2022. ByT5: Towards
ab: Fast and accurate antibody structure prediction without a token-free future with pre-trained byte-to-byte models.
sequence homologs. bioRxiv, pages 2022–11. ArXiv:2105.13626 [cs].

[614] P. Y. Wu, J. A. Tucker, J. Nagler and S. Messing. 2023. [630] L. Xue, A. Barua, N. Constant, R. Al-Rfou, S. Narang,
Large language models can be used to estimate the ideolo- M. Kale, A. Roberts and C. Raffel. 2022. ByT5: Towards
gies of politicians in a zero-shot learning setting. a token-free future with pre-trained byte-to-byte models.
Transactions of the Association for Computational Linguis-
[615] S. Wu, X. Zhao, T. Yu, R. Zhang, C. Shen, H. Liu, F. Li, tics, 10:291–306.
H. Zhu et al. 2021. Yuan 1.0: Large-scale pre-trained
language model in zero-shot and few-shot learning. [631] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou,
A. Siddhant, A. Barua and C. Raffel. 2021. mT5: A mas-
[616] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, sively multilingual pre-trained text-to-text transformer. In
S. Gehrmann, P. Kambadur, D. Rosenberg et al. 2023. Proceedings of the 2021 Conference of the North American
Bloomberggpt: A large language model for finance. Chapter of the Association for Computational Linguistics:
Human Language Technologies, pages 483–498, Online.
[617] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, Association for Computational Linguistics.
W. Macherey, M. Krikun, Y. Cao et al. 2016. Google’s neu-
ral machine translation system: Bridging the gap between [632] L. Yan, L. Sha, L. Zhao, Y. Li, R. Martinez-Maldonado,
human and machine translation. G. Chen, X. Li, Y. Jin et al. 2023. Practical and ethical chal-
lenges of large language models in education: A systematic
[618] Y. Wu, M. Gardner, P. Stenetorp and P. Dasigi. 2022. literature review.
Generating data to mitigate spurious correlations in
natural language inference datasets. arXiv preprint [633] G. Yang, E. Hu, I. Babuschkin, S. Sidor, X. Liu, D. Farhi,
arXiv:2203.12942. N. Ryder, J. Pachocki et al. 2021. Tuning large neural net-
works via zero-shot hyperparameter transfer. Advances in
[619] Z. Wu, L. Qiu, A. Ross, E. Akyürek, B. Chen, B. Wang, Neural Information Processing Systems, 34:17084–17097.
N. Kim, J. Andreas et al. 2023. Reasoning or Reciting?
Exploring the Capabilities and Limitations of Language [634] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang,
Models Through Counterfactual Tasks. ArXiv:2307.02477 B. Yin and X. Hu. 2023. Harnessing the power of llms in
[cs]. practice: A survey on chatgpt and beyond.

[620] Y. Xiao and W. Y. Wang. 2021. On Hallucination and [635] K. Yang and D. Klein. 2021. Fudge: Controlled text
Predictive Uncertainty in Conditional Language Genera- generation with future discriminators. arXiv preprint
tion. ArXiv:2103.15025 [cs]. arXiv:2104.05218.

70
[636] K. Yang, D. Klein, N. Peng and Y. Tian. 2022. Doc: Im- [650] R. You, Y. Liu, H. Mamitsuka and S. Zhu. 2021.
proving long story coherence with detailed outline control. Bertmesh: deep contextual representation learning for
arXiv preprint arXiv:2212.10077. large-scale high-performance mesh indexing with full text.
Bioinformatics, 37(5):684–692.
[637] K. Yang, N. Peng, Y. Tian and D. Klein. 2022. Re3:
Generating longer stories with recursive reprompting and [651] F. Yu, L. Quartey and F. Schilder. 2022. Legal prompting:
revision. arXiv preprint arXiv:2210.06774. Teaching a language model to think like a lawyer. arXiv
preprint arXiv:2212.01326.
[638] X. Yang, K. Chen, W. Zhang, C. Liu, Y. Qi, J. Zhang, [652] L. Yu, D. Simig, C. Flaherty, A. Aghajanyan, L. Zettle-
H. Fang and N. Yu. 2023. Watermarking Text Generated moyer and M. Lewis. 2023. Megabyte: Predicting
by Black-Box Language Models. ArXiv:2305.08883 [cs]. million-byte sequences with multiscale transformers. arXiv
preprint arXiv:2305.07185.
[639] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths,
Y. Cao and K. Narasimhan. 2023. Tree of Thoughts: De- [653] P. Yu, M. Artetxe, M. Ott, S. Shleifer, H. Gong, V. Stoy-
liberate Problem Solving with Large Language Models. anov and X. Li. 2022. Efficient language modeling with
ArXiv:2305.10601 [cs]. sparse all-mlp.

[640] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan [654] P. Yu, T. Wang, O. Golovneva, B. Alkhamissy, G. Ghosh,
and Y. Cao. 2022. React: Synergizing reasoning and acting M. Diab and A. Celikyilmaz. 2022. Alert: Adapting
in language models. arXiv preprint arXiv:2210.03629. language models to reasoning tasks. arXiv preprint
arXiv:2212.08286.
[641] X. Yao, Y. Zheng, X. Yang and Z. Yang. 2022. NLP
From Scratch Without Large-Scale Pretraining: A Simple [655] L. Yunxiang, L. Zihan, Z. Kai, D. Ruilong and Z. You.
and Efficient Framework. In Proceedings of the 39th Inter- 2023. Chatdoctor: A medical chat model fine-tuned on
national Conference on Machine Learning, pages 25438– llama model using medical domain knowledge.
25451. PMLR. ISSN: 2640-3498. [656] E. Zelikman, Y. Wu, J. Mu and N. Goodman. 2022. STar:
Bootstrapping reasoning with reasoning. In Advances in
[642] Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng, Neural Information Processing Systems.
H. Chen and N. Zhang. 2023. Editing Large Lan-
guage Models: Problems, Methods, and Opportunities. [657] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi,
ArXiv:2305.13172 [cs]. F. Roesner and Y. Choi. 2019. Defending against neural
fake news. Advances in neural information processing
[643] Z. Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li and systems, 32.
Y. He. 2022. Zeroquant: Efficient and affordable post-
training quantization for large-scale transformers. arXiv [658] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding,
preprint arXiv:2206.01861. Z. Yang, Y. Xu et al. 2022. Glm-130b: An open bilingual
pre-trained model.
[644] M. Yasunaga, A. Bosselut, H. Ren, X. Zhang, C. D. Man-
ning, P. Liang and J. Leskovec. 2022. Deep bidirectional [659] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang,
language-knowledge graph pretraining. arXiv preprint X. Jiang, Z. Yang et al. 2021. Pangu-α: Large-scale au-
arXiv:2210.09338. toregressive pretrained chinese language models with auto-
parallel computation.
[645] S. Yi, R. Goel, C. Khatri, A. Cervone, T. Chung, B. He- [660] F. Zhang, B. Chen, Y. Zhang, J. Liu, D. Zan, Y. Mao, J.-
dayatnia, A. Venkatesh, R. Gabriel et al. 2019. Towards G. Lou and W. Chen. 2023. Repocoder: Repository-level
coherent and engaging spoken dialog response generation code completion through iterative retrieval and generation.
using automatic conversation evaluators. In Proceedings
of the 12th International Conference on Natural Language [661] H. Zhang, L. H. Li, T. Meng, K.-W. Chang and G. V. d.
Generation, pages 65–75, Tokyo, Japan. Association for Broeck. 2022. On the Paradox of Learning to Reason from
Computational Linguistics. Data. ArXiv:2205.11502 [cs].

[646] D. Yogatama, C. de Masson d’Autume and L. Kong. [662] H. Zhang, D. Duckworth, D. Ippolito and A. Neelakan-
2021. Adaptive semiparametric language models. Trans- tan. 2021. Trading off diversity and quality in natural
actions of the Association for Computational Linguistics, language generation. In Proceedings of the Workshop on
9:362–373. Human Evaluation of NLP Systems (HumEval), pages 25–
33, Online. Association for Computational Linguistics.
[647] T. Yoneda, J. Fang, P. Li, H. Zhang, T. Jiang, S. Lin,
B. Picker, D. Yunis et al. 2023. Statler: State-maintaining [663] M. Zhang and Y. He. 2020. Accelerating training of
language models for embodied reasoning. transformer-based language models with progressive layer
dropping.
[648] K. M. Yoo, D. Park, J. Kang, S.-W. Lee and W. Park. [664] M. Zhang, O. Press, W. Merrill, A. Liu and N. A. Smith.
2021. GPT3Mix: Leveraging large-scale language models 2023. How Language Model Hallucinations Can Snowball.
for text augmentation. In Findings of the Association for ArXiv:2305.13534 [cs].
Computational Linguistics: EMNLP 2021, pages 2225–
2239, Punta Cana, Dominican Republic. Association for [665] S. Zhang. 2023. [...] that’s an unhelpful order of
Computational Linguistics. magnitude difference in how large of a model you
should be training in order to be considered “compute
[649] K. Yoo, W. Ahn, J. Jang and N. Kwak. 2023. Robust Nat- optimal”. https://twitter.com/suchenzang/
ural Language Watermarking through Invariant Features. status/1616752494608007171?s=20. Accessed:
ArXiv:2305.01904 [cs]. 2023-06-06.

71
[666] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, [681] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma,
S. Chen, C. Dewan, M. Diab et al. 2022. Opt: Open pre- A. Efrat et al. 2023. LIMA: Less Is More for Alignment.
trained transformer language models. ArXiv:2305.11206 [cs].

[667] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger and [682] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang,
Y. Artzi. 2019. Bertscore: Evaluating text generation with D. Schuurmans, C. Cui et al. 2022. Least-to-most prompt-
bert. arXiv preprint arXiv:1904.09675. ing enables complex reasoning in large language models.

[668] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKeown [683] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis,
and T. B. Hashimoto. 2023. Benchmarking large language H. Chan and J. Ba. 2023. Large language models are
models for news summarization. human-level prompt engineers. In International Confer-
ence on Learning Representations.
[669] Z. Zhang, Y. Gu, X. Han, S. Chen, C. Xiao, Z. Sun,
Y. Yao, F. Qi et al. 2021. Cpm-2: Large-scale cost-effective [684] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urta-
pre-trained language models. sun, A. Torralba and S. Fidler. 2015. Aligning books and
movies: Towards story-like visual explanations by watch-
[670] Z. Zhang, Y. Lin, Z. Liu, P. Li, M. Sun and J. Zhou. 2022. ing movies and reading books.
Moefication: Transformer feed-forward layers are mixtures
of experts. [685] B. Zhuang, J. Liu, Z. Pan, H. He, Y. Weng and C. Shen.
2023. A survey on efficient training of transformers. arXiv
[671] Z. Zhang, A. Zhang, M. Li and A. Smola. 2022. Auto- preprint arXiv:2302.01107.
matic chain of thought prompting in large language models.
[686] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Rad-
[672] S. Zhao, J. Wen, L. A. Tuan, J. Zhao and J. Fu. ford, D. Amodei, P. Christiano and G. Irving. 2019. Fine-
2023. Prompt as triggers for backdoor attack: Examin- tuning language models from human preferences. arXiv
ing the vulnerability in language models. arXiv preprint preprint arXiv:1909.08593.
arXiv:2305.01219.
[687] B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean,
[673] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, N. Shazeer and W. Fedus. 2022. St-moe: Designing stable
Y. Min, B. Zhang et al. 2023. A Survey of Large Language and transferable sparse expert models.
Models. ArXiv:2303.18223 [cs].
[688] M. Zvyagin, A. Brace, K. Hippe, Y. Deng, B. Zhang,
[674] Y. Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, C. O. Bohorquez, A. Clyde, B. Kale et al. 2022. Genslms:
L. Wright, H. Shojanazeri et al. 2023. Pytorch fsdp: experi- Genome-scale language models reveal sars-cov-2 evolu-
ences on scaling fully sharded data parallel. arXiv preprint tionary dynamics. bioRxiv, pages 2022–10.
arXiv:2304.11277.

[675] Z. Zhao, E. Wallace, S. Feng, D. Klein and S. Singh.


2021. Calibrate before use: Improving few-shot perfor-
mance of language models. In Proceedings of the 38th
International Conference on Machine Learning, volume
139 of Proceedings of Machine Learning Research, pages
12697–12706. PMLR.

[676] B. Zheng, L. Dong, S. Huang, S. Singhal, W. Che, T. Liu,


X. Song and F. Wei. 2021. Allocating large vocabulary ca-
pacity for cross-lingual language model pre-training. arXiv
preprint arXiv:2109.07306.

[677] L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen,


Y. Huang, Y. Wang, Y. Xu et al. 2022. Alpa: Automat-
ing inter- and Intra-Operator parallelism for distributed
deep learning. In 16th USENIX Symposium on Operat-
ing Systems Design and Implementation (OSDI 22), pages
559–578, Carlsbad, CA. USENIX Association.

[678] R. Zheng, S. Dou, S. Gao, W. Shen, B. Wang, Y. Liu,


S. Jin, Q. Liu et al. 2023. Secrets of RLHF in Large
Language Models Part I: PPO. ArXiv:2307.04964 [cs].

[679] W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang,


A. Saied, W. Chen et al. 2023. Agieval: A human-
centric benchmark for evaluating foundation models. arXiv
preprint arXiv:2304.06364.

[680] A. Zhou, Y. Ma, J. Zhu, J. Liu, Z. Zhang, K. Yuan,


W. Sun and H. Li. 2021. Learning N: M fine-grained
structured sparse neural networks from scratch. In 9th In-
ternational Conference on Learning Representations, ICLR
2021, Virtual Event, Austria, May 3-7, 2021. OpenRe-
view.net.

72

You might also like