6
Training Optimus Prime, M.D.
A Case Study of Automated Item Generation
Using Artificial Intelligence – From Fine-Tuned
GPT2 to GPT3 and Beyond
Matthias von Davier
1. Introduction
The aim of this chapter is to provide evidence on the state of automated item generation (AIG)
using deep neural networks (DNNs). Based on earlier work, a paper that tackled this issue
used character-based recurrent neural networks (von Davier, 2018), the current contribution
describes an experiment exploring AIG using very large transformer-based language models
(Vaswani et al., 2017; Brown et al., 2020; BLOOM: https://huggingface.co/bigscience/bloom).
The chapter provides an overview of a case study that utilizes the latest generation of language models for text generation. In terms of significant stepping-stones, the description is
based on the following developments:
a. GPT-2, OpenAI’s model, was described in Radford et al. (2018). This chapter explains,
among other things, how GPT-2 was retrained using millions of PubMed open access
articles for the purpose of generating clinical vignettes. GPT-2 was superseded (in size)
by MegatronLM (NVIDIA, 2019).
b. The next step, and one that made not only a huge splash in the media but also resulted in
a large number of startups using NLG, was the release of the GPT-3 API, which allowed
access to the currently most used transformer model, which clocks in at 175 billion
parameters (Brown et al., 2020).
c. GPT-J-6B, a 6-billion-parameter model, and the more recent (February 2022) GPT-neoX
with 20 billion parameters, were released. Some examples generated for this chapter are
based on GPT-J. Both models are provided by www.eleuther.ai/, a self-described grassroots
campaign of a ‘decentralized collective of volunteer researchers, engineers, and developers
focused on AI alignment, scaling, and open-source AI research, founded in July of 2020’.
d. BLOOM (July 2022), available through the Hugging Face portal (https://huggingface.
co/bigscience/bloom) based on the BigScience collaborative open science initiative, is a
language model trained on 46 natural languages. It aims at free worldwide access, while
OpenAI’s GPT models were (despite the name of the organization) proprietary and
licensed to individuals and organizations.
DOI: 10.4324/9781003278658-8
This chapter has been made available under a CC-BY-NC-ND 4.0 license.
Training Optimus Prime, M.D. • 91
These most recent developments, BLOOM as well as GPT-neoX, give reason to hope that
access and work with these language models is further improving and that researchers who
were unable to participate in work due to language, economic, or political barriers will be able
to engage in applications of and research on large language models again. Moreover, BLOOM
is based on what BigScience calls a Responsible AI Licensing agreement (RAIL: https://huggingface.co/spaces/bigscience/license), which includes a section that restricts the use of this
model for purposes that can lead to discrimination, physical or emotional harm, or the dissemination of misinformation. An increasingly important aspect of the applications of AI is the
responsible and accountable use of these increasingly potent technologies.
Some of the recent neural network-based language models include more than 175 billion
parameters, which is incomprehensible compared to the type of neural networks that were
used only a few years back. In the winter semester of 1999–2000, I taught classes about artificial
neural networks (NNs) – for example, Perceptrons (Rosenblatt, 1958) or Hopfield networks
(Hopfield, 1982). Back then, artificial intelligence (AI) already entered what was referred to as
the ‘AI winter’, as most network sizes were limited to rather small architectures unless supercomputers were employed. On smaller machines that were available to most researchers, only
rather limited versions of these NNs could be trained and used, so successful applications were
rare, even though one of the key contributions that enabled deep learning and a renaissance
of NN-based AI, the long-short-term-memory (LSTM) design (Hochreiter & Schmidhuber,
1997) was made in those years. In 2017, I started looking into neural networks again because
I wanted to learn how to program graphical processing units (GPUs) for high-performance
computing (HPC) as needed in estimating complex psychometric models (von Davier, 2016).
After experimenting with high performance computing for analyzing PISA data (which
cut down estimation of IRT models from several hours to 2–3 minutes using the parallel-E
parallel-M algorithm developed in 2016), this finally led me to write a paper on using deep
neural networks for automated item generation (AIG; von Davier, 2018). AIG is a field that has
seen many different attempts, but most were only partially successful, involved a lot of human
preparations, and ended up more or less being fill-in-the-blanks approaches such as we see in
simple form as MadLibs books for learners.
While I was able to generate something that resembled human written personality items,
using a public database that contains some 3,000, several of the (cherry-picked) generated
items sounded and functioned a lot like those found in personality inventories (Goldberg,
1999; Goldberg et al., 2006). I was somewhat skeptical whether one would be able to properly
train neural networks for this task, given that it would require a very large number of items, and
I assumed that each network for that purpose would need to be solely trained on items of the
form it is supposed to generate. Part of my concern was that the items that were generated had
to be hand-picked, as many of the generated character or word sequences ended up not being
properly formed statements. However, those that were selected for an empirical comparison
with human-coded items were found to show the same dimensionality (von Davier, 2018) and
hence to be fully useful as replacements of human-authored items. Nevertheless, some doubt
remained due to the needed handpicking and the limited supply of training material. After all,
AI and neural networks have a long history (e.g., Rosenblatt, 1958; Wiesner, 1961) and have
been hyped to be the next big thing that may soon replace humans and take our jobs.
As mentioned, items generated using RNNs (von Davier, 2018), then cherry-picked, were
passing empirical evaluations and hence functioned a lot like the human-written items in an
online data collection. However, many of the generated items were either not properly formed
statements that are typical for this domain, or, if the network was trained too long on too little
data, they were almost exact copies of what was entered as training material. Therefore, I concluded one would need a lot more data, or an unforeseen qualitative jump in deep learning that
I expected to be years away. This was wrong; it turns out that time indeed flies, and the field of
92 • Matthias von Davier
deep learning did not rest, and while in the paper published in 2018 I stated that operational
use could be years away, I am not so sure anymore that we have to wait that long.
It may well be that we will see automated item generation based on deep learning systems
soon in tools that support item writers for developing test questions for high-stakes exams,
and that deep neural networks will be used to generate questions or distractors for multiplechoice questions used in test preparation and practice exams much sooner. The reason why
I believe this has to do with a graduate student who developed a software tool for programmers
based on a product that was released by OpenAI (Radford, 2018). The software that supposedly
makes programmer lives so much better is called TabNine (e.g., Vincent, 2019) and it provides
context-sensitive (intelligent?) auto-completion based on indexed source code files. The author
of the software estimates that TabNine will save programmers at least 1 second per minute by
suggesting how lines of program code are completed, or what the most likely next line of code
may be, based on the code that the programmer provides and the software uses to improve a
predictive model.
The title of the current chapter is a reference to two relevant lines of inquiry. There was an
article with the title ‘Doctor A.I.’ (Choi et al., 2015), which described a deep learning approach
using generative adversarial networks (GANs) to generate electronic health records (EHRs)
that can pass as plausible EHRs, and the other is the recently ignited race around language
models that use a specific neural network structure called transformer, which was an obvious
trigger for many references to the sci-fi toys and movies. The remainder of this chapter is structured as follows: The next section introduces language models that are based on approaches
that can be used to generate the probability of a next word or language token using information
about a previously observed sequence of words. The following section outlines potential areas
of application and shows select examples of how NN-based language models could be utilized
in medical licensure and other assessment domains for AIG.
2. Background and Significance
AIG has been an area of research in the field of employment and educational testing for quite
some time (Bejar, 2002). Employing human experts to develop items that can be used in medical licensing and certification is particularly cost-intensive, as expert knowledge is needed to
author case vignettes and to develop plausible response options when writing multiple-choice
test questions. Any technology that can reduce these development costs by applying machine
learning or AI would be a welcomed addition to the toolbox of test developers. AIG often either
focused on items that are language free, such as intelligence tests with matrices of graphical
symbols that need to be completed by test-takers (Embretson, 1999), or employed methods
that amount to something that bears strong similarities to fill-in-the-blanks texts such as the
ones found in MadLibs.
The current work builds on and extends a study presented by von Davier (2018), in which a
RNN was trained on an open access database of 3,000 items available through the IPIP database
(Goldberg, 1999). While this previous study concluded that with existing recurrent networkbased models, and with limited item banks, a practical use of AI for AIG would be years away,
the development of language models took a quantum leap when researchers did away with
recurrence and focused on network architectures built around self-attention (Vaswani et al.,
2017). This allowed designing a simple network structure that was easily trained, allowed parallelism in training, and could be pretrained on general corpora of texts and subsequently trained
for specific purposes.
Retraining (and re-implementation) of the transformer has led to a variety of applications,
including the generation of poems, patent texts, and completion of code in support of software
developers. These applications will be references in appropriate sections over the remainder of
Training Optimus Prime, M.D. • 93
this chapter. This chapter also describes a similar experiment with the goal to provide a tool for
developing medical education test items using deep learning–based language models.
3. Materials and Methods
The basis of these predictive approaches are sequential models that provide the probability
of the next word (or other language token such as full stop, newline, etc.) given a number of
previous words. These models are not new, I recall my first encounter of this type of model
was an article in Scientific American before 1985, when I was still a high school student and
part-time programmer working for a small educational gaming company located in northern
Germany (yes, game-based learning existed back then). This 1980 version actually goes back to
the seminal paper by Shannon (1948) and constitutes a primitive language model. This simple
model of course did not have the many layers and the complex network architecture of deep
learning applications that are nowadays used for machine translations, picture annotations, or
automated item generation (von Davier, 2018); rather, it was based on a single layer that connected an input word (previous encounter) to an output word (next encounter). Technically,
the basis of this model was a transition matrix, with input (previous) and output (next) words
coded as binary vectors, and the model basically implemented the Markov assumption for a
model for natural language.
3.1 Markovian Language Models
The model just mentioned is a simple language model that can be viewed as direct translation
of the Markov assumption for modeling a sequence of words wt Î WV with index t = 1,¼,T .
Here, WV is a finite set of words, the vocabulary of a language, and S = WV < ¥ denotes the size
of the vocabulary. Let w : {1,¼, S} WV be an index, i.e., a bijective function that maps integers
to words. That is, we can obtain an integer that represents a word wt by applying it = w -1 ( wt ),
and the associated word can be retrieved from any integer it Î {1,¼, S} through w ( it ).
In this most simple case of a language model, we assume that
P ( wt +1 | wt ,¼, w1 ) º P ( it +1 | it ,¼, i1 ) = P ( it +1 | it ) º P ( wt +1 | wt )
for any t Î {1,¼,T - 1} , namely that the probability of observing a next word wt +1 at position
t +1 of the sequence depends only on the last observed word, wt , and nothing else. The whole
sequence preceding the next-to-last word is ignored in this model. Then, if we assume homogeneity of the transitions, i.e., P w -1 ( wt +1 ) | w -1 ( wt ) = P w -1 ( wu+1 ) | w -1 ( wu ) whenever
wt = wu and wt +1 = wu +1, we can define
(
)
(
)
Mt|t -1 = ( P (1 | 1) I P I ...P ( i | i ) I P I ...P ( S | S ) ) ,
which is a transition matrix that provides a conditional probability distribution for any
i = w -1 ( w ). If there are no constraints, this transition matrix has S [ S - 1] = SS - S parameters,
i.e., roughly the square of the vocabulary size. The parameters can be obtained by estimating
simple sample statistics, or by some more sophisticated methods (e.g., Shannon, 1948).
A more complex language model would consider more than one previous word. This
can be implemented as follows. In order to take the previous L words into account, define
nt = wt , wt -1 ,¼, wt -( L -1) ÎÄiL=1 WV , which is an n-gram of length L.
Then assume for t > L that
(
)
(
)
P ( wt +1 | wt ,¼, w1 ) = P ( wt +1 | nt ) = P wt +1 | wt , wt -1 ,¼, wt -( L -1) .
94 • Matthias von Davier
While this is a perfectly sound definition, it has practical implications that may make applications impossible, as soon as the vocabulary contains more than a few handful of words and the
length of sequence, L, grows larger than, say, 3. The issue is that the mini-sequence nt is an element of ÄiL=1 WV , a much larger set, with S L elements. For a vocabulary of only 100 words and
three-word sequences, there are already 1003 = 1, 000, 000 different elements.
For a transition matrix that contains all conditional probabilities for the next
words, given the previous three, we would need to train, estimate, or otherwise obtain
(100 - 1) ´ 1, 000, 000 = ( S - 1) ´ S L probabilities. Therefore, most traditional approaches to construct such a large transition matrix have not been pursued, as this would require very large
amounts of data.
3.2 Char- and Word-RNNs
One way of circumventing the need to use classical statistical estimation methods, and to be
able to ignore some of the more rigorous requirements of these methods, is using NNs for the
purpose of language modeling. NNs have been shown to be universal function approximators
(e.g., Hornik, 1991; Hanin, 2017). This means that an NN with proper design can be used to
plug in an estimate of a function that is otherwise hard to calculate, or hard to specify based on
more traditional approximation or estimation methods. This advantage is paid for by having
only vague knowledge about the actual form of the function that is being approximated, as NNs
operate as black boxes and do not easily reveal how the approximation is achieved.
In order to further reduce demands, one could model the sequence of characters rather than
words, as natural languages often contain several thousand words, while alphabetic languages
can be expressed using a much smaller character set. Therefore, an alternative to word-based
language models using neural networks can be implemented as a character-based language
model. A few years ago, Google released TensorFlow (Abadi et al., 2015), a powerful software
toolbox to design, train, and sample from neural networks. This triggered implementation of
a variety of deep learning approaches using this new tool, among these a character-based deep
recurrent neural network (Char-RNN, e.g., Ozair, 2016), and, more recently, other architectures that will be described in this chapter. Obviously, there are many more tools for deep
learning, and the models released for further analyses and fine-tuning, as done in the current
study, are typically available in more than one framework.
Wikipedia provides a list of neural network tools, specifically, deep learning–oriented tools,
at https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software.
4. Attention Is All You Need
Recent language models introduced the concept of attention, as a structure that was part of
the neural network architecture aimed at keeping certain concepts more salient. This was initially implemented in addition to the recurrent structures of deep learning models designed for
sequence-to-sequence and language modeling. However, Vaswani et al. (2017) proposed an
alternative, much simpler structure in which the context and the attention mechanism would
replace the sequential structures of RNNs. The title of Vaswani et al.’s article is mirrored in the
subsection title, and this article led to multiple language models published in short succession,
one of which was recently released by OpenAI and forms the basis of the retrained/fine-tuned
model presented in this chapter.
Vaswani et al. (2017) describe the new network structure as consisting only of decoderencoder layers with multi-headed attention, which provides a distribution of most likely language tokens, given a context of a certain length (say, 1,024 words and information about their
position). Psychoanalysts would probably say that transformers simulate some form of free
Training Optimus Prime, M.D. • 95
association, noting that this is even called self-attention in the literature. Interestingly, the
attention architecture used in the transformer-based models is simpler than what was previously deemed necessary in language models based on recurrent neural networks such as the
one used in Ozair (2016) and Brown et al. (2020). This simpler structure allows much faster
training, as the transformer architecture allows parallel processing by means of simultaneously
using word and position encoding rather than encoding the text sequentially. The drawback is
that (currently) only limited lengths of text can be encoded, as the parallel processing makes
it necessary to have the sequence to be encoded (input) as well as the output to be present as a
whole (for example, sentence-by-sentence), rather than word-by-word.
5. Reincarnations of the Transformers: GPT-2, Transformer-XL,
Grover, MegatronLM
The GPT-2 model was trained by a team of researchers at OpenAI (Radford et al., 2018) using
four different levels of complexity of the transformer architecture. In an unprecedented move,
OpenAI released only the two smallest models, which comprise network weights amounting to
117 million and 345 million parameters, respectively. The larger models are not published due
to concerns of malicious use cases and contain up to 1.4 billion (!) parameters. However, this
number was recently toppled by NVIDIA, publishing the MegatronLM model that includes
more than 8 billion parameters, and making the code available on GitHub (https://github.com/
NVIDIA/Megatron-LM). However, the 1.4 billion OpenAI parameter model remains unpublished, as it says on the OpenAI website:
Due to our concerns about malicious applications of the technology, we are not releasing
the trained model. As an experiment in responsible disclosure, we are instead releasing a
much smaller model for researchers to experiment with, as well as a technical paper.
All GPT-2 models were trained on what OpenAI called WebText, which is a 40 GB database of
text scraped from the World Wide Web, excluding Wikipedia, as OpenAI researchers assumed
that Wikipedia may be used by secondary analysts to retrain/fine-tune for specific topics. As
the full model is not available, this means that the actual performance of the GPT-2 Transformer model cannot be verified independently, and other researchers can only use and modify
(retrain) the smaller models. The examples presented in this chapter are based on experiments
with the model that contains 345 million parameters.
Several other transformer-based language models have been under active development and
are being made available to researchers for fine-tuning and adaptation to different applications. Among these are the Transformer-XL (Dai et al., 2019), Grover (Zellers et al., 2019), and,
most recently, MegatronLM (NVIDIA, 2019). While the NVIDIA model used a corpus called
WebText that contains 40 GB of data and was modeled after the corpus used by OpenAI, Grover
was trained on 46 GB of real news and can be used to either generate, or detect, fake news.
This ability to both detect and generate is based on the fact that all of these approaches can
be viewed as probabilistic models that predict a sequence of new words (fake news, a translation, next poem lines, next syntax line in a software program) based on the previous sentence(s)
or lines of code. More formally, we can calculate the loss function
T
1
H T , P = - å log P ( wt | nt -1 ),
T t =1
( )
where P˘ ( wt | nt -1 ) is the estimated distribution of word wt given context (history) nt -1. This
is an estimate of the cross entropy, or logarithmic entropy (Shannon, 1948) of the observed
96 • Matthias von Davier
sequence w0 ,..., wt given some initial context n0. This quantity can be used to evaluate generated
sequences relative to the distribution of the loss based on true (human-generated) sequences
to help distinguish them. The cross entropy is a measure of how well predicted (in terms of
expected log-likelihood, e.g., Gilula & Haberman, 1994) an observed sequence is if a certain
model P̆ is assumed to hold. This loss function is also used during training or fine-tuning in
order to evaluate how well the network predicts new batches of data that are submitted to the
training algorithm.
It is worth mentioning that while all of these are variations on a theme, the transformer
architecture for language modeling has shown great potential in improving over previous
designs in terms of performance on a number of tasks (Devlin et al., 2018). In terms of the use
for generating test questions, Grover (Zellers et al., 2019) may prove useful in future applications, as it was designed to produce and detect fake news by using 46 GB worth of data based
on actual news scraped from the internet. Retraining Grover with targeted assessment materials around a content domain is one of the future directions to take for applied research into
automated item generation using NN-based language models.
6. Method and Generating Samples
The applications of deep learning and recurrent neural networks as well as convolutional networks range from computer vision and picture annotation to summarizing, text generation,
question answering, and generating new instances of trained material. In some sense, RNNs
can be viewed as the imputation model of deep learning. One example of medical applications
is medGAN (Choi et al., 2016), a generative adversarial network (GAN) that can be trained on
a public database of EHRs and then used to generate new, synthetic health records. However,
medGAN can also be considered an ‘old style’ approach, just as the approach I used is for
generating personality items (von Davier, 2018), as medGAN was not based on a pretrained
network that already includes a large body of materials in order to give it general capabilities
that would be fine-tuned later.
Language models as represented by GPT-2 are pretrained based on large amounts of material that is available online. GPT-2 was trained on 40 GB of text collected from the internet but
excluding Wikipedia, as it was considered that some researchers may want to use this resource
to retrain the base GPT-2 model. These types of language models are considered multi-task
learners by their creators, i.e., they claim these models are systems that can be trained to perform a number of different language-related tasks such as summarization, question answering,
and translation (e.g., Radford, 2018). This means that a trained model can be used as the basis
for further targeted improvement, and that the rudimentary capabilities already trained into
the model can be improved by presenting further task-specific material.
7. AI-Based AIG Trained on Workstations With Gaming GPUs
While this should not distract from the aim of the chapter, it is important to know that some
considerations have to be made with respect to how and where calculations will be conducted.
Software tools used for deep learning are free (Abadi et al., 2015), and preconfigured servers and cloud services exist that facilitate the use of these tools. At the same time, significant
costs are involved, and in particular researchers who develop new models and approaches may
need multiple times more time and resources compared to standard applications that are used
to analyze data. The dilemma is that while most tools for training deep learning systems are
made freely available, these tools are worthless without powerful computers. And pointing
to the cloud is not helpful, as the cloud is ‘just someone else’s computer’ (as memes and geek
Training Optimus Prime, M.D. • 97
merchandise prove): High-performance hardware and algorithms that employ parallelism are
needed to train these kinds of networks, either in the form of hardware on-site, in a data center,
or rented through the cloud. The training of RNNs as well as transformer-based language models takes many hours of GPU time, which comes at significant costs if the cloud is used. For
recent language models of the type of GPT-2 large (1.4 billion parameters), or Grover-Mega, or
XLNet, the estimated cost was around $30K–$245K (XLNet) and $25K (Grover-Mega). More
details can be found at Sarazen and Peng (2019) as well as in online forums discussing the training and retraining of these models.
Obviously, cloud computing services come at a cost, and while new preconfigured systems
pop up daily and prices will decrease due to reduced hardware cost and competition, any moreinvolved project that requires training specialized systems, or retraining existing large models, will incur significant costs as well. The model used in the current paper was pretrained
on several TPUs (specialized Google hardware for tensor computations) for over a week and
retraining as well as fine-tuning will take weeks of GPU time in order to produce a system that
is useful for a specific purpose. Therefore, building or purchasing a deep learning computer is
one of the options that should be carefully considered as well as the use of cloud computing
or on-demand GPU time such as Vast.AI. Nowadays, even modest hardware such as gaming
Figure 6.1 Server parts from eBay used to provide the official PISA 2015 data analysis, and now upgraded
and re-purposed for automated item generation. All you need is processor cores, RAM, GPUs,
and an eBay auction sniper. While cloud computing is an option, the experiments reported
here are time-consuming and cloud computing is currently available at an on-demand rate
of $0.80/hour ($0.45 pre-ordered) per GPU. Retraining took 6 days on two GTX 1080Ti GPUs
obtained and installed in a 2013 T7610 Dell dual Xeon processor workstation.
98 • Matthias von Davier
desktops can be utilized, as most of these contain powerful GPUs for graphical processing,
which can be turned into thousands of processing units through toolkits such as CUDA provided by the makers of these graphics cards (e.g., Chevitarese et al., 2012).
The hardware needed for training large NNs can be found at specialized vendors such as
Lambda Labs, who often also provide turnkey solutions such as operating system images that
include all the common machine learning toolkits such as KERAS, TensorFlow, PyTorch,
and others. An alternative is to DIY and to use the many web resources that describe which
workstations can be obtained cheaply and how many of the essential GPUs can be housed,
with or without modifications. In addition, there are free web resources – for example, Google
Colab which is essentially a Jupyter Notebook that anyone with a Google account can use for
deep learning and machine learning experiments (free for short-term use), or time-share ondemand GPU services such as Vast.AI can be used for a fee.
Without further digressions, we now turn to how these systems, either purchased fully configured as turnkey solutions, or put together from used parts, can be utilized to produce text
that, to a much greater extent than imaginable only two years ago, can facilitate automated
generation of assessment materials, including the generation of electronic health record, the
production of suggestions for distractor choices in multiple-choice items, and the drafting of
patient vignettes based on prompts provided by item writers.
8. Electronic Health Records and Deep Learning
The fact that medicine uses IT for storing and managing patient data brought with it that
computer scientists were needed and hired to work on systems for this purpose. At the same
time, data on patients, as it is stored in electronic health records (EHRs), is highly sensitive, so
developers working in this area looked for ways to use databases that would not directly reflect
anyone’s real data. One way was to use the same data, carefully anonymized so that individuals
cannot be identified. A second approach was to generate health data of nonexistent patients
using the regularities found in real health data.
This was the birth of synthetic EHRs, either in the form of expert-generated models
(Synthea: Walonoski, 2017) or in the form of deep learning–based models that either predict
the next clinical event or generate (plausible) synthetic EHRs based on a training dataset of real
EHRs (Dr. AI: Choi, 2015; MedGAN: Choi, 2016). These models can be used to generate data
that can be considered a limitless resource, and they are anonymous by design so that concerns
about privacy and data protection are alleviated when using these synthetic records in design
and analysis work.
A recent systematic review (Xiao, 2018) describes 98 studies that use deep learning in conjunction with EHRs. The studies range from generating synthetic EHRs to enable users to
experiment with these data without any privacy issues, to building predictive models that are
used to forecast the next clinical encounter based on existing patient history. EHRs are an
important source of information and can be used to look at systematic differences in trajectories between patient groups, as well as how different treatments play out by comparing the
prevalence of subsequent encounters.
9. Distractor Generation Through Question Answering
The utility of the GPT-2 language model is currently explored by means of retraining for specific
purposes. One application that was mentioned in the introduction is the TabNine software that
allows the prediction of line completions and programming code continuation. For that purpose,
each user provides data of their own project-related code, which can further improve prediction.
Training Optimus Prime, M.D. • 99
Table 6.1 Example Output of the 345M Parameter Model When Used for Question Answering
Model prompt >>> Q: What are the most common side effects of statins? A:
SAMPLE 1: They can lead to a reduction in the effectiveness of your insulin. They can be associated with an increased
risk of heart attack and stroke
SAMPLE 2: Common side effects are: nausea, vomiting, diarrhea, weight gain, joint pain . . .
SAMPLE 3: Statins can cause stomach inflammation or stomach ulcers. . . .
SAMPLE 4: There are some common side effects to statins, but these are not always permanent. For example, people
with cardiovascular disease tend to have more fatigue.
SAMPLE 5: These can include: dizziness, feeling like your eyes are going to pop out (this is called ‘repetitive eye
movements’), muscle pain, blurred vision
Table 6.2 Responses to a Question About Symptoms of IBS Using a Network Trained for 3 Days
Model prompt >>> What are the most common symptoms of irritable bowel syndrome?
SAMPLE 1.1 In chronic condition, patients report numerous symptoms including pain, bloating, abdominal pain,
diarrhea, constipation, and vomiting. Those patients who have diarrhea, cough, or abdominal pain are often affected
by this disorder . . . .
SAMPLE 1.2 In 2014, an estimated 3.9 million U.S. adults reported more than 1 or 2 symptoms of irritable bowel
syndrome. About 30 percent reported at least one symptom. In this survey, 63.8 percent IAmericans . . .
SAMPLE 1.3 Symptoms of irritable bowel syndrome tend to be more severe in those with irritable bowel syndrome,
which may make them the least likely to seek diagnosis and treatment. Some symptoms that may be present include
abdominal In, bloating, . . .
SAMPLE 1.4 Symptoms of irritable bowel syndrome Complications of irritable bowel syndrome Symptoms of
irritable bowel syndrome. Control Medication Inflammation Abnormal bowel movements. . . .
SAMPLE 2.1 Symptoms of irritable bowel syndrome No. of Possible Causes 8.2 10.7 Major Common Symptoms –
upper and lower gastrointestinal issues. 1. Irritable bowel syndrome with diarrhea (ILS) has been described as . . .
SAMPLE 2.2 It is common for the majority of irritable bowel syndrome patients to experience a variety of symptoms
including abdominal pain, cramping, bloating, nausea, abdominal pain, bloating, diarrhea, cramps, bloating,
cramping . . .
SAMPLE 2.3 My best friend and I recently purchased a new home to be closer to family and friends, and we felt
more comfortable living at home. However, due to some suspicious and unusual events, we were contacted by your
physician to take our . . .
SAMPLE 2.4 Symptoms of irritable bowel syndrome are often severe and often require hospitalization. The symptoms
of diarrhea (1.2%) and constipation (1.4%) are common in the community. Severe diarrhea and const . . .
The ability to generate distractors for existing multiple-choice items is already given even in
the original, not-retrained 345M GPT-2 model. The training material contained a large number of cases in which a question was followed by the prompt ‘A:’ and an answer. By means of
introducing this type of training material, the model was enabled to react to a prompt that ends
in ‘A:’ with a continuation of the text that in many cases represents a meaningful response to a
question that was contained in the prompt. Here, we show two examples that were generated
using a local copy of GPT-2 on the author’s workstation equipped with a GPU and TensorFlow
(Abadi et al., 2015), which facilitates sampling from the language model.
The GPT-2 model was trained with material that includes text that has the structure: ‘Q:
What is X? A: X is a Y.’ In other words, the model is prompted to associate a sequence of words
that is bracketed in ‘Q:’ and ‘A:’ as a question that requires an answer. The next figure shows
an example of output generated using the 345M model. Note that these are far from perfect,
but they could serve as inspiration for human item writers. The first example (Table 6.1) was
generated without any retraining, using the downloadable version of the 345M GPT-2 model.
100 • Matthias von Davier
It is clear that not all of the listed side effects are actual ones patients may experience. However, some overlap with side effects mostly listed in online resources, and some others may
be ‘plausible enough’ to potentially serve as wrong options in a multiple-choice test. The next
example (Table 6.2) asks about common symptoms of IBS; the selection of responses were not
cherry-picked, and from among two sets of 4 answers, most are on topic.
It is important to note that the responses are based on a general language model that has
not been trained specifically to answer questions about medical content. This model is, on
top of that, the second-smallest of the GPT-2 models, and contains (by today’s standards)
only 345 million parameters, while other, larger variants contain much more complex model
layers and approximately 1.4 billion parameters (Radford et al., 2018). Again, note that these
responses that could potentially be used as distractor suggestions were generated without any
retraining of specifically medical assessment materials.
10. Automatic Item Generation
The tests reported in this section are based on the GPT-2 (345M) pretrained language model
and roughly 800,000 open access subset articles from the PubMed collection (www.ncbi.nlm.
nih.gov/pmc/tools/openftlist/) used for retraining. The data was encoded using the GPT-2
(https://github.com/nshepperd/gpt-2) toolbox for accessing the vocabulary used for pretraining and fine-tuning GPT-2 using TensorFlow. The 800,000 articles roughly equate to 8 GB
worth of text from a variety of scientific journals that allow open access to some or all of their
articles. Training took 6 days on a Dell T7610 equipped with 128 GB RAM, two 10-core Intel
Xeon processors, and two NVIDIA 1080 Ti GPUs using CUDA 10.0 and TensorFlow 1.14,
Python 3.6.8 and running Ubuntu 18.04 LTS. It was necessary to use the memory-efficient
gradient storing (Gruslys et al., 2016; Chen, 2016) options, as the size of data structures for the
345M model used in the retraining exceeded the 11 GB memory of the GPUs without it.
The amount of training data available through open access (OA) papers that can be downloaded from PubMed repositories is quite impressive: The number of OA articles exceeds
800,000, and the compressed pre-processed databases used for retraining in this study
exceeds 8 GB. However, free medical texts are available in abundance, and a 2011 survey
(Singh et al., 2011) lists many resources. Language models for data of this size were not able
to be processed on customary hardware only a few years ago, while nowadays (with a few
tricks), even the medium-size (345 million hyper parameter) GPT-2 model can be retrained
on decent gaming GPUs.
Incidentally, during the 6 days of training there is some downtime, which allowed me to
find a recent arXiv submission that talks about automated patent application generation using
GPT-2 pretrained with (public, granted) patent applications available online (Lee, 2019). Other
applications include the syntax completion software TabNine described in the introduction as
well as experiments aimed at automatic generation of poems (Gwernnet, 2019). The authors of
the GPT-2 patent retraining study used Google Colab, a free online deep learning platform that
allows users access to GPUs for NN training for up to 12 hours. This is insufficient for the 8 GB
of PubMed data to be fully retrained on GPT-2 medium, so the author of this chapter resorted
to upgrading and using a dual GPU workstation.
Table 6.3 shows exemplary results after 2 days of retraining with the 800,000-article PubMed
database. Other publicly available medical text and article databases are listed in Singh et al.
(2011). While the results are encouraging, they can certainly not be used as is, when produced
by the NN. However, some editing and human expert input could use this raw output as inspiration for authoring clinical vignettes. Results should be closer to human-authored item stems
using a transformer that was trained on a large number of real medical licensure and certification item stems and distractors, and as larger pretrained transformer models get published. It
should be noted that these early examples are certainly not texts that would pass as real items.
Training Optimus Prime, M.D. • 101
Table 6.3 Example 1 (After Only 2 Days of Training on Customary Easily Obtainable Hardware) of Generated Text That
Could Potentially Be Used as the Basis for Item Stem Authoring
Model Prompt >>> A 52-year-old man complaining about chest pain and headache is coming to the emergency
room.
PubMed (2 days) retrained, GPT-2 based model-generated texts:
SAMPLE 1: His blood pressure is 143/99 (normal). He also has hypertension and urticaria. It is a potentially lifethreatening situation. The first piece of information from the patient is that he has on occasion been suffering from
chronic abdominal pain, abdominal pain during pregnancy, chronic abdominal pain during childbirth, urinary tract
symptoms. All these are related to chronic abdominal pain. He is conscious and willing to go to the emergency room
for assessment. However, he is already complaining about abdominal pain when we come to the hospital. My primary
concern is the possibility of heart failure due to chronic pain. I don’t feel that I am in a position to explain this to
the patient, but I want to reassure him that it is a possibility. So, the decision to take him to the emergency room for
evaluation is already made. Some would argue that he would be better off in the hospital, but that does not seem to
be the case. When it comes to dealing with chronic pain, we can do better with a diagnosis of dysarthria, rather than
acute cardiomyopathy. We can give him three days of antibiotics. We can tell him that he will need to have three small
intestinal procedures in the hospital, but that he can go home
SAMPLE 2: He also complained of tingling and aching in his hands, feet, legs, and arms. The patient is not in good
condition. He had a history of a recent allergic reaction to peanuts. He took 4 mg. of polyphenols: red chip, red,
black and white, in combination with 250 mg. of oat milk and 60 mg. of vitamin B12. He immediately experienced
tingling sensation in his hands and feet. In addition, he had a mild fever. He did not take medicine for a long time. He
then started to complain of a very severe pain in his chest and abdomen. His symptoms became worse. He felt cold,
tingling sensation, and headache. He was taken to the hospital, and there a further examination showed that he was
allergic to peanuts. A follow-up examination confirmed that he was allergic to peanuts. He was discharged home the
same day. A change in his allergy history showed that he had not taken medicine for a long time, and he had a mild
fever. He started to take food for a long time. He started to take vitamin B12. He started to take oat milk as well. He
started taking 500 mg. of vitamin B12 in combination with 150 mg. . . .
SAMPLE 3: The patient was admitted to the hospital on Friday night with a minor hip fracture. The patient was
experiencing abdominal pain, nausea, and fever. He did not report any cough, other than the cough of headache and
chest pain. The patient was first treated with antibiotics, then with polyethylene glycol solution (PEG), and finally with
chloramphenicol. At the end of treatment, the patient’s abdominal pain and abdominal pain continued. The patient
told his doctor that he did not feel well, but did not report any cough, cold, or other symptoms. On the following day,
the patient developed a fever of 40.2°C and some abdominal pain. On the following morning, he reported no other
symptoms, but fever of 40.2°C and fever of 39.2°C. The patient’s temperature was recorded as 49.3°C, and his level
of red blood cell count was 120/16. The patient was prescribed phenylalanine (PBA), but did not report any other
symptoms. The patient’s BP was recorded as 118/28, his level of red blood cell count was 126/17. The patient was then
referred to an oncology facility for further treatment.
SAMPLE 4: His symptoms were described as sudden onset of chest pain with a brief onset of bradycardia. The patient
had a history of transient rapid, transient, and rapid diastolic blood pressure increases. Patient in critical condition for
dehydration, 5-day interval between study. Recurrent hypertension and pre-existing coronary heart disease (CHD)
with ipsilateral ventricular tachycardia. Diabetes mellitus. 2 studies (RR=0.75, 95% CI=0.65 to 0.85). Over 30-year
patient without prior cardiac disease. Shaolin acupressure technique applied (AG) and exercise at 1.5m duration.
Patient is resting on a reclining couch, a patient in critical condition for dehydration, 5-day interval between study; he
has chest pain and pain in the left chest with bradycardia. The patient had no previous CHD, no prior hypertension,
no pre-existing CHD, and a 30-year history of no chronic obstructive pulmonary disease. Approximate number of
patients who participated in follow-up and follow . . .
Note: The GPT-2 transformer was retrained on medical open access (PubMed open access subset) articles, not on any secure or publicly available medical licensure or admission items.
However, the quality of the texts is such that it can be assumed that the larger transformer
model, trained on real medical licensure items, would be able to produce source material that
could be used by medical experts as prompts to facilitate authoring expert–generated items.
A second example contains two sentences as a prompt which provides a bit more context for
the ‘attention’ circuits of the transformer network. The network checkpoint (saved parameters
after partial retraining) that was used represents the retrained GPT-2 after 200,000 cycles using
the 800,000 PubMed open access database.
102 • Matthias von Davier
Table 6.4 Example 2: Obtained From a Network That Was Retrained for 3 Days
Model prompt >>> A 32-year-old woman comes to the doctor’s office complaining about severe headache and
blurry vision. She had been suffering from migraine for years.
SAMPLE 1: She was prescribed a drug for tics and headaches, a sulfate and an antihistamine, and the prescription
was confirmed by the doctor, who thought that the symptoms could be caused by an active nerve stimulation. It
was a decision to keep the dosage, which was higher than usual. Subsequent results showed a very severe headache,
which did not respond to the medications, and was now very severe. This was the first case of tetanus-induced brain
hemorrhage (HBIH) in Europe. HBIH is a rare but potentially fatal disease in which the brain hemorrhage is mainly
due to the tetanus toxin.
The researchers reviewed the literature to identify cases of HBIH. All of these cases were reported in the period 2006
to 2011, but a considerable number of cases in the years 2011 to 2012 were only reported in 2005 and 2006. The
authors of the paper reported that four cases of HBIH in Europe have been reported during the years 2006 to 2011, of
which three were reported in the period 2006 to 2012. The second case in this period, which was reported in 2006 to
2011, was a result of the same kind of drug (antihistamine, sulfated humic acid) as the first case . . .
SAMPLE 2: At her hospital, she was admitted for medical examination and he was informed that she had
experienced a migraine and fever in the preceding week. When he examined her eyesight, she revealed that they were
deteriorating. She had been having headaches for a few years and was very concerned about them. She had the fever
and headache. She was feverish, very fatigued and not able to sleep. She was in the critical stage and was not able to
tolerate even the medium intensity headaches. She had been very relieved with this problem. She felt relieved with
the headache, but she was in severe pain in the back of her neck. The doctor who examined her eyesight revealed that
the intensity of her headache was not satisfactory, so the doctors had recommended that she should be admitted for
further medical examination. She had not been allowed to use a pacemaker since she had been admitted for treatment
of a chronic migraine. However, during her admission, she complained that her eyesight was getting worse. She
was also in pain in her back. She was admitted for emergency treatment. She was admitted for treatment of chronic
migraine. The doctors who examined her eyesight gave her an intravenous injection of levodopa and there were no
complications. They also gave her chlordiazepoxide and waited for a . . .
SAMPLE 3: She was wondering if the headgear could help her to relax and feel a bit better. She was in need of the
help to stop the pain. But she had found that her headache was even worse than usual. She was planning to take a
bath during this period and it would make her worry. So, she asked the doctor for an eye examination, to see if there
was any abnormality. The doctor, who is a nurse, explained to the woman that, after a headache, there is an increased
production of nerve gas from the chest and lungs, so it would be best to take a nasal spray with some corticosteroid
treatment. She applied the nasal spray to her nose and then took a step back and let the cold air come into the headgear.
But, the patient was still not satisfied and the doctor was still giving her advice. She then asked for another exam. The
patient was now told that she had to take some corticosteroid treatment. She then was informed that she should take
some steroid nasal spray in addition to the eye examination to take the steroid treatment. She was then told that her
headache was still worse than usual. On examination, the doctor showed the patient an X-ray of the neck . . .
SAMPLE 4: According to her husband, she tried some anti-emetic medication but there was no difference in her
condition. She felt that migraine pain was like a pressure and pain would only come when she was moving around.
She asked the doctors to prescribe anti-emetic medication to treat her headache but, she was not satisfied with the
treatment. So, she tried to make a prescription from the family doctor for anti-emetic medication and the prescription
was filled and passed out the door. When she was with the family doctor, she tried to find out how to use the antiemetic medication to relieve her headache but, the anti-emetic medication was not effective. So, she finally decided
to try the anti-emetic medication and it became effective. She started to use the anti-emetic medication for about a
month. She finally decided to stop using the anti-emetic medication. She did not know how to take it and she was
very afraid of taking it. But, she tried to use the anti-emetic medication. And, finally, the headaches stopped. She was
happy and was able to do her normal activities. Then, she started to take anti-emetic medication and it was elective.
She began to think that she should take . . .
Note: The network recall can be fine-tuned as well to produce most likely vs. more divergent responses.
The point to be made here is that the existing network architecture can be used for question
answering, and to a limited extent also for ‘inspiration’ of human test developers who could
enter ideas as prompts and have the neural network spit out ideas. Current applications that
are similar in kind used the GPT-2 model for retraining based on openly available patent texts,
poems, as well as source code files. It appears plausible that further fine-tuning with targeted
assessment material should improve the results dramatically – for example, by using all available items in a certain subject domain such as cardiology. It is not claimed that the current
Training Optimus Prime, M.D. • 103
system is fully useful as is, but the quality of text produced by the currently available transformer architecture makes it rather likely that correctly formed item stems can be produced by
deep learning–based language models in the very near future.
11. Discussion and Current State of the Art
After GPT-2 kicked off several endeavors to use transformer-based NLG-focused models for
research and commercial purposes, including startups that carried fine-tuning in their name, the
next generation of NLG models essentially made finetuning an expensive and largely unneeded
exercise. GPT-3 (Brown et al., 2020) provided researchers and entrepreneurs with a model that
was both much larger and much more capable so that it could be used with minimal adjustments to hyperparameters or prompting in order to enable special-purpose applications. Even
comparably small models (by today’s standards) such as the GPT-J provided by the grassroots
initiative of AI researchers called eleuther.ai allows generating questions without pretraining.
The second sample output even included an example of a potential diagnosis (‘The most
likely diagnosis is a brain tumor’), which appears a plausible choice, even though it was not
requested.
GPT-3 is obviously capable of delivering vignettes that are quite close to what one would find
in USMLE items. However, for special applications, GPT3 still allows fine-tuning, and even
offers a fine-tuned model for code completion in languages such as Python and JavaScript –
for example, in the form of OpenAI’s Codex (https://doi.org/10.48550/arXiv.2107.03374).
Tools such as Codex are similarly impressive as they allow supporting writing code, which,
after all, seems to be equally well modeled as natural languages by very large transformers
such as GPT-3.
We are at the frontier of AI entering many domains of daily life. While phone makers
contribute to the hype and advertise the next generation of smartphones as running neural
networks, there are industrial domains in which these applications are essential. Among these
are computer vision and assisted driving. Others are recommenders for e-commerce, but also
applications that are trained to detect the use of AI for deep fakes, video material that was
made by a machine, programmed with malicious intent to fool humans. However, many applications also support human creativity in more benign ways, such as gauGAN NVIDIA (2019),
a tool that helps illustrators to compose landscapes easily with only a few clicks, and more
recently DALL-E and CLIP (https://doi.org/10.48550/arXiv.2204.06125). There are also tools
based on AI that support wellness (Qualcomm, 2019), using the same technologies to analyze
data on health that are used to predict what music one may like based on past purchase and
download behavior.
Table 6.5 Example Output Generated by https://6b.eleuther.ai/ for the Prompt: ‘Statins are used to lower cholesterol.
What are the most common side effects of statins?’
Your doctor will likely tell you that statins can cause side effects. You can expect some side effects from statins, but most
are minor. The most common side effects of statins are:
Headaches
Nausea
Dizziness
Tiredness
Back pain
Urinary tract infections
Muscle aches
104 • Matthias von Davier
Table 6.6 Vignette Written by GPT-3 Based on a Prompt, Including a Question, Which Appears to Be Basically a
Paraphrase
Prompt:
Write a USLME Vignette and a question about next treatment steps based on the following sentence:
‘A 32-year-old woman comes to the doctor’s office complaining about a severe headache and blurry vision.’
GPT-3 Playground Output 1:
Vignette:
A 32-year-old woman comes to the doctor’s office complaining about a severe headache and blurry vision. The
doctor conducts a physical examination and orders a CT scan. The CT scan reveals a mass in the woman’s brain.
The doctor refers the woman to a neurologist for further evaluation and treatment.
Question:
What are the next treatment steps for the woman?
GPT-3 Playground Output 2:
Vignette:
A 32-year-old woman comes to the doctor’s office complaining about a severe headache and blurry vision. She
reports that the headache has been getting progressively worse over the past few weeks and that her vision has
become increasingly blurry. She is concerned that something may be wrong with her eyes. The doctor performs
a thorough examination and orders a CT scan of the woman’s head. The CT scan reveals a mass in the woman’s
brain. The doctor refers the woman to a neurologist for further evaluation and treatment.
Question:
What is the most likely diagnosis based on the information provided?
12. Conclusion
The prospects of this technology become really exciting when looking at how these pretrained
models could be deployed. There are efforts underway to develop toolkits that utilize language
models, currently GPT-2 and BERT, another transformer-based language model developed
by Google (Devlin, 2018) on iOS devices. This would not train these networks on phones,
but would allow utilization of a trained network to generate new text based on a sentence
that describes a case or a context entered by a user. For automated item generation, apps
could be developed that use the language generation on smartphones, for supporting item
developers in writing new content on their mobile devices (https://github.com/huggingface/
swift-coreml-transformers).
Once pretrained models for medical specialties are available, it would be straightforward
to develop a tool in which medical experts can enter a draft vignette or even a few keywords
that are wrapped by the app into a case description draft, which can then be finalized and
submitted by the human expert for further editing and finalization by item writers at the
testing agency who assembles, administers, and scores the certification tests. At the testing
agency, the just-developed case vignette could be finalized using yet another set of machine
learning tools to generate correct and incorrect response options which are either used in
multiple-choice formats or for training an automated scoring system for short constructed
responses.
As it turns out, apps using transformers to generate texts have not flooded the market,
yet, even three years after the first draft of this chapter. However, OpenAI reports already in
May 2021 that over 300 applications are using GPT-3.
Regarding automated generation of questions (or items) using transformers, the TIMSS &
PIRLS International Study Center is looking into utilizing writing assistants to generate parallel
Training Optimus Prime, M.D. • 105
versions of item stems, passages, and questions in order to support human experts in their item
writing activities. It was reported by Drori et al. (2022) that a pipeline built based on transformer models was able to generate questions as well as answers at the level of MIT mathematics course material using targeted pretraining and fine-tuning. The output consists not only of
texts, but of graphs, diagrams, tables, and other objects commonly found in math instruction
and assessment. This leads me to conjecture that within a few years we will achieve machine
generated items not only for simple open ended and multiple-choice questions, but we will
also be able to generate using AI workflows complex engaging items that mix graphical stimuli
and responses (which then will be automatically scored; e.g., von Davier et al., 2022) and can
be generated by means of inputs that specify target grade, topic, cognitive processes needed to
solve the item, and type of response and stimulus material.
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M.,
Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jozefowicz, R., Jia, Y., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D. . . . Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. www.
tensorflow.org.CoRRabs/1603.04467
Bejar, I. I. (2002). Generative testing: From conception to implementation in Item Generation for Test Development. In:
Item generation for test development (pp. 199–217). Lawrence Erlbaum Associates.
Brown, T. B., Mann, B., Ryder, R., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., & Askell,
A. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS). https://doi.org/10.48550/arXiv.2005.14165
Chen, T., Xu, B., Zhang, C., & Guestrin, C. (2016). Training deep nets with sublinear memory cost. CoRR. http://arxiv.
org/abs/1604.06174
Chevitarese, D. S., Szwarcman, D., & Vellasco, M. (2012). Speeding up the training of neural networks with cuda technology. In L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. A. Zadeh, & J. M. Zurada (Eds.), Artificial intelligence and soft computing (pp. 30–38). SpringerBerlin Heidelberg. ISBN 978-3-642-29347-4
Choi, E., Bahadori, M. T., & Sun, J. (2015). Doctor AI: Predicting clinical events via recurrent neural networks. CoRR.
http://arxiv.org/abs/1511.05942
Choi, E., Schuetz, A., Stewart, W. F., & Sun, J. (2016). Medical concept representation learning from electronic health
records and its application on heart failure prediction. CoRR. http://arxiv.org/abs/1602.03686
Dai, Z., Yang, Z., Yang, Y., Carbonell, J. G., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-xl: Attentive language
models beyond a fixed-length context. CoRR. http://arxiv.org/abs/1901.02860
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for
language understanding. CoRR. http://arxiv.org/abs/1810.04805
Drori, I., Zhang, S., Shuttleworth, R., Tang, L., Lu, A., Ke, E., Liu, K., Chen, L., Tran, S., Cheng, N., Wang, R., Singh, N.,
Patti, T. L., Lynch, J., Shporer, A., Verma, N., Wu, E., & Strang, G. (2022). A neural network solves, explains, and
generates university math problems by program synthesis and few-shot learning at human level. https://arxiv.org/
abs/2112.15594
Embretson, S. E. (1999). Generating items during testing: Psychometric issues and models. Psychometrika, 64(4),
407–433.
Gilula, Z., &., and Haberman, S. J. (1994). Conditional log ‘linear models for analyzing categorical panel data’. Journal
of the American Statistical Association, 89(426), 645–656. www.jstor.org/stable/2290867
Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring the lower-level facets
of several five-factor models. In F. Ostendorf (Ed.), Personality psychology in Europe (Vol. 7, pp. 7–28). Tilburg
University Press.
Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., & Gough, H. C. (2006). The
international personality item pool and the future of public-domain personality measures. Journal of Research in
Personality, 40, 84–96.
Gruslys, A., Munos, R., Danihelka, I., Lanctot, M., & Graves, A. (2016). Memory-efficient backpropagation through
time. CoRR. http://arxiv.org/abs/1606.03401
Gwern.net. (2019). Gpt-2 neural network poetry. www.gwern.net/GPT-2
Hanin, B. (2017). Universal function approximation by deep neural nets with bounded width and relu activations.
CoRR. http://arxiv.org/abs/1708.02691
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79(8), 2554–2558. https://doi.
org/10.1073/pnas.79.8.2554
106 • Matthias von Davier
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251–257.
https://doi.org/doi: doi:10.1016/0893-6080(91)90009-T
Lee, H. (2019). Patent claim generation by fine-tuning. OpenAI GPT-2 2052, Department of Computer Science and
Information Engineering, National Taiwan University.
NVIDIA (2019, August 13). Gaugan tool for artists. www.mingyuliu.com/gaugan; Megatronlm press release. https://
github.com/NVIDIA/Megatron-LM. Also https://nvidianews.nvidia.com/news/nvidia-achieves-breakthroughsin-language-understandingto-enable-real-time-conversational-ai
Ozair, S. (2016). Char-RNN for TensorFlow. https://github.com/sherjilozair/char-rnn-tensorflow
Qualcomm. (2019). AI supporting wellness. www.qualcomm.com/news/onq/2019/07/11/ai-your-supportive-wellnesscompanion
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2018). Language models are unsupervised multitask
learners. https://d4mucfpksywv.cloudfront.net/better-language-models/language
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain.
Psychological Review, 65(6), 386–408.
Sarazen, T., & Peng, M. (2019). The staggering cost of training SOTA AI models. https://syncedreview.com/2019/06/27/
the-staggering-cost-of-training-sota-ai-models
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423. https://
doi.org/10.1002/j.1538-7305.1948.tb01338.x
Singh, A., Singh, M., Singh, A. K., Singh, D., Singh, P., & Sharma, A. (2011). Free full text articles: Where to search for
them? International Journal of Trichology, 3(2), 75–79. https://doi.org/10.4103/0974-7753.90803
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin., I. (2017). Attention
is all you need. CoRR. http://arxiv.org/abs/1706.03762
Vincent, J. (2019). This AI-powered autocompletion software is Gmail’s smart compose for coders – Upgrading coding
autocompleter tools with deep learning. www.theverge.com/2019/7/24/20708542/coding-autocompleter-deeptabnine-ai-deep-learning-smart-compose
von Davier, M. (2016). High-performance psychometrics: The parallel-E parallel-M algorithm for generalized latent
variable models. ETS Research Report Series, 1–11. https://doi.org/10.1002/ets2.12120
von Davier, M. (2018). Automated item generation with recurrent neural networks. Psychometrika, 83(4), 847–857.
https://doi.org/doi: 10.1007/s11336-018-9608-y
von Davier, M., Tyack, L., & Khorramdel, L. (2022). Scoring graphical responses in TIMSS 2019 using artificial neural
networks. Educational and Psychological Measurement. https://doi.org/10.1177/00131644221098021
Walonoski, J., Kramer, M., Nichols, J., Quina, A., Moesel, C., Hall, D., Duffett, C., Dube, K., Gallagher, T., & McLachlan,
S. (2017). Synthea: An approach, method, and software mechanism for generating synthetic patients and the
synthetic electronic health care record. Journal of the American Medical Informatics Association, 25(3), 230–238.
ISSN 1527-974X. https://doi.org/10.1093/jamia/ocx079
Wiesner, J. B. (1961). MIT: The thinking machine. www.youtube.com/watch
Xiao, C., Choi, E., & Sun, J. (2018). Opportunities and challenges in developing deep learning models using electronic health records data: Aa systematic review. Journal of the American Medical Informatics Association, 25(10),
1419–1428, ISSN 1527-974X. https://doi.org/10.1093/jamia/ocy068
Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., & Choi, Y. (2019). Defending against neural fake
news. CoRR. https://doi.org/arxiv.org/abs/1905.12616