Academia.eduAcademia.edu

Training Optimus Prime, M.D

2023, Routledge eBooks

6 Training Optimus Prime, M.D. A Case Study of Automated Item Generation Using Artificial Intelligence – From Fine-Tuned GPT2 to GPT3 and Beyond Matthias von Davier 1. Introduction The aim of this chapter is to provide evidence on the state of automated item generation (AIG) using deep neural networks (DNNs). Based on earlier work, a paper that tackled this issue used character-based recurrent neural networks (von Davier, 2018), the current contribution describes an experiment exploring AIG using very large transformer-based language models (Vaswani et al., 2017; Brown et al., 2020; BLOOM: https://huggingface.co/bigscience/bloom). The chapter provides an overview of a case study that utilizes the latest generation of language models for text generation. In terms of significant stepping-stones, the description is based on the following developments: a. GPT-2, OpenAI’s model, was described in Radford et al. (2018). This chapter explains, among other things, how GPT-2 was retrained using millions of PubMed open access articles for the purpose of generating clinical vignettes. GPT-2 was superseded (in size) by MegatronLM (NVIDIA, 2019). b. The next step, and one that made not only a huge splash in the media but also resulted in a large number of startups using NLG, was the release of the GPT-3 API, which allowed access to the currently most used transformer model, which clocks in at 175 billion parameters (Brown et al., 2020). c. GPT-J-6B, a 6-billion-parameter model, and the more recent (February 2022) GPT-neoX with 20 billion parameters, were released. Some examples generated for this chapter are based on GPT-J. Both models are provided by www.eleuther.ai/, a self-described grassroots campaign of a ‘decentralized collective of volunteer researchers, engineers, and developers focused on AI alignment, scaling, and open-source AI research, founded in July of 2020’. d. BLOOM (July 2022), available through the Hugging Face portal (https://huggingface. co/bigscience/bloom) based on the BigScience collaborative open science initiative, is a language model trained on 46 natural languages. It aims at free worldwide access, while OpenAI’s GPT models were (despite the name of the organization) proprietary and licensed to individuals and organizations. DOI: 10.4324/9781003278658-8 This chapter has been made available under a CC-BY-NC-ND 4.0 license. Training Optimus Prime, M.D. • 91 These most recent developments, BLOOM as well as GPT-neoX, give reason to hope that access and work with these language models is further improving and that researchers who were unable to participate in work due to language, economic, or political barriers will be able to engage in applications of and research on large language models again. Moreover, BLOOM is based on what BigScience calls a Responsible AI Licensing agreement (RAIL: https://huggingface.co/spaces/bigscience/license), which includes a section that restricts the use of this model for purposes that can lead to discrimination, physical or emotional harm, or the dissemination of misinformation. An increasingly important aspect of the applications of AI is the responsible and accountable use of these increasingly potent technologies. Some of the recent neural network-based language models include more than 175 billion parameters, which is incomprehensible compared to the type of neural networks that were used only a few years back. In the winter semester of 1999–2000, I taught classes about artificial neural networks (NNs) – for example, Perceptrons (Rosenblatt, 1958) or Hopfield networks (Hopfield, 1982). Back then, artificial intelligence (AI) already entered what was referred to as the ‘AI winter’, as most network sizes were limited to rather small architectures unless supercomputers were employed. On smaller machines that were available to most researchers, only rather limited versions of these NNs could be trained and used, so successful applications were rare, even though one of the key contributions that enabled deep learning and a renaissance of NN-based AI, the long-short-term-memory (LSTM) design (Hochreiter & Schmidhuber, 1997) was made in those years. In 2017, I started looking into neural networks again because I wanted to learn how to program graphical processing units (GPUs) for high-performance computing (HPC) as needed in estimating complex psychometric models (von Davier, 2016). After experimenting with high performance computing for analyzing PISA data (which cut down estimation of IRT models from several hours to 2–3 minutes using the parallel-E parallel-M algorithm developed in 2016), this finally led me to write a paper on using deep neural networks for automated item generation (AIG; von Davier, 2018). AIG is a field that has seen many different attempts, but most were only partially successful, involved a lot of human preparations, and ended up more or less being fill-in-the-blanks approaches such as we see in simple form as MadLibs books for learners. While I was able to generate something that resembled human written personality items, using a public database that contains some 3,000, several of the (cherry-picked) generated items sounded and functioned a lot like those found in personality inventories (Goldberg, 1999; Goldberg et al., 2006). I was somewhat skeptical whether one would be able to properly train neural networks for this task, given that it would require a very large number of items, and I assumed that each network for that purpose would need to be solely trained on items of the form it is supposed to generate. Part of my concern was that the items that were generated had to be hand-picked, as many of the generated character or word sequences ended up not being properly formed statements. However, those that were selected for an empirical comparison with human-coded items were found to show the same dimensionality (von Davier, 2018) and hence to be fully useful as replacements of human-authored items. Nevertheless, some doubt remained due to the needed handpicking and the limited supply of training material. After all, AI and neural networks have a long history (e.g., Rosenblatt, 1958; Wiesner, 1961) and have been hyped to be the next big thing that may soon replace humans and take our jobs. As mentioned, items generated using RNNs (von Davier, 2018), then cherry-picked, were passing empirical evaluations and hence functioned a lot like the human-written items in an online data collection. However, many of the generated items were either not properly formed statements that are typical for this domain, or, if the network was trained too long on too little data, they were almost exact copies of what was entered as training material. Therefore, I concluded one would need a lot more data, or an unforeseen qualitative jump in deep learning that I expected to be years away. This was wrong; it turns out that time indeed flies, and the field of 92 • Matthias von Davier deep learning did not rest, and while in the paper published in 2018 I stated that operational use could be years away, I am not so sure anymore that we have to wait that long. It may well be that we will see automated item generation based on deep learning systems soon in tools that support item writers for developing test questions for high-stakes exams, and that deep neural networks will be used to generate questions or distractors for multiplechoice questions used in test preparation and practice exams much sooner. The reason why I believe this has to do with a graduate student who developed a software tool for programmers based on a product that was released by OpenAI (Radford, 2018). The software that supposedly makes programmer lives so much better is called TabNine (e.g., Vincent, 2019) and it provides context-sensitive (intelligent?) auto-completion based on indexed source code files. The author of the software estimates that TabNine will save programmers at least 1 second per minute by suggesting how lines of program code are completed, or what the most likely next line of code may be, based on the code that the programmer provides and the software uses to improve a predictive model. The title of the current chapter is a reference to two relevant lines of inquiry. There was an article with the title ‘Doctor A.I.’ (Choi et al., 2015), which described a deep learning approach using generative adversarial networks (GANs) to generate electronic health records (EHRs) that can pass as plausible EHRs, and the other is the recently ignited race around language models that use a specific neural network structure called transformer, which was an obvious trigger for many references to the sci-fi toys and movies. The remainder of this chapter is structured as follows: The next section introduces language models that are based on approaches that can be used to generate the probability of a next word or language token using information about a previously observed sequence of words. The following section outlines potential areas of application and shows select examples of how NN-based language models could be utilized in medical licensure and other assessment domains for AIG. 2. Background and Significance AIG has been an area of research in the field of employment and educational testing for quite some time (Bejar, 2002). Employing human experts to develop items that can be used in medical licensing and certification is particularly cost-intensive, as expert knowledge is needed to author case vignettes and to develop plausible response options when writing multiple-choice test questions. Any technology that can reduce these development costs by applying machine learning or AI would be a welcomed addition to the toolbox of test developers. AIG often either focused on items that are language free, such as intelligence tests with matrices of graphical symbols that need to be completed by test-takers (Embretson, 1999), or employed methods that amount to something that bears strong similarities to fill-in-the-blanks texts such as the ones found in MadLibs. The current work builds on and extends a study presented by von Davier (2018), in which a RNN was trained on an open access database of 3,000 items available through the IPIP database (Goldberg, 1999). While this previous study concluded that with existing recurrent networkbased models, and with limited item banks, a practical use of AI for AIG would be years away, the development of language models took a quantum leap when researchers did away with recurrence and focused on network architectures built around self-attention (Vaswani et al., 2017). This allowed designing a simple network structure that was easily trained, allowed parallelism in training, and could be pretrained on general corpora of texts and subsequently trained for specific purposes. Retraining (and re-implementation) of the transformer has led to a variety of applications, including the generation of poems, patent texts, and completion of code in support of software developers. These applications will be references in appropriate sections over the remainder of Training Optimus Prime, M.D. • 93 this chapter. This chapter also describes a similar experiment with the goal to provide a tool for developing medical education test items using deep learning–based language models. 3. Materials and Methods The basis of these predictive approaches are sequential models that provide the probability of the next word (or other language token such as full stop, newline, etc.) given a number of previous words. These models are not new, I recall my first encounter of this type of model was an article in Scientific American before 1985, when I was still a high school student and part-time programmer working for a small educational gaming company located in northern Germany (yes, game-based learning existed back then). This 1980 version actually goes back to the seminal paper by Shannon (1948) and constitutes a primitive language model. This simple model of course did not have the many layers and the complex network architecture of deep learning applications that are nowadays used for machine translations, picture annotations, or automated item generation (von Davier, 2018); rather, it was based on a single layer that connected an input word (previous encounter) to an output word (next encounter). Technically, the basis of this model was a transition matrix, with input (previous) and output (next) words coded as binary vectors, and the model basically implemented the Markov assumption for a model for natural language. 3.1 Markovian Language Models The model just mentioned is a simple language model that can be viewed as direct translation of the Markov assumption for modeling a sequence of words wt Î WV with index t = 1,¼,T . Here, WV is a finite set of words, the vocabulary of a language, and S = WV < ¥ denotes the size of the vocabulary. Let w : {1,¼, S}  WV be an index, i.e., a bijective function that maps integers to words. That is, we can obtain an integer that represents a word wt by applying it = w -1 ( wt ), and the associated word can be retrieved from any integer it Î {1,¼, S} through w ( it ). In this most simple case of a language model, we assume that P ( wt +1 | wt ,¼, w1 ) º P ( it +1 | it ,¼, i1 ) = P ( it +1 | it ) º P ( wt +1 | wt ) for any t Î {1,¼,T - 1} , namely that the probability of observing a next word wt +1 at position t +1 of the sequence depends only on the last observed word, wt , and nothing else. The whole sequence preceding the next-to-last word is ignored in this model. Then, if we assume homogeneity of the transitions, i.e., P w -1 ( wt +1 ) | w -1 ( wt ) = P w -1 ( wu+1 ) | w -1 ( wu ) whenever wt = wu and wt +1 = wu +1, we can define ( ) ( ) Mt|t -1 = ( P (1 | 1) I P I ...P ( i | i ) I P I ...P ( S | S ) ) , which is a transition matrix that provides a conditional probability distribution for any i = w -1 ( w ). If there are no constraints, this transition matrix has S [ S - 1] = SS - S parameters, i.e., roughly the square of the vocabulary size. The parameters can be obtained by estimating simple sample statistics, or by some more sophisticated methods (e.g., Shannon, 1948). A more complex language model would consider more than one previous word. This can be implemented as follows. In order to take the previous L words into account, define nt = wt , wt -1 ,¼, wt -( L -1) ÎÄiL=1 WV , which is an n-gram of length L. Then assume for t > L that ( ) ( ) P ( wt +1 | wt ,¼, w1 ) = P ( wt +1 | nt ) = P wt +1 | wt , wt -1 ,¼, wt -( L -1) . 94 • Matthias von Davier While this is a perfectly sound definition, it has practical implications that may make applications impossible, as soon as the vocabulary contains more than a few handful of words and the length of sequence, L, grows larger than, say, 3. The issue is that the mini-sequence nt is an element of ÄiL=1 WV , a much larger set, with S L elements. For a vocabulary of only 100 words and three-word sequences, there are already 1003 = 1, 000, 000 different elements. For a transition matrix that contains all conditional probabilities for the next words, given the previous three, we would need to train, estimate, or otherwise obtain (100 - 1) ´ 1, 000, 000 = ( S - 1) ´ S L probabilities. Therefore, most traditional approaches to construct such a large transition matrix have not been pursued, as this would require very large amounts of data. 3.2 Char- and Word-RNNs One way of circumventing the need to use classical statistical estimation methods, and to be able to ignore some of the more rigorous requirements of these methods, is using NNs for the purpose of language modeling. NNs have been shown to be universal function approximators (e.g., Hornik, 1991; Hanin, 2017). This means that an NN with proper design can be used to plug in an estimate of a function that is otherwise hard to calculate, or hard to specify based on more traditional approximation or estimation methods. This advantage is paid for by having only vague knowledge about the actual form of the function that is being approximated, as NNs operate as black boxes and do not easily reveal how the approximation is achieved. In order to further reduce demands, one could model the sequence of characters rather than words, as natural languages often contain several thousand words, while alphabetic languages can be expressed using a much smaller character set. Therefore, an alternative to word-based language models using neural networks can be implemented as a character-based language model. A few years ago, Google released TensorFlow (Abadi et al., 2015), a powerful software toolbox to design, train, and sample from neural networks. This triggered implementation of a variety of deep learning approaches using this new tool, among these a character-based deep recurrent neural network (Char-RNN, e.g., Ozair, 2016), and, more recently, other architectures that will be described in this chapter. Obviously, there are many more tools for deep learning, and the models released for further analyses and fine-tuning, as done in the current study, are typically available in more than one framework. Wikipedia provides a list of neural network tools, specifically, deep learning–oriented tools, at https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software. 4. Attention Is All You Need Recent language models introduced the concept of attention, as a structure that was part of the neural network architecture aimed at keeping certain concepts more salient. This was initially implemented in addition to the recurrent structures of deep learning models designed for sequence-to-sequence and language modeling. However, Vaswani et al. (2017) proposed an alternative, much simpler structure in which the context and the attention mechanism would replace the sequential structures of RNNs. The title of Vaswani et al.’s article is mirrored in the subsection title, and this article led to multiple language models published in short succession, one of which was recently released by OpenAI and forms the basis of the retrained/fine-tuned model presented in this chapter. Vaswani et al. (2017) describe the new network structure as consisting only of decoderencoder layers with multi-headed attention, which provides a distribution of most likely language tokens, given a context of a certain length (say, 1,024 words and information about their position). Psychoanalysts would probably say that transformers simulate some form of free Training Optimus Prime, M.D. • 95 association, noting that this is even called self-attention in the literature. Interestingly, the attention architecture used in the transformer-based models is simpler than what was previously deemed necessary in language models based on recurrent neural networks such as the one used in Ozair (2016) and Brown et al. (2020). This simpler structure allows much faster training, as the transformer architecture allows parallel processing by means of simultaneously using word and position encoding rather than encoding the text sequentially. The drawback is that (currently) only limited lengths of text can be encoded, as the parallel processing makes it necessary to have the sequence to be encoded (input) as well as the output to be present as a whole (for example, sentence-by-sentence), rather than word-by-word. 5. Reincarnations of the Transformers: GPT-2, Transformer-XL, Grover, MegatronLM The GPT-2 model was trained by a team of researchers at OpenAI (Radford et al., 2018) using four different levels of complexity of the transformer architecture. In an unprecedented move, OpenAI released only the two smallest models, which comprise network weights amounting to 117 million and 345 million parameters, respectively. The larger models are not published due to concerns of malicious use cases and contain up to 1.4 billion (!) parameters. However, this number was recently toppled by NVIDIA, publishing the MegatronLM model that includes more than 8 billion parameters, and making the code available on GitHub (https://github.com/ NVIDIA/Megatron-LM). However, the 1.4 billion OpenAI parameter model remains unpublished, as it says on the OpenAI website: Due to our concerns about malicious applications of the technology, we are not releasing the trained model. As an experiment in responsible disclosure, we are instead releasing a much smaller model for researchers to experiment with, as well as a technical paper. All GPT-2 models were trained on what OpenAI called WebText, which is a 40 GB database of text scraped from the World Wide Web, excluding Wikipedia, as OpenAI researchers assumed that Wikipedia may be used by secondary analysts to retrain/fine-tune for specific topics. As the full model is not available, this means that the actual performance of the GPT-2 Transformer model cannot be verified independently, and other researchers can only use and modify (retrain) the smaller models. The examples presented in this chapter are based on experiments with the model that contains 345 million parameters. Several other transformer-based language models have been under active development and are being made available to researchers for fine-tuning and adaptation to different applications. Among these are the Transformer-XL (Dai et al., 2019), Grover (Zellers et al., 2019), and, most recently, MegatronLM (NVIDIA, 2019). While the NVIDIA model used a corpus called WebText that contains 40 GB of data and was modeled after the corpus used by OpenAI, Grover was trained on 46 GB of real news and can be used to either generate, or detect, fake news. This ability to both detect and generate is based on the fact that all of these approaches can be viewed as probabilistic models that predict a sequence of new words (fake news, a translation, next poem lines, next syntax line in a software program) based on the previous sentence(s) or lines of code. More formally, we can calculate the loss function T 1 H T , P = - å log P ( wt | nt -1 ), T t =1 ( ) where P˘ ( wt | nt -1 ) is the estimated distribution of word wt given context (history) nt -1. This is an estimate of the cross entropy, or logarithmic entropy (Shannon, 1948) of the observed 96 • Matthias von Davier sequence w0 ,..., wt given some initial context n0. This quantity can be used to evaluate generated sequences relative to the distribution of the loss based on true (human-generated) sequences to help distinguish them. The cross entropy is a measure of how well predicted (in terms of expected log-likelihood, e.g., Gilula & Haberman, 1994) an observed sequence is if a certain model P̆ is assumed to hold. This loss function is also used during training or fine-tuning in order to evaluate how well the network predicts new batches of data that are submitted to the training algorithm. It is worth mentioning that while all of these are variations on a theme, the transformer architecture for language modeling has shown great potential in improving over previous designs in terms of performance on a number of tasks (Devlin et al., 2018). In terms of the use for generating test questions, Grover (Zellers et al., 2019) may prove useful in future applications, as it was designed to produce and detect fake news by using 46 GB worth of data based on actual news scraped from the internet. Retraining Grover with targeted assessment materials around a content domain is one of the future directions to take for applied research into automated item generation using NN-based language models. 6. Method and Generating Samples The applications of deep learning and recurrent neural networks as well as convolutional networks range from computer vision and picture annotation to summarizing, text generation, question answering, and generating new instances of trained material. In some sense, RNNs can be viewed as the imputation model of deep learning. One example of medical applications is medGAN (Choi et al., 2016), a generative adversarial network (GAN) that can be trained on a public database of EHRs and then used to generate new, synthetic health records. However, medGAN can also be considered an ‘old style’ approach, just as the approach I used is for generating personality items (von Davier, 2018), as medGAN was not based on a pretrained network that already includes a large body of materials in order to give it general capabilities that would be fine-tuned later. Language models as represented by GPT-2 are pretrained based on large amounts of material that is available online. GPT-2 was trained on 40 GB of text collected from the internet but excluding Wikipedia, as it was considered that some researchers may want to use this resource to retrain the base GPT-2 model. These types of language models are considered multi-task learners by their creators, i.e., they claim these models are systems that can be trained to perform a number of different language-related tasks such as summarization, question answering, and translation (e.g., Radford, 2018). This means that a trained model can be used as the basis for further targeted improvement, and that the rudimentary capabilities already trained into the model can be improved by presenting further task-specific material. 7. AI-Based AIG Trained on Workstations With Gaming GPUs While this should not distract from the aim of the chapter, it is important to know that some considerations have to be made with respect to how and where calculations will be conducted. Software tools used for deep learning are free (Abadi et al., 2015), and preconfigured servers and cloud services exist that facilitate the use of these tools. At the same time, significant costs are involved, and in particular researchers who develop new models and approaches may need multiple times more time and resources compared to standard applications that are used to analyze data. The dilemma is that while most tools for training deep learning systems are made freely available, these tools are worthless without powerful computers. And pointing to the cloud is not helpful, as the cloud is ‘just someone else’s computer’ (as memes and geek Training Optimus Prime, M.D. • 97 merchandise prove): High-performance hardware and algorithms that employ parallelism are needed to train these kinds of networks, either in the form of hardware on-site, in a data center, or rented through the cloud. The training of RNNs as well as transformer-based language models takes many hours of GPU time, which comes at significant costs if the cloud is used. For recent language models of the type of GPT-2 large (1.4 billion parameters), or Grover-Mega, or XLNet, the estimated cost was around $30K–$245K (XLNet) and $25K (Grover-Mega). More details can be found at Sarazen and Peng (2019) as well as in online forums discussing the training and retraining of these models. Obviously, cloud computing services come at a cost, and while new preconfigured systems pop up daily and prices will decrease due to reduced hardware cost and competition, any moreinvolved project that requires training specialized systems, or retraining existing large models, will incur significant costs as well. The model used in the current paper was pretrained on several TPUs (specialized Google hardware for tensor computations) for over a week and retraining as well as fine-tuning will take weeks of GPU time in order to produce a system that is useful for a specific purpose. Therefore, building or purchasing a deep learning computer is one of the options that should be carefully considered as well as the use of cloud computing or on-demand GPU time such as Vast.AI. Nowadays, even modest hardware such as gaming Figure 6.1 Server parts from eBay used to provide the official PISA 2015 data analysis, and now upgraded and re-purposed for automated item generation. All you need is processor cores, RAM, GPUs, and an eBay auction sniper. While cloud computing is an option, the experiments reported here are time-consuming and cloud computing is currently available at an on-demand rate of $0.80/hour ($0.45 pre-ordered) per GPU. Retraining took 6 days on two GTX 1080Ti GPUs obtained and installed in a 2013 T7610 Dell dual Xeon processor workstation. 98 • Matthias von Davier desktops can be utilized, as most of these contain powerful GPUs for graphical processing, which can be turned into thousands of processing units through toolkits such as CUDA provided by the makers of these graphics cards (e.g., Chevitarese et al., 2012). The hardware needed for training large NNs can be found at specialized vendors such as Lambda Labs, who often also provide turnkey solutions such as operating system images that include all the common machine learning toolkits such as KERAS, TensorFlow, PyTorch, and others. An alternative is to DIY and to use the many web resources that describe which workstations can be obtained cheaply and how many of the essential GPUs can be housed, with or without modifications. In addition, there are free web resources – for example, Google Colab which is essentially a Jupyter Notebook that anyone with a Google account can use for deep learning and machine learning experiments (free for short-term use), or time-share ondemand GPU services such as Vast.AI can be used for a fee. Without further digressions, we now turn to how these systems, either purchased fully configured as turnkey solutions, or put together from used parts, can be utilized to produce text that, to a much greater extent than imaginable only two years ago, can facilitate automated generation of assessment materials, including the generation of electronic health record, the production of suggestions for distractor choices in multiple-choice items, and the drafting of patient vignettes based on prompts provided by item writers. 8. Electronic Health Records and Deep Learning The fact that medicine uses IT for storing and managing patient data brought with it that computer scientists were needed and hired to work on systems for this purpose. At the same time, data on patients, as it is stored in electronic health records (EHRs), is highly sensitive, so developers working in this area looked for ways to use databases that would not directly reflect anyone’s real data. One way was to use the same data, carefully anonymized so that individuals cannot be identified. A second approach was to generate health data of nonexistent patients using the regularities found in real health data. This was the birth of synthetic EHRs, either in the form of expert-generated models (Synthea: Walonoski, 2017) or in the form of deep learning–based models that either predict the next clinical event or generate (plausible) synthetic EHRs based on a training dataset of real EHRs (Dr. AI: Choi, 2015; MedGAN: Choi, 2016). These models can be used to generate data that can be considered a limitless resource, and they are anonymous by design so that concerns about privacy and data protection are alleviated when using these synthetic records in design and analysis work. A recent systematic review (Xiao, 2018) describes 98 studies that use deep learning in conjunction with EHRs. The studies range from generating synthetic EHRs to enable users to experiment with these data without any privacy issues, to building predictive models that are used to forecast the next clinical encounter based on existing patient history. EHRs are an important source of information and can be used to look at systematic differences in trajectories between patient groups, as well as how different treatments play out by comparing the prevalence of subsequent encounters. 9. Distractor Generation Through Question Answering The utility of the GPT-2 language model is currently explored by means of retraining for specific purposes. One application that was mentioned in the introduction is the TabNine software that allows the prediction of line completions and programming code continuation. For that purpose, each user provides data of their own project-related code, which can further improve prediction. Training Optimus Prime, M.D. • 99 Table 6.1 Example Output of the 345M Parameter Model When Used for Question Answering Model prompt >>> Q: What are the most common side effects of statins? A: SAMPLE 1: They can lead to a reduction in the effectiveness of your insulin. They can be associated with an increased risk of heart attack and stroke SAMPLE 2: Common side effects are: nausea, vomiting, diarrhea, weight gain, joint pain . . . SAMPLE 3: Statins can cause stomach inflammation or stomach ulcers. . . . SAMPLE 4: There are some common side effects to statins, but these are not always permanent. For example, people with cardiovascular disease tend to have more fatigue. SAMPLE 5: These can include: dizziness, feeling like your eyes are going to pop out (this is called ‘repetitive eye movements’), muscle pain, blurred vision Table 6.2 Responses to a Question About Symptoms of IBS Using a Network Trained for 3 Days Model prompt >>> What are the most common symptoms of irritable bowel syndrome? SAMPLE 1.1 In chronic condition, patients report numerous symptoms including pain, bloating, abdominal pain, diarrhea, constipation, and vomiting. Those patients who have diarrhea, cough, or abdominal pain are often affected by this disorder . . . . SAMPLE 1.2 In 2014, an estimated 3.9 million U.S. adults reported more than 1 or 2 symptoms of irritable bowel syndrome. About 30 percent reported at least one symptom. In this survey, 63.8 percent IAmericans . . . SAMPLE 1.3 Symptoms of irritable bowel syndrome tend to be more severe in those with irritable bowel syndrome, which may make them the least likely to seek diagnosis and treatment. Some symptoms that may be present include abdominal In, bloating, . . . SAMPLE 1.4 Symptoms of irritable bowel syndrome Complications of irritable bowel syndrome Symptoms of irritable bowel syndrome. Control Medication Inflammation Abnormal bowel movements. . . . SAMPLE 2.1 Symptoms of irritable bowel syndrome No. of Possible Causes 8.2 10.7 Major Common Symptoms – upper and lower gastrointestinal issues. 1. Irritable bowel syndrome with diarrhea (ILS) has been described as . . . SAMPLE 2.2 It is common for the majority of irritable bowel syndrome patients to experience a variety of symptoms including abdominal pain, cramping, bloating, nausea, abdominal pain, bloating, diarrhea, cramps, bloating, cramping . . . SAMPLE 2.3 My best friend and I recently purchased a new home to be closer to family and friends, and we felt more comfortable living at home. However, due to some suspicious and unusual events, we were contacted by your physician to take our . . . SAMPLE 2.4 Symptoms of irritable bowel syndrome are often severe and often require hospitalization. The symptoms of diarrhea (1.2%) and constipation (1.4%) are common in the community. Severe diarrhea and const . . . The ability to generate distractors for existing multiple-choice items is already given even in the original, not-retrained 345M GPT-2 model. The training material contained a large number of cases in which a question was followed by the prompt ‘A:’ and an answer. By means of introducing this type of training material, the model was enabled to react to a prompt that ends in ‘A:’ with a continuation of the text that in many cases represents a meaningful response to a question that was contained in the prompt. Here, we show two examples that were generated using a local copy of GPT-2 on the author’s workstation equipped with a GPU and TensorFlow (Abadi et al., 2015), which facilitates sampling from the language model. The GPT-2 model was trained with material that includes text that has the structure: ‘Q: What is X? A: X is a Y.’ In other words, the model is prompted to associate a sequence of words that is bracketed in ‘Q:’ and ‘A:’ as a question that requires an answer. The next figure shows an example of output generated using the 345M model. Note that these are far from perfect, but they could serve as inspiration for human item writers. The first example (Table 6.1) was generated without any retraining, using the downloadable version of the 345M GPT-2 model. 100 • Matthias von Davier It is clear that not all of the listed side effects are actual ones patients may experience. However, some overlap with side effects mostly listed in online resources, and some others may be ‘plausible enough’ to potentially serve as wrong options in a multiple-choice test. The next example (Table 6.2) asks about common symptoms of IBS; the selection of responses were not cherry-picked, and from among two sets of 4 answers, most are on topic. It is important to note that the responses are based on a general language model that has not been trained specifically to answer questions about medical content. This model is, on top of that, the second-smallest of the GPT-2 models, and contains (by today’s standards) only 345 million parameters, while other, larger variants contain much more complex model layers and approximately 1.4 billion parameters (Radford et al., 2018). Again, note that these responses that could potentially be used as distractor suggestions were generated without any retraining of specifically medical assessment materials. 10. Automatic Item Generation The tests reported in this section are based on the GPT-2 (345M) pretrained language model and roughly 800,000 open access subset articles from the PubMed collection (www.ncbi.nlm. nih.gov/pmc/tools/openftlist/) used for retraining. The data was encoded using the GPT-2 (https://github.com/nshepperd/gpt-2) toolbox for accessing the vocabulary used for pretraining and fine-tuning GPT-2 using TensorFlow. The 800,000 articles roughly equate to 8 GB worth of text from a variety of scientific journals that allow open access to some or all of their articles. Training took 6 days on a Dell T7610 equipped with 128 GB RAM, two 10-core Intel Xeon processors, and two NVIDIA 1080 Ti GPUs using CUDA 10.0 and TensorFlow 1.14, Python 3.6.8 and running Ubuntu 18.04 LTS. It was necessary to use the memory-efficient gradient storing (Gruslys et al., 2016; Chen, 2016) options, as the size of data structures for the 345M model used in the retraining exceeded the 11 GB memory of the GPUs without it. The amount of training data available through open access (OA) papers that can be downloaded from PubMed repositories is quite impressive: The number of OA articles exceeds 800,000, and the compressed pre-processed databases used for retraining in this study exceeds 8 GB. However, free medical texts are available in abundance, and a 2011 survey (Singh et al., 2011) lists many resources. Language models for data of this size were not able to be processed on customary hardware only a few years ago, while nowadays (with a few tricks), even the medium-size (345 million hyper parameter) GPT-2 model can be retrained on decent gaming GPUs. Incidentally, during the 6 days of training there is some downtime, which allowed me to find a recent arXiv submission that talks about automated patent application generation using GPT-2 pretrained with (public, granted) patent applications available online (Lee, 2019). Other applications include the syntax completion software TabNine described in the introduction as well as experiments aimed at automatic generation of poems (Gwernnet, 2019). The authors of the GPT-2 patent retraining study used Google Colab, a free online deep learning platform that allows users access to GPUs for NN training for up to 12 hours. This is insufficient for the 8 GB of PubMed data to be fully retrained on GPT-2 medium, so the author of this chapter resorted to upgrading and using a dual GPU workstation. Table 6.3 shows exemplary results after 2 days of retraining with the 800,000-article PubMed database. Other publicly available medical text and article databases are listed in Singh et al. (2011). While the results are encouraging, they can certainly not be used as is, when produced by the NN. However, some editing and human expert input could use this raw output as inspiration for authoring clinical vignettes. Results should be closer to human-authored item stems using a transformer that was trained on a large number of real medical licensure and certification item stems and distractors, and as larger pretrained transformer models get published. It should be noted that these early examples are certainly not texts that would pass as real items. Training Optimus Prime, M.D. • 101 Table 6.3 Example 1 (After Only 2 Days of Training on Customary Easily Obtainable Hardware) of Generated Text That Could Potentially Be Used as the Basis for Item Stem Authoring Model Prompt >>> A 52-year-old man complaining about chest pain and headache is coming to the emergency room. PubMed (2 days) retrained, GPT-2 based model-generated texts: SAMPLE 1: His blood pressure is 143/99 (normal). He also has hypertension and urticaria. It is a potentially lifethreatening situation. The first piece of information from the patient is that he has on occasion been suffering from chronic abdominal pain, abdominal pain during pregnancy, chronic abdominal pain during childbirth, urinary tract symptoms. All these are related to chronic abdominal pain. He is conscious and willing to go to the emergency room for assessment. However, he is already complaining about abdominal pain when we come to the hospital. My primary concern is the possibility of heart failure due to chronic pain. I don’t feel that I am in a position to explain this to the patient, but I want to reassure him that it is a possibility. So, the decision to take him to the emergency room for evaluation is already made. Some would argue that he would be better off in the hospital, but that does not seem to be the case. When it comes to dealing with chronic pain, we can do better with a diagnosis of dysarthria, rather than acute cardiomyopathy. We can give him three days of antibiotics. We can tell him that he will need to have three small intestinal procedures in the hospital, but that he can go home SAMPLE 2: He also complained of tingling and aching in his hands, feet, legs, and arms. The patient is not in good condition. He had a history of a recent allergic reaction to peanuts. He took 4 mg. of polyphenols: red chip, red, black and white, in combination with 250 mg. of oat milk and 60 mg. of vitamin B12. He immediately experienced tingling sensation in his hands and feet. In addition, he had a mild fever. He did not take medicine for a long time. He then started to complain of a very severe pain in his chest and abdomen. His symptoms became worse. He felt cold, tingling sensation, and headache. He was taken to the hospital, and there a further examination showed that he was allergic to peanuts. A follow-up examination confirmed that he was allergic to peanuts. He was discharged home the same day. A change in his allergy history showed that he had not taken medicine for a long time, and he had a mild fever. He started to take food for a long time. He started to take vitamin B12. He started to take oat milk as well. He started taking 500 mg. of vitamin B12 in combination with 150 mg. . . . SAMPLE 3: The patient was admitted to the hospital on Friday night with a minor hip fracture. The patient was experiencing abdominal pain, nausea, and fever. He did not report any cough, other than the cough of headache and chest pain. The patient was first treated with antibiotics, then with polyethylene glycol solution (PEG), and finally with chloramphenicol. At the end of treatment, the patient’s abdominal pain and abdominal pain continued. The patient told his doctor that he did not feel well, but did not report any cough, cold, or other symptoms. On the following day, the patient developed a fever of 40.2°C and some abdominal pain. On the following morning, he reported no other symptoms, but fever of 40.2°C and fever of 39.2°C. The patient’s temperature was recorded as 49.3°C, and his level of red blood cell count was 120/16. The patient was prescribed phenylalanine (PBA), but did not report any other symptoms. The patient’s BP was recorded as 118/28, his level of red blood cell count was 126/17. The patient was then referred to an oncology facility for further treatment. SAMPLE 4: His symptoms were described as sudden onset of chest pain with a brief onset of bradycardia. The patient had a history of transient rapid, transient, and rapid diastolic blood pressure increases. Patient in critical condition for dehydration, 5-day interval between study. Recurrent hypertension and pre-existing coronary heart disease (CHD) with ipsilateral ventricular tachycardia. Diabetes mellitus. 2 studies (RR=0.75, 95% CI=0.65 to 0.85). Over 30-year patient without prior cardiac disease. Shaolin acupressure technique applied (AG) and exercise at 1.5m duration. Patient is resting on a reclining couch, a patient in critical condition for dehydration, 5-day interval between study; he has chest pain and pain in the left chest with bradycardia. The patient had no previous CHD, no prior hypertension, no pre-existing CHD, and a 30-year history of no chronic obstructive pulmonary disease. Approximate number of patients who participated in follow-up and follow . . . Note: The GPT-2 transformer was retrained on medical open access (PubMed open access subset) articles, not on any secure or publicly available medical licensure or admission items. However, the quality of the texts is such that it can be assumed that the larger transformer model, trained on real medical licensure items, would be able to produce source material that could be used by medical experts as prompts to facilitate authoring expert–generated items. A second example contains two sentences as a prompt which provides a bit more context for the ‘attention’ circuits of the transformer network. The network checkpoint (saved parameters after partial retraining) that was used represents the retrained GPT-2 after 200,000 cycles using the 800,000 PubMed open access database. 102 • Matthias von Davier Table 6.4 Example 2: Obtained From a Network That Was Retrained for 3 Days Model prompt >>> A 32-year-old woman comes to the doctor’s office complaining about severe headache and blurry vision. She had been suffering from migraine for years. SAMPLE 1: She was prescribed a drug for tics and headaches, a sulfate and an antihistamine, and the prescription was confirmed by the doctor, who thought that the symptoms could be caused by an active nerve stimulation. It was a decision to keep the dosage, which was higher than usual. Subsequent results showed a very severe headache, which did not respond to the medications, and was now very severe. This was the first case of tetanus-induced brain hemorrhage (HBIH) in Europe. HBIH is a rare but potentially fatal disease in which the brain hemorrhage is mainly due to the tetanus toxin. The researchers reviewed the literature to identify cases of HBIH. All of these cases were reported in the period 2006 to 2011, but a considerable number of cases in the years 2011 to 2012 were only reported in 2005 and 2006. The authors of the paper reported that four cases of HBIH in Europe have been reported during the years 2006 to 2011, of which three were reported in the period 2006 to 2012. The second case in this period, which was reported in 2006 to 2011, was a result of the same kind of drug (antihistamine, sulfated humic acid) as the first case . . . SAMPLE 2: At her hospital, she was admitted for medical examination and he was informed that she had experienced a migraine and fever in the preceding week. When he examined her eyesight, she revealed that they were deteriorating. She had been having headaches for a few years and was very concerned about them. She had the fever and headache. She was feverish, very fatigued and not able to sleep. She was in the critical stage and was not able to tolerate even the medium intensity headaches. She had been very relieved with this problem. She felt relieved with the headache, but she was in severe pain in the back of her neck. The doctor who examined her eyesight revealed that the intensity of her headache was not satisfactory, so the doctors had recommended that she should be admitted for further medical examination. She had not been allowed to use a pacemaker since she had been admitted for treatment of a chronic migraine. However, during her admission, she complained that her eyesight was getting worse. She was also in pain in her back. She was admitted for emergency treatment. She was admitted for treatment of chronic migraine. The doctors who examined her eyesight gave her an intravenous injection of levodopa and there were no complications. They also gave her chlordiazepoxide and waited for a . . . SAMPLE 3: She was wondering if the headgear could help her to relax and feel a bit better. She was in need of the help to stop the pain. But she had found that her headache was even worse than usual. She was planning to take a bath during this period and it would make her worry. So, she asked the doctor for an eye examination, to see if there was any abnormality. The doctor, who is a nurse, explained to the woman that, after a headache, there is an increased production of nerve gas from the chest and lungs, so it would be best to take a nasal spray with some corticosteroid treatment. She applied the nasal spray to her nose and then took a step back and let the cold air come into the headgear. But, the patient was still not satisfied and the doctor was still giving her advice. She then asked for another exam. The patient was now told that she had to take some corticosteroid treatment. She then was informed that she should take some steroid nasal spray in addition to the eye examination to take the steroid treatment. She was then told that her headache was still worse than usual. On examination, the doctor showed the patient an X-ray of the neck . . . SAMPLE 4: According to her husband, she tried some anti-emetic medication but there was no difference in her condition. She felt that migraine pain was like a pressure and pain would only come when she was moving around. She asked the doctors to prescribe anti-emetic medication to treat her headache but, she was not satisfied with the treatment. So, she tried to make a prescription from the family doctor for anti-emetic medication and the prescription was filled and passed out the door. When she was with the family doctor, she tried to find out how to use the antiemetic medication to relieve her headache but, the anti-emetic medication was not effective. So, she finally decided to try the anti-emetic medication and it became effective. She started to use the anti-emetic medication for about a month. She finally decided to stop using the anti-emetic medication. She did not know how to take it and she was very afraid of taking it. But, she tried to use the anti-emetic medication. And, finally, the headaches stopped. She was happy and was able to do her normal activities. Then, she started to take anti-emetic medication and it was elective. She began to think that she should take . . . Note: The network recall can be fine-tuned as well to produce most likely vs. more divergent responses. The point to be made here is that the existing network architecture can be used for question answering, and to a limited extent also for ‘inspiration’ of human test developers who could enter ideas as prompts and have the neural network spit out ideas. Current applications that are similar in kind used the GPT-2 model for retraining based on openly available patent texts, poems, as well as source code files. It appears plausible that further fine-tuning with targeted assessment material should improve the results dramatically – for example, by using all available items in a certain subject domain such as cardiology. It is not claimed that the current Training Optimus Prime, M.D. • 103 system is fully useful as is, but the quality of text produced by the currently available transformer architecture makes it rather likely that correctly formed item stems can be produced by deep learning–based language models in the very near future. 11. Discussion and Current State of the Art After GPT-2 kicked off several endeavors to use transformer-based NLG-focused models for research and commercial purposes, including startups that carried fine-tuning in their name, the next generation of NLG models essentially made finetuning an expensive and largely unneeded exercise. GPT-3 (Brown et al., 2020) provided researchers and entrepreneurs with a model that was both much larger and much more capable so that it could be used with minimal adjustments to hyperparameters or prompting in order to enable special-purpose applications. Even comparably small models (by today’s standards) such as the GPT-J provided by the grassroots initiative of AI researchers called eleuther.ai allows generating questions without pretraining. The second sample output even included an example of a potential diagnosis (‘The most likely diagnosis is a brain tumor’), which appears a plausible choice, even though it was not requested. GPT-3 is obviously capable of delivering vignettes that are quite close to what one would find in USMLE items. However, for special applications, GPT3 still allows fine-tuning, and even offers a fine-tuned model for code completion in languages such as Python and JavaScript – for example, in the form of OpenAI’s Codex (https://doi.org/10.48550/arXiv.2107.03374). Tools such as Codex are similarly impressive as they allow supporting writing code, which, after all, seems to be equally well modeled as natural languages by very large transformers such as GPT-3. We are at the frontier of AI entering many domains of daily life. While phone makers contribute to the hype and advertise the next generation of smartphones as running neural networks, there are industrial domains in which these applications are essential. Among these are computer vision and assisted driving. Others are recommenders for e-commerce, but also applications that are trained to detect the use of AI for deep fakes, video material that was made by a machine, programmed with malicious intent to fool humans. However, many applications also support human creativity in more benign ways, such as gauGAN NVIDIA (2019), a tool that helps illustrators to compose landscapes easily with only a few clicks, and more recently DALL-E and CLIP (https://doi.org/10.48550/arXiv.2204.06125). There are also tools based on AI that support wellness (Qualcomm, 2019), using the same technologies to analyze data on health that are used to predict what music one may like based on past purchase and download behavior. Table 6.5 Example Output Generated by https://6b.eleuther.ai/ for the Prompt: ‘Statins are used to lower cholesterol. What are the most common side effects of statins?’ Your doctor will likely tell you that statins can cause side effects. You can expect some side effects from statins, but most are minor. The most common side effects of statins are: Headaches Nausea Dizziness Tiredness Back pain Urinary tract infections Muscle aches 104 • Matthias von Davier Table 6.6 Vignette Written by GPT-3 Based on a Prompt, Including a Question, Which Appears to Be Basically a Paraphrase Prompt: Write a USLME Vignette and a question about next treatment steps based on the following sentence: ‘A 32-year-old woman comes to the doctor’s office complaining about a severe headache and blurry vision.’ GPT-3 Playground Output 1: Vignette: A 32-year-old woman comes to the doctor’s office complaining about a severe headache and blurry vision. The doctor conducts a physical examination and orders a CT scan. The CT scan reveals a mass in the woman’s brain. The doctor refers the woman to a neurologist for further evaluation and treatment. Question: What are the next treatment steps for the woman? GPT-3 Playground Output 2: Vignette: A 32-year-old woman comes to the doctor’s office complaining about a severe headache and blurry vision. She reports that the headache has been getting progressively worse over the past few weeks and that her vision has become increasingly blurry. She is concerned that something may be wrong with her eyes. The doctor performs a thorough examination and orders a CT scan of the woman’s head. The CT scan reveals a mass in the woman’s brain. The doctor refers the woman to a neurologist for further evaluation and treatment. Question: What is the most likely diagnosis based on the information provided? 12. Conclusion The prospects of this technology become really exciting when looking at how these pretrained models could be deployed. There are efforts underway to develop toolkits that utilize language models, currently GPT-2 and BERT, another transformer-based language model developed by Google (Devlin, 2018) on iOS devices. This would not train these networks on phones, but would allow utilization of a trained network to generate new text based on a sentence that describes a case or a context entered by a user. For automated item generation, apps could be developed that use the language generation on smartphones, for supporting item developers in writing new content on their mobile devices (https://github.com/huggingface/ swift-coreml-transformers). Once pretrained models for medical specialties are available, it would be straightforward to develop a tool in which medical experts can enter a draft vignette or even a few keywords that are wrapped by the app into a case description draft, which can then be finalized and submitted by the human expert for further editing and finalization by item writers at the testing agency who assembles, administers, and scores the certification tests. At the testing agency, the just-developed case vignette could be finalized using yet another set of machine learning tools to generate correct and incorrect response options which are either used in multiple-choice formats or for training an automated scoring system for short constructed responses. As it turns out, apps using transformers to generate texts have not flooded the market, yet, even three years after the first draft of this chapter. However, OpenAI reports already in May 2021 that over 300 applications are using GPT-3. Regarding automated generation of questions (or items) using transformers, the TIMSS & PIRLS International Study Center is looking into utilizing writing assistants to generate parallel Training Optimus Prime, M.D. • 105 versions of item stems, passages, and questions in order to support human experts in their item writing activities. It was reported by Drori et al. (2022) that a pipeline built based on transformer models was able to generate questions as well as answers at the level of MIT mathematics course material using targeted pretraining and fine-tuning. The output consists not only of texts, but of graphs, diagrams, tables, and other objects commonly found in math instruction and assessment. This leads me to conjecture that within a few years we will achieve machine generated items not only for simple open ended and multiple-choice questions, but we will also be able to generate using AI workflows complex engaging items that mix graphical stimuli and responses (which then will be automatically scored; e.g., von Davier et al., 2022) and can be generated by means of inputs that specify target grade, topic, cognitive processes needed to solve the item, and type of response and stimulus material. References Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jozefowicz, R., Jia, Y., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D. . . . Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. www. tensorflow.org.CoRRabs/1603.04467 Bejar, I. I. (2002). Generative testing: From conception to implementation in Item Generation for Test Development. In: Item generation for test development (pp. 199–217). Lawrence Erlbaum Associates. Brown, T. B., Mann, B., Ryder, R., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., & Askell, A. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS). https://doi.org/10.48550/arXiv.2005.14165 Chen, T., Xu, B., Zhang, C., & Guestrin, C. (2016). Training deep nets with sublinear memory cost. CoRR. http://arxiv. org/abs/1604.06174 Chevitarese, D. S., Szwarcman, D., & Vellasco, M. (2012). Speeding up the training of neural networks with cuda technology. In L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. A. Zadeh, & J. M. Zurada (Eds.), Artificial intelligence and soft computing (pp. 30–38). SpringerBerlin Heidelberg. ISBN 978-3-642-29347-4 Choi, E., Bahadori, M. T., & Sun, J. (2015). Doctor AI: Predicting clinical events via recurrent neural networks. CoRR. http://arxiv.org/abs/1511.05942 Choi, E., Schuetz, A., Stewart, W. F., & Sun, J. (2016). Medical concept representation learning from electronic health records and its application on heart failure prediction. CoRR. http://arxiv.org/abs/1602.03686 Dai, Z., Yang, Z., Yang, Y., Carbonell, J. G., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. CoRR. http://arxiv.org/abs/1901.02860 Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR. http://arxiv.org/abs/1810.04805 Drori, I., Zhang, S., Shuttleworth, R., Tang, L., Lu, A., Ke, E., Liu, K., Chen, L., Tran, S., Cheng, N., Wang, R., Singh, N., Patti, T. L., Lynch, J., Shporer, A., Verma, N., Wu, E., & Strang, G. (2022). A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level. https://arxiv.org/ abs/2112.15594 Embretson, S. E. (1999). Generating items during testing: Psychometric issues and models. Psychometrika, 64(4), 407–433. Gilula, Z., &., and Haberman, S. J. (1994). Conditional log ‘linear models for analyzing categorical panel data’. Journal of the American Statistical Association, 89(426), 645–656. www.jstor.org/stable/2290867 Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models. In F. Ostendorf (Ed.), Personality psychology in Europe (Vol. 7, pp. 7–28). Tilburg University Press. Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., & Gough, H. C. (2006). The international personality item pool and the future of public-domain personality measures. Journal of Research in Personality, 40, 84–96. Gruslys, A., Munos, R., Danihelka, I., Lanctot, M., & Graves, A. (2016). Memory-efficient backpropagation through time. CoRR. http://arxiv.org/abs/1606.03401 Gwern.net. (2019). Gpt-2 neural network poetry. www.gwern.net/GPT-2 Hanin, B. (2017). Universal function approximation by deep neural nets with bounded width and relu activations. CoRR. http://arxiv.org/abs/1708.02691 Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79(8), 2554–2558. https://doi. org/10.1073/pnas.79.8.2554 106 • Matthias von Davier Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251–257. https://doi.org/doi: doi:10.1016/0893-6080(91)90009-T Lee, H. (2019). Patent claim generation by fine-tuning. OpenAI GPT-2 2052, Department of Computer Science and Information Engineering, National Taiwan University. NVIDIA (2019, August 13). Gaugan tool for artists. www.mingyuliu.com/gaugan; Megatronlm press release. https:// github.com/NVIDIA/Megatron-LM. Also https://nvidianews.nvidia.com/news/nvidia-achieves-breakthroughsin-language-understandingto-enable-real-time-conversational-ai Ozair, S. (2016). Char-RNN for TensorFlow. https://github.com/sherjilozair/char-rnn-tensorflow Qualcomm. (2019). AI supporting wellness. www.qualcomm.com/news/onq/2019/07/11/ai-your-supportive-wellnesscompanion Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2018). Language models are unsupervised multitask learners. https://d4mucfpksywv.cloudfront.net/better-language-models/language Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408. Sarazen, T., & Peng, M. (2019). The staggering cost of training SOTA AI models. https://syncedreview.com/2019/06/27/ the-staggering-cost-of-training-sota-ai-models Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423. https:// doi.org/10.1002/j.1538-7305.1948.tb01338.x Singh, A., Singh, M., Singh, A. K., Singh, D., Singh, P., & Sharma, A. (2011). Free full text articles: Where to search for them? International Journal of Trichology, 3(2), 75–79. https://doi.org/10.4103/0974-7753.90803 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin., I. (2017). Attention is all you need. CoRR. http://arxiv.org/abs/1706.03762 Vincent, J. (2019). This AI-powered autocompletion software is Gmail’s smart compose for coders – Upgrading coding autocompleter tools with deep learning. www.theverge.com/2019/7/24/20708542/coding-autocompleter-deeptabnine-ai-deep-learning-smart-compose von Davier, M. (2016). High-performance psychometrics: The parallel-E parallel-M algorithm for generalized latent variable models. ETS Research Report Series, 1–11. https://doi.org/10.1002/ets2.12120 von Davier, M. (2018). Automated item generation with recurrent neural networks. Psychometrika, 83(4), 847–857. https://doi.org/doi: 10.1007/s11336-018-9608-y von Davier, M., Tyack, L., & Khorramdel, L. (2022). Scoring graphical responses in TIMSS 2019 using artificial neural networks. Educational and Psychological Measurement. https://doi.org/10.1177/00131644221098021 Walonoski, J., Kramer, M., Nichols, J., Quina, A., Moesel, C., Hall, D., Duffett, C., Dube, K., Gallagher, T., & McLachlan, S. (2017). Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association, 25(3), 230–238. ISSN 1527-974X. https://doi.org/10.1093/jamia/ocx079 Wiesner, J. B. (1961). MIT: The thinking machine. www.youtube.com/watch Xiao, C., Choi, E., & Sun, J. (2018). Opportunities and challenges in developing deep learning models using electronic health records data: Aa systematic review. Journal of the American Medical Informatics Association, 25(10), 1419–1428, ISSN 1527-974X. https://doi.org/10.1093/jamia/ocy068 Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., & Choi, Y. (2019). Defending against neural fake news. CoRR. https://doi.org/arxiv.org/abs/1905.12616