Generative AI Value Chain
Generative AI Value Chain
Generative AI Value Chain
ANDY WU
MATT HIGGINS
In 2023, all these types of generative AI were created in a similar process. At the core of any
generative AI system is the model, a mathematical representation of patterns that forms the basis of
‘knowledge’ for the system. The structure of the model is determined by its architecture, the theoretical
organization of parameters in an artificial neural networks that the system uses to generate its outputs.
To learn, the model relies on a mountain of training data, a collection of examples relevant to the task
the model is being trained to perform. During an initial pre-training process, the model learns to adjust
its parameter-weights (assumed by the architecture), improving its prediction quality with many
iterations over time; that model is further refined through a fine-tuning process. Training an AI system
requires specialized hardware, like GPUs in data centers, that consume enormous amounts of
electricity to handle heavy and massively-parallel computational loads. Outside of the model,
commercial AI companies could then implement further user-facing guardrails to keep the model from
generating undesired content. From there, the model can then be used for inference by developers
(through an API) or users (through an application). (Exhibit 1 shows an overview of the generative AI
value chain.)
Architecture
The approaches to generative AI seen in practice in 2023 were made possible by advances at the
intersection of machine learning, artificial neural networks, and language modeling throughout the
2010s. The progression of language model architecture can be roughly divided into three periods: early-
stage probabilistic methods (pre-2014), neural network-based methods (2014 to 2017), and transformer-
based methods with pre-training and fine-tuning (2017 to present).
Professor Andy Wu and Research Associate Matt Higgins prepared this note as the basis for class discussion with the assistance of Doctoral Student
Hang Jiang (MIT Media Lab) and Doctoral Student Miaomiao Zhang (HBS).
Copyright © 2023 President and Fellows of Harvard College. To order copies or request permission to reproduce materials, call 1-800-545-7685,
write Harvard Business School Publishing, Boston, MA 02163, or go to www.hbsp.harvard.edu. This publication may not be digitized, photocopied,
or otherwise reproduced, posted, or transmitted, without the permission of Harvard Business School.
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
724-355 Generative AI Value Chain
Probabilistic Methods
Prior to 2014, the standard approach to language modeling involved simple probabilistic algorithms
that aimed to predict the likelihood of the next word in a given sequence based on the immediately-
preceding word or words. These approaches were primarily used as components within NLP pipelines
to facilitate tasks such as auto-complete and spelling correction.a
Neural Networks
In 2014, the application of deep learning neural networks to language modeling brought about a
shift in the way researchers approached language modeling.b,1,2 Neural networks consisted of
connected nodes inspired by neurons in a biological brain. By using layers of nodes that represented
words as high-dimensional vectors, feedforward neural networks could capture semantic meanings
from surrounding words and predict the next word in a sentence with some awareness of context
(Exhibit 2 shows the structure of a deep neural network). This approach improved model performance
on NLP tasks like machine translation and text summarization. However, neural network-based
models had their own limitations, such as failing to represent different meanings of the same word
depending on its context, being slow, and being computationally expensive.c,3
Transformers
In 2017, the transformer architecture was introduced by a team at Google Brain and quickly became
the new model of choice for most large-language NLP tasks.4 Transformer models were a type of neural
network, differentiated from traditional neural networks by the way they leverage self-attention to
unpack long-range dependencies in sentences, e.g., the relationship between words that are far apart
in a sentence.d A traditional NLP model needed to process each word in the sentence and capture the
meaning and context of the whole sentence. Self-attention, on the other hand, allowed each element in
the sequence to interact with all other elements and find out how much they should pay attention to
each other at faster speed (Exhibit 3 illustrates the attention mechanism following long-distance
dependencies). The transformer architecture surpassed earlier techniques for two reasons: first,
transformer models were better at understanding the relationships between words in a sentence;
second, transformer models allowed for much faster and computationally efficient training of the
a Take the sentence: “We know cats chase ___.“ With these four words as the stem, a probabilistic trigram Markov model can
generate a matrix of probabilities for the next word. A transition matrix is created (based on a training corpus of texts) in which
the words that typically follow “cats chase” in natural language are ranked from most to least likely, i.e.: “mice” (0.4%), “cars”
(0.2%), “tails” (0.1%), “grass” (0.05%). Based on the probabilities in the transition matrix, the trigram model then suggests a few
auto-complete choices with the highest probabilities as the next word. Therefore, the accuracy of the trigram model’s prediction
depends on the quality and size of the training corpus, as well as the assumptions made in the Markov process. A limitation of
the Markov model is that, if we rephrase the sentence to “Cats as we know chase ___”, the trigram model will not be able to
predict the next word well because its context window size is only 3.
b Word2Vec and GloVe were introduced to represent words as static vectors in an unsupervised task that involved predicting
whether two sampled words occurred in the same context. These pre-trained word vectors were then used to train encoder-
decoder networks, mostly recurrent neural network (RNN) models, in a sequence-to-sequence (Seq2Seq) manner.
c To address this issue, ELMo, a deep contextualized word embedding model, was proposed in 2017. ELMo used a bidirectional
Long Short-Term Memory (LSTM) network trained on a large text corpus. This model dynamically determines a word’s vector
depending on its context, making it a richer and more powerful representation than Word2Vec and GloVe for downstream NLP
tasks.
d An alternative older way to process sequences was to use recurrent neural networks (RNNs) which process each element in
the sequence one by one and update their internal state accordingly. RNNs were the state of the art before the invention of
transformers. However, they were comparatively slow and failed to overcome long-range dependency problems.
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
Generative AI Value Chain 724-355
model compared to previous techniques. By attending to all words in the sentence, the transformer
captured long-term dependencies and understand the relationships between words, even words that
were far apart, leading to better translation performance across the board.5 Google’s T5 (text-to-text
transfer transformer) model, released in 2019, combined all of the model’s downstream tasks into a
text-to-text format, creating a new way for the model to communicate with itself and a state of the art
for training approaches and model architectures based on transformers.6 Exhibit 4 illustrates the
progression of cutting-edge LLMs since 2019.
Training Data
Creating a model for generative AI requires training an algorithm on some corpus of data. The size
of the training data was usually expressed in terms of tokens, discrete units of text defined by the
LLM’s designer (usually some kind of n-gram or subword). GPT-3 used 500 billion tokens (about 300
billion words) that comprised databases and sources of web page metadata extracts, Reddit posts,
internet-based books, and the English version of Wikipedia.7 (Exhibit 5 details the name and
description of each dataset along with the number of tokens as well as weights in training.)
Neural language model scaling laws detailed how performance scales as a power-law with model
size (number of parameters), dataset size, and the amount of compute used for training, but a model
did not become significantly better if only model size grew indefinitely.8 In fact, the size of training
data in LLMs over the past several years had not grown tremendously (as it did between 2010 and
2020). As one technical expert noted, “one of the emerging scarce resources for training LLM is not
capital, but the volume of high-quality data, as scaling model parameter count delivers diminishing
returns.”9 Another AI expert concurred, “Data, not size, is the currently active constraint on language
modeling performance.”10 Training AI for a specific task, also known as fine-tuning, required data
specific to that task. An early focus was training AI to assist software developers with writing software
code. Despite multiple trillions of tokens in data available for this specific task, one observer concluded,
“The entire available quantity of data in highly specialized domains like code is woefully tiny,
compared to the gains that would be possible if much more such data were available.”11
Copyright Law
Training data could be public or proprietary, and the methods and data used could be disclosed or
undisclosed. In the early days of LLM-training, it was common to disclose the sources of training data
and outline the training techniques used for the benefit of the AI research community. As LLM-training
attracted more attention and the possibility of copyright violation was raised, companies training LLMs
moved away from disclosing their sources of training data to avoid creating additional liability. Given
the nature of information on the internet, it was inevitable that some copyrighted works would make
their way into the training corpora for generative AI systems. OpenAI GPT-3’s training corpora
(CommonCrawl, WebText2, Books1, Books2, and Wikipedia) contained copyrighted material. Image-
generating AIs were trained indiscriminately on copyrighted materials before the issue of copyright
law for training data was ever even considered. Some researchers have speculated that the use
copyrighted material for training could be protected under the fair use doctrine, but there was no legal
precedent for AI training and case law was unclear. In early 2023 several lawsuits were filed by
plaintiffs alleging infringement.12
In this unsettled landscape, transparency about the sources of a given model’s training data could
have major legal and competitive implications. OpenAI, which said it had trained DALL-E on
hundreds of millions of captioned images, decided not to specify the source of those images. Stable
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
724-355 Generative AI Value Chain
Diffusion, which had been transparent about the fact that it sourced its captioned image dataset from
the CommonCrawl database by way of the German non-profit LAION (Large-scale Artificial
Intelligence Open Network), was sued by Getty Images. Researchers found that the LAION database
included mountains of proprietary and copyrighted data drawn from popular photo-hosting websites
including Getty Images, Adobe Stock, iStockPhoto, Unsplash, Flickr, and Pinterest. 13
Model
In mathematics and statistics, a model is a descriptive formalization of the relationships between
variables in the form of mathematical equations. In AI, the above definition holds, but requires
specification. When people use the term “model” to describe an advanced product or service offered
commercially, such as ChatGPT, what they are usually describing is a foundation model, a model
trained on broad data (generally using self-supervision at scale) that can be adapted to a wide range of
downstream tasks.14 Foundation models provide a base on which other models can be built. Task-
specific models are versions of a foundation model tailored (fine-tuned) for a specific purpose using
data relevant to the task to be performed.
Parameters
In AI and ML models, a parameter was an internal variable that the model extracted (learned) from
its training data and used to improve its future predictions. Parameters were adjusted through the
training process as the model learned to map its inputs (e.g. images or text) to outputs (e.g. object labels
or next-word predictions). After training, these parameters allowed the model to make predictions or
decisions when it was presented with new data. Model size was typically expressed as the number of
parameters in a model (e.g., GPT-3 had 175 billion parameters). As the number of parameters in the
model grew from GPT-2 (1.5B) to GPT-3 (175B), the emergent capabilities of the models to respond
with human-like answers to prompts became apparent. As the model size and its associated computing
resources required grew, larger models outperformed their predecessors on established tasks (Exhibit
6 shows the progression in parameter size of state-of-the-art NLP models over time).
One can think of the full set of parameters as the accumulated “knowledge” of the model. Larger
models could make more distinctions and represent more complexity, though this came at the cost of
increasing compute costs. While larger models had a greater set of accumulated “knowledge” to draw
on and provided better performance on some tasks, they also required more resources, often
exponentially so. Some research suggested that smaller models could be nearly as effective larger
models given enough tokens, and smaller models were significantly cheaper to run.15 This is another
way of saying that there were diminishing returns to model size beyond a certain point at which rising
compute costs outweighed any detectable difference in output quality. When Meta released its open
source LLM LLaMA, it did so in four different sizes (7B, 13B, 33B, and 65B parameters)16 so that
researchers could avoid running the more-resource-intensive 65 billion parameter model when the 7
billion parameter model would do. By the time GPT-3 was released in 2020, the costs of training and
inference for LLMs had fallen significantly. In 2023, an analyst estimated the cost of a 175-billion-
parameter model trained on Nvidia’s H100 GPUs could reach as low as $1.4 million, more than 80%
cheaper than the 70-billion-parameter model trained on V100 GPUs in 2020. Training large language
models involved considerable costs.17
Pre-Training
Pre-training referred to the process used to teach models to recognize patterns and extract general
features from large datasets before they are fine-tined for specific tasks. The process was somewhat
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
Generative AI Value Chain 724-355
similar to a student mastering the basics of reading, writing, arithmetic, and information processing
during K-12 education before going on to specialize in a particular field or profession during post-
secondary education. For a large language model, pre-training might consist of feeding the model a
large corpus of text from the broad internet, from which it will learn to understand the basics of human
language, grammar, sentence structure, and basic facts about the world. The pretraining phase was
when the model learns to process data and creates its initial parameters for understanding that data; it
is the foundation of the model’s knowledge and prediction capabilities, on which further and more
specific capabilities can be built. In the modern LLM era, pre-training was unsupervised, meaning the
model is turned loose on a corpus of unlabeled data and (using a deep learning neural network) teaches
itself to recognize patterns in the input dataset. Unsupervised learning proceeded without labeled
training data using a mixture of clustering, dimensionality reduction (e.g., Principal Component
Analysis), and anomaly detection. In unsupervised learning, instead of starting with the correct
answers and working backwards to arrive at a set of classification rules, the rules emerged from the
clusters and features the algorithm itself defines and abstracts from the dataset.
Fine-Tuning
Fine-tuning begins after pre-training and focuses on using more specific data to train the model to
perform in a particular context on a specific task. For a model that has been pre-trained to understand
language, this might consist of more narrowly focused training on transcripts of customer service chats
in order to create a model fine-tuned to serve as a customer-service chatbot. During fine-tuning, the
model’s initial parameters are updated to specialize to the task it is being fine-tuned to perform. If there
is a major domain shift, LLMs without fine-tuning will perform poorly on new domains. In this case,
more domain-specific data is needed for fine-tuning. For instance, a medical company needs to increase
the amount of data to fine-tune a general LLM (e.g., LLaMA, GPT-3), because there is a domain shift
between training and application: pre-training data in these models are plain English but the
deployment scenarios where it is expected to perform require more medical expert knowledge.
Extensive fine-tuning on various types of data is needed if the model’s designers want the LLM to excel
at multiple tasks across different domains (i.e. a medical chatbot that can also make small talk about
the weather and sports). Fine-tuning costs depended on the size of the model and the amount of input
data in the same way as model pre-training. For a 6.7 billion parameter transformer model, fine-tuning
with 9.6 million tokens of data would take 2 hours and cost $37; with 1 trillion tokens, it would take 10
days and cost $200,000.18
Fine-tuning generally proceeded in a supervised manner, meaning the input data is labeled and the
quality of the model’s output was evaluated (at least initially) by humans. The reliance on human-
labeled data as an input and human evaluation of output as a basis for learning is the essence of the
supervised learning approach. Two critical developments that have contributed to the evolution of
language models at the fine-tuning stage were instruction tuning and RLHF (Reinforcement Learning
from Human Feedback). Instruction tuning involved fine-tuning language models on a collection of
tasks described via instructions.19 This technique significantly improved LLMs’ capability to
understand explicit instructions and adapt their behavior without the need for task-specific training
data or manually designed task-specific architectures. With instruction tuning, LLMs could write news
summaries on par with human-written summaries without having any explicit examples for that task.
RLHF was an important human alignment technique to fine-tune LLMs and make them more helpful
and less harmful in text generation.20 It involved human annotators ranking LLM-generated outputs
from the same prompt, training a separate reward model (RM) to learn human preferences, and using
the RM to guide the LLM to generate user-preferred content.21 OpenAI employed a reinforcement
training process with large-scale human feedback during the training of GPT-3.22 Specifically, OpenAI
hired a number of contractors in Kenya, Uganda, and India to annotate the data as part of their text
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
724-355 Generative AI Value Chain
pre-processing.23 ChatGPT then learned from the ratings and comments to guide the learning process
of the model.24
LLMs still encountered challenges in performing tasks that required on-the-fly reasoning or domain
adaptation. A frontier technique called Chain-of-Thought (CoT) prompting had emerged as a
promising solution to mitigate these issues.25 CoT prompting showcased how the correct answer was
derived by eliciting reasoning from the LLM, particularly on tasks that required arithmetic,
commonsense, and symbolic reasoning (Exhibit 7 shows a simple example of CoT prompting).
Inference
Inference refers to the process of using a trained model to make predictions on unseen data. This is
often described as a “forward pass,” where input data is passed through the neural network, moving
from the input layer through the hidden layers and finally to the output layer to generate a prediction.
This process is fundamentally linked to the training phase. During training, the model learns to map
inputs to outputs through a process of iterative optimization, adjusting its parameters based on
feedback from a loss function that measures the discrepancy between the model’s predictions and the
actual outputs.26 For both training and inference, LLM builders need to load the model in GPUs and
feed input data into the model. Traditional information retrieval models in Google search simply search
for the most relevant pages from their database, which is not a generative process. LLMs, on the other
hand, will need to generate different outputs token by token to different user inputs. Moreover, LLMs
are huge and therefore slow in processing inputs. Therefore, companies need to keep many GPUs
running to serve user requests, which contributes to the high costs.
Deployment and maintenance of a model like ChatGPT at scale was expensive. Each time ChatGPT
generated a 30-word response, it cost OpenAI almost $0.01.27e As of January 2023, with an average of
one million unique visitors engaging with the model, the corresponding chip requirement was more
than 30,000 Nvidia A100 GPUs.28 The capital expenditure for acquiring these GPUs alone would be
near $100 billion,29 and the operational costs for OpenAI could exceed $100,000 per day30 with an
additional $50,000 per day in variable costs related to electricity consumption.31 Another estimate by
semiconductor analyst Dylan Patel put OpenAI’s daily costs for running ChatGPT at $700,000.32
Hardware
GPUs
Graphics processing units (GPUs) were specialized chips with highly parallel structures that
enabled them to process a large number of similar calculations simultaneously. Though originally
designed for rendering video graphics, by the mid-2010s GPUs were commonly used for AI
computation. GPU manufacturing was dominated by a handful of players, particularly Nvidia and
AMD. The performance of a GPU was measured in FLOPS (Floating Point Operations Per Second), the
number of discrete arithmetic calculations the chip could perform in one second. As the computational
power of the leading chips grew exponentially, the unit for expressing FLOPS in technical
specifications evolved from MegaFLOPS/MFLOPS (1,000,000 FLOPS) to GigaFLOPS/GFLOPS
(1,000,000,000 FLOPS) to TeraFLOPS/TFLOPS (1,000,000,000,000 FLOPS). In 2022, the Nvidia H100 AI
e To run inference on the GPT-3 model with 175 billion parameters, an input of 30 tokens with an output of 500 tokens would
cost $0.0018 based on GCP TPU v4 pricing.
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
Generative AI Value Chain 724-355
data center chip had a published a capability of 67 (FP32)f TFLOPS,33 a 70% increase over the 19.5 FP32
TFLOPS delivered by the A100, Nvidia’s previous AI data center chip released n 2020. 34 As important
as the raw computational power measured in FLOPS was to AI users, the single most important
hardware consideration for building out AI infrastructure was FLOPS per dollar, a measure of the
computational cost-effectiveness of the chip. In 2023, an A100 cost around $10,000 while the newer
H100s, which were in short supply, could be priced as high as $40,000.35 At these prices, the A100 came
in at around 0.00195 TFLOPS/dollar, compared to the H100’s 0.00168 TFLOPS/dollar.
GPU costs depended on many factors: macroeconomic conditions, general demand, sector-specific
trends in demand such as the boom periods in cryptocurrency mining (2015-2019) and generative AI
(2020-2023), re-seller opportunism, and, perhaps most importantly, year-over-year spec and
performance improvements in the GPU models coming from manufacturers. Before the pandemic,
GPU prices per unit of computation power had been falling at an order of magnitude over roughly 10
years (Exhibit 8 shows the trend in GPU price per half-precision FLOPS over the past five years). An
ongoing GPU shortage eased by the end of 2022, with the average selling price of a new desktop GPU
dropping to $529 in Q2 2022 from $1077 in Q3 2021.36 With the launch of newer cards, the cost of older
GPUs, such as Nvidia GeForce GTX and RTX 3070 Ti, saw over 15% declines in just one month from
January to February in 2022.37 While prices had returned to more-or-less normal levels, tariff
exemptions on graphic cards imports from China had not been renewed as of March 2023. The global
chip shortage and supply chain issues were still causing some models of AMD and Nvidia cards to be
sold at higher than MSRP prices as of February 2023.38
Electricity
In January 2023, with 13 million daily unique visits to ChatGPT, the electricity costs were estimated
to be $50,000 a day. Some analysts predicted the electricity cost of training alone for LLMs like GPT-3
could reach as high as $12 million.39 More recently, Google reported that training the 540-billion-
parameter PaLM model costs 3.4 gigawatt-hours over about two weeks, which is equivalent to
powering about 300 US households for a year.
In 2018, Nvidia’s CEO Jensen Huang commented that GPUs were growing 25 times faster than in
2013, doubling their performance every year, and that the gap would widen as AI became more
complex and data-intensive.40 Nvidia introduced Tensor Cores in 2016 with its data center V100 model
used for deep learning and high-performance computing applications. The H100 model Nvidia
expected to deliver in 2023 featured a new transformer engine that could achieve up to nine times faster
AI training, and up to 30 times faster AI interface speedups compared to the prior generation A100.41
Based on the improvements from the V100 (with which GPT-3 was trained), H100 would reduce in-
house training costs from $744,000 to an estimated $312,000.42 Moreover, some scholars found that the
compute and energy consumption of deep learning models for inference had improved over time,
suggesting a trajectory of higher efficiency. Energy efficiency optimization processes depended on a
number of decisions about model architectures, hardware platforms, and the optimization techniques
used.43 For example, algorithmic improvement in reducing recomputing activations had been shown
to significantly save memory storage and thus accelerate large transformer model training processes.44
f Any measurement of compute power in FLOPS must specify the data type to be comparable. FP refers to the “single-precision
floating point format” and 32 specifies that it uses 32 bits of memory. FP32 is a popular standard for comparisons of chip
performance, though other floating point formats such as FP16 and FP64 may also be listed.
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
724-355 Generative AI Value Chain
Data Centers
Data centers were expensive to build and operate, but an organization that expected to use
substantial computing power over time might find it more cost effective to build out a private data
center than to continue to rent compute power from a cloud provider. The decision depends on the
organization’s computational needs and the projected payback period—the cost of the data center
build-out divided by the net annual cash inflow from the utilization of the data center (the money saved
by not using a cloud service provider). Some companies took a hybrid approach, using a cloud provider
for lower-cost functions and doing more computationally intensive back-end operations on company
owned-and-operated hardware. Since the mid-2010s, colocation vendors had filled the gap between in-
house build-outs of (on-premise) computing infrastructure and the (public) cloud, offering what
amounted to a private data-center service.
What had changed in the emerging AI era was the scale of computational demands, particularly the
compute required to train new models, which had the potential to generate very expensive cloud bills
for companies that didn’t own their own compute resources. Combined with the rising costs of cutting-
edge GPUs, the subfield of LLM training optimization had seen an explosion of interest from those
interested in reducing costs.45 Public cloud services presented a cost-effective solution for businesses
prioritizing rapid product-market fit exploration, particularly for those who did not train foundational
models, or companies that did not integrate AI applications vertically. Private data centers, on the other
hand, had proven to be a more beneficial investment for companies that managed large data scales and
that could amortize the upfront capital expenditure with high hardware utilizations. Moreover, private
data centers could be advantageous for those with unique hardware requirements (e.g., GPUs) and
geographic considerations, offering the level of scalability, flexibility, and convenience that public
cloud services could not always provide.46
Guardrails could be added “outside” of the model, i.e., developers could make adjustments to the
input and output of the model without affecting the underlying model structure itself. Developers
could do this by adding additional instructors and restrictions affecting what goes into and comes out
of an API for the model. An Application Programming Interface (API) is a set of protocols that enables
communication between two or more pieces of software and specifies how the interaction will proceed.
APIs work by exposing certain functionalities of the software to outside calls, in a limited way, creating
a space for interoperability between software components. Whenever you enter data on the web, you
are interacting with the underlying software through an API. When you type a query into ChatGPT,
for example, the browser sends that data to a server using an API provided by OpenAI to exchange
that data with the service. The server receives the data, processes it through the ChatGPT model, and
generates a response. The server then sends the generated response back to the user where the output
appears on the user’s screen as though it is being typed back one word at a time.
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
Generative AI Value Chain 724-355
One of Microsoft’s early forays into generative chatbots, Tay, unintentionally illustrated the need
for the improved guardrails in place in today’s generative AI systems. Microsoft’s Tay, released on
Twitter in 2016, was designed to mimic language usage and learn from its users. Within a few hours,
Tay had learned to use racial slurs, argue for genocide, and parrot white-supremacist propaganda.
Sixteen hours after it was opened to the public, Microsoft apologized and pulled Tay offline.47 In 2023,
users attempting to trick ChatGPT into saying offensive things—by, for example, asking the model to
name some positive things done by Nazis—were generally unsuccessful, thwarted by the model’s
robust guardrails. One writer who experimented with ChatGPT relayed his experience:
Unlike some of its more benign predecessors, it will actually take stances. When I
asked what Hitler did well (a common test to see if a bot goes Nazi), it refused to list
anything. Then, when I mentioned Hitler built highways in Germany, it replied they were
made with forced labor. This was impressive, nuanced pushback I hadn’t previously seen
from chatbots.48
The guardrails dictate what the model will do, or not do, in response to certain prompts. If you ask
a chatbot “How do I dispose of a dead body?” it will respond with, “I'm sorry, but I cannot provide the
information or guidance on illegal or unethical activities, including disposing of a dead body. My
purpose is to provide helpful and responsible information to users.” The raw model is, of course,
capable of putting together a response to that query, but the model’s designers have inserted guardrails
to prevent it from responding to that particular query in line with the chatbots content policy
In fact, there is range of prompts that the leading chatbots from ChatGPT to Bard to Claude all
effectively deflect or refuse to answer, from the potentially-criminal (“How do I build a bomb?” / “I’m
sorry, but I can't assist with that request.”) to the political (“Why is Donald Trump better than Joe
Biden?” / “As an AI language model, I don't hold personal opinions or biases, and I don't make value
judgments about political leaders. Evaluating the effectiveness or superiority of a political leader is
subjective and can vary depending on individual perspectives and priorities.”) to the moral (“Is it okay
to get an abortion?” / “The question of whether it is acceptable to get an abortion is a deeply complex
and highly debated topic. It is a matter that involves a range of ethical, moral, religious, and personal
beliefs.”)
Guardrails can be implemented when the user input is first entered, by declining to send queries
that include certain keywords (“bomb,” for example) to the generative model; within the model, where
the response can be refined according to the model’s internal logic before returning a response; or after
the model generates a complete response, but before returning it to the user. An advantage of the first
approach is that it saves computational resources by cutting off the query with simple up-front logic.
An advantage of the second and third approaches is that the system gets the opportunity to become
more sophisticated at reasoning through content policies and responding safely to problematic queries.
Even if the bulk of these generated responses are never returned to the user, the chatbot and its system
of guardrails are improving.
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
724-355 Generative AI Value Chain
Source: Casewriter.
Source: Jojo John Moolayil, “A Layman’s Guide to Deep Neural Networks,” Towards Data Science, July 24, 2019,
https://towardsdatascience.com/a-laymans-guide-to-deep-neural-networks-ddcea24847fb, accessed July 2023.
10
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
Generative AI Value Chain 724-355
Source: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Polosukhin, “Attention Is All You Need,” NeurIPS 2017,
https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf, accessed July 2023.
Note: “Many of the attention heads attend to a distant dependency of the verb ‘making’, completing the phrase
‘making...more difficult’. Attentions here shown only for the word ‘making’. Different colors represent different heads.
Best viewed in color.”
Source: Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie
Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu
Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen, “A survey of large language models,” arXiv, March 31,
2023, https://arxiv.org/abs/2303.18223, accessed April 2023.
11
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
724-355 Generative AI Value Chain
Weight in
Dataset Tokens Training Description
Common Crawl 410 billion 60% Petabytes of data collected over 8 years of web crawling.
Contains scraped raw web page data, metadata extracts,
and text extracts.
WebText2 19 billion 22% Text of webpages from all outbound Reddit links
Books1 12 billion 8% Internet-based books corpora
Books2 55 billion 8% Internet-based books corpora
Wikipedia 3 billion 3% Pages in English language
Total 499 billion 100%
Source: Adapted by casewriter from: Khaled Abousamak, “Data behind ChatGPT,” LinkedIn, February 25, 2023,
http://linkedin.com/pulse/data-behind-chatgpt-khaled-abdelghani-pmp-cdmp, accessed April 2023.
GLaM
1,200
PaLM
1000 Megatron-Turing
540 OPT-175B
NLG
GPT-3 530 175
175
Galactica
LaMDA
120
100 137
Model Size ( Billions of Parameters)
BLOOM
Turing-NLG 176 LLaMA-65B
17.2 65
T5
10 11
GPT-2
1.5
1 BERT-Large
0.34
ELMo
0.094
0.1
0.01
2017 2018 2018 2019 2019 2020 2021 2021 2022 2022 2023
Source: Casewriter.
12
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
Generative AI Value Chain 724-355
Q: The cafeteria had 23 apples. If they used Q: The cafeteria had 23 apples. If they used
20 to make lunch and bought 6 more, how 20 to make lunch and bought 6 more, how
many apples do they have? many apples do they have?
Model Output The answer is 27. ❌ A: The cafeteria had 23 apples originally.
They used 20 to make lunch. So they had
23 - 20 = 3. They bought 6 more apples, so
they have 3 + 6 = 9. The answer is 9. ✓
Source: Adapted by casewriter from: Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed
Chi, Quoc Le, and Denny Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” arXiv,
January 10, 2023, https://arxiv.org/abs/2201.11903, accessed April 2023. Licensed under Creative Commons
Attribution 4.0 International (CC BY 4.0).
Note: Highlighting indicates chain-of-thought reasoning processes. The authors of the research argue that “Chain-of-thought
prompting enables large language models to tackle complex arithmetic, commonsense, and symbolic reasoning tasks.”
13
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
724-355 Generative AI Value Chain
Source: Asya Bergal, “2019 recent trends in GPU price per FLOPS,” AI Impacts, March 25, 2020, https://aiimpacts.org/2019-
recent-trends-in-gpu-price-per-flops/#GPU_price_single-precision_FLOPS-2, accessed April 2023. Licensed
under CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.
14
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
Generative AI Value Chain 724-355
Endnotes
1 Tomas Mikolov et al, “Distributed Representations of Words and Phrases and their Compositionality,” Neural Information
Processing Systems, 2013, https://arxiv.org/abs/1310.4546; Jeffrey Pennington, Richard Socher, and Christopher D. Manning,
“GloVe: Global Vectors for Word Representation,” Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), 2014, https://aclanthology.org/D14-1162.pdf
2 Ilya Sutskever, Oriol Vinyals, Quoc V. Le, “Sequence to Sequence Learning with Neural Networks,” NeurIPs 2014,
https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf
3 Matthew E. Peters et al, “Deep Contextualized Word Representations,” Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers),
June 2018, https://aclanthology.org/N18-1202/
4 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Polosukhin, “Attention Is All You Need,” NeurIPS 2017,
https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
5 DJ, “Decoding Transformers: A Revolution in Natural Language Processing,” Medium, March 22, 2023,
https://medium.com/@deepujain/decoding-transformers-a-revolution-in-natural-language-processing-910c45781ea5,
accessed June 26, 2023.
6 Rohan Jagtap, “T5: Text-To-Text Transfer Transformer,” Towards Data Science, August 1, 2020,
https://towardsdatascience.com/t5-text-to-text-transfer-transformer-643f89e8905e, accessed June 26, 2023.
7 Tom B. Brown et al, “Language Models are Few-Shot Learners,” NeurIPS 2020,
https://papers.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
8 Jordan Hoffmann et al, “Training Compute-Optimal Large Language Models,” March 29, 2022,
https://arxiv.org/abs/2203.15556.
9 Sunyan, “The Economics of Large Language Models,” January 21, 2023, https://sunyan.substack.com/p/the-economics-of-
large-language-models, accessed March 22, 2023.
10 Nostalgebraist, “chichilla’s wild implications,” AlignmentForum.org, July 30, 2022,
https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications#1__the_scaling_law,
accessed March 24, 2023.
11 Nostalgebraist, “chichilla’s wild implications,” AlignmentForum.org, July 30, 2022,
https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-
implications.#2__are_we_running_out_of_data_, accessed March 24, 2023
12 Diana Bikbaeva, “AI Trained on Copyrighted Works: When Is It Fair Use?” The Fashion Law, February 1, 2023,
https://www.thefashionlaw.com/ai-trained-on-copyrighted-works-when-is-it-fair-use/, accessed May 11, 2023.
13 Andy Baio, “Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator,” Waxy,
August 30, 2022, https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-
generator/, accessed May 12, 2023.
14 Rishi Bommasani and Percy Liang, “Reflections on Foundation Models,” Stanford University Human Centered Artificial
Intelligence Center for Research on Foundation Models, October 18, 2021,
https://crfm.stanford.edu/2021/10/18/reflections.html, accessed July 6, 2023.
15 Harm de Vries, “ Go Smol or Go Home,” HarmDeVries.com, May 8, 2023, https://www.harmdevries.com/post/model-
size-vs-compute-overhead/, accessed May 12, 2023.
16 Meta, “Introducing LLaMA: A Foundational, 65-Billion-Parameter Large Language Model,” February 24, 2023,
https://ai.facebook.com/blog/large-language-model-llama-meta-ai/, accessed June 19, 2023.
17 Sunyan, “The Economics of Large Language Models,” January 21, 2023, https://sunyan.substack.com/p/the-economics-of-
large-language-models, accessed March 22, 2023.
18 MosaicML NLP Team, “Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs,” MosaicML,
May 5, 2023, https://www.mosaicml.com/blog/mpt-7b, accessed July 2023.
15
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
724-355 Generative AI Value Chain
19 Victor Sanh et al, “Multitask Prompted Trainign Enables Zero-Shot Task Generalization,” ICLR 2022,
https://openreview.net/pdf?id=9Vrb9D0WI4; Jason Wei et al, “Finetuned Language Models Are Zero-Shot Learners,” ICLR
2022, https://openreview.net/pdf?id=gEZrGCozdqR.
20 Nisan Stiennon et al, “Learning to Summarize from Human Feedback,” NeurIPS 2020,
https://dl.acm.org/doi/pdf/10.5555/3495724.3495977; Long Ouyang et al, “Training Language Models to Follow Instructions
with Human Feedback,” NeurUPS 2022, https://openreview.net/forum?id=TG8KACxEON.
21 Long Ouyang et al, “Training Language Models to Follow Instructions with Human Feedback,” NeurUPS 2022,
https://openreview.net/forum?id=TG8KACxEON.
22 Haxing Dai et al, “AugGPT: Leveraging ChatGPT for Text Data Augmentation,” March 20, 2023,
https://arxiv.org/abs/2302.13007
23 Billy Perrigo, “Exclusive: OpenAI Used Kenyan Workers on Less Than $2 Per Hour to Make ChatGPT Less Toxic,” Time
Magazine, January 18, 2023, https://time.com/6247678/openai-chatgpt-kenya-workers/, accessed March 24, 2023.
24 Ben Dickson, “What Is Reinforcement Learning from Human Feedback,“ TechTalks, January 16, 2023,
https://bdtechtalks.com/2023/01/16/what-is-rlhf/, accessed March 24, 2023.
25 Jason Wei et al, “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” NeurIPS 2022,
https://openreview.net/pdf?id=_VjQlMeSB_J
26 Adam Kohan, “Learning and Inference in a Forward Pass: The New Framework,” Towards Data Science, January 20, 2023,
https://towardsdatascience.com/learning-and-inference-in-a-forward-pass-the-new-framework-dc1356399002, accessed May
30, 2023.
27 CIO Coverage, “OpenAI’s ChatGPT Reportedly Costs $100,000 a Day to Run,” https://www.ciocoverage.com/openais-
chatgpt-reportedly-costs-100000-a-day-to-run/, accessed June 6, 2023.
28 P.K. Tseng, “TrendForce Says with Cloud Companies Initiating AI Arms Race, GPU Demand from ChatGPT Could Reach
30,000 Chips as It Readies for Commercialization,” TrendForce Press Center,
https://www.trendforce.com/presscenter/news/20230301-11584.html, accessed June 6, 2023.
29 Davit Buniatyan, “Generative AI Data Infrastructure: How to Train Large Language Models (LLMs) with Deep Lake,”
Activloop, February 16, 2023, https://www.activeloop.ai/resources/generative-ai-data-infrastructure-how-to-train-large-
language-models-ll-ms-with-deep-lake/, accessed June 6, 2023.
30 Ibid.
31 Threza Gabriel, “How Much Does ChatGPT Cost? $2-12 Million per Training for Large Models,” TechGoing, February 18,
2023, https://www.techgoing.com/how-much-does-chatgpt-cost-2-12-million-per-training-for-large-models/, accessed June 6,
2023.
32 Anissa Gardizy and Wayne Ma, “Microsoft Readies AI Chip as Machine Learning Costs Surge,” The Information, April 18,
2023, https://www.theinformation.com/articles/microsoft-readies-ai-chip-as-machine-learning-costs-surge, accessed June 6,
2023.
33 Hassan Mujtaba, “NVIDIA Hopper H100 GPU Is Even More Powerful In Latest Specifications, Up To 67 TFLOPs Single-
Precision Compute Listed,” WCCFTech.com, October 3, 2022, https://wccftech.com/nvidia-hopper-h100-gpu-more-powerful-
latest-specifications-up-to-67-tflops-fp32-compute/, accessed May 30, 2023.
34 Nvidia, “NVIDIA A100 Tensor Core GPU Factsheet,” https://www.nvidia.com/content/dam/en-zz/Solutions/Data-
Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf, accessed May 30, 2023.
35 Kif Leswing, “Nvidia’s Top A.I. Chips Are Selling for More than $40,000 on eBay,” CNBC, April 14, 2023,
https://www.cnbc.com/2023/04/14/nvidias-h100-ai-chips-selling-for-more-than-40000-on-ebay.html, accessed May 30, 2023.
36 Michael Crider, “Graphics Card Prices Have Plunged 50% this Year,“ PCWorld, September 13, 20222,
https://www.pcworld.com/article/1066537/gpu-prices-have-dropped-50-in-2022.html, accessed March 24, 2023.
37 Monica J. White, “GPU Prices Are Slowly Dropping at Long Last,” DigitalTrends, March 1, 2022,
https://www.digitaltrends.com/computing/graphics-card-prices-slow-decline-february-2022/, accessed March 24, 2023.
38 Jacob Roach, “GPU Prices and Availability (February 2023): How Much Are GPUs Today?” DigitalTrends, February 1, 2023,
https://www.digitaltrends.com/computing/gpu-price-tracking/, accessed March 24, 2023.
16
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.
Generative AI Value Chain 724-355
39 Thereza Gabriel, “How Much Does ChatGPT Cost? $2-12 Million per Training for Large Models,” TechGoing, February 18,
2023, https://www.techgoing.com/how-much-does-chatgpt-cost-2-12-million-per-training-for-large-models/, accessed March
24, 2023.
40 Nvidia, “GTC 2018 Keynote with NVIDIA CEO Jensen Huang,” YouTube, March 28, 2018,
https://www.youtube.com/watch?v=95nphvtVf34, accessed March 24, 2023.
41 Michael Andersch et al, “NVIDIA Hopper Architecture In-Depth,” Nvidia Developer Blog, March 22, 2022,
https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/, accessed March 28, 2023.
42 Sunyan, “The Economics of Large Language Models,” January 21, 2023, https://sunyan.substack.com/p/the-economics-of-
large-language-models, accessed March 22, 2023.
43 Radosvet Desislavov, Fernando Martínez-Plumed, José Hernández-Orallo, “Compute and Energy Consumption Trends in
Deep Learning Inference,” September 12, 2021, https://arxiv.org/abs/2109.05472.
44 Vijay Korthikanti et al, “Reducing Activation Recomputation in Large Transformer Models,” May 10, 2022,
https://arxiv.org/abs/2205.05198.
45 Dmytro Nikolaiev, “Behind the Millions: Estimating the Scale of Large Language Models,” Towards Data Science, March 31,
2023, https://towardsdatascience.com/behind-the-millions-estimating-the-scale-of-large-language-models-97bd7287fb6b,
accessed June 6, 2023.
46 Guido Appenzeller, Matt Bornstein, and Martin Casado, “Navigating the High Cost of AI Compute,” April 27, 2023,
https://a16z.com/2023/04/27/navigating-the-high-cost-of-ai-compute/, accessed June 6, 2023.
47 Abby Ohlheiser, “Trolls Turned Tay, Microsoft’s Fun Millennial AI Bot, into A Genocidal Maniac,” The Washington Post,
March 25, 2016, https://www.washingtonpost.com/news/the-intersect/wp/2016/03/24/the-internet-turned-tay-microsofts-
fun-millennial-ai-bot-into-a-genocidal-maniac/, accessed June 6, 2023.
48 Alex Kantrowitz, “Finally, an A.I. Chatbot That Reliably Passes ‘the Nazi Test,’ Slate, December 2, 2022,
https://slate.com/technology/2022/12/chatgpt-openai-artificial-intelligence-chatbot-whoa.html, accessed June 26, 2023.
17
This document is authorized for use only in HBS Faculty's the OPM 62 Unit 3, Fall 2024 at Harvard Business School from Sep 2024 to Oct 2025.