LLMs in Production-MLC - GRC

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Contents

1 Introduction

2 Chapter 1: Ideal Large Language Model


stack Key Takeaways:

3 Chapter 2: Challenges of deploying large


1 Logical data models help you build AI
language models first systems and leverage the
opportunity presented by unstructured
4 Chapter 3: Turning unstructured data to data.
structured data
2 Fine-tuning a 175 Bn parameter DaVinci
5 Chapter 4: Approaches to deploying LLMs model to summarize Wikipedia would
& building Infra cost $ 180 K. Whereas, fine-tuning a self-
hosted 7Bn param model only costs
6 Chapter 5: Putting an end to LLM $1.4 K. This shows in some use-cases
hallucinations smaller fine-tuned models are more
economical
7 Chapter 6: Building and curating data sets
for RLHF and fine-tuning 3 RL can be used to shape LLM
explorations by using a method called
8 Chapter 7: Large Language Models for ELLM (Exploring with LLMs), which
recommendation systems rewards an agent for achieving goals
suggested by a language model,
9 Chapter 8: Understanding economics of prompted with a description of the
Large language models agent’s current state.

10 Chapter 9: The confidence checklist for 4 RLHF is a great approach to training


LLMs in Production LLMs. The 3 key advantages of RLHF
are, avoids offensive behaviour, provides
11 Chapter 10: Creating a contextual Chatbot accurate information & informed
with LLMs feedback to the LLM.

12 About TrueFoundry

2
Introduction
LLMs in Production: Essential Speaker Summaries has been
authored by TrueFoundry in partnership with the MLOps
community and Greg Coquillo.

This eBook is a summary of some of the most notable talks from


the MLOps community’s LLMs in Production Conference held at
San Francisco.
In the eBook we cover the ideal LLM stack, challenges of
deploying LLMs, different approaches to deploying LLMs, solving
Hallucinations in LLMs and much more!

Explore the knowledge and insights shared by over 50 industry-


leaders from renowned companies such as Stripe, Meta, Canva,
Databricks, Anthropic, Cohere, Redis, LangChain, Chroma, Human
loop, and many more

3
Ideal Large Language
1 Model stack
Here we discuss the ideal ML stack and detail out how each
component of that stack plays a crucial role in
productionizing LLMs.

This chapter is a summary of the presentation titled “LLMs


for the rest of us” by Joseph Gonzalez, Professor at UC
Berkeley, Co-founder & VP Product at Aqueduct

We explore the following five critical elements of an LLM


stack:
Joseph Gonzalez
Python code
Python libraries,
Vector Databases,
Large Language Model (LLM)
LLMOps platform

Cost &
Resource
management
Validation,
tracking &
Logging

Data access &


governance

4
Components of an Ideal LLM stack
1. Python Code: Python facilitates the implementation, management, and deployment of
LLMs in production environments. Some of its functions are as follows,

a. Python frameworks such as Flask, Django, or FastAPI can be used to build robust and
scalable RESTful APIs or web applications for serving LLM predictions and interacting with
end-users.

b. Python enables containerization and orchestration using frameworks like Docker and
Kubernetes, making it easier to manage and deploy LLMs

2. Python Libraries

Some must-have frameworks and libraries for large language modelling include:

a. LlamaIndex: LlamaIndex provides a simple interface between LLMs and external data
sources. LlamaIndex is an "on-demand" data query tool that allows you to use any data
loader within the LlamaIndex core repository or in LlamaHub. It also enables you to insert
arbitrary amounts of conversation memory into your LangChain agent, allowing for more
natural conversations with your LLM.

b. LangChain: LangChain is a framework for building and managing complex conversations


with LLMs. It allows you to leverage multiple instances of ChatGPT, provides them with
memory, and even multiple instances of LlamaIndex. This makes it easier to create
conversational agents that can understand the context and respond naturally to user input.

These libraries, along with others like TensorFlow and Keras for deep learning enable
developers to build powerful LLMs with ease.

3. Vector Databases (DBs): A vector DB is purpose-built to store, index, & query large
quantities of vector embeddings. In LLM applications, data is often represented as vectors
in a high-dimensional space. Hence, Vector DBs enable quick retrieval of relevant data
points based on similarity or distance metrics, which is essential for tasks like
recommendation systems, image recognition, and natural language processing.

• Some popular vector DBs for LLMs applications are Milvus, Qdrant, and Pinecone

5
4. Large Language Models (LLMs): Pre-trained LLMs, such as GPT-3 and BERT, are the
backbone of any LLM stack. These models, built on vast amounts of text data, can generate
human-like text and understand context, making them invaluable for tasks like sentiment
analysis, summarization, and chatbot development.

• LLMs allow developers to leverage transfer learning, where a pre-trained model is fine-tuned
on a specific task or dataset, significantly reducing training time and computational
resources. This approach has led to state-of-the-art results in various NLP tasks.

5. LLMOps Platform: An LLMOps platform is a unified platform for managing and deploying
LLMs.

• It simplifies the process of building, training, and deploying LLMs by providing a set of tools
and services such as model versioning, automated deployment, and monitoring. This allows
developers to focus on developing their applications instead of worrying about the
underlying infrastructure and cloud costs.

• Additionally, an LLMOps platform also helps in reducing costs associated with running large
language models by providing efficient resource utilization and scalability.

To learn about emerging LLM architectures visit here

To learn more about LLMs, download our eBook by signing up here

6
Challenges of deploying
2 Large Language Models

Though revolutionary, deploying LLMs has its own set


of challenges.

For example, GPT-3, one of the most powerful LLMs


today, is 175 billion parameters in size & requires an
estimated 4.6 exaFLOPS of compute power to run.

So, inspired by the presentation given by Oscar Rovira,


Co-founder of Mystic.ai on the considerations when
deploying LLMs, here we have put together the most Oscar Rovira
pressing challenges that practitioners can encounter
when deploying LLMs

7
Here are some key points highlighting the challenges faced when deploying LLMs:

1. Computational Resources: Storing and managing the large size of LLMs can be
challenging, especially in resource-constrained environments or edge devices. This requires
developers to find ways to compress the models or use techniques like model distillation to
create smaller, more efficient variants.

2. Model Fine-tuning: Pre-trained LLMs often need fine-tuning on specific tasks or datasets to
achieve optimal performance. This process can be computationally expensive. For example,
fine-tuning a 175 Bn parameter DaVinci model would cost $ 180K.

3. Ethical Concerns: LLMs can sometimes generate inappropriate or biased content due to
the nature of the data they are trained on. This raises concerns about the ethical
implications of deploying such models and the potential harm they might cause.

4. Hallucinations: Hallucinations is a phenomenon in which when users ask questions or


provide prompts the LLM produces responses that are imaginative or creative but not
grounded in reality. These responses may appear plausible and coherent but are not based
on actual knowledge.

5. Interpretability and Explainability: Understanding the internal workings of LLMs and how
they make decisions can be difficult due to their complexity. This lack of interpretability
poses challenges for developers who need to debug, optimize, and ensure the reliability of
these models in real-world applications.

6. Latency and Inference Time: As LLMs have a large number of parameters, they can be
slow to generate predictions, particularly on devices with limited computational resources.
This can be a challenge when deploying LLMs in real-time applications where low latency is
essential.

7. Environmental Impact: The significant computational resources required for training LLMs
also contribute to increased energy consumption and carbon emissions. For example, the
carbon footprint of ChatGPT is roughly equivalent to the annual emissions of about 978
cars. This raises concerns about the environmental impact of deploying LLMs.

For solutions to your LLM deployment challenges read this blog

8
Turning unstructured data
3 into structured data
This chapter is a summary of “Bringing Structure to
unstructured data with an AI-First system design” by Will
Gaviria Rojas, co-founder of Coactive AI.

AI has fundamentally changed the game because it allows


us to turn data into meaningful insights.

But as we’re moving from data to content we are collecting


more and more unstructured data.
in fact, by 2025, 80% of worldwide data will be unstructured!
William A. Gaviria Rojas
Hence, being able to give structure to unstructured data will
become a serious competitive advantage for organizations
and businesses.

9
Approaches to turning unstructured data into structured data

S No. Approach Pros Cons

1 Do Nothing Easy Massive unrealized


opportunity
2 Human labelling Easy to get started Expensive
Time Consuming
Noisy

3 AI-Powered APIs Cost effective No customization


Easy to get started Privacy Concerns

4 PhDs + GPUs Customized solutions Very Expensive


Hard to scale

5 AI-First Data Systems Reliable We need to build them


Scalable Adaptable

Production Data Scientist

APIs SQL

What organizations do Engineering Analytics


today…

Specialized DB, Relational DB


Blob Storage Document store

Unstructured Data Semi-structured Data Structured Data

All data generated by Organization

10
Problem 1: Currently in organizations data teams build a system that doesn’t capture the needs of
the AI team

A good way to solve this problem is by building an AI first system

AI first system: An AI-first system requires first understanding the problems that companies want to
solve with AI and then identifying the data that will help them do that.

Our lessons learned from building an AI-first system

Lesson 1: Logical data models matter more than you think

Credits: This is a work of Will Gaviria Rojas Co-founder, Coactive AI

In our experience, this is what works best:


• Data + AI hybrid team builds a system that focuses on impedance matching and logical data
models at scale
Resolves AI bottleneck
Unlocks pathways for optimization

Building logical data models

Physical data Model Impedance Matching Logical data models

key value

Comment.json … key other Task


Sandwich.jpg … …
key

other Task
Transform …
key

other Task
Background.mp3 … … … Task
Video_review.mp4 …

Data Engineering + AI/ML/DS/Research AI/ML/DS/Research

Credits: This is a work of Will Gaviria Rojas Co-founder, Coactive AI

11
Problem 2: Using the foundation model on everything. This has the following problems-
• AI costs quickly grow out of control
• Bottlenecks in future foundation model applications

Lesson 2: Use embeddings to cache compute.

• The majority of the computing happens in the feature extractor


• Across different tasks, only the last layer changes (i.e. the output layer)
So an effective way to deal with this problem can be breaking the monolith into layers as shown
in the diagram.

Breaking up the monolith shows the root cause of the problem

Credits: This is a work of Will Gaviria Rojas Co-founder, Coactive AI

Credits: This is a work of Will Gaviria Rojas Co-founder, Coactive AI

12
Approaches to deploying
4 LLMs & building Infra
Here we discuss the various approaches to deploying LLMs
and building the infrastructure around them.

The goal of this chapter is to shed light on the,

Pros and cons of each approach


Guide users on what best would help them
Provide effective solutions to mitigate risks associated
with the deployment of LLMs.
Oscar Rovira
This chapter also is inspired by Oscar Rovira’s, co-founder
Mystic.ai, presentation on “Considerations on Deploying
Open-source LLMs (and other models)”

13
Different ways to deploy LLMs

Option 1: Building an In-House ML Platform with Your Existing Team's Expertise (K8 +
Docker)
When considering the development of an in-house machine learning (ML) platform, leveraging
your existing team's expertise in Kubernetes (K8) and Docker can be a viable option. This
approach offers several advantages and disadvantages, as outlined below.

Pros
1.Full Control: By building your own ML platform for LLM applications allows you to ensure a
perfect fit for your use cases and workflows.

2.Deploy in Your Infrastructure of Choice: This option allows you to deploy the LLM in your
preferred infrastructure, whether it's on-premises, in a private cloud, or in a public cloud
environment. This flexibility enables you to optimize costs, performance, and security based on
your organization's needs.

3.Leverage Existing Expertise: Utilizing your team's existing knowledge of K8 and Docker can
help streamline the development process and reduce the learning curve associated with adopting
new technologies and tools.

Cons
1.Requires a Team of Engineers to Build and Maintain: This option can be resource-intensive
and costly as it requires a dedicated team of engineers with specialized skills in ML, distributed
systems, and infrastructure management.

2.Time-Consuming: Building an in-house ML platform from scratch can take a considerable


amount of time, potentially delaying the deployment of ML solutions.

3.Limited to the Existing Team's Knowledge: As the field of ML is rapidly evolving, relying on your
current team's expertise in K8 and Docker may limit the capabilities of your ML platform.

4.Lack of External Support: This option can make troubleshooting and resolving issues more
challenging and time-consuming as you might not have access to the same level of support and
resources provided by established ML platform vendors.

14
Option 2: Select a Cloud Provider and Build Around Their LLM Tooling
Another approach to implementing machine learning (ML) solutions, particularly LLMs, is to
select a cloud provider and build around the LLM tooling they offer.

Pros
1.Faster than Building In-House: Allows you to leverage their pre-built ML infrastructure and
services, reducing the time it takes to develop, deploy, and maintain your ML solutions compared
to building an in-house platform from scratch.

2.Access to Advanced Technologies: Cloud providers often offer cutting-edge technologies and
tools for ML, including the latest LLMs and other AI services. By building around their cloud, you
can take advantage of these advanced capabilities without having to invest in research and
development.

3.Scalability and Flexibility: Cloud-based ML platforms provide the ability to scale resources up or
down based on demand, making it easier to manage costs and accommodate changing
workloads. This flexibility is especially valuable for organizations with fluctuating or seasonal ML
requirements.

4.Support and Maintenance: Cloud providers typically offer extensive documentation, support,
and managed services for their ML tooling, which can help alleviate the burden of maintaining the
platform and troubleshooting issues.

Cons
1.Cloud Lock-In: Relying on a specific cloud provider's LLM tooling may lead to vendor lock-in,
making it difficult to switch providers or move to an in-house solution in the future. This
dependence can also limit your ability to negotiate pricing and service terms.

2.Requires Expertise on the Cloud of Choice: To make the most of a cloud provider's LLM
tooling, your team will need to develop expertise in the provider's specific technologies, services,
and best practices. This may require additional training and resources, particularly if your team is
more familiar with another cloud provider or in-house solutions.

3.Cost Management: While cloud-based ML platforms can offer cost savings through on-demand
resource allocation, it's essential to monitor and manage usage carefully to avoid unexpected
expenses. This may require implementing cost management strategies and tools to track and
optimize spending.

4.Maintenance and Updates: Although cloud providers handle much of the underlying
infrastructure maintenance, you will still need to dedicate resources to managing and updating
your ML solutions built around their LLM tooling. This includes staying up-to-date with new
features and changes in the cloud provider's offerings.

15
Option 3: Purchasing a Pre-Built Solution for Hosting LLMs
A third approach to implementing Large Language Models (LLMs) is purchasing a pre-built
solution that provides all the necessary infrastructure for hosting and deploying LLMs.

Pros
1.Fastest Go-to-Market: This option can significantly reduce the time it takes to deploy your ML
models. These solutions are designed to streamline the implementation process, enabling you to
focus on developing and refining your models rather than building and maintaining infrastructure.

2.Most Affordable: Purchasing an off-the-shelf solution can be more cost-effective than building
and maintaining an in-house platform or relying on a cloud provider's services. The costs
associated with hiring a team of engineers, acquiring hardware, and managing resources are
minimized, making this option particularly attractive for smaller organizations or those with limited
budgets.

3.Proven and Reliable: Pre-built LLM hosting solutions are often used by multiple companies,
which means they have been tested and refined over time. By choosing a solution that has already
been proven to work for others, you can benefit from the improvements and optimizations made
by the provider based on real-world use cases and feedback.

Cons
1.Limited Customization: While these solutions often offer a range of configuration options, they
may not provide the same level of flexibility and control as building your infrastructure.

2.Dependency on External Provider: When relying on a third-party solution, especially open-


source solutions, you become dependent on the provider for updates, support, and maintenance.

3. Lack of In-House Infrastructure Expertise: Choosing this option can hamper the in-house
expertise required to build and manage ML infrastructure. This can limit your ability to adapt to
changing requirements, troubleshoot issues, or transition to other solutions in the future.

Some of our top recommendations for LLM hosting platforms are:

1. Sagemaker: Is a fully managed Machine Learning platform by AWS. It recently added support
for deploying LLMs. Click here to check it out

2. TrueFoundry: Is a powerful user-friendly developer platform for ML teams that helps you
deploy and fine-tune LLMs in just a few clicks. Click here to check it out

3. BentoML: Provides open-source frameworks to build and deploy their AI products. Click here
to check it out

4. MosaicML: Recently acquired by Databricks, MosaicML enables you to easily train and deploy
LLMs in your secure environment. Click here to check it out

16
Putting an end to LLM
5 hallucinations
This chapter is a summary of “Stopping hallucinations from
hurting your LLMs” by Atindriyo Sanyal, Co-Founder & CTO
Galileo

Note from Atindriyo Sanyal


“Hallucination in language models always ties back to the
quality of the data these models are trained on. And since
LLMs have no idea of the undertones of the language they
describe, detecting hallucinated regions in these outputs
becomes a critical step towards achieving general intelligence.”
Atindriyo Sanyal

A recent example of hallucinations led to a law firm shelling out


$5,000 in fines because they cited six fake cases invented by
the AI tool.

17
Stopping Hallucinations
Some typical reasons why a model may hallucinate are as follows,

• Overfitting to training data: If a language model becomes excessively specialized to the


specific patterns and examples in its training data, it may struggle to generalize accurately
when faced with novel inputs, leading to the generation of hallucinatory responses that align
with the training data but lack real-world validity.

• Insufficient or Imbalanced training data: When a language model is trained on limited or


skewed data, it may not have exposure to a wide range of perspectives, topics, or language
patterns. This can result in the model producing hallucinations due to the lack of diverse
information needed to generate accurate and well-grounded responses.

• Divergences in large training and eval datasets: Hallucinations can occur if there are
significant differences between the data used to train the language model and the data used
to evaluate its performance. If the model encounters patterns or examples during evaluation
that significantly differ from its training data, it may produce responses that deviate from
reality.

• Poor Quality Prompts: When users provide vague, ambiguous, or contradictory prompts,
language models may struggle to understand the intended meaning or context, potentially
leading to hallucinations as the model tries to make sense of the input and generates
responses that may not align with the user's expectations.

• Encoding/Decoding Mistakes: Language models can make errors during the encoding or
decoding process, which can introduce noise or distortions in the representation of
information. These mistakes can propagate and manifest as hallucinations in the generated
responses, as the model may generate text based on flawed or misinterpreted input.

There are two main types of Hallucinations:

• Fabrication of facts: Occurs when the LLM produces text that appears to be correct but is
false. For example, an LLM may generate a sentence that includes information about a
person or event that does not exist in reality.
• Faulty reasoning: Occurs when the LLM produces text that contains logical inconsistencies
or contradictions. This could include statements that contradict each other within the same
sentence, or statements that contradict facts or beliefs. For example, an LLM may generate
a sentence that states something like "The sun rises in the west."

18
• Both types of hallucinations can lead to confusion and misinformation if they are not
addressed properly by developers and users of large language models. To reduce these
types of hallucinations, developers should ensure their models are trained on accurate data
sets and use techniques such as fact-checking and logic-checking algorithms to detect any
potential errors before they are released into production systems.

Detecting Hallucinations: A systematic evaluation process (eval) for LLMs


Evaluation processes or Evals is a system that is employed to assess the performance and
behaviour of the models. Evals help in preventing LLMs from hallucinating by helping to identify
and mitigate the generation of unreliable or fictional information.

Methodology:
1. Curate a dataset for open-ended text generation: To evaluate the proposed metric, create a
dataset of inputs and corresponding LLM outputs. This dataset should encompass diverse
topics, styles, and complexity levels to ensure a comprehensive evaluation.
2. Examine token probabilities for each completion: Analyze the probabilities assigned to
each token in the generated output by the LLM. This information could help identify tokens
with low probabilities that might be associated with hallucinations.
3. Channel to other 3rd parties (neutral) models for extra signal: To obtain additional insights
and reduce biases, compare the LLM-generated outputs with those of other neutral third-
party models. This comparison can provide a broader perspective on the likelihood of
hallucinations in the generated text.
4. Determine if completions contain hallucinations: Based on the token probabilities and
comparison with other models, classify the generated text as containing hallucinations or
not. This step will involve designing a robust algorithm that considers various factors to
determine the presence of hallucinations or human feedback.

Goal
1. Accurately detect model hallucinations: Develop a metric that can reliably identify
hallucinations in LLM-generated text. This metric should be able to distinguish between
factually correct, incorrect, and inconsistent content.

2. Broad applicability across diverse generation tasks: The proposed metric should have a
wide range of applications, including but not limited to news articles, social media posts,
and creative writing. This broad applicability will ensure that the metric can be used to
improve the quality of LLM-generated content across various domains.

19
Evals should cover 5 main cases

1. “Hero” use cases - These are the most common use cases that are expected to be
encountered in production. It is important to test these scenarios thoroughly to ensure that
the model behaves as expected under normal conditions.

2. System capabilities (white-box testing) - In this case, tests are conducted on the internal
components of the model, such as its algorithms and data structures, to ensure that they
are functioning correctly and efficiently.

3. Edge cases (handling missing or conflicting data) - This type of test is designed to evaluate
how well a model can handle unexpected inputs or situations where there are conflicting
data present. This helps identify any potential issues with a model's ability to handle edge
cases before it is deployed in production.

4. User-driven use cases (based on user feedback) - This type of test evaluates how well a
model responds to user feedback and incorporates it into its decision-making process. This
helps ensure that a model can adapt quickly and accurately when faced with a new input or
changing conditions in production environments.

5. Non-workflow scenarios (out-of-scope functionality) - This type of test evaluates how well a
model can handle tasks outside its scope of responsibility, such as handling requests for
information not related to its primary purpose or responding appropriately when faced with
unexpected input from users.

Preventing Chatbots from Hallucinating:


AI chatbots are becoming increasingly popular as they can provide answers to questions quickly
and accurately. However, there is a risk of AI chatbots "hallucinating" - providing convincing but
completely made-up answers. To prevent this from happening, two main approaches can be
taken:

1. The first approach is to give the chatbot access to a knowledge base. This means
providing it with data about the world so that it can draw on this information when
responding to queries. By having access to reliable sources of information, the chatbot will
be able to provide accurate responses rather than make up its answers.

2. The second approach is to build guardrails so that the chatbot only responds with answers
that are grounded in reality. This involves setting up rules for how the chatbot should
respond in certain situations and ensuring that it sticks within these boundaries. For
example, if a user asks a question about something outside of the scope of the chatbot,
then it could be set up so that it provides an appropriate response such as referring them
elsewhere for more information.

20
Building & curating data
6 sets for RLHF & fine-tuning

This chapter is a summary of the presentation on “Building


and curating Datasets for RLHF and LLM fine-tuning” by
Daniel Vila Suero, CEO of Argila

“When it comes to deploying safe and reliable software


solutions, there are few shortcuts, and this holds true for
LLMs.
However, there is a notable distinction: for LLMs, the
primary source of reliability, safety, and accuracy is data. Daniel Vila Suero
Therefore, rigorous evaluation and human feedback are
indispensable for transitioning from LLM experiments and
proofs-of-concept to real-world applications.”- Daniel Vila

21
Reinforcement Learning from Human Feedback (RLHF) is an approach used to train and
improve machine learning models, including LLMs, by leveraging human-generated feedback.
Here's an overview of how RLHF works:

1. Initial Model Training: The LLM is initially pre-trained using large-scale datasets, such as
text from the internet. This pre-training enables the model to learn patterns and language
understanding from diverse sources.

2. Creating a Feedback Loop: To refine and improve the model's behaviour, a feedback loop is
established. Human feedback is collected in the form of demonstrations or comparisons.
Demonstrations involve providing the model with desired responses or actions for specific
inputs, while comparisons involve ranking or selecting the best response among multiple
choices.

3. Reward Signal: Human feedback serves as a reward signal that guides the reinforcement
learning process. The model is trained to maximize rewards, which are determined based
on the quality, accuracy, relevance, or other desired characteristics of the generated
responses.

4. Fine-Tuning: The model undergoes fine-tuning using techniques such as Proximal Policy
Optimization (PPO) or other reinforcement learning algorithms. During this process, the
model's parameters are adjusted to optimize its performance based on human-provided
feedback.

5. Iterative Refinement: The RLHF process is typically performed iteratively. The fine-tuned
model is deployed to interact with users or perform specific tasks, and new feedback is
collected from human evaluators or users. This new feedback is then used to further refine
the model's behaviour through additional rounds of training and fine-tuning.

Why Human feedback?

Training LLMs on internet-scale data can lead to models that generate toxic, inaccurate, and
unhelpful content, and automatic evaluation metrics often fail to identify these behaviours.

As models become more capable, human feedback is an invaluable signal for evaluating and
improving models.

To Learn about RL applications in LLMs click here

22
Further, RLHF is important for the following three reasons,

1. Resolves queries, seeks necessary details, responds empathetically, and provides


informed suggestions.
Metrics to measure this benefit-
a. Human preference ranking
b. Preference Model scores
c. Elo scores

2. Avoids offensive behaviour, refuses to assist in dangerous acts, and considers diverse
cultural and contextual differences of its users.
Metrics to measure this benefit-
a. Binary questions (inappropriate, sexual, violent content, etc.)
b. Bias and toxicity benchmarks
c. Preference Model scores

3. Provides accurate information and appropriately expresses uncertainty. InstructGPT


paper suggests using Truthfulness instead
Metrics to measure this benefit-
a. Measure made-up information on closed-domain tasks
b. Truthful QA dataset

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. Bai et al., 2022
https://arxiv.org/pdf/2204.05862.pdf

23
Large Language Models for
7 recommendation systems

LLMs have become a popular tool for recommendation


systems due to their ability to capture complex relationships
between items in the data set and make accurate predictions
about user preferences.

One of the key advantages of LLMs in recommendation


systems is their ability to handle sparse & implicit user
feedback.

This chapter is a summary of LLMs for recommendations by Sumit Kumar


Sumit Kumar, Sr. ML Engineer Meta, ex-Tik Tok

24
Using LLMs in recommenders:

• LLMs for Feature Engineering and Feature Encoding: For feature engineering and feature
encoding, LLMs can be used to extract meaningful features from raw data that can then be
used to create more accurate recommendations.
• LLMs as Pipeline Controllers: LLMs can also be used as pipeline controllers, which means
they are responsible for orchestrating the different components of a recommendation
system.
• LLMs as Scoring/Ranking Function
a. Alongside a traditional recommender component: When used alongside a traditional
recommender component, an LLM will take the output of the recommender and
score it based on certain criteria such as relevance or popularity.
b. As a standalone ranker: As a standalone ranker, an LLM will take input from the user
and generate personalized recommendations based on its own scoring criteria.

LLMs for recommendation systems: Pros

• Vast External Knowledge: Pre-trained language models such as BERT and GPT learn
general text representations and encode extensive world knowledge; thus, they can
efficiently capture semantic relationships between items in a dataset. This allows them to
better understand user preferences and make more accurate recommendations.

• Few-shot Tuning and simplified feature engineering: LLMs have the ability to be tuned with
few shots. This means that they can quickly adapt to new datasets without needing to be re-
trained from scratch and are ideal for dynamic environments where data changes rapidly or
when there is limited training data available.

• Alleviating Sparsity Scenarios: LLMs provide a richer representation of user preferences


than traditional methods such as collaborative filtering or content-based filtering. By
leveraging the vast external knowledge encoded in pre-trained language models, LLMs are
able to capture subtle relationships between items that would otherwise be missed by
traditional methods.

• Interactive and Explainability: LLMs allow users to get an understanding of why certain
recommendations were made. This helps build trust between users and the system and
provides valuable feedback for improving the system over time.

• Simplified Feature Engineering: LLMs can simplify feature engineering by learning to


extract meaningful features from text data automatically. Hence removing the need for
manual feature engineering.

25
Cons of LLMs for Recommendation Systems:

• Stochastic and potentially invalid/unexpected outputs: LLMs have the tendency to produce
different outputs for the same input. This makes it difficult to reproduce results and can
lead to potentially invalid or unexpected outputs.

• Including IDs is challenging: Since IDs are not part of the language model's vocabulary
including IDs such as user IDs is challenging.

• Universal knowledge vs private domain data: while LLMs encode universal knowledge, they
may not be able to capture private domain data which could be important for certain types
of recommendations.

• Limited Context Length: LLMs have limited context length which means they may not
capture long-term dependencies between items in a sequence.

• Popularity and Position Bias: there are some biases associated with using LLMs for
recommendation systems such as popularity bias and position bias where popular items
are recommended more often than unpopular ones and items at the beginning of a list are
recommended more often than those at the end.

• Fairness Bias: There is also fairness bias where certain groups of users receive different
recommendations than others based on their demographic information or other factors.

This is an example ChatGPT interface for


recommender tasks

To learn more about ChatGPT


based recommender system visit
this link

26
Understanding economics
8 of Large Language Models
The development and deployment of LLMs entail
substantial costs and considerations, making it
imperative to understand the economic factors at play.

This is a summary of our co-founder, Nikunj’s talk at


the LLM in Production conference organized by the
MLOps community in San Francisco.

Here we delve into the


Technical & financial intricacies of fine-tuning Nikunj Bajaj
LLMs
Explore their economic implications and provide
insights about the same

27
Understanding LLM Economics
To understand LLM Economics let’s take a sample task and calculate the cost of carrying it out.

Sample Task: Summarize the entire Wikipedia to Half the size

Here are some points to keep in mind regarding the economics of shipping LLMs,
1. OpenAI models are ~7X more expensive if fine tuned compared to ~1.6x of open-source
models.
2. Cost increases with an increase in context window by ~2X

3. Cost of using the model increases with an increase in the number of parameters of the
model
4. Lower parameter models can also perform better than larger models when fine-tuned for a
particular use case.
5. Significant cost saving is possible without harming the performance much if the right
trade-off is established between cost and performance

28
Cost of summarization for various
models
Pretrained/ Model Name Params* Fine-tuning Input Cost Output Total Cost
Fine-Tuned cost ($) ($) Cost ($) ($)

GPT-4 32k 1Tn+ n/a $ 360k $ 360k $ 720k

GPT-4 32k 1Tn+ n/a $ 180k $ 180k $ 360k

Pre- DaVinci 175Bn n/a $ 120k $ 60k $ 180k


Trained Claude V1 52Bn n/a $ 66k $ 96k $ 162k

Curie 13Bn n/a $ 12k $ 6k $ 18k

Self-hosted 7B 7Bn n/a $ 350 $ 1.75k $ 2.1k

DaVinci 175Bn $ 180k $ 720k $ 360k $ 1,260k


Fine-
Curie 13Bn $ 18k $ 72k $ 36k $ 126k
Tuned
Self-hosted 7B 7Bn $ 1.4k $ 350 $ 1.75k $ 3.5k

29
Confidence checklist for
9 LLMs in production

This chapter is a summary of “The confidence checklist


for LLMs in production” by Rohit Agarwal CEO and co-
founder of Porteky.ai.

Given the complex nature of LLMs and their influence on


critical decision-making processes, it becomes
imperative to establish a confidence checklist to ensure
the responsible and trustworthy deployment of these
models.
Rohit Agarwal
This checklist contains,
Technical measures,
Evaluation metrics
Robust testing methodologies

30
Confidence checklist for LLMs in Prod

Confidence Checklist for LLMs in Production


Output Validation

Prepare for DDoS attacks

Building user limits

Take care about latency

Don’t retro-fit logs & Monitoring records of LLMs

Implement Data Privacy

Here are some ways in which you can tick off the above checklist,

Output validation: When productionising LLMs, it is important to consider the output


validation process. Output validation ensures that the outputs of the LLM are accurate and
reliable. Some ways to validate the output from large language models are as follows,

Basic ways:
1. Checking for empty strings & character lengths

Advanced ways:
1. Checking for relevance and ranking Results

Expert Ways:
1. Model Based checks: This involves using metrics such as perplexity and BLEU scores to
evaluate how well a model performs.

31
Prepare for a DDoS: DDoS attacks have a significant impact on the performance and
availability of your LLM system. DDoS attacks flood the bandwidth or resources of a
targeted system. This can disrupt normal traffic and make it difficult for users to access
your services. Some ways in which you can prepare your Large Language Model stack for a
DDoS attack:

Basic:
1. Captcha: Adding a Captcha to your service can help prevent automated bots from sending
a large number of requests and overwhelming your system.

Advanced:
1. Rate-limiting: Rate-limits on API usage can help manage the flow of incoming requests.
This can effectively control the traffic and prevent potential DDoS attacks.
2. Geo-blocking: Using this you can block or limit access from specific countries or regions
that are known sources of DDoS attacks.

Expert:
1. IP-based Monitoring: Monitor incoming IP addresses for signs of malicious activity. Doing
this can proactively identify and mitigate potential DDoS attacks.
2. Fingerprinting: Fingerprinting involves analysing various characteristics of incoming
requests. By monitoring these fingerprints, you can identify and block malicious actors
attempting to launch a DDoS attack.
3. Intrusion Detection and Prevention Systems (IDPS): Deploying an IDPS can help you
monitor network traffic for signs of a DDoS attack.
4. DDoS Protection Service: Consider using a specialized DDoS protection service that
provides advanced techniques for detecting and mitigating DDoS attacks, as well as
additional tools for securing your language model stack against potential threats.

32
Building User limits: LLMs require a lot of data and computational resources, making it
difficult to control their use. Hence, user limits provide a way to ensure that LLMs are used
appropriately, by limiting the amount of data they can access or the types of tasks they can
perform. Additionally, user limits can help protect users from potential misuse or abuse of
LLMs by providing safeguards against malicious actors. By setting user limits for large
language applications, organizations can ensure that their AI capabilities are being used
safely and responsibly. Some ways in which you can build user limits are,

Basic
1. Client-side rate-limiting: This is the simplest form of rate-limiting, and it involves the client
(the user's device) limiting the number of requests it makes to the server.
2. Debounce: Debounce is a technique that can be used to prevent users from making too many
requests in a short time. It works by delaying the response to a request until a certain amount
of time has passed since the previous request.
Advanced

1. Users, organizations, and segment-based rate-limits: This type of rate-limiting allows you to
specify different limits for different users, organizations, or segments of users.

Expert

1. Dynamic rate-limiting: Dynamic rate-limiting allows you to adjust the limits for a user or
organization based on their behaviour

2. Exponential decrease in limits for identified abuse: This technique helps prevent abusing the
system even if users can circumvent the initial limits.

3. Start with lower limits and increase as trust increases

4. Fingerprinting: Apart from preventing DDoS attacks this can also prevent abuse by users
who are trying to circumvent the rate-limiting system by using different devices or IP
addresses.

33
Care about Latency: Latency is a measure of the amount of time it takes for a data packet to
travel from one point to another.. Low latency is essential for applications that require fast
response times, such as voice recognition or natural language processing. By tracking
latency, developers can identify areas where improvements are needed and optimize their
applications accordingly.
Some ways in which you can ensure optimal latency are as follows,

Basic
1. Implement streaming: Streaming is a technique that allows you to send data in a
continuous stream, rather than all at once. This can be helpful for reducing latency, as it
allows the client to start processing the data as soon as it is received, rather than waiting
for the entire response to be received before it can start processing.

Further, using the correct HTTP headers makes sure that the data is sent in a way that is
efficient for the client.

Advanced
1. Handle rate-limits with backoff: Rate-limits can be helpful for preventing abuse, but they
can also lead to latency if the client is not able to make a request because it has reached its
rate-limit. Backoff is a technique that can be used to mitigate the effects of rate-limits. It
works by making the client wait a random amount of time before making another request.
2. Caching: Caching is a technique that can be used to store frequently accessed data in
memory. This can help to reduce latency, as the client can access the data from memory,
rather than having to make a request to the server.

Expert
1. Build fallbacks: Fallbacks are a way of handling situations where the primary method of
accessing data fails. For example, if you are using a CDN to cache data, you could have a
fallback that uses the origin server if the CDN is unavailable.
2. Queues: Queues are a way of storing requests that cannot be processed immediately. This
can be helpful for reducing latency, as it allows the server to process requests in the order
that they are received, rather than having to wait for a request to be processed before it can
start processing the next request.

3. Semantic caching: Semantic caching is a technique that allows you to cache data based on
its meaning, rather than just its exact value. This can help to reduce latency, as the server
can return a cached response even if the data has changed slightly since it was cached.

34
Don’t retro-fit Logs & Monitoring records of LLMs: Retrofitting can also lead to inaccurate
results as it relies on time-stamped log-records which may be incomplete or out of date.
Here are some ways in which you can prevent retro-fitting of logs and monitoring records of
LLMs

Basic

1. Live with poor visibility: This is the simplest option, but it is also the least secure. If you are
not willing to pay for a monitoring service, then you will have to live with the fact that you
will have limited visibility into your LLMs.
2. Pay for a monitoring service: There are a number of monitoring services available, such as
Datadog and Cloudwatch. These services can provide you with detailed visibility into your
LLMs, including logs, metrics, and traces.

Expert

1. Build your own monitoring layer: If you want the most control over your monitoring, then
you can build your own monitoring layer. This involves using a tool like Elasticsearch,
Kibana, and Grafana to collect and visualize your logs and metrics.
Implement Data Privacy: LLMs are powerful ML algorithms that can generate human-like
text and are trained on vast amounts of data. This data can include sensitive information,
such as personal details or confidential business records. Therefore, it is essential to ensure
that LLMs only reveal private information in the right contexts and to the right people. To
achieve this, organizations must implement measures such as prompt tuning and secure
data processing to protect customer privacy and comply with regulations like the GDPR.
Here are some ways in which you can implement data privacy for your LLM applications

Basic

1. Use encryption: Encrypt all data that is stored in your LLM applications. This will help to
protect the data from unauthorized access.
2. Limit data access: Only allow authorized users to access the data in your LLM applications.
This will help to protect the data from unauthorized access.

3. Monitor your systems: Monitor your systems for signs of unauthorized access. This will
help you to detect and respond to any security breaches.

35
Advanced

1. Amend GDPR, Cookie, and Privacy Policies: If you are using LLMs in the European Union,
then you will need to amend your GDPR, Cookie, and Privacy Policies to reflect the use of
LLMs. This includes providing information about how you are collecting and using personal
data, and how you are protecting the privacy of your users.

2. Implement PII masking with LangChain / LLamaIndex: PII masking is a technique that can
be used to obscure personally identifiable information (PII) in text data. This can help to
protect the privacy of your users, and it can also help to comply with data privacy
regulations. LangChain and LLamaIndex are two tools that can be used to implement PII
masking for LLM applications.

Expert

1. Microsoft/Presido OR Azure service: Microsoft/Presido and Azure are two services that
can be used to help you implement data privacy for your LLM applications. These services
provide a variety of features, such as data encryption, data masking, and data access
control.

36
Creating a contextual
10 Chatbot with LLMs

This chapter is a summary of “Creating a contextual


Chatbot with LLM and VectorDb in 10 Minutes” by
Elsevier.

This is a practical guide on creating a contextual


Chatbot by the Elsevier team. Here we cover,

Frameworks & Python libraries used for


embeddings
Raahul Dutta
The Vector database used to make the chatbot

The fine-tuning technique used

How the LLM was hosted

The Business Case:

💨 Elsevier provides a policy document on a topic to other organizations.


• Policymakers do not have the time to read individual articles and must rely upon
synopses.
• These synopses often arrive too late and the date of the finding is quick.
• There is a disconnect between science and policy on what 'good evidence' means.

💭 That's why they’ve built an automated solution to generate the policy papers on a
specific dataset.

37
Building contextual Chatbot with LLMs
Implementation

Embedding
Model KServe API

LangChain Qdrant
Web UI Embedding
Orchestrator Database result
Document

LLM LLM
Model Model

Embeddings Tried
1. Vicuna 13b
1. TensorRT Pytorch Pipeline - KServe 2. facebook/opt-6.7b
bato produces the embeddings. 3. Bloom 7b -
2. Use - "intfloat/simlm-base-msmarco- 4. RedPajama-INCITE-Instruct-3B-v1
finetuned". 5. Fine tuned Falcon 7b Instruct
3. We stored the embeddings in NumPy 6. The best result is : mosaicml/mpt-7b-
vectors. instruct

Fine Tuning
Vector Database (Qdrant) PeFT and QLoRA

1. Used Qdrant - quadrant.tech as Vector To Host:


Database VLLM based rust Based Server.
2. Used Rust Code to upload the
embedding Vectors.
3. We stored the embeddings in NumPy
vectors.
4. Tuned
Indexing Optimizer
Memmap_threshold
5. Deployed the docker in the K8 cluster.

38
About TrueFoundry
TrueFoundry is a US Headquartered Cloud-native Machine Learning Training and Deployment
Platform. We enable enterprises to run ChatGPT type models and manage LLMOps on their own
cloud or infrastructure. -- which has the benefits of:

1) Data Security with no data flowing to vendors like OpenAI,

2) Reduced Cost with Optimized GPU Management

3) Stable and scalable Inference

After, having talked to 50+ companies that are already starting to put LLMs in production we
have developed a Model Catalogue which is cloud agnostic and let’s you deploy and fine-tune
the LLM models in less than 5 minutes.

We aim to help companies self host their LLMs on top of Kuberenetes, thus making your
inference costs 10x cheaper in 1 click.

If you want to use a single API to call any commercial or open-source LLM check out our LLM
playground. To learn more about our LLM solution, schedule a demo by clicking below

Schedule a personalized Demo

Click to to learn more about our LLM solution

www.truefoundry.com

39

You might also like