Leancontext: Cost-Efficient Domain-Specific Question Answering Using Llms

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

LeanContext: Cost-Efficient Domain-Specific Question Answering Using LLMs

Md Adnan Arefeen1,2 * , Biplob Debnath1 , and Srimat Chakradhar1


1 NEC Laboratories America, 2 University of Missouri-Kansas City
[email protected], {biplob, chak}@nec-labs.com
arXiv:2309.00841v1 [cs.CL] 2 Sep 2023

Abstract
Question-answering (QA) is a significant application of
Large Language Models (LLMs), shaping chatbot capabili-
ties across healthcare, education, and customer service. How-
ever, widespread LLM integration presents a challenge for
small businesses due to the high expenses of LLM API us-
age. Costs rise rapidly when domain-specific data (context)
is used alongside queries for accurate domain-specific LLM
responses. One option is to summarize the context by using
LLMs and reduce the context. However, this can also fil-
ter out useful information that is necessary to answer some
domain-specific queries. In this paper, we shift from human-
oriented summarizers to AI model-friendly summaries. Our Figure 1: Compared to the original context LeanContext
approach, LeanContext , efficiently extracts k key sentences only drops in ∼2% ROUGE-1 score with ∼ 68% savings on
from the context that are closely aligned with the query. BBCNews dataset (Li 2023).
The choice of k is neither static nor random; we introduce
a reinforcement learning technique that dynamically deter-
mines k based on the query and context. The rest of the less
important sentences are reduced using a free open source cent events on which it has not been trained. This lack of
text reduction method. We evaluate LeanContext against sev- exposure to up-to-date information can lead to inaccurate
eral recent query-aware and query-unaware context reduc- responses, particularly for domain-specific information pro-
tion approaches on prominent datasets (arxiv papers and BBC cessing, where the LLMs may not grasp new terminology
news articles). Despite cost reductions of 37.29% to 67.81%, or jargon. To build an effective domain-specific question-
LeanContext ’s ROUGE-1 score decreases only by 1.41% to
answering system, it becomes essential to educate the LLMs
2.65% compared to a baseline that retains the entire context
(no summarization). Additionally, if free pretrained LLM- about the specific domains, enabling them to adapt and un-
based summarizers are used to reduce context (into human derstand new information accurately.
consumable summaries), LeanContext can further modify the LLMs can learn domain-specific information in two ways,
reduced context to enhance the accuracy (ROUGE-1 score) (a) via fine-tuning the model weights for the specific domain,
by 13.22% to 24.61%. (b) via prompting means users’ can share the contents with
the LLMs as input context. Fine-tuning these large models
Introduction containing billions of parameters is expensive and consid-
ered impractical if there is a rapid change of context over
In recent times, large language models (LLMs) have seen time (Schlag et al. 2023) e.g. a domain-specific QA system
extensive utilization, especially since the introduction of where the documents shared by users are very recent and
LLM APIs for customer-oriented applications on a large from different domains. A more practical way is to select
scale (Liu et al. 2023). These applications include chat- the latter approach i.e. the prompt-based solution, where rel-
bots (like GPT-4), language translation (Jiao et al. 2023), evant contents from user documents are added to the query
text summarization (Luo, Xie, and Ananiadou 2023; Yang to answer based on the context. Motivated by this, we focus
et al. 2023; Zhang, Liu, and Zhang 2023a), and question- on prompt-based solutions for document-based QA systems.
answering (QA) tasks (Tan et al. 2023), personalized robot
assistance (Wu et al. 2023). While the zero-shot perfor- One of the challenges for long document processing us-
mance of the LLM model is nearly on par with fine-tuned ing a prompt-based solution is the input prompt length being
models for specific tasks, it has limitations. One signifi- limited to a maximum length defined by the LLM API. The
cant limitation is its inability to answer queries about re- token limit of GPT-3.5 and GPT-4 vary from 4,096 to 32,768
max tokens limit proportional to the usage cost. Therefore,
* Work done during internship at NEC Laboratories America LLMs will fail to answer the query if the prompt length is
larger than the max token limit due to the larger context • It reduces the LLM API usage cost by 37.29% ∼ 67.81%
length in the prompt. One suitable way to get rid of this of a domain-specific QA system with little drop in perfor-
problem is via document chunking (Harrison Chase 2022). mance by only 1.41% ∼ 2.65%. (Table 1, Table 2).
In this case, initially, the user documents are segmented into • It boosts the QA performance by 13.22% ∼ 24.59% (Ta-
chunks. Only the relevant chunks of fixed size are retrieved ble 3) by combining query-aware top-𝑘 sentences with
as context based on the query. the reduced context generated through free open-source
The cost for context-based querying using LLM APIs via text summarizers .
prompting is associated with a cost that is proportional to
the number of input tokens (contributing prompt cost), and
the number of output tokens (contributing generation cost).
Related Work
According to a recent study(GPT-3 Cost), with 15,000 visi- For domain-specific tasks, LLMs can be utilized to adapt
tors having 24 requests per month, the cost of using GPT-3 the domains without modifying their inner parameters via
(Davinci model) is $14,400 per month (assuming prompt to- discrete prompting where distinct instructions with contexts
kens = 1800, output tokens = 80) which is challenging for a can be delivered as an input to generate responses for down-
small business to operate. For GPT-4, the cost is even higher stream tasks (Ling et al. 2023; Brown et al. 2020). For
than this amount. In this paper, our focus is to reduce this domain-specific QA tasks, the domain can be reduced by
cost. context summarization to reduce the LLM cost. A lot of
To mitigate the cost of using LLM API, the number of to- research has been conducted for summarizing text (Miller
kens in the context should be reduced as the cost is propor- 2019; Yang et al. 2023). Existing research works can be cat-
tional to the length of the context. A low-cost option to re- egorized into two main parts: (a) extractive and (b) abstrac-
duce the context is to summarize the context using free open- tive. The extractive summarizers (Miller 2019) first iden-
source summarizer models. However, for domain-specific tify important sentences from the text, and next summa-
QA applications, the pre-trained open-source summarizers rizes them. While abstractive summarizers (Laskar et al.
do not contribute to good accuracy. On the contrary, using a 2023) reduce the context by generating new sentences. The
pay-per-use model like ChatGPT further increases the query main goal of both approaches is to generate a meaningful
processing cost instead of reducing it as the additional cost summary for human users. In contrast, the goal of Lean-
is added at the time of text reduction. Context is to reduce context which will be consumed by a
To this end, we propose a domain-specific query- question-answering model like ChatGPT. For the prompt-
answering system LeanContext , where users ask queries based summarization task, recently, iterative text summa-
based on a document. To answer a query, LeanContext first rization (Zhang, Liu, and Zhang 2023b) has been proposed
forms a context from the document based on the query to refine the summary task in a feedback-based iterative
by retrieving relevant chunks. Next, it identifies the top- manner. In aspect or query-based summarization (Yang et al.
𝑘 sentences related to the query from the context. Lean- 2023) summaries are generated based on a domain set of
Context introduces a reinforcement learning technique that specific queries.
dynamically determines k based on the query and context. Query-unaware text compression via prompting is also
Then, LeanContext reduces the rest of the sentences in frag- observed in recent literature. Semantic compression (Gilbert
ments by an open source text reduction method. Next, it et al. 2023) involves generating systematic prompts to re-
forms a new context by stitching top-k sentences and re- duce context using ChatGPT model (GPT-3.5-turbo, GPT-
duced fragments in the order of their appearance order in 4) and acquire reasonable compression compared to the zlib
the original context. Finally, it invokes an LLM (like Chat- compression method. Due to limited context window size,
GPT) to answer the query using that new context. It is to be recent literature focus on prompt context filtering. In se-
noted that the goal of LeanContext is contrary to the sum- lective context (Li 2023), token, phrase, or sentence-level
marization task that generates a meaningful summary for hu- query-unaware content filtering is proposed using the en-
man users. Rather, in LeanContext , the reduced context will tropy of GPT-2 model logits for each entity. Usage of Chat-
be consumed by a question-answering model like ChatGPT. GPT model in the medical domain especially in radiol-
Figure 1 shows the scenario of LeanContext , reducing the ogy is explored via prompt-engineering (Ma et al. 2023) to
context size with accuracy close to the original context and summarize difficult radiology reports. Extract-then-generate
outperforming other open-source models. pipeline-based summarization improves abstractive sum-
In summary, LeanContext makes the following contribu- mary faithfulness (Zhang, Liu, and Zhang 2023a) through
tions: the chain of thought (CoT) (Wei et al. 2022) reasoning.
To reduce the cost of the use of LLM, FrugalGPT (Chen,
• It presents a low-cost domain-specific QA system, which Zaharia, and Zou 2023) proposed several ideas regarding
reduces the LLM API usage cost by reducing the domain prompt adaptation by query concatenation, LLM approxi-
context through the selection of important sentences re- mation by fine-tuning or caching, and LLM cascade by the
lated to the query and keeping them intact, while reduc- selective selection of LLMs from lower to higher cost based
ing rest of the sentences in between important sentences on a query. Still, it lacks context compression ideas to reduce
through a free open-source summarizer. It proposes a re- the prompt tokens.
inforcement learning technique to select the percentage It is to be noted that recent studies either focus on sum-
of the important sentences. marization as a downstream task or utilize the summary of
the context for the question-answering task. For most of the
0.1, 0.4, …

Embedding -0.2, 0.6, …

Text Splitter Generator -2, 0.8, ...

existing content filtering approaches, the main focus is to 0.12, 0.5, ...

query-agnostic filter content with the deletion of less in- User Documents Chunks Embeddings

formative content or for solely summarization tasks. Using


LLM for query-aware context reduction adds extra overhead Vector Database
for using pay-per-use LLM to answer correctly. In addition, (key: embedding, value: chunk)
(a) Domain Data Ingestion
in recent articles, the chunk-based preprocessing of the arti-
cle is ignored by assuming each content in the dataset as a Query
Embedding Semantic
Search
chunk. In LeanContext , the main focus is to reduce the LLM User Vector Database
(key: embedding, value: chunk)
cost by considering query-aware context reduction. Due to
the possibility of rapid change of domain-specific user data,
ChatGPT
fine-tuning LLM or parameter-efficient LLM is not a fea- Answer
(LLM)
Context

sible solution. Utilizing open-source LLM (Touvron et al. Subset of Domain Data

2023; Chung et al. 2022) model either does not perform well (b) Question-Answering (QA)
on domain data or adds additional deployment cost. Con-
sider a small business to run, we consider using pay-per-use Figure 2: Workflow of a domain-specific QA system.
LLMs such as OpenAI LLMs to make the system running at
a reasonable cost by reducing the context.
token count of C. As LLM prompt cost is proportional to
Domain-specific QA System the token count of context, LeanContext helps to reduce the
prompt cost of LLMs. In other words, if the total number of
In a domain-specific QA system, a context with a query is tokens in C is 𝑇 and the number of tokens in C′ is 𝑡 then
given to an LLM to get the answer. If the context size ex- LeanContext reduces the token ratio 𝜏 is defined as, 𝜏 = 𝑇𝑡
ceeds the max-token limits of the LLM API, the LLM will without compromising the accuracy (𝑎𝑐𝑐) of the QA system.
fail to answer the query. As a result, for long document pro- So, if the optimal accuracy of the system is 𝑎𝑐𝑐∗ , we formu-
cessing, a vector database (ChromaDB; Pinecone) is used late the optimization problem of LeanContext as follows.
to store domain-specific documents into a number of small
chunks so that a subset of relevant domain context can be min. (1 − 𝛼) × 𝜏 + 𝛼 × |𝑎𝑐𝑐 − 𝑎𝑐𝑐 ∗ | (1)
retrieved from the long document rather than the whole doc-
ument as context. A domain-specific QA system is shown in Finally, the reduced context C′
and the query (𝑞 𝑖 ) are given
Figure 2. The QA system can be divided into two steps. to the pay-per-use LLM API to answer the query. Then, the
answer is shown to the respective user via an interactive in-
(a) Domain data ingestion: In this step, the documents, terface.
D will be split into a number of fixed chunks (𝑐) by a text For context-based QA, generally, the answers reside
splitter. An embedding function computes the embeddings within a couple of sentences. If the smallest amount of
of each chunk using an embedding generator. The chunks context for a certain question can be identified, the same
along with the embedding vector (v𝑐 ) of each chunk are response can be provided by LLMs at a lower cost i.e.
stored in a vector database. less prompt tokens. So, identifying the top-𝑘 sentences can
reduce the context without compromising accuracy. Moti-
(b) QA: At the question-answering (QA) step, given a user vated by this simple idea, we propose LeanContext which is
query 𝑞 𝑖 , similar 𝑁 chunks are retrieved by their embeddings shown in Figure 3.
similar to query embedding using semantic search. These
chunks form the context (C). Finally, the context which is a
Semantic
subset of domain data is fed into LLM to get the answer. Query
Search

As domain-specific data and user queries are dynamic in User Vector Database
(key: embedding, value: chunk)
nature, retrieving minimal context based on a query is chal- Context
lenging. One possible way is to make the chunk size and
number of chunks dynamic so that the context contains min- ChatGPT Reduced Top-k
Answer Combine RL
imal sentences to answer the query. But this solution is infea- (LLM) Context sentences

sible as the vector database needs to be reconfigured again Text Other


Query-based
Sentence
Reduction Sentences
per query with the change of domain. Instead, it will be more ranking

practical if after getting the possible chunks as context, the


context is further reduced based on a query to get the near- Figure 3: LeanContext System
optimal cost of LLM. Following this notion, we propose
LeanContext , an adaptive context reduction system to re-
After forming the context with semantic search, Lean-
duce the prompt cost of ChatGPT like LLMs.
Context first ranks the sentences of the context based on a
query. Assuming the context consists of a sequence of sen-
LeanContext tences (𝑠1 , 𝑠2 , 𝑠3 , ..., 𝑠 𝑛 ), it extracts top-𝑘 sentences similar
The objective of LeanContext is to further reduce the con- to the query from context using the cosine-similarity func-
text C to C′ where the token count of C′ is smaller than the tion. To accomplish this, it computes the embedding of the
query (v𝑞 ), and the embedding is compared with each of the Adaptive-𝑘: Identifying the number 𝑘 is crucial for con-
sentence embedding (v𝑠𝑖 ) and top-𝑘 sentences are identified. text reduction as well as performance. Considering a fixed-
Top-𝑘 sentences = sort(V, similarity score(v𝑞 , v𝑠𝑖 )) 𝑘 might impact accuracy if 𝑘 << 𝑛. On the contrary, if
𝑘 ≃ 𝑛, the token ratio will be higher, leading to higher costs.
Here, 1 ≤ 𝑖 ≤ 𝑛. Using only top-𝑘 sentences as a re- hence, to achieve minimal cost with maximum accuracy, the
duced context is a lightweight approach to reduce the cost 𝑘 should be adaptive. To get a query-based adaptive con-
LLM API usages. However, LeanContext boosts accuracy text, we propose a lightweight 𝑄-learning-based reinforce-
by combining information from the rest of the sentences. ment learning algorithm that perceives an optimal policy for
Figure 4 shows an illustration of this combination process. an agent operating in an environment. After training, we will
LeanContext keeps the top-𝑘 sentences intact, while other have an optimal 𝑄 table that takes the best action for a given
sentences between the top-𝑘 sentences are reduced by an state.
open-source text reduction method. LeanContext maintains 𝜋 ∗ (𝑠𝑡𝑎𝑡𝑒) = argmax 𝑄 ∗ (𝑠𝑡𝑎𝑡𝑒, 𝑎𝑐𝑡𝑖𝑜𝑛)
the order of the top-k sentences and other sentences accord- 𝑎
ing to their appearances in the original context in order to Details of RL: Due to the dynamic environment of do-
produce more accurate results. mains and queries, it is difficult to estimate the optimal con-
text. In this situation where the environment is complex and
dynamic, RL fits the best. After retrieving the context based
on a query, LeanContext computes the state with context and
query embedding. With the state, the RL agent finds a suit-
able action from the trained 𝑄 ∗ table. Based on the action,
the threshold for context reduction is computed and top-𝑘
Figure 4: Illustration of the reduction of less important sen- sentences are selected. Then the reduced context version is
tences while keeping the top-k sentences intact in a context. produced according to one of the sentence positioning vari-
ants. After then the response is retrieved from the reduced
The idea of maintaining the top sentences intact is a sim- context and query and sent to the LLM. We carefully define
ple but interesting approach. Adding this simple approach our state, action, and reward function as follows.
to the existing open-source pre-trained summarizer models State: We combine query and context to define the state.
will help increase the performance. However, an interesting At offline profiling, we compute the embedding of the query
question lies in identifying the k in top-𝑘. We ask the ques- v𝑞 and the embedding of the context as v𝑐 . Then we subtract
tion here, (a) What is the number of 𝑘? We answer the ques- v𝑞 from v𝑐 that indicates the context-query pair. After build-
tion in a more detailed manner with a reinforcement learning ing the vector with a number of training samples, we run the
(RL) based solution as follows. K-means model to compute the centroids. These centroids
are utilized as state vectors S. So, at run time, a query con-
Algorithm 1: LeanContext Training Algorithm text pair is concatenated using their embedding vectors and
Require: D: Input documents the closest centroid will be the state for them.
Require: 𝒒: A set of queries Ø
!
Require: Θ: a set of predefined thresholds, acts as a set of S ← K-means (v𝑐𝑖 − v𝑞 𝑗 )
actions in RL 𝑖, 𝑗
Ensure: 𝑄 ∗ : Trained 𝑄-table
1: X ← ∅ ⊲ set of states Action: As our goal is to make the top-k extraction adap-
2: for each 𝑞, context ∈ 𝒒, D do
tive. Given a set of thresholds from 0 to 0.4 (maximum top-𝑘
3: v𝑐 , v𝑞 ← Embedding(context), Embedding(𝑞) ⊲ will be 40% of the total context) to select the number of sen-
get embedding vectors tences, we define each possible choice as an action of the
4: X ← X ∪ (v𝑐 − v𝑞 ) proposed RL system. The outcome of each action choice is
5: end for
computed by the reward function and the 𝑄 table is updated
6: S ← K-Means(X) ⊲ centroids as state vectors for the corresponding (state, action) pair.
7: for each 𝑞, context ∈ 𝒒, D do Reward: The reward for the RL model is higher if the con-
8: state ← get state(S, context, 𝑞) text ratio is less and the accuracy is almost equal to the op-
9: 𝑎𝑐𝑡𝑖𝑜𝑛 ← get action(Θ) timal accuracy using full context. We compute the ROUGE
10: C ← retrieve(𝑞, v𝑞 , 𝜃) ⊲ context formation score (𝑟) to evaluate the answer using reduced context with
11: C′ ← perform action(C, action) ⊲ text reduction the actual answer using full context. If the ROUGE score
12: answer ← llm(𝑞, C′ ) with expensive LLM is (𝑟 ∗ ) for a query, then the current
13: r ← compute score(answer, original answer) (state, action) will be rewarded if 𝑟 − 𝑟 ∗ ≥ 0, otherwise will
14: reward ← 𝛼(2r − r∗ ) − (1 − 𝛼) × 𝜏(C′ , C) be penalized. For the token ratio, the lower the better as the
15: 𝑄(state,action) ← 𝑄(state,action) + 𝑛1 (reward − reduction of context as much as possible without compro-
𝑄(state,action)) ⊲ Update 𝑄 table mising accuracy is rewarded. Thus, the reward function R is
16: end for defined as follows.
17: return 𝑄 ∗ ⊲ return trained 𝑄 table
R = −(1 − 𝛼)𝜏 + 𝛼(2𝑟 − 𝑟 ∗ )
Training algorithm: The off-policy 𝑄 table training al- aware summary of a context.
gorithm is shown in Algorithm 1. In line 1 − 6, each state “A document along with its query is given below.
is computed by the subtraction of query embedding from Write down the most reasonable summary relevant to its
context embedding. A k-means model is trained to get the document-query pair.
centroids and utilize the centroids as different states. In line Document: {CONTEXT}
7 − 15 for selecting each threshold as action, the correspond- Query: {QUERY}”
ing reward is computed the 𝑄 table is updated. Finally, the 3. Semantic Compression (Gilbert et al. 2023): A query un-
updated 𝑄 table is deployed for reducing the context. The aware prompt is stated to compress a context as follows.
training of RL requires LLM to compute the reward. To re- “Please compress the following text into a latent rep-
duce the training cost, we perform training on fewer sam- resentation that a different gpt-3.5-turbo model
ples. hence to observe the effect of each action in each state, can decompress into the original text. The compression
we do a full exploration to update the 𝑄 table. model should purely minimize the number of characters
LeanContext Inference The LeanContext inference al- in the compressed representation while maintaining the
gorithm is shown in Algorithm 2. For each query, the cor- semantics of the original text. The resulting compressed
responding context is retrieved, and the state is computed by text does not need to be decompressed into the original
the trained RL-agent [line 2]. Based on the state, the thresh- text but should capture the semantics of the original text.
old 𝜃 is computed as an action to select the top-k sentences The compressed text should be able to be decompressed
and the top-𝑘 sentences with reduced less important sen- into a text that is semantically similar to the original text
tences produce the reduced context, C′ [line 3-4]. This re- but does not need to be identical.
duced context is utilized to get answer with the LLM [line Text to Compress: {CONTEXT}”
5]. Finally, the answer is returned to the user. 4. SBert (Miller 2019): A Bert-model-based extractive sum-
marization approach. The context is reduced to several
Algorithm 2: LeanContext Inference sentences. We keep the number of sentences 3 in our ex-
periments.
Require: D𝑡 : Test documents
Require: 𝒒 𝑡 : A set of test queries 5. Selective Context (SC (Li 2023)): It utilizes entropy to
Require: 𝐴𝑔𝑒𝑛𝑡: Trained RL-Agent filter out less informative content from context. In our ex-
1: for each 𝑞, context ∈ 𝒒 𝑡 , D𝑡 do periments, we employ phrase-level content filtering with
2: state ← 𝐴𝑔𝑒𝑛𝑡.get state( 𝐴𝑔𝑒𝑛𝑡.S, context, 𝑞) different reduction ratios. We use GPT-2 model to com-
3: 𝑎𝑐𝑡𝑖𝑜𝑛 ← 𝐴𝑔𝑒𝑛𝑡.get action(𝑠𝑡𝑎𝑡𝑒) pute the self-information.
4: C′ ← 𝐴𝑔𝑒𝑛𝑡.perform action(𝑐𝑜𝑛𝑡𝑒𝑥𝑡, action) 6. Flan-T5-Base (Chung et al. 2022): We use this 250M
5: answer ← llm(𝑞, C′ ) encoder-decoder model to summarize the context based
6: return 𝑎𝑛𝑠𝑤𝑒𝑟 on the same instruction as CQSumDP.
7: end for
Implementation Details: To implement the document in-
gestion, we use a cheap all-MiniLM-L6-v2 model (Reimers
Experimental Settings and Gurevych 2019) as an embedding generator. The text
Dataset: In real-time scenarios, user documents are new chunks along with the embeddings are stored in Chro-
to LLMs such as gpt-3.5-turbo model, and thus should not maDB (ChromaDB) vector database. We use the chunk size
be able to answer a query without giving a context. To en- 500, chunk overlap is 0. We vary the number of chunks N =
sure this, we use recent arxiv papers and BBC news articles 2, 4, 8, 10 to compare how well the RL algorithm performs.
so that the LLMs are not trained on these. Following this, we We use this template for the QA: “Answer to the question
use the Arxiv Dataset and BBC News Dataset where docu- based on the given context. Context: {CONTEXT}, Ques-
ments are published in March 2023 (Li 2023). We generate tion: {QUERY}, if you do not find any answer in the con-
questions with answers for each document using QAGenera- text, simply return ‘No answer’”. We use LLMChain from
tionChain from LangChain (Harrison Chase 2022) based on Langchain (Harrison Chase 2022) to ask queries to the Ope-
gpt-3.5-turbo model. nAI model by prompting. LeanContext uses selective con-
text method (Li 2023) to reduce less important sentences by
Baseline Models: In our question-answering-based sys- 80%.
tem, we mainly focus on reducing the context length while
keeping the same performance in QA. So, we use a pay-per- Results
use LLM (gpt-turbo-3.5) model for all cases to answer
Arxiv Dataset: We consider random 25 articles from the
a query based on the context. We evaluate our proposed con-
Arxiv dataset for testing and another 5 documents RL train-
text length reduction approach with the recent context reduc-
ing distinct from the 25 documents. After retrieving the
tion approaches as follows.
query, we use the RL agent to identify to top-k in the con-
1. Context (Original): We keep the context length intact and text and reduce the context using the RL agent. For the same
ask LLM to answer the query. query, and context setting, we evaluate our approach with the
2. CQSumDP (Laskar et al. 2023): We generate the follow- baseline approaches. The comparison of LeanContext with
ing prompt similar to CQSumDP to generate the query- baseline models is shown in Table 1. LeanContext achieves
Avg. Avg. Avg. Avg. Avg. Avg. Avg. Avg.
Cost Savings Cost Savings
Text Reduction Method Total Prompt Summary Completion ROUGE-1 ROUGE-2 ROUGE-L Dataset Text Reduction Method Total Prompt Summary Completion ROUGE-1 ROUGE-2 ROUGE-L
(%) (%)
tokens tokens tokens tokens tokens tokens tokens tokens
Context (Original) 547 521 0 26 0.3985 0.2868 0.3714 0.00
Context (Original) 547 521 0 26 0.3985 0.2868 0.3714 0.00 T5 79 71 0 8 0.1614 0.1101 0.1486 85.56
CQSumDP 631 89 517 25 0.4424 0.3061 0.4213 -15.36 T5 + LeanContext (Only Top-k=0.1) 131 113 0 18 0.3325 0.2390 0.3146 76.05
Semantic Compression 939 319 597 23 0.3331 0.2221 0.3132 -71.66 T5 + LeanContext (Only Top-k=RL) 284 263 0 21 0.3942 0.2847 0.3742 48.08
T5-base 79 71 0 8 0.1614 0.1101 0.1486 85.56 Arxiv T5 + LeanContext 357 335 0 22 0.4073 0.2863 0.3809 34.73
SBert 205 188 0 17 0.2563 0.1701 0.2469 62.52 SBert 205 188 0 17 0.2563 0.1701 0.2469 62.52
SC (reduction = 0.50) 334 316 0 18 0.2945 0.2014 0.2755 38.94 SBert + LeanContext (Only Top-k=0.1) 250 230 0 20 0.3104 0.2181 0.2949 54.30
SBert + LeanContext (Only Top-k=RL) 405 380 0 25 0.3676 0.2594 0.3464 25.96
LeanContext (Fixed k =0.1) 210 196 0 14 0.2305 0.1623 0.2173 61.62 SBert + LeanContext 478 452 0 26 0.3885 0.2720 0.3596 12.61
LeanContext (Adaptive k [RL]) 343 321 0 22 0.3844 0.2684 0.3577 37.29
Context (Original) 761 724 0 37 0.5498 0.4172 0.5337 0
T5 74 51 0 23 0.3993 0.2631 0.3752 90.28
T5 + LeanContext (Only Top-k=0.1) 142 116 0 26 0.5228 0.3914 0.5065 81.34
T5 + LeanContext (Only Top-k=RL) 192 165 0 27 0.5368 0.4072 0.5187 74.77
Table 1: Comparison on Random 100 samples from Arxiv BBCNews T5 + LeanContext
SBert
259
174
232
147
0
0
27
27
0.5532
0.4261
0.4298
0.2917
0.5374
0.4082
65.97
77.14
Dataset. Number of chunks, 𝑁 = 4 SBert + LeanContext (Only Top-k=0.1)
SBert + LeanContext (Only Top-k=RL)
241
289
212
260
0
0
29
29
0.5072
0.5200
0.3731
0.3827
0.4914
0.5039
68.33
62.02
SBert + LeanContext 355 327 0 28 0.5355 0.4007 0.5235 53.36

Text Reduction Method


Avg.
Total
Avg.
Prompt
Avg.
Summary
Avg.
Completion ROUGE-1 ROUGE-2 ROUGE-L
Cost Savings
(%)
Table 3: Effect of cascading LeanContext with open source
tokens tokens tokens tokens
Context (Original) 761 724 0 37 0.5498 0.4172 0.5337 0.00 summarizers.
CQSumDP 842 75 738 29 0.5801 0.4405 0.5637 -10.64
Semantic Compression 1078 228 820 30 0.4729 0.3204 0.4517 -41.66
T5-base 74 51 0 23 0.3993 0.2631 0.3752 90.28
SBert 174 147 0 27 0.4261 0.2917 0.4082 77.14
SC (reduction = 0.50) 461 429 0 32 0.4740 0.3308 0.4521 39.42
LeanContext (Fixed k = 0.1)
LeanContext (Adaptive-k [RL])
278
245
250
218
0
0
28
27
0.5017
0.5233
0.3740
0.3943
0.4872
0.5093
63.47
67.81 test queries are discussed in the Ablation Study section.

Table 2: Comparison on Random 100 samples from BBC- Top-k is all you need: We observe an interesting phe-
News Dataset. Number of chunks, 𝑁 = 8 nomenon when adding our LeanContext with existing open-
source models. As these open-source models are not trained
on new domain data, we observe that combining 10% top-
𝑘 sentences with the open-source models boosts the QA
similar performance compared to the original context with system’s performance by 5.41% ∼ 17.11% for the Arxiv
no text reduction and outperforms existing open-source dataset and 8.11% ∼ 12.35% for the BBCNews dataset
models with 37% cost savings. CQSumDP (Laskar et al. as shown in Table 3. In addition, using top-𝑘 on the fly
2023) achieves better accuracy with a greater cost (adds with a generic summarizer adds an extra advantage to the
15.36% more cost) by reducing the minimal context to get overall system. Our LeanContext with top-𝑘 sentences us-
the right answer than the original context. Throughout the ing RL and reduced version of other sentences in between
experiments, we observe the same effect and conclude that them make the performance even better than the fixed (10%)
the zero-shot performance of GPT-4 like LLMs is better top-𝑘 by 13.22% ∼ 24.59% for the Arxiv dataset and
with concise and relevant context to query (CQSumDP) than 10.94% ∼ 15.39% for the BBCNews dataset (Table 3).
the context with a lot of irrelevant information. Although LeanContext with T5-base model outperforms all the exist-
Semantic Compression (Gilbert et al. 2023) uses LLM to ing approaches including no reduction method.
minimize context, due to the context-agnostic nature of the
prompt, it performs even worse than LeanContext and adds Avg. Avg. Avg. Avg.
Cost Savings
Text Reduction Method Total Prompt Summary Completion ROUGE-1 ROUGE-2 ROUGE-L
additional ∼ 72% cost. tokens tokens tokens tokens
(%)

Number of chunks, N=2


Context (Original) 243 211 0 32 0.5370 0.3987 0.5190 0.00
BBCNews Dataset: We consider random 100 news arti- CQSumDP
Semantic Compression
322
448
70
113
225
307
27
28
0.5303
0.4786
0.3897
0.3169
0.5105
0.4519
-32.51
-84.36
T5-base 76 50 0 26 0.4010 0.2640 0.3781 68.72
cles from the BBCNews dataset. We keep 80 articles for SBert
SC (reduction = 0.50)
157
164
128
137
0
0
29
27
0.4825
0.4614
0.3395
0.3113
0.4604
0.4403
35.39
32.51
testing and the rest 20 articles to train the RL agent. For LeanContext 117 92 0 25 0.4556 0.3265 0.4373 51.85

query-based context retrieval settings, we evaluate our ap- Number of chunks, N=4
Context (Original) 417 384 0 33 0.5489 0.4183 0.5327 0.00
CQSumDP 496 71 398 27 0.5569 0.4156 0.5389 -18.94
proach with the baseline approaches. Semantic Compression
T5-base
665
74
159
50
479
0
27
24
0.4916
0.3921
0.3328
0.2555
0.4684
0.3678
-59.47
82.25
Due to the cost of the OpenAI model and the current usage SBert
SC (reduction = 0.50)
163
263
136
235
0
0
27
28
0.4626
0.4726
0.3287
0.3250
0.4452
0.4507
60.91
36.93

limit, we follow recent literature (Yang et al. 2023), we gen- LeanContext


Number of chunks, N=10
153 128 0 25 0.4856 0.3605 0.4700 63.31

erate queries using the QA generation method, and consider Context (Original)
CQSumDP
930
1008
892
74
0
906
38
28
0.5423
0.5849
0.4072
0.4440
0.5246
0.5644
0.00
-8.39
random 100 query samples for evaluation. The evaluation Semantic Compression
T5-base
1274
74
258
51
987
0
29
23
0.4862
0.3903
0.3343
0.2584
0.4642
0.3688
-36.99
92.04
SBert 181 153 0 28 0.4164 0.2807 0.3967 80.54
result is shown in Table 2. SC (reduction = 0.50) 559 527 0 32 0.4825 0.3312 0.4591 39.89
LeanContext 279 252 0 27 0.5318 0.4069 0.5205 70.00
We observe that generating a query-aware summary from
document-query pair using OpenAI LLM (Laskar et al. Table 4: Evaluation of RL model on a different number of
2023) performs better i.e. ROUGE-1 0.5801 than query- chunks on a random 100 samples from BBCNews Dataset.
aware open-source baseline methods such as T5-base i.e. Our RL agent trained with (N=8) shows promising results
0.3993 but with a high cost. However, it also contributes while applying on different chunk numbers.
more cost than the original context (10.64% more). We
also observe that the query-unaware LLM using a differ-
ent hard prompt (Gilbert et al. 2023) even performs worse
i.e. ROUGE-1 score 0.4729 with a 41.66% cost overhead. Ablation Study
Additionally, adding top-𝑘(0.1) uplifts the ROUGE-1 score We investigate whether the RL model has a performance de-
of T5-base by ∼ 12.35%. With adaptive top-𝑘 (Top-𝑘(RL)) pendency on the number of chunks. So, we train the RL
and adaptive top-𝑘 (Top-𝑘(RL)) with sentence order, it per- model using 𝑁 = 8 and run the same model on 𝑁 = 2, 4,
forms even better by reducing the cost from 65% ∼ 74%. and 10. The results are shown in Table 4. We observe that al-
We observe the same scenario for other models too. Further though the 𝑁 changes, the RL model still outperforms other
investigation with different number of chunks for the same less expensive baseline methods. Another interesting obser-
vances in neural information processing systems, 33: 1877–
1901.
Chen, L.; Zaharia, M.; and Zou, J. 2023. FrugalGPT: How
to Use Large Language Models While Reducing Cost and
Improving Performance. arXiv preprint arXiv:2305.05176.
ChromaDB. 2023. ChromaDB. https://www.trychroma.
com/. Accessed: June 20, 2023.
Chung, H. W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fe-
dus, W.; Li, E.; Wang, X.; Dehghani, M.; Brahma, S.; et al.
2022. Scaling instruction-finetuned language models. arXiv
preprint arXiv:2210.11416.
Figure 5: Adaptive-𝑘 ratio selected by the BBCNews RL Gilbert, H.; Sandborn, M.; Schmidt, D. C.; Spencer-Smith,
agent based on queries. J.; and White, J. 2023. Semantic Compression With Large
Language Models. arXiv preprint arXiv:2304.12512.
Avg. Avg. Avg.
Compression Method Total
tokens
Prompt
tokens
Completion
tokens
ROUGE-1 ROUGE-2 ROUGE-L GPT-3 Cost. 2023. GPT-3 cost estimation for real appli-
None   538 514 24 0.3945 0.2904 0.3764 cations. https://neoteric.eu/blog/how-much-does-it-cost-to-
𝒗 ·𝒗
LeanContext state = | |𝒗 𝑐 𝑐| | | |𝒗𝑞𝑞 | | 296 275 21 0.3443 0.2369 0.3259 use-gpt-models-gpt-3-pricing-explained/. Accessed: July
LeanContext (state = 𝒗 𝑐 − 𝒗 𝑞 ) 331 308 23 0.3553 0.2535 0.3388
LeanContext (state = 𝒗 𝑐 ⊕ 𝒗 𝑞 ) 284 265 19 0.3400 0.2370 0.3223 11, 2023.
Harrison Chase. 2022. LangChain. https://github.com/
Table 5: Comparison of RL state functions on Arxiv dataset hwchase17/langchain. 2022-10-17.
Jiao, W.; Wang, W.; Huang, J.-t.; Wang, X.; and Tu, Z. 2023.
Is ChatGPT a good translator? A preliminary study. arXiv
vation lies in making N as larger as possible for the suc- preprint arXiv:2301.08745.
cessful retrieval of context using LeanContext whereas pre-
viously without LeanContext 𝑁 should be kept smaller to Laskar, M. T. R.; Rahman, M.; Jahan, I.; Hoque, E.;
reduce the cost of LLM usage. In Figure 5, we show how and Huang, J. 2023. CQSumDP: A ChatGPT-Annotated
the action (top-𝑘 ratio) is taken by the RL agent given the Resource for Query-Focused Abstractive Summarization
context and query for each query sample out of 100 samples. Based on Debatepedia. arXiv preprint arXiv:2305.06147.
Considering an adaptive top-k ratio chosen by our RL agent Li, Y. 2023. Unlocking Context Constraints of LLMs: En-
varies over query samples to achieve the adaptive reduction hancing Context Efficiency of LLMs with Self-Information-
of context. Based Content Filtering. arXiv preprint arXiv:2304.12102.
We also empirically evaluate state function with differ- Ling, C.; Zhao, X.; Lu, J.; Deng, C.; Zheng, C.; Wang, J.;
ent associations between context
 (𝒗 𝑐 ) and query  embedding Chowdhury, T.; Li, Y.; Cui, H.; Zhao, T.; et al. 2023. Beyond
𝒗 𝑐 ·𝒗 𝑞 One-Model-Fits-All: A Survey of Domain Specialization for
(𝒗 𝑞 ) i.e. cosine similarity state = | |𝒗 𝑐 | | | |𝒗 𝑞 | | , concatenation
Large Language Models. arXiv preprint arXiv:2305.18703.
(𝒗 𝑐 ⊕ 𝒗 𝑞 )), and subtraction (𝒗 𝑐 − 𝒗 𝑞 )). Among them, sub-
traction performs best (Table 5). In this experiment, we only Liu, Y.; Han, T.; Ma, S.; Zhang, J.; Yang, Y.; Tian, J.; He, H.;
select the top-k sentences using RL to evaluate the impact of Li, A.; He, M.; Liu, Z.; et al. 2023. Summary of chatgpt/gpt-
state definition on performance. 4 research and perspective towards the future of large lan-
guage models. arXiv preprint arXiv:2304.01852.
Conclusion and Future Work Luo, Z.; Xie, Q.; and Ananiadou, S. 2023. Chatgpt as a
factual inconsistency evaluator for abstractive text summa-
In this paper, we propose LeanContext , a cost-efficient rization. arXiv preprint arXiv:2303.15621.
query-aware context reduction system to reduce the cost
associated with LLM API usage. Despite the reduction of Ma, C.; Wu, Z.; Wang, J.; Xu, S.; Wei, Y.; Liu, Z.; Guo, L.;
prompt tokens, LeanContext achieves similar or better per- Cai, X.; Zhang, S.; Zhang, T.; et al. 2023. ImpressionGPT:
formance compared to the no reduction of the original con- an iterative optimizing framework for radiology report sum-
text. The advantage of top-𝑘 is that it can be plugged in with marization with chatGPT. arXiv preprint arXiv:2304.08448.
any existing summarization method of a domain-based QA Miller, D. 2019. Leveraging BERT for extractive text sum-
system to boost the overall performance according to our marization on lectures. arXiv preprint arXiv:1906.04165.
experimental results. Here, we only focus on text-based con- Pinecone. 2023. Vector database. https://www.pinecone.io/
text as a domain. We will explore other domains in our future learn/vector-database/. Accessed: June 20, 2023.
work. Reimers, N.; and Gurevych, I. 2019. Sentence-bert: Sen-
tence embeddings using siamese bert-networks. arXiv
References preprint arXiv:1908.10084.
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Schlag, I.; Sukhbaatar, S.; Celikyilmaz, A.; Yih, W.-t.; We-
Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, ston, J.; Schmidhuber, J.; and Li, X. 2023. Large language
A.; et al. 2020. Language models are few-shot learners. Ad- model programs. arXiv preprint arXiv:2305.05364.
Tan, Y.; Min, D.; Li, Y.; Li, W.; Hu, N.; Chen, Y.; and Qi,
G. 2023. Evaluation of ChatGPT as a question answering
system for answering complex questions. arXiv preprint
arXiv:2303.07992.
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.;
Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale,
S.; et al. 2023. Llama 2: Open foundation and fine-tuned
chat models. arXiv preprint arXiv:2307.09288.
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.;
Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022. Chain-of-
thought prompting elicits reasoning in large language mod-
els. Advances in Neural Information Processing Systems,
35: 24824–24837.
Wu, J.; Antonova, R.; Kan, A.; Lepert, M.; Zeng, A.; Song,
S.; Bohg, J.; Rusinkiewicz, S.; and Funkhouser, T. 2023.
Tidybot: Personalized robot assistance with large language
models. arXiv preprint arXiv:2305.05658.
Yang, X.; Li, Y.; Zhang, X.; Chen, H.; and Cheng, W. 2023.
Exploring the limits of chatgpt for query or aspect-based text
summarization. arXiv preprint arXiv:2302.08081.
Zhang, H.; Liu, X.; and Zhang, J. 2023a. Extractive summa-
rization via chatgpt for faithful summary generation. arXiv
preprint arXiv:2304.04193.
Zhang, H.; Liu, X.; and Zhang, J. 2023b. SummIt: It-
erative Text Summarization via ChatGPT. arXiv preprint
arXiv:2305.14835.

You might also like