Leancontext: Cost-Efficient Domain-Specific Question Answering Using Llms
Leancontext: Cost-Efficient Domain-Specific Question Answering Using Llms
Leancontext: Cost-Efficient Domain-Specific Question Answering Using Llms
Abstract
Question-answering (QA) is a significant application of
Large Language Models (LLMs), shaping chatbot capabili-
ties across healthcare, education, and customer service. How-
ever, widespread LLM integration presents a challenge for
small businesses due to the high expenses of LLM API us-
age. Costs rise rapidly when domain-specific data (context)
is used alongside queries for accurate domain-specific LLM
responses. One option is to summarize the context by using
LLMs and reduce the context. However, this can also fil-
ter out useful information that is necessary to answer some
domain-specific queries. In this paper, we shift from human-
oriented summarizers to AI model-friendly summaries. Our Figure 1: Compared to the original context LeanContext
approach, LeanContext , efficiently extracts k key sentences only drops in ∼2% ROUGE-1 score with ∼ 68% savings on
from the context that are closely aligned with the query. BBCNews dataset (Li 2023).
The choice of k is neither static nor random; we introduce
a reinforcement learning technique that dynamically deter-
mines k based on the query and context. The rest of the less
important sentences are reduced using a free open source cent events on which it has not been trained. This lack of
text reduction method. We evaluate LeanContext against sev- exposure to up-to-date information can lead to inaccurate
eral recent query-aware and query-unaware context reduc- responses, particularly for domain-specific information pro-
tion approaches on prominent datasets (arxiv papers and BBC cessing, where the LLMs may not grasp new terminology
news articles). Despite cost reductions of 37.29% to 67.81%, or jargon. To build an effective domain-specific question-
LeanContext ’s ROUGE-1 score decreases only by 1.41% to
answering system, it becomes essential to educate the LLMs
2.65% compared to a baseline that retains the entire context
(no summarization). Additionally, if free pretrained LLM- about the specific domains, enabling them to adapt and un-
based summarizers are used to reduce context (into human derstand new information accurately.
consumable summaries), LeanContext can further modify the LLMs can learn domain-specific information in two ways,
reduced context to enhance the accuracy (ROUGE-1 score) (a) via fine-tuning the model weights for the specific domain,
by 13.22% to 24.61%. (b) via prompting means users’ can share the contents with
the LLMs as input context. Fine-tuning these large models
Introduction containing billions of parameters is expensive and consid-
ered impractical if there is a rapid change of context over
In recent times, large language models (LLMs) have seen time (Schlag et al. 2023) e.g. a domain-specific QA system
extensive utilization, especially since the introduction of where the documents shared by users are very recent and
LLM APIs for customer-oriented applications on a large from different domains. A more practical way is to select
scale (Liu et al. 2023). These applications include chat- the latter approach i.e. the prompt-based solution, where rel-
bots (like GPT-4), language translation (Jiao et al. 2023), evant contents from user documents are added to the query
text summarization (Luo, Xie, and Ananiadou 2023; Yang to answer based on the context. Motivated by this, we focus
et al. 2023; Zhang, Liu, and Zhang 2023a), and question- on prompt-based solutions for document-based QA systems.
answering (QA) tasks (Tan et al. 2023), personalized robot
assistance (Wu et al. 2023). While the zero-shot perfor- One of the challenges for long document processing us-
mance of the LLM model is nearly on par with fine-tuned ing a prompt-based solution is the input prompt length being
models for specific tasks, it has limitations. One signifi- limited to a maximum length defined by the LLM API. The
cant limitation is its inability to answer queries about re- token limit of GPT-3.5 and GPT-4 vary from 4,096 to 32,768
max tokens limit proportional to the usage cost. Therefore,
* Work done during internship at NEC Laboratories America LLMs will fail to answer the query if the prompt length is
larger than the max token limit due to the larger context • It reduces the LLM API usage cost by 37.29% ∼ 67.81%
length in the prompt. One suitable way to get rid of this of a domain-specific QA system with little drop in perfor-
problem is via document chunking (Harrison Chase 2022). mance by only 1.41% ∼ 2.65%. (Table 1, Table 2).
In this case, initially, the user documents are segmented into • It boosts the QA performance by 13.22% ∼ 24.59% (Ta-
chunks. Only the relevant chunks of fixed size are retrieved ble 3) by combining query-aware top-𝑘 sentences with
as context based on the query. the reduced context generated through free open-source
The cost for context-based querying using LLM APIs via text summarizers .
prompting is associated with a cost that is proportional to
the number of input tokens (contributing prompt cost), and
the number of output tokens (contributing generation cost).
Related Work
According to a recent study(GPT-3 Cost), with 15,000 visi- For domain-specific tasks, LLMs can be utilized to adapt
tors having 24 requests per month, the cost of using GPT-3 the domains without modifying their inner parameters via
(Davinci model) is $14,400 per month (assuming prompt to- discrete prompting where distinct instructions with contexts
kens = 1800, output tokens = 80) which is challenging for a can be delivered as an input to generate responses for down-
small business to operate. For GPT-4, the cost is even higher stream tasks (Ling et al. 2023; Brown et al. 2020). For
than this amount. In this paper, our focus is to reduce this domain-specific QA tasks, the domain can be reduced by
cost. context summarization to reduce the LLM cost. A lot of
To mitigate the cost of using LLM API, the number of to- research has been conducted for summarizing text (Miller
kens in the context should be reduced as the cost is propor- 2019; Yang et al. 2023). Existing research works can be cat-
tional to the length of the context. A low-cost option to re- egorized into two main parts: (a) extractive and (b) abstrac-
duce the context is to summarize the context using free open- tive. The extractive summarizers (Miller 2019) first iden-
source summarizer models. However, for domain-specific tify important sentences from the text, and next summa-
QA applications, the pre-trained open-source summarizers rizes them. While abstractive summarizers (Laskar et al.
do not contribute to good accuracy. On the contrary, using a 2023) reduce the context by generating new sentences. The
pay-per-use model like ChatGPT further increases the query main goal of both approaches is to generate a meaningful
processing cost instead of reducing it as the additional cost summary for human users. In contrast, the goal of Lean-
is added at the time of text reduction. Context is to reduce context which will be consumed by a
To this end, we propose a domain-specific query- question-answering model like ChatGPT. For the prompt-
answering system LeanContext , where users ask queries based summarization task, recently, iterative text summa-
based on a document. To answer a query, LeanContext first rization (Zhang, Liu, and Zhang 2023b) has been proposed
forms a context from the document based on the query to refine the summary task in a feedback-based iterative
by retrieving relevant chunks. Next, it identifies the top- manner. In aspect or query-based summarization (Yang et al.
𝑘 sentences related to the query from the context. Lean- 2023) summaries are generated based on a domain set of
Context introduces a reinforcement learning technique that specific queries.
dynamically determines k based on the query and context. Query-unaware text compression via prompting is also
Then, LeanContext reduces the rest of the sentences in frag- observed in recent literature. Semantic compression (Gilbert
ments by an open source text reduction method. Next, it et al. 2023) involves generating systematic prompts to re-
forms a new context by stitching top-k sentences and re- duce context using ChatGPT model (GPT-3.5-turbo, GPT-
duced fragments in the order of their appearance order in 4) and acquire reasonable compression compared to the zlib
the original context. Finally, it invokes an LLM (like Chat- compression method. Due to limited context window size,
GPT) to answer the query using that new context. It is to be recent literature focus on prompt context filtering. In se-
noted that the goal of LeanContext is contrary to the sum- lective context (Li 2023), token, phrase, or sentence-level
marization task that generates a meaningful summary for hu- query-unaware content filtering is proposed using the en-
man users. Rather, in LeanContext , the reduced context will tropy of GPT-2 model logits for each entity. Usage of Chat-
be consumed by a question-answering model like ChatGPT. GPT model in the medical domain especially in radiol-
Figure 1 shows the scenario of LeanContext , reducing the ogy is explored via prompt-engineering (Ma et al. 2023) to
context size with accuracy close to the original context and summarize difficult radiology reports. Extract-then-generate
outperforming other open-source models. pipeline-based summarization improves abstractive sum-
In summary, LeanContext makes the following contribu- mary faithfulness (Zhang, Liu, and Zhang 2023a) through
tions: the chain of thought (CoT) (Wei et al. 2022) reasoning.
To reduce the cost of the use of LLM, FrugalGPT (Chen,
• It presents a low-cost domain-specific QA system, which Zaharia, and Zou 2023) proposed several ideas regarding
reduces the LLM API usage cost by reducing the domain prompt adaptation by query concatenation, LLM approxi-
context through the selection of important sentences re- mation by fine-tuning or caching, and LLM cascade by the
lated to the query and keeping them intact, while reduc- selective selection of LLMs from lower to higher cost based
ing rest of the sentences in between important sentences on a query. Still, it lacks context compression ideas to reduce
through a free open-source summarizer. It proposes a re- the prompt tokens.
inforcement learning technique to select the percentage It is to be noted that recent studies either focus on sum-
of the important sentences. marization as a downstream task or utilize the summary of
the context for the question-answering task. For most of the
0.1, 0.4, …
existing content filtering approaches, the main focus is to 0.12, 0.5, ...
query-agnostic filter content with the deletion of less in- User Documents Chunks Embeddings
sible solution. Utilizing open-source LLM (Touvron et al. Subset of Domain Data
2023; Chung et al. 2022) model either does not perform well (b) Question-Answering (QA)
on domain data or adds additional deployment cost. Con-
sider a small business to run, we consider using pay-per-use Figure 2: Workflow of a domain-specific QA system.
LLMs such as OpenAI LLMs to make the system running at
a reasonable cost by reducing the context.
token count of C. As LLM prompt cost is proportional to
Domain-specific QA System the token count of context, LeanContext helps to reduce the
prompt cost of LLMs. In other words, if the total number of
In a domain-specific QA system, a context with a query is tokens in C is 𝑇 and the number of tokens in C′ is 𝑡 then
given to an LLM to get the answer. If the context size ex- LeanContext reduces the token ratio 𝜏 is defined as, 𝜏 = 𝑇𝑡
ceeds the max-token limits of the LLM API, the LLM will without compromising the accuracy (𝑎𝑐𝑐) of the QA system.
fail to answer the query. As a result, for long document pro- So, if the optimal accuracy of the system is 𝑎𝑐𝑐∗ , we formu-
cessing, a vector database (ChromaDB; Pinecone) is used late the optimization problem of LeanContext as follows.
to store domain-specific documents into a number of small
chunks so that a subset of relevant domain context can be min. (1 − 𝛼) × 𝜏 + 𝛼 × |𝑎𝑐𝑐 − 𝑎𝑐𝑐 ∗ | (1)
retrieved from the long document rather than the whole doc-
ument as context. A domain-specific QA system is shown in Finally, the reduced context C′
and the query (𝑞 𝑖 ) are given
Figure 2. The QA system can be divided into two steps. to the pay-per-use LLM API to answer the query. Then, the
answer is shown to the respective user via an interactive in-
(a) Domain data ingestion: In this step, the documents, terface.
D will be split into a number of fixed chunks (𝑐) by a text For context-based QA, generally, the answers reside
splitter. An embedding function computes the embeddings within a couple of sentences. If the smallest amount of
of each chunk using an embedding generator. The chunks context for a certain question can be identified, the same
along with the embedding vector (v𝑐 ) of each chunk are response can be provided by LLMs at a lower cost i.e.
stored in a vector database. less prompt tokens. So, identifying the top-𝑘 sentences can
reduce the context without compromising accuracy. Moti-
(b) QA: At the question-answering (QA) step, given a user vated by this simple idea, we propose LeanContext which is
query 𝑞 𝑖 , similar 𝑁 chunks are retrieved by their embeddings shown in Figure 3.
similar to query embedding using semantic search. These
chunks form the context (C). Finally, the context which is a
Semantic
subset of domain data is fed into LLM to get the answer. Query
Search
As domain-specific data and user queries are dynamic in User Vector Database
(key: embedding, value: chunk)
nature, retrieving minimal context based on a query is chal- Context
lenging. One possible way is to make the chunk size and
number of chunks dynamic so that the context contains min- ChatGPT Reduced Top-k
Answer Combine RL
imal sentences to answer the query. But this solution is infea- (LLM) Context sentences
Table 2: Comparison on Random 100 samples from BBC- Top-k is all you need: We observe an interesting phe-
News Dataset. Number of chunks, 𝑁 = 8 nomenon when adding our LeanContext with existing open-
source models. As these open-source models are not trained
on new domain data, we observe that combining 10% top-
𝑘 sentences with the open-source models boosts the QA
similar performance compared to the original context with system’s performance by 5.41% ∼ 17.11% for the Arxiv
no text reduction and outperforms existing open-source dataset and 8.11% ∼ 12.35% for the BBCNews dataset
models with 37% cost savings. CQSumDP (Laskar et al. as shown in Table 3. In addition, using top-𝑘 on the fly
2023) achieves better accuracy with a greater cost (adds with a generic summarizer adds an extra advantage to the
15.36% more cost) by reducing the minimal context to get overall system. Our LeanContext with top-𝑘 sentences us-
the right answer than the original context. Throughout the ing RL and reduced version of other sentences in between
experiments, we observe the same effect and conclude that them make the performance even better than the fixed (10%)
the zero-shot performance of GPT-4 like LLMs is better top-𝑘 by 13.22% ∼ 24.59% for the Arxiv dataset and
with concise and relevant context to query (CQSumDP) than 10.94% ∼ 15.39% for the BBCNews dataset (Table 3).
the context with a lot of irrelevant information. Although LeanContext with T5-base model outperforms all the exist-
Semantic Compression (Gilbert et al. 2023) uses LLM to ing approaches including no reduction method.
minimize context, due to the context-agnostic nature of the
prompt, it performs even worse than LeanContext and adds Avg. Avg. Avg. Avg.
Cost Savings
Text Reduction Method Total Prompt Summary Completion ROUGE-1 ROUGE-2 ROUGE-L
additional ∼ 72% cost. tokens tokens tokens tokens
(%)
query-based context retrieval settings, we evaluate our ap- Number of chunks, N=4
Context (Original) 417 384 0 33 0.5489 0.4183 0.5327 0.00
CQSumDP 496 71 398 27 0.5569 0.4156 0.5389 -18.94
proach with the baseline approaches. Semantic Compression
T5-base
665
74
159
50
479
0
27
24
0.4916
0.3921
0.3328
0.2555
0.4684
0.3678
-59.47
82.25
Due to the cost of the OpenAI model and the current usage SBert
SC (reduction = 0.50)
163
263
136
235
0
0
27
28
0.4626
0.4726
0.3287
0.3250
0.4452
0.4507
60.91
36.93
erate queries using the QA generation method, and consider Context (Original)
CQSumDP
930
1008
892
74
0
906
38
28
0.5423
0.5849
0.4072
0.4440
0.5246
0.5644
0.00
-8.39
random 100 query samples for evaluation. The evaluation Semantic Compression
T5-base
1274
74
258
51
987
0
29
23
0.4862
0.3903
0.3343
0.2584
0.4642
0.3688
-36.99
92.04
SBert 181 153 0 28 0.4164 0.2807 0.3967 80.54
result is shown in Table 2. SC (reduction = 0.50) 559 527 0 32 0.4825 0.3312 0.4591 39.89
LeanContext 279 252 0 27 0.5318 0.4069 0.5205 70.00
We observe that generating a query-aware summary from
document-query pair using OpenAI LLM (Laskar et al. Table 4: Evaluation of RL model on a different number of
2023) performs better i.e. ROUGE-1 0.5801 than query- chunks on a random 100 samples from BBCNews Dataset.
aware open-source baseline methods such as T5-base i.e. Our RL agent trained with (N=8) shows promising results
0.3993 but with a high cost. However, it also contributes while applying on different chunk numbers.
more cost than the original context (10.64% more). We
also observe that the query-unaware LLM using a differ-
ent hard prompt (Gilbert et al. 2023) even performs worse
i.e. ROUGE-1 score 0.4729 with a 41.66% cost overhead. Ablation Study
Additionally, adding top-𝑘(0.1) uplifts the ROUGE-1 score We investigate whether the RL model has a performance de-
of T5-base by ∼ 12.35%. With adaptive top-𝑘 (Top-𝑘(RL)) pendency on the number of chunks. So, we train the RL
and adaptive top-𝑘 (Top-𝑘(RL)) with sentence order, it per- model using 𝑁 = 8 and run the same model on 𝑁 = 2, 4,
forms even better by reducing the cost from 65% ∼ 74%. and 10. The results are shown in Table 4. We observe that al-
We observe the same scenario for other models too. Further though the 𝑁 changes, the RL model still outperforms other
investigation with different number of chunks for the same less expensive baseline methods. Another interesting obser-
vances in neural information processing systems, 33: 1877–
1901.
Chen, L.; Zaharia, M.; and Zou, J. 2023. FrugalGPT: How
to Use Large Language Models While Reducing Cost and
Improving Performance. arXiv preprint arXiv:2305.05176.
ChromaDB. 2023. ChromaDB. https://www.trychroma.
com/. Accessed: June 20, 2023.
Chung, H. W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fe-
dus, W.; Li, E.; Wang, X.; Dehghani, M.; Brahma, S.; et al.
2022. Scaling instruction-finetuned language models. arXiv
preprint arXiv:2210.11416.
Figure 5: Adaptive-𝑘 ratio selected by the BBCNews RL Gilbert, H.; Sandborn, M.; Schmidt, D. C.; Spencer-Smith,
agent based on queries. J.; and White, J. 2023. Semantic Compression With Large
Language Models. arXiv preprint arXiv:2304.12512.
Avg. Avg. Avg.
Compression Method Total
tokens
Prompt
tokens
Completion
tokens
ROUGE-1 ROUGE-2 ROUGE-L GPT-3 Cost. 2023. GPT-3 cost estimation for real appli-
None 538 514 24 0.3945 0.2904 0.3764 cations. https://neoteric.eu/blog/how-much-does-it-cost-to-
𝒗 ·𝒗
LeanContext state = | |𝒗 𝑐 𝑐| | | |𝒗𝑞𝑞 | | 296 275 21 0.3443 0.2369 0.3259 use-gpt-models-gpt-3-pricing-explained/. Accessed: July
LeanContext (state = 𝒗 𝑐 − 𝒗 𝑞 ) 331 308 23 0.3553 0.2535 0.3388
LeanContext (state = 𝒗 𝑐 ⊕ 𝒗 𝑞 ) 284 265 19 0.3400 0.2370 0.3223 11, 2023.
Harrison Chase. 2022. LangChain. https://github.com/
Table 5: Comparison of RL state functions on Arxiv dataset hwchase17/langchain. 2022-10-17.
Jiao, W.; Wang, W.; Huang, J.-t.; Wang, X.; and Tu, Z. 2023.
Is ChatGPT a good translator? A preliminary study. arXiv
vation lies in making N as larger as possible for the suc- preprint arXiv:2301.08745.
cessful retrieval of context using LeanContext whereas pre-
viously without LeanContext 𝑁 should be kept smaller to Laskar, M. T. R.; Rahman, M.; Jahan, I.; Hoque, E.;
reduce the cost of LLM usage. In Figure 5, we show how and Huang, J. 2023. CQSumDP: A ChatGPT-Annotated
the action (top-𝑘 ratio) is taken by the RL agent given the Resource for Query-Focused Abstractive Summarization
context and query for each query sample out of 100 samples. Based on Debatepedia. arXiv preprint arXiv:2305.06147.
Considering an adaptive top-k ratio chosen by our RL agent Li, Y. 2023. Unlocking Context Constraints of LLMs: En-
varies over query samples to achieve the adaptive reduction hancing Context Efficiency of LLMs with Self-Information-
of context. Based Content Filtering. arXiv preprint arXiv:2304.12102.
We also empirically evaluate state function with differ- Ling, C.; Zhao, X.; Lu, J.; Deng, C.; Zheng, C.; Wang, J.;
ent associations between context
(𝒗 𝑐 ) and query embedding Chowdhury, T.; Li, Y.; Cui, H.; Zhao, T.; et al. 2023. Beyond
𝒗 𝑐 ·𝒗 𝑞 One-Model-Fits-All: A Survey of Domain Specialization for
(𝒗 𝑞 ) i.e. cosine similarity state = | |𝒗 𝑐 | | | |𝒗 𝑞 | | , concatenation
Large Language Models. arXiv preprint arXiv:2305.18703.
(𝒗 𝑐 ⊕ 𝒗 𝑞 )), and subtraction (𝒗 𝑐 − 𝒗 𝑞 )). Among them, sub-
traction performs best (Table 5). In this experiment, we only Liu, Y.; Han, T.; Ma, S.; Zhang, J.; Yang, Y.; Tian, J.; He, H.;
select the top-k sentences using RL to evaluate the impact of Li, A.; He, M.; Liu, Z.; et al. 2023. Summary of chatgpt/gpt-
state definition on performance. 4 research and perspective towards the future of large lan-
guage models. arXiv preprint arXiv:2304.01852.
Conclusion and Future Work Luo, Z.; Xie, Q.; and Ananiadou, S. 2023. Chatgpt as a
factual inconsistency evaluator for abstractive text summa-
In this paper, we propose LeanContext , a cost-efficient rization. arXiv preprint arXiv:2303.15621.
query-aware context reduction system to reduce the cost
associated with LLM API usage. Despite the reduction of Ma, C.; Wu, Z.; Wang, J.; Xu, S.; Wei, Y.; Liu, Z.; Guo, L.;
prompt tokens, LeanContext achieves similar or better per- Cai, X.; Zhang, S.; Zhang, T.; et al. 2023. ImpressionGPT:
formance compared to the no reduction of the original con- an iterative optimizing framework for radiology report sum-
text. The advantage of top-𝑘 is that it can be plugged in with marization with chatGPT. arXiv preprint arXiv:2304.08448.
any existing summarization method of a domain-based QA Miller, D. 2019. Leveraging BERT for extractive text sum-
system to boost the overall performance according to our marization on lectures. arXiv preprint arXiv:1906.04165.
experimental results. Here, we only focus on text-based con- Pinecone. 2023. Vector database. https://www.pinecone.io/
text as a domain. We will explore other domains in our future learn/vector-database/. Accessed: June 20, 2023.
work. Reimers, N.; and Gurevych, I. 2019. Sentence-bert: Sen-
tence embeddings using siamese bert-networks. arXiv
References preprint arXiv:1908.10084.
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Schlag, I.; Sukhbaatar, S.; Celikyilmaz, A.; Yih, W.-t.; We-
Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, ston, J.; Schmidhuber, J.; and Li, X. 2023. Large language
A.; et al. 2020. Language models are few-shot learners. Ad- model programs. arXiv preprint arXiv:2305.05364.
Tan, Y.; Min, D.; Li, Y.; Li, W.; Hu, N.; Chen, Y.; and Qi,
G. 2023. Evaluation of ChatGPT as a question answering
system for answering complex questions. arXiv preprint
arXiv:2303.07992.
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.;
Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale,
S.; et al. 2023. Llama 2: Open foundation and fine-tuned
chat models. arXiv preprint arXiv:2307.09288.
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.;
Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022. Chain-of-
thought prompting elicits reasoning in large language mod-
els. Advances in Neural Information Processing Systems,
35: 24824–24837.
Wu, J.; Antonova, R.; Kan, A.; Lepert, M.; Zeng, A.; Song,
S.; Bohg, J.; Rusinkiewicz, S.; and Funkhouser, T. 2023.
Tidybot: Personalized robot assistance with large language
models. arXiv preprint arXiv:2305.05658.
Yang, X.; Li, Y.; Zhang, X.; Chen, H.; and Cheng, W. 2023.
Exploring the limits of chatgpt for query or aspect-based text
summarization. arXiv preprint arXiv:2302.08081.
Zhang, H.; Liu, X.; and Zhang, J. 2023a. Extractive summa-
rization via chatgpt for faithful summary generation. arXiv
preprint arXiv:2304.04193.
Zhang, H.; Liu, X.; and Zhang, J. 2023b. SummIt: It-
erative Text Summarization via ChatGPT. arXiv preprint
arXiv:2305.14835.