RAG with OpenAI for Financial Analysis

RAG_with_OpenAI
September 9, 2024
1 Retrieval from OpenAI: file_search

Retrieval in OpenAI is performed using “file search” tool when creating your assistant or a
theard.
(It’s declared in the same place than code_interpreter tool if you are familiar with it)
What does file_search tool do? * File search tool is the RAG (Retrieval Augmented Gener-
ation) pipeline developed by OpenAI in their Assistant AI API.
• It enhances the Assistant’s capabilities by incorporating external knowledge.
• When using this tool, OpenAI will process your documents: parse it, chunk it, create and
store embedding in vector store, and then retrieve relevant chunks using both keyword and
smeantic search.
In this tools, OpenAI applies several built-in retrieval best practices to extarct eﬀicently data from
a document:
• Rewrites user queries to optimize them for search.
• Breaks down complex user queries into multiple searches it can run in parallel.
• Runs both keyword and semantic searches across both assistant and thread vector stores.
• Reranks search results to pick the most relevant ones before generating the final response.
Vector Store
OpenAI provides also a vector store creation when uploading files with a default chunking Stragey:
• max_chunk_size_tokens: 800 (can be customized)
• chunk_overlap_tokens: 400 (can be customized)
• It’s using text-embedding-3-large model
• It returns 20 chunks (can be customized) with their relevancy score to the user query
[ ]: !pip install openai -q
[2]: from google.colab import userdata

openai_api_key = userdata.get('OPENAI_API_KEY')
from openai import OpenAI

client = OpenAI(api_key=openai_api_key)
1
2 Upload file to OpenAI
• Load the file from internet to local
• Create a vector store first
[ ]: !wget "https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/
↪c7c14359-36fa-40c3-b3ca-5bf7f3fa0b96.pdf" -O amzn_2023_10k.pdf
2.1 Create a Vector Store

[12]: vector_store = client.beta.vector_stores.create(name="Financial Analyst")
[13]: vector_store
[13]: VectorStore(id='vs_lIw5lPrbgo8oX8RqwzgFJr0j', created_at=1725822560,

file_counts=FileCounts(cancelled=0, completed=0, failed=0, in_progress=0,
total=0), last_active_at=1725822560, metadata={}, name='Financial Analyst',
object='vector_store', status='completed', usage_bytes=0, expires_after=None,
expires_at=None)
[6]: # This is how you can find the chunking values parametered per default in␣
↪OpenAI:
vector_store_files.data[0].chunking_strategy
[6]: StaticFileChunkingStrategyObject(static=StaticFileChunkingStrategy(chunk_overlap
_tokens=400, max_chunk_size_tokens=800), type='static')
[ ]: # #List vector store with their files

# vector_store_files = client.beta.vector_stores.files.list(
# vector_store_id='vs_lIw5lPrbgo8oX8RqwzgFJr0j'
# )
# print(vector_store_files)
[67]: # # Delete a vector stire with a file

# deleted_vector_store_file = client.beta.vector_stores.files.delete(
# vector_store_id='vs_tX1Y8hrMjZZArJOIVrWjkcKk',
# file_id='file-h3Kn24nP0VMt5YRKKPTac3mF'
# )
# print(deleted_vector_store_file)
VectorStoreFileDeleted(id='file-h3Kn24nP0VMt5YRKKPTac3mF', deleted=True,
object='vector_store.file.deleted')
2
2.2 Upload the file to OpenAI using the Vector Store
[14]: # Upload files to OpenAI
file_paths = ["./amzn_2023_10k.pdf"]
file_streams = [open(path, "rb") for path in file_paths]
# Use the upload_and_poll endpoint to upload the files, add them to the vector␣
↪store, and poll the status of the file batch for completion.
# To ensure all contents have finihsed uploading, check the status. Status must␣
↪now be completed (instead of in_progress)
file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
vector_store_id=vector_store.id, files=file_streams
)
print(file_batch.status)
print(file_batch.file_counts)
completed
FileCounts(cancelled=0, completed=1, failed=0, in_progress=0, total=1)
3 Create an assistant with the vector store information

[15]: from openai import OpenAI
assistant = client.beta.assistants.create(
name="Financial Analyst",
instructions="""
You are a expert skilled financial analyst.
Leverage your extensive knowledge of accounting principles and financial␣
↪reporting standards to provide accurate and insightful answers to questions␣
↪regarding financial statements.
Analyze details such as income statements, balance sheets, cash flow␣

↪statements, and notes to the accounts to support your explanations.
""",
model="gpt-4o-mini",
tools=[{"type": "file_search"}], #before April, it was called retrieval
tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
)
## if you have already created your assistant without vector store and you want␣
↪to add it to the assistant, you need to update it like this:
# assistant = client.beta.assistants.update(
# assistant_id=assistant.id,
# tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
# )
3
4 Create a thread
[16]: thread = client.beta.threads.create()
You can also create a thread with a vector store, to do that, you need: * To upload the file (raw)
to the OpenAI * Than create a thread and attach to the uploaded file * This will create under the
hood (in OpenAI) a vector Store.
# Upload the file to OpenAI
uploaded_file = client.files.create(
file=open("./amzn_2023_10k.pdf", "rb"), purpose="assistants"
)
# Create a thread and attach the file to the message
thread = client.beta.threads.create(
messages=[
{
"role": "user",
"content": "What was the net income of 2023?",
"attachments": [
{ "file_id": uploaded_file.id, "tools": [{"type": "file_search"}] }
],
}
]
)
# The thread now has a vector store with that file in its tool resources.
print(thread.tool_resources.file_search)
• When creating a run on this theard, the file search tool will query both the vector Store from
the assistant and the one from the thread.
5 Create a Run
[29]: run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id,
)
print(run.status)
queued
6 Start chatting
For the moment I didn’t ask any question:
[20]: messages = client.beta.threads.messages.list(thread_id=thread.id)
4
[21]: nbr_messages = len(messages.data)
interaction = nbr_messages
for message in messages.data:
content = message.content[0]
if content.type == 'text':
print(f"Interaction #{interaction}")
print(f"ROLE={message.role}")
print(content.text.value)
print("\n")
print("--"*50)
interaction-=1
Interaction #1
ROLE=assistant
I see that you've uploaded files, but I don't have specific information about
their content yet. What specific information or analysis are you looking for
regarding the financial statements in these files?
--------------------------------------------------------------------------------
--------------------
6.1 Methods
Here are several methods that will effectively assist in asking questions to assistants (such as
creating a message within a thread or initiating a run) and in parsing and understanding the
responses provided by the assistant.
[24]: def send_message(message_user, my_thread_id, my_assistant_id):
"""
Create a message with the user request, and then create a run that will ask␣
↪the Assistant to run on the requested question:
"""
params = {
"thread_id": my_thread_id,
"role": "user",
"content": message_user,
}
thread_message = client.beta.threads.messages.create(**params)
run = client.beta.threads.runs.create(
thread_id=my_thread_id,
assistant_id=my_assistant_id,
)
return run
def get_response(my_thread_id):
return client.beta.threads.messages.list(thread_id=my_thread_id)
5
def print_messages(messages):
nbr_messages = len(messages.data)
interaction = nbr_messages
for message in messages.data:
content = message.content[0]
print(f"Interaction #{interaction} and {content.type}")
if content.type == 'text':
print(f"ROLE={message.role}")
print(content.text.value)
print("\n")
print("--"*50)
else:
print(content.type)
interaction-=1
6.2 Question: “What was the net income of 2023?”

[46]: message_user = "What was the net income of 2023?"
run = send_message(message_user, thread.id, assistant.id)
print(run.id, run.status)
run.tools
run_Lj3Wky79Kbv1kbTxb5BGveSE queued
[46]: [FileSearchTool(type='file_search', file_search=FileSearch(max_num_results=None,

ranking_options=FileSearchRankingOptions(ranker='default_2024_08_21',
score_threshold=0.0)))]
Give the assistant some time to think (in seconds):
[25]: messages = get_response(thread.id)

print_messages(messages)
Interaction #3 and text

ROLE=assistant
The net income for the year 2023 was $30,425 million (or $30.4
billion)�5:1†source�.
--------------------------------------------------------------------------------
--------------------
ROLE=user
What was the net income of 2023?
6
--------------------------------------------------------------------------------
--------------------
ROLE=assistant
--------------------------------------------------------------------------------
--------------------
6.3 Get the specific information about the retrieved chunks

[ ]: run_steps = client.beta.threads.runs.steps.list(
run_id='run_mPnuN4vyVaoYkU9IcpkQaSi1'
# run_id = 'run_Lj3Wky79Kbv1kbTxb5BGveSE'
)
print(run_steps)
[ ]: for step in run_steps.data:

print(step.id, step.step_details)
# .tool_calls[0].file_search.results[0].content)
# Results
# step_GvSdx4K5BJUlFZruVTpwo3OP␣
↪MessageCreationStepDetails(message_creation=MessageCreati...
# step_oPkFeY8F1ny1i0zv6rWaMLyp␣
↪ToolCallsStepDetails(tool_calls=[FileSearchToolCall(id...
7 Advanced Specifities
7.1 Get the relevance score for each chunk
By taking as example the previous question and answer:
In the last step “ToolCallsStepDetails” you can get the relevance score to the user’s query for each
of the 20 retrieved chunks. This value can be modified file_search.max_num_results to 5 if needed
when creating the asstsistant or the run.
[ ]: # for step = 'step_oPkFeY8F1ny1i0zv6rWaMLyp'
step.step_details.tool_calls[0].file_search.results
# Even if the content is None, here, see script below to know how to get the␣
↪adequate chunk raw data.
#Results
7
# [FileSearchResult(file_id='file-JmeROw0FJQT7vjOcvHHUHzT8',␣
↪file_name='amzn_2023_10k.pdf', score=0.8368480370110275, content=None),
# FileSearchResult(file_id='file-JmeROw0FJQT7vjOcvHHUHzT8',␣
# ....
↪file_name='amzn_2023_10k.pdf', score=0.3284635772545966, content=None)]
7.2 Get the details chunks retrieved from OpenAI to answer to your question:
Get the run and the step ids. For step id, it’s the one that retrieved the chunks, with object
ToolCallsStepDetails.
[ ]: run_step = client.beta.threads.runs.steps.retrieve(
run_id='run_mPnuN4vyVaoYkU9IcpkQaSi1',
step_id="step_oPkFeY8F1ny1i0zv6rWaMLyp",
include=["step_details.tool_calls[*].file_search.results[*].content"]
)
print(run_step)
[ ]: run_step.step_details.tool_calls[0].file_search.results[0].content
# [FileSearchResultContent(text='Consolidated\nNet sales $ 469,822 $ 513,983 $␣

↪574,785 \nOperating expenses 444,943 501,735 537,933 \nOperating income␣
↪24,879 12,248 36,852 \n....
7.3 Chunk Strategy

2 things that you can modify: * max_chunk_size_tokens * chunk_overlap_tokens
When adding files to the vector store, you can modify these 2 values:
[ ]: # vector_store_file = client.beta.vector_stores.files.create(
# vector_store_id='vs_lIw5lPrbgo8oX8RqwzgFJr0j',
# file_id='file-JmeROw0FJQT7vjOcvHHUHzT8',
# chunking_strategy={'type':'static', 'static':{'max_chunk_size_tokens':1240,␣
↪'chunk_overlap_tokens':500}}
# )
# print(vector_store_file)
8
8 Summarize
8.1 Question: “Can you summarize the document?”
[26]: message_user = "Can you summarize the document?"
send_message(message_user, thread.id, assistant.id)
[26]: Run(id='run_mTrttRcyqi2CE6P4su8RbIO5',
assistant_id='asst_z1ap2fxFg6yxCzKKvhs8DOtc', cancelled_at=None,
completed_at=None, created_at=1725822920, expires_at=1725823520, failed_at=None,
incomplete_details=None, instructions='\n You are a expert skilled financial
analyst. \n Leverage your extensive knowledge of accounting principles and
financial reporting standards to provide accurate and insightful answers to
questions regarding financial statements. \n Analyze details such as income
statements, balance sheets, cash flow statements, and notes to the accounts to
support your explanations.\n ', last_error=None, max_completion_tokens=None,
max_prompt_tokens=None, metadata={}, model='gpt-4o-mini', object='thread.run',
parallel_tool_calls=True, required_action=None, response_format='auto',
started_at=None, status='queued', thread_id='thread_Sp02A7D6nY5YEgsmxb1NMZDa',
tool_choice='auto', tools=[FileSearchTool(type='file_search',
file_search=FileSearch(max_num_results=None,
ranking_options=FileSearchRankingOptions(ranker='default_2024_08_21',
score_threshold=0.0)))], truncation_strategy=TruncationStrategy(type='auto',
last_messages=None), usage=None, temperature=1.0, top_p=1.0, tool_resources={})
[28]: messages = get_response(thread.id)

print_messages(messages)

ROLE=assistant
Here’s a summary of the document, which appears to be a 10-K filing for
Amazon.com, Inc. for the year ended December 31, 2023:
1. **Financial Performance**:
- Amazon reported total net sales of $574.785 billion for 2023, representing
a 12% increase from the prior year.
- The company achieved a net income of $30.425 billion in 2023, a significant
recovery from a net loss of $2.722 billion in 2022�5:1†source��9:15†source�.
- Operating income improved to $36.852 billion from $12.248 billion in the
previous year�9:9†source�.
- Earnings per share for 2023 were $2.95 (basic) and $2.90
(diluted)�9:15†source�.
2. **Sales and Segments**:

- Sales growth was driven mainly by increased unit sales, third-party seller
services, and advertising sales�9:8†source�.
- The breakdown of sales showed that North America contributed 61%,
international sales accounted for 23%, and AWS (Amazon Web Services) made up 16%
9
of total sales�9:8†source�.
3. **Operating Expenses**:
- Operating expenses totaled $537.933 billion, an increase from $501.735
billion in 2022. Notably, technology and infrastructure costs grew
significantly�9:8†source��9:9†source�.
4. **Taxation**:
- The provision for income taxes was $7.120 billion, demonstrating the
company’s substantial tax obligations arising from its income
generation�9:15†source��9:7†source�.
5. **Capital Expenditures and Cash Flow**:

- Amazon reported strong cash flow from operations of $84.946 billion for
2023 and a positive free cash flow of $36.813 billion�9:1†source�.
6. **Market Exposure**:
- The report also discusses Amazon's exposure to market risks, including
fluctuations in foreign exchange rates and interest
rates�9:9†source��9:6†source�.
This overview provides key insights into Amazon’s financial health, operational
efficiency, and overall performance for 2023 as reflected in their 10-K filing.
If you need more detailed information on specific sections or figures, feel free
to ask!
--------------------------------------------------------------------------------
--------------------
ROLE=user
Can you summarize the document?
--------------------------------------------------------------------------------
--------------------
ROLE=assistant
The net income for the year 2023 was $30,425 million (or $30.4
billion)�5:1†source�.
--------------------------------------------------------------------------------
--------------------
ROLE=user
What was the net income of 2023?
10
--------------------------------------------------------------------------------
--------------------
ROLE=assistant
--------------------------------------------------------------------------------
--------------------
11

RAG with OpenAI for Financial Analysis

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

RAG with OpenAI for Financial Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RAG with OpenAI for Financial Analysis

Uploaded by

Copyright:

Available Formats

RAG_with_OpenAI

1 Retrieval from OpenAI: file_search

[ ]: !pip install openai -q

[2]: from google.colab import userdata

from openai import OpenAI

2.1 Create a Vector Store

[13]: VectorStore(id='vs_lIw5lPrbgo8oX8RqwzgFJr0j', created_at=1725822560,

[ ]: # #List vector store with their files

[67]: # # Delete a vector stire with a file

3 Create an assistant with the vector store information

↪regarding financial statements.

Analyze details such as income statements, balance sheets, cash flow␣

6.2 Question: “What was the net income of 2023?”

[46]: [FileSearchTool(type='file_search', file_search=FileSearch(max_num_results=None,

Give the assistant some time to think (in seconds):

[25]: messages = get_response(thread.id)

Interaction #3 and text

6.3 Get the specific information about the retrieved chunks

[ ]: for step in run_steps.data:

# [FileSearchResultContent(text='Consolidated\nNet sales $ 469,822 $ 513,983 $␣

↪24,879 12,248 36,852 \n....

7.3 Chunk Strategy

[28]: messages = get_response(thread.id)

Interaction #5 and text

2. **Sales and Segments**:

5. **Capital Expenditures and Cash Flow**:

You might also like

2. Sales and Segments:

5. Capital Expenditures and Cash Flow: