RAG with OpenAI for Financial Analysis
RAG with OpenAI for Financial Analysis
RAG with OpenAI for Financial Analysis
September 9, 2024
1
2 Upload file to OpenAI
• Load the file from internet to local
• Create a vector store first
[ ]: !wget "https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/
↪c7c14359-36fa-40c3-b3ca-5bf7f3fa0b96.pdf" -O amzn_2023_10k.pdf
[13]: vector_store
[6]: # This is how you can find the chunking values parametered per default in␣
↪OpenAI:
vector_store_files.data[0].chunking_strategy
[6]: StaticFileChunkingStrategyObject(static=StaticFileChunkingStrategy(chunk_overlap
_tokens=400, max_chunk_size_tokens=800), type='static')
VectorStoreFileDeleted(id='file-h3Kn24nP0VMt5YRKKPTac3mF', deleted=True,
object='vector_store.file.deleted')
2
2.2 Upload the file to OpenAI using the Vector Store
[14]: # Upload files to OpenAI
file_paths = ["./amzn_2023_10k.pdf"]
file_streams = [open(path, "rb") for path in file_paths]
# Use the upload_and_poll endpoint to upload the files, add them to the vector␣
↪store, and poll the status of the file batch for completion.
# To ensure all contents have finihsed uploading, check the status. Status must␣
↪now be completed (instead of in_progress)
file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
vector_store_id=vector_store.id, files=file_streams
)
print(file_batch.status)
print(file_batch.file_counts)
completed
FileCounts(cancelled=0, completed=1, failed=0, in_progress=0, total=1)
assistant = client.beta.assistants.create(
name="Financial Analyst",
instructions="""
You are a expert skilled financial analyst.
Leverage your extensive knowledge of accounting principles and financial␣
↪reporting standards to provide accurate and insightful answers to questions␣
""",
model="gpt-4o-mini",
tools=[{"type": "file_search"}], #before April, it was called retrieval
tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
)
## if you have already created your assistant without vector store and you want␣
↪to add it to the assistant, you need to update it like this:
# assistant = client.beta.assistants.update(
# assistant_id=assistant.id,
# tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
# )
3
4 Create a thread
[16]: thread = client.beta.threads.create()
You can also create a thread with a vector store, to do that, you need: * To upload the file (raw)
to the OpenAI * Than create a thread and attach to the uploaded file * This will create under the
hood (in OpenAI) a vector Store.
# Upload the file to OpenAI
uploaded_file = client.files.create(
file=open("./amzn_2023_10k.pdf", "rb"), purpose="assistants"
)
# Create a thread and attach the file to the message
thread = client.beta.threads.create(
messages=[
{
"role": "user",
"content": "What was the net income of 2023?",
"attachments": [
{ "file_id": uploaded_file.id, "tools": [{"type": "file_search"}] }
],
}
]
)
# The thread now has a vector store with that file in its tool resources.
print(thread.tool_resources.file_search)
• When creating a run on this theard, the file search tool will query both the vector Store from
the assistant and the one from the thread.
5 Create a Run
[29]: run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id,
)
print(run.status)
queued
6 Start chatting
For the moment I didn’t ask any question:
[20]: messages = client.beta.threads.messages.list(thread_id=thread.id)
4
[21]: nbr_messages = len(messages.data)
interaction = nbr_messages
for message in messages.data:
content = message.content[0]
if content.type == 'text':
print(f"Interaction #{interaction}")
print(f"ROLE={message.role}")
print(content.text.value)
print("\n")
print("--"*50)
interaction-=1
Interaction #1
ROLE=assistant
I see that you've uploaded files, but I don't have specific information about
their content yet. What specific information or analysis are you looking for
regarding the financial statements in these files?
--------------------------------------------------------------------------------
--------------------
6.1 Methods
Here are several methods that will effectively assist in asking questions to assistants (such as
creating a message within a thread or initiating a run) and in parsing and understanding the
responses provided by the assistant.
[24]: def send_message(message_user, my_thread_id, my_assistant_id):
"""
Create a message with the user request, and then create a run that will ask␣
↪the Assistant to run on the requested question:
"""
params = {
"thread_id": my_thread_id,
"role": "user",
"content": message_user,
}
thread_message = client.beta.threads.messages.create(**params)
run = client.beta.threads.runs.create(
thread_id=my_thread_id,
assistant_id=my_assistant_id,
)
return run
def get_response(my_thread_id):
return client.beta.threads.messages.list(thread_id=my_thread_id)
5
def print_messages(messages):
nbr_messages = len(messages.data)
interaction = nbr_messages
for message in messages.data:
content = message.content[0]
print(f"Interaction #{interaction} and {content.type}")
if content.type == 'text':
print(f"ROLE={message.role}")
print(content.text.value)
print("\n")
print("--"*50)
else:
print(content.type)
interaction-=1
run_Lj3Wky79Kbv1kbTxb5BGveSE queued
--------------------------------------------------------------------------------
--------------------
Interaction #2 and text
ROLE=user
What was the net income of 2023?
6
--------------------------------------------------------------------------------
--------------------
Interaction #1 and text
ROLE=assistant
I see that you've uploaded files, but I don't have specific information about
their content yet. What specific information or analysis are you looking for
regarding the financial statements in these files?
--------------------------------------------------------------------------------
--------------------
print(run_steps)
# Results
# step_GvSdx4K5BJUlFZruVTpwo3OP␣
↪MessageCreationStepDetails(message_creation=MessageCreati...
# step_oPkFeY8F1ny1i0zv6rWaMLyp␣
↪ToolCallsStepDetails(tool_calls=[FileSearchToolCall(id...
7 Advanced Specifities
7.1 Get the relevance score for each chunk
By taking as example the previous question and answer:
In the last step “ToolCallsStepDetails” you can get the relevance score to the user’s query for each
of the 20 retrieved chunks. This value can be modified file_search.max_num_results to 5 if needed
when creating the asstsistant or the run.
[ ]: # for step = 'step_oPkFeY8F1ny1i0zv6rWaMLyp'
step.step_details.tool_calls[0].file_search.results
# Even if the content is None, here, see script below to know how to get the␣
↪adequate chunk raw data.
#Results
7
# [FileSearchResult(file_id='file-JmeROw0FJQT7vjOcvHHUHzT8',␣
↪file_name='amzn_2023_10k.pdf', score=0.8368480370110275, content=None),
# FileSearchResult(file_id='file-JmeROw0FJQT7vjOcvHHUHzT8',␣
↪file_name='amzn_2023_10k.pdf', score=0.8209594035063414, content=None),
# ....
# FileSearchResult(file_id='file-JmeROw0FJQT7vjOcvHHUHzT8',␣
↪file_name='amzn_2023_10k.pdf', score=0.3300425731103375, content=None),
# FileSearchResult(file_id='file-JmeROw0FJQT7vjOcvHHUHzT8',␣
↪file_name='amzn_2023_10k.pdf', score=0.3284635772545966, content=None)]
7.2 Get the details chunks retrieved from OpenAI to answer to your question:
Get the run and the step ids. For step id, it’s the one that retrieved the chunks, with object
ToolCallsStepDetails.
[ ]: run_step = client.beta.threads.runs.steps.retrieve(
thread_id=thread.id,
run_id='run_mPnuN4vyVaoYkU9IcpkQaSi1',
step_id="step_oPkFeY8F1ny1i0zv6rWaMLyp",
include=["step_details.tool_calls[*].file_search.results[*].content"]
)
print(run_step)
[ ]: run_step.step_details.tool_calls[0].file_search.results[0].content
# )
# print(vector_store_file)
8
8 Summarize
8.1 Question: “Can you summarize the document?”
[26]: message_user = "Can you summarize the document?"
send_message(message_user, thread.id, assistant.id)
[26]: Run(id='run_mTrttRcyqi2CE6P4su8RbIO5',
assistant_id='asst_z1ap2fxFg6yxCzKKvhs8DOtc', cancelled_at=None,
completed_at=None, created_at=1725822920, expires_at=1725823520, failed_at=None,
incomplete_details=None, instructions='\n You are a expert skilled financial
analyst. \n Leverage your extensive knowledge of accounting principles and
financial reporting standards to provide accurate and insightful answers to
questions regarding financial statements. \n Analyze details such as income
statements, balance sheets, cash flow statements, and notes to the accounts to
support your explanations.\n ', last_error=None, max_completion_tokens=None,
max_prompt_tokens=None, metadata={}, model='gpt-4o-mini', object='thread.run',
parallel_tool_calls=True, required_action=None, response_format='auto',
started_at=None, status='queued', thread_id='thread_Sp02A7D6nY5YEgsmxb1NMZDa',
tool_choice='auto', tools=[FileSearchTool(type='file_search',
file_search=FileSearch(max_num_results=None,
ranking_options=FileSearchRankingOptions(ranker='default_2024_08_21',
score_threshold=0.0)))], truncation_strategy=TruncationStrategy(type='auto',
last_messages=None), usage=None, temperature=1.0, top_p=1.0, tool_resources={})
1. **Financial Performance**:
- Amazon reported total net sales of $574.785 billion for 2023, representing
a 12% increase from the prior year.
- The company achieved a net income of $30.425 billion in 2023, a significant
recovery from a net loss of $2.722 billion in 2022�5:1†source��9:15†source�.
- Operating income improved to $36.852 billion from $12.248 billion in the
previous year�9:9†source�.
- Earnings per share for 2023 were $2.95 (basic) and $2.90
(diluted)�9:15†source�.
9
of total sales�9:8†source�.
3. **Operating Expenses**:
- Operating expenses totaled $537.933 billion, an increase from $501.735
billion in 2022. Notably, technology and infrastructure costs grew
significantly�9:8†source��9:9†source�.
4. **Taxation**:
- The provision for income taxes was $7.120 billion, demonstrating the
company’s substantial tax obligations arising from its income
generation�9:15†source��9:7†source�.
6. **Market Exposure**:
- The report also discusses Amazon's exposure to market risks, including
fluctuations in foreign exchange rates and interest
rates�9:9†source��9:6†source�.
This overview provides key insights into Amazon’s financial health, operational
efficiency, and overall performance for 2023 as reflected in their 10-K filing.
If you need more detailed information on specific sections or figures, feel free
to ask!
--------------------------------------------------------------------------------
--------------------
Interaction #4 and text
ROLE=user
Can you summarize the document?
--------------------------------------------------------------------------------
--------------------
Interaction #3 and text
ROLE=assistant
The net income for the year 2023 was $30,425 million (or $30.4
billion)�5:1†source�.
--------------------------------------------------------------------------------
--------------------
Interaction #2 and text
ROLE=user
What was the net income of 2023?
10
--------------------------------------------------------------------------------
--------------------
Interaction #1 and text
ROLE=assistant
I see that you've uploaded files, but I don't have specific information about
their content yet. What specific information or analysis are you looking for
regarding the financial statements in these files?
--------------------------------------------------------------------------------
--------------------
11