RAG with OpenAI for Financial Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

RAG_with_OpenAI

September 9, 2024

1 Retrieval from OpenAI: file_search


Retrieval in OpenAI is performed using “file search” tool when creating your assistant or a
theard.
(It’s declared in the same place than code_interpreter tool if you are familiar with it)
What does file_search tool do? * File search tool is the RAG (Retrieval Augmented Gener-
ation) pipeline developed by OpenAI in their Assistant AI API.
• It enhances the Assistant’s capabilities by incorporating external knowledge.
• When using this tool, OpenAI will process your documents: parse it, chunk it, create and
store embedding in vector store, and then retrieve relevant chunks using both keyword and
smeantic search.
In this tools, OpenAI applies several built-in retrieval best practices to extarct efficently data from
a document:
• Rewrites user queries to optimize them for search.
• Breaks down complex user queries into multiple searches it can run in parallel.
• Runs both keyword and semantic searches across both assistant and thread vector stores.
• Reranks search results to pick the most relevant ones before generating the final response.
Vector Store
OpenAI provides also a vector store creation when uploading files with a default chunking Stragey:
• max_chunk_size_tokens: 800 (can be customized)
• chunk_overlap_tokens: 400 (can be customized)
• It’s using text-embedding-3-large model
• It returns 20 chunks (can be customized) with their relevancy score to the user query

[ ]: !pip install openai -q

[2]: from google.colab import userdata


openai_api_key = userdata.get('OPENAI_API_KEY')

from openai import OpenAI


client = OpenAI(api_key=openai_api_key)

1
2 Upload file to OpenAI
• Load the file from internet to local
• Create a vector store first
[ ]: !wget "https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/
↪c7c14359-36fa-40c3-b3ca-5bf7f3fa0b96.pdf" -O amzn_2023_10k.pdf

2.1 Create a Vector Store


[12]: vector_store = client.beta.vector_stores.create(name="Financial Analyst")

[13]: vector_store

[13]: VectorStore(id='vs_lIw5lPrbgo8oX8RqwzgFJr0j', created_at=1725822560,


file_counts=FileCounts(cancelled=0, completed=0, failed=0, in_progress=0,
total=0), last_active_at=1725822560, metadata={}, name='Financial Analyst',
object='vector_store', status='completed', usage_bytes=0, expires_after=None,
expires_at=None)

[6]: # This is how you can find the chunking values parametered per default in␣
↪OpenAI:

vector_store_files.data[0].chunking_strategy

[6]: StaticFileChunkingStrategyObject(static=StaticFileChunkingStrategy(chunk_overlap
_tokens=400, max_chunk_size_tokens=800), type='static')

[ ]: # #List vector store with their files


# vector_store_files = client.beta.vector_stores.files.list(
# vector_store_id='vs_lIw5lPrbgo8oX8RqwzgFJr0j'
# )
# print(vector_store_files)

[67]: # # Delete a vector stire with a file


# deleted_vector_store_file = client.beta.vector_stores.files.delete(
# vector_store_id='vs_tX1Y8hrMjZZArJOIVrWjkcKk',
# file_id='file-h3Kn24nP0VMt5YRKKPTac3mF'
# )
# print(deleted_vector_store_file)

VectorStoreFileDeleted(id='file-h3Kn24nP0VMt5YRKKPTac3mF', deleted=True,
object='vector_store.file.deleted')

2
2.2 Upload the file to OpenAI using the Vector Store
[14]: # Upload files to OpenAI
file_paths = ["./amzn_2023_10k.pdf"]
file_streams = [open(path, "rb") for path in file_paths]

# Use the upload_and_poll endpoint to upload the files, add them to the vector␣
↪store, and poll the status of the file batch for completion.

# To ensure all contents have finihsed uploading, check the status. Status must␣
↪now be completed (instead of in_progress)

file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
vector_store_id=vector_store.id, files=file_streams
)

print(file_batch.status)
print(file_batch.file_counts)

completed
FileCounts(cancelled=0, completed=1, failed=0, in_progress=0, total=1)

3 Create an assistant with the vector store information


[15]: from openai import OpenAI

assistant = client.beta.assistants.create(
name="Financial Analyst",
instructions="""
You are a expert skilled financial analyst.
Leverage your extensive knowledge of accounting principles and financial␣
↪reporting standards to provide accurate and insightful answers to questions␣

↪regarding financial statements.

Analyze details such as income statements, balance sheets, cash flow␣


↪statements, and notes to the accounts to support your explanations.

""",
model="gpt-4o-mini",
tools=[{"type": "file_search"}], #before April, it was called retrieval
tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
)

## if you have already created your assistant without vector store and you want␣
↪to add it to the assistant, you need to update it like this:

# assistant = client.beta.assistants.update(
# assistant_id=assistant.id,
# tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
# )

3
4 Create a thread
[16]: thread = client.beta.threads.create()

You can also create a thread with a vector store, to do that, you need: * To upload the file (raw)
to the OpenAI * Than create a thread and attach to the uploaded file * This will create under the
hood (in OpenAI) a vector Store.
# Upload the file to OpenAI
uploaded_file = client.files.create(
file=open("./amzn_2023_10k.pdf", "rb"), purpose="assistants"
)
# Create a thread and attach the file to the message
thread = client.beta.threads.create(
messages=[
{
"role": "user",
"content": "What was the net income of 2023?",
"attachments": [
{ "file_id": uploaded_file.id, "tools": [{"type": "file_search"}] }
],
}
]
)
# The thread now has a vector store with that file in its tool resources.
print(thread.tool_resources.file_search)
• When creating a run on this theard, the file search tool will query both the vector Store from
the assistant and the one from the thread.

5 Create a Run
[29]: run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id,
)
print(run.status)

queued

6 Start chatting
For the moment I didn’t ask any question:
[20]: messages = client.beta.threads.messages.list(thread_id=thread.id)

4
[21]: nbr_messages = len(messages.data)
interaction = nbr_messages
for message in messages.data:
content = message.content[0]
if content.type == 'text':
print(f"Interaction #{interaction}")
print(f"ROLE={message.role}")
print(content.text.value)
print("\n")
print("--"*50)
interaction-=1

Interaction #1
ROLE=assistant
I see that you've uploaded files, but I don't have specific information about
their content yet. What specific information or analysis are you looking for
regarding the financial statements in these files?

--------------------------------------------------------------------------------
--------------------

6.1 Methods
Here are several methods that will effectively assist in asking questions to assistants (such as
creating a message within a thread or initiating a run) and in parsing and understanding the
responses provided by the assistant.
[24]: def send_message(message_user, my_thread_id, my_assistant_id):
"""
Create a message with the user request, and then create a run that will ask␣
↪the Assistant to run on the requested question:

"""
params = {
"thread_id": my_thread_id,
"role": "user",
"content": message_user,
}
thread_message = client.beta.threads.messages.create(**params)

run = client.beta.threads.runs.create(
thread_id=my_thread_id,
assistant_id=my_assistant_id,
)
return run

def get_response(my_thread_id):
return client.beta.threads.messages.list(thread_id=my_thread_id)

5
def print_messages(messages):
nbr_messages = len(messages.data)
interaction = nbr_messages
for message in messages.data:
content = message.content[0]
print(f"Interaction #{interaction} and {content.type}")
if content.type == 'text':
print(f"ROLE={message.role}")
print(content.text.value)
print("\n")
print("--"*50)
else:
print(content.type)
interaction-=1

6.2 Question: “What was the net income of 2023?”


[46]: message_user = "What was the net income of 2023?"
run = send_message(message_user, thread.id, assistant.id)
print(run.id, run.status)
run.tools

run_Lj3Wky79Kbv1kbTxb5BGveSE queued

[46]: [FileSearchTool(type='file_search', file_search=FileSearch(max_num_results=None,


ranking_options=FileSearchRankingOptions(ranker='default_2024_08_21',
score_threshold=0.0)))]

Give the assistant some time to think (in seconds):

[25]: messages = get_response(thread.id)


print_messages(messages)

Interaction #3 and text


ROLE=assistant
The net income for the year 2023 was $30,425 million (or $30.4
billion)�5:1†source�.

--------------------------------------------------------------------------------
--------------------
Interaction #2 and text
ROLE=user
What was the net income of 2023?

6
--------------------------------------------------------------------------------
--------------------
Interaction #1 and text
ROLE=assistant
I see that you've uploaded files, but I don't have specific information about
their content yet. What specific information or analysis are you looking for
regarding the financial statements in these files?

--------------------------------------------------------------------------------
--------------------

6.3 Get the specific information about the retrieved chunks


[ ]: run_steps = client.beta.threads.runs.steps.list(
thread_id=thread.id,
run_id='run_mPnuN4vyVaoYkU9IcpkQaSi1'
# run_id = 'run_Lj3Wky79Kbv1kbTxb5BGveSE'
)

print(run_steps)

[ ]: for step in run_steps.data:


print(step.id, step.step_details)
# .tool_calls[0].file_search.results[0].content)

# Results
# step_GvSdx4K5BJUlFZruVTpwo3OP␣
↪MessageCreationStepDetails(message_creation=MessageCreati...

# step_oPkFeY8F1ny1i0zv6rWaMLyp␣
↪ToolCallsStepDetails(tool_calls=[FileSearchToolCall(id...

7 Advanced Specifities
7.1 Get the relevance score for each chunk
By taking as example the previous question and answer:
In the last step “ToolCallsStepDetails” you can get the relevance score to the user’s query for each
of the 20 retrieved chunks. This value can be modified file_search.max_num_results to 5 if needed
when creating the asstsistant or the run.
[ ]: # for step = 'step_oPkFeY8F1ny1i0zv6rWaMLyp'
step.step_details.tool_calls[0].file_search.results
# Even if the content is None, here, see script below to know how to get the␣
↪adequate chunk raw data.

#Results

7
# [FileSearchResult(file_id='file-JmeROw0FJQT7vjOcvHHUHzT8',␣
↪file_name='amzn_2023_10k.pdf', score=0.8368480370110275, content=None),

# FileSearchResult(file_id='file-JmeROw0FJQT7vjOcvHHUHzT8',␣
↪file_name='amzn_2023_10k.pdf', score=0.8209594035063414, content=None),

# ....
# FileSearchResult(file_id='file-JmeROw0FJQT7vjOcvHHUHzT8',␣
↪file_name='amzn_2023_10k.pdf', score=0.3300425731103375, content=None),

# FileSearchResult(file_id='file-JmeROw0FJQT7vjOcvHHUHzT8',␣
↪file_name='amzn_2023_10k.pdf', score=0.3284635772545966, content=None)]

7.2 Get the details chunks retrieved from OpenAI to answer to your question:
Get the run and the step ids. For step id, it’s the one that retrieved the chunks, with object
ToolCallsStepDetails.
[ ]: run_step = client.beta.threads.runs.steps.retrieve(
thread_id=thread.id,
run_id='run_mPnuN4vyVaoYkU9IcpkQaSi1',
step_id="step_oPkFeY8F1ny1i0zv6rWaMLyp",
include=["step_details.tool_calls[*].file_search.results[*].content"]
)

print(run_step)

[ ]: run_step.step_details.tool_calls[0].file_search.results[0].content

# [FileSearchResultContent(text='Consolidated\nNet sales $ 469,822 $ 513,983 $␣


↪574,785 \nOperating expenses 444,943 501,735 537,933 \nOperating income␣

↪24,879 12,248 36,852 \n....

7.3 Chunk Strategy


2 things that you can modify: * max_chunk_size_tokens * chunk_overlap_tokens
When adding files to the vector store, you can modify these 2 values:
[ ]: # vector_store_file = client.beta.vector_stores.files.create(
# vector_store_id='vs_lIw5lPrbgo8oX8RqwzgFJr0j',
# file_id='file-JmeROw0FJQT7vjOcvHHUHzT8',
# chunking_strategy={'type':'static', 'static':{'max_chunk_size_tokens':1240,␣
↪'chunk_overlap_tokens':500}}

# )
# print(vector_store_file)

8
8 Summarize
8.1 Question: “Can you summarize the document?”
[26]: message_user = "Can you summarize the document?"
send_message(message_user, thread.id, assistant.id)

[26]: Run(id='run_mTrttRcyqi2CE6P4su8RbIO5',
assistant_id='asst_z1ap2fxFg6yxCzKKvhs8DOtc', cancelled_at=None,
completed_at=None, created_at=1725822920, expires_at=1725823520, failed_at=None,
incomplete_details=None, instructions='\n You are a expert skilled financial
analyst. \n Leverage your extensive knowledge of accounting principles and
financial reporting standards to provide accurate and insightful answers to
questions regarding financial statements. \n Analyze details such as income
statements, balance sheets, cash flow statements, and notes to the accounts to
support your explanations.\n ', last_error=None, max_completion_tokens=None,
max_prompt_tokens=None, metadata={}, model='gpt-4o-mini', object='thread.run',
parallel_tool_calls=True, required_action=None, response_format='auto',
started_at=None, status='queued', thread_id='thread_Sp02A7D6nY5YEgsmxb1NMZDa',
tool_choice='auto', tools=[FileSearchTool(type='file_search',
file_search=FileSearch(max_num_results=None,
ranking_options=FileSearchRankingOptions(ranker='default_2024_08_21',
score_threshold=0.0)))], truncation_strategy=TruncationStrategy(type='auto',
last_messages=None), usage=None, temperature=1.0, top_p=1.0, tool_resources={})

[28]: messages = get_response(thread.id)


print_messages(messages)

Interaction #5 and text


ROLE=assistant
Here’s a summary of the document, which appears to be a 10-K filing for
Amazon.com, Inc. for the year ended December 31, 2023:

1. **Financial Performance**:
- Amazon reported total net sales of $574.785 billion for 2023, representing
a 12% increase from the prior year.
- The company achieved a net income of $30.425 billion in 2023, a significant
recovery from a net loss of $2.722 billion in 2022�5:1†source��9:15†source�.
- Operating income improved to $36.852 billion from $12.248 billion in the
previous year�9:9†source�.
- Earnings per share for 2023 were $2.95 (basic) and $2.90
(diluted)�9:15†source�.

2. **Sales and Segments**:


- Sales growth was driven mainly by increased unit sales, third-party seller
services, and advertising sales�9:8†source�.
- The breakdown of sales showed that North America contributed 61%,
international sales accounted for 23%, and AWS (Amazon Web Services) made up 16%

9
of total sales�9:8†source�.

3. **Operating Expenses**:
- Operating expenses totaled $537.933 billion, an increase from $501.735
billion in 2022. Notably, technology and infrastructure costs grew
significantly�9:8†source��9:9†source�.

4. **Taxation**:
- The provision for income taxes was $7.120 billion, demonstrating the
company’s substantial tax obligations arising from its income
generation�9:15†source��9:7†source�.

5. **Capital Expenditures and Cash Flow**:


- Amazon reported strong cash flow from operations of $84.946 billion for
2023 and a positive free cash flow of $36.813 billion�9:1†source�.

6. **Market Exposure**:
- The report also discusses Amazon's exposure to market risks, including
fluctuations in foreign exchange rates and interest
rates�9:9†source��9:6†source�.

This overview provides key insights into Amazon’s financial health, operational
efficiency, and overall performance for 2023 as reflected in their 10-K filing.
If you need more detailed information on specific sections or figures, feel free
to ask!

--------------------------------------------------------------------------------
--------------------
Interaction #4 and text
ROLE=user
Can you summarize the document?

--------------------------------------------------------------------------------
--------------------
Interaction #3 and text
ROLE=assistant
The net income for the year 2023 was $30,425 million (or $30.4
billion)�5:1†source�.

--------------------------------------------------------------------------------
--------------------
Interaction #2 and text
ROLE=user
What was the net income of 2023?

10
--------------------------------------------------------------------------------
--------------------
Interaction #1 and text
ROLE=assistant
I see that you've uploaded files, but I don't have specific information about
their content yet. What specific information or analysis are you looking for
regarding the financial statements in these files?

--------------------------------------------------------------------------------
--------------------

11

You might also like