1

Context:

I have a XML file (DEXPI) and I want to use it as a data source to implement Retrieval Augmented Generation (RAG) system using llama-index to fetch the correct context against any natural language query.

Current Issue:

  • I cannot use the XML file like a text document.
  • llama-index does not provide any type of splitter for XML data so that XML data can be correctly divided into chunks (nodes).
  • Even if we write some custom chunker/splitter, a lot of unwanted jargons would be still there in the chunks like XML tags and other metadata related to XML.

What did I try?

To solve this issue I have 2 approaches:

Approach 1:

Convert the XML into SQl tables (or CSVs). Convert these tables into natural language english text. Then pass this text to llama-index for further processing. Here, while preparing the knowledge graph index, the llama-index will automatically figure out the vertices (entities) and the edges (relationships) between them.

Approach 2:

Convert the XML into SQL tables (or CSVs). Convert these SQL tables into Graph DB entities & relationships manually. Then query the graph db by using a graph query generated from any LLM.

My Questions:

  1. I need suggestions on which approach to choose currently & how effective they are.
  2. Are there any better approaches to deal with XML data when using llama-index.
1
  • I would create a Powershell script to parse the xml into another format like CSV. You can also with Powershell connect to a SQL Server and store data into the database.
    – jdweng
    Commented Dec 6, 2023 at 13:52

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.