Enhance TableGPT Agent with RAG¶
While the File Reading Workflow is adequate for most scenarios, it may not always provide the information necessary for the LLM to generate accurate code. Consider the following examples:
- A categorical column in the dataset contains 'foo', 'bar', and 'baz', but 'baz' only appears after approximately 100 rows. In this case, the LLM may not encounter the 'baz' value through
df.head()
. - The user's query may not align with the dataset's content for several reasons:
- The dataset lacks proper governance. For instance, a cell value might be misspelled from 'foo' to 'fou'.
- There could be a typo in the user's query. For example, if the user queries, "Show me the data for 'fou'," but the dataset contains 'foo' instead.
In such situations, the Dataset Retriever plugin can be utilized to fetch additional information about the dataset from external sources, thereby providing the LLM with more context and improving its ability to generate accurate responses.
Quick Start¶
To help you quickly integrate and utilize RAG
with the TableGPT Agent
, follow the steps outlined in this section. These instructions will guide you through the process of loading datasets, enhancing retrieval with document compression, and integrating with a powerful LLM-based agent. By the end of this quick start, you'll be able to issue complex queries and receive enriched, context-aware responses.
Step 1: Install Required Dependencies¶
To get started with using RAG in the TableGPT Agent, you need to install the necessary dependencies. The primary package required is langchain, which facilitates building retrieval-augmented workflows.
Run the following command to install it:
pip install langchain
Step 2: Load and Prepare Data with CSVLoader¶
The TableGPT Agent
provides a convenient CSVLoader
for converting CSV
or Excel
files into a format that can be processed by the RAG pipeline. This method allows seamless integration of your data for further retrieval and embedding.
Example Code:
from langchain_core.vectorstores import InMemoryVectorStore
from tablegpt.retriever import CSVLoader
loader = CSVLoader("产品销量表.csv", autodetect_encoding=True)
documents = []
async for item in loader.alazy_load():
documents.append(item)
# Initialize with an embedding model
vector_store = InMemoryVectorStore(embedding=SomeEmbeddingModel())
await vector_store.aadd_documents(documents=documents)
dataset_base_retriever = vector_store.as_retriever()
Step 3: Build a Context-Aware Retriever with Document Compression¶
To enhance the retrieval process, langchain
provides powerful retriever utilities that can be combined with custom compressors. In this step, we utilize the ColumnDocCompressor
from tablegpt to focus on relevant columns and build an efficient dataset_retriever
.
Example Code:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from tablegpt.retriever import ColumnDocCompressor
dataset_compressor = DocumentCompressorPipeline(
transformers=[ColumnDocCompressor()]
)
dataset_retriever = ContextualCompressionRetriever(
base_compressor=dataset_compressor,
base_retriever=dataset_base_retriever,
)
Step 4: Integrate with TableGPT Agent¶
In this step, we integrate the dataset_retriever
with the TableGPT Agent
using an LLM
and a local execution environment. This setup ensures that the agent can handle user queries effectively by leveraging both the LLM and retrieved dataset context.
Example Code:
from langchain_openai import ChatOpenAI
from pybox import LocalPyBoxManager
from tablegpt.agent import create_tablegpt_graph
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR
llm = ChatOpenAI(openai_api_base="YOUR_VLLM_URL", openai_api_key="whatever", model_name="TableGPT2-7B")
pybox_manager = LocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)
agent = create_tablegpt_graph(
llm=llm,
pybox_manager=pybox_manager,
dataset_retriever=dataset_retriever,
)
With this setup, your TableGPT Agent
is ready to process user queries, retrieve relevant data, and generate contextually accurate responses. The integration of RAG techniques ensures that the agent leverages external data effectively, providing enhanced insights and performance.
Step 5: Analyze Data with the TableGPT Agent¶
Finally, you can use the TableGPT Agent
to perform analysis by sending a query. The response can help determine whether retrieval-augmented generation (RAG) has provided enhanced results. Observing the returned information allows you to assess the accuracy and completeness of the generated response.
Example Code:
from datetime import date
from langchain_core.messages import HumanMessage
message = HumanMessage(content="桃酥的销售量是多少?")
_input = {
"messages": [message],
"parent_id": "some-parent-id",
"date": date.today(),
}
response = await agent.ainvoke(_input)
response["messages"]
[HumanMessage(content='桃酥的销售量是多少?', additional_kwargs={}, response_metadata={}, id='b567e1c3-8943-453c-9ebe-fa8d34cfc388'), SystemMessage(content='\nHere are some extra column information that might help you understand the dataset:\n- 产品销量表.csv:\n - {"column": 名称, "dtype": "string", "values": ["花生桃酥", ...]}\n - {"column": 销售额 , "dtype": "string", "values": [" ¥931,000.00 ", " ¥225,060.00 ", " ¥58,500.00 ", ...]}\n', additional_kwargs={'parent_id': 'some-parent-id'}, response_metadata={}, id='07fdddf4-05e8-4022-9a78-98ee3744aab2'), AIMessage(content="为了回答这个问题,我们首先需要读取文件`产品销量表.csv`,然后找到列名包含“名称”和“销售额”的列,特别是需要找到“花生桃酥”的销售量。让我们先读取数据并查看前几行。\n```python\nimport pandas as pd\n\n# 读取数据\ndf = read_df(uri='产品销量表.csv')\n\n# 显示数据框的前几行\ndf.head()\n```", additional_kwargs={'thought': '为了回答这个问题,我们首先需要读取文件`产品销量表.csv`,然后找到列名包含“名称”和“销售额”的列,特别是需要找到“花生桃酥”的销售量。让我们先读取数据并查看前几行。', 'action': {'tool': 'python', 'tool_input': "import pandas as pd\n\n# 读取数据\ndf = read_df(uri='产品销量表.csv')\n\n# 显示数据框的前几行\ndf.head()"}, 'parent_id': 'some-parent-id'}, response_metadata={}, id='27da6f10-2201-4349-bc23-9f7b42f34742', tool_calls=[{'name': 'python', 'args': {'query': "import pandas as pd\n\n# 读取数据\ndf = read_df(uri='产品销量表.csv')\n\n# 显示数据框的前几行\ndf.head()"}, 'id': 'be9a29de-7f5d-4010-a85b-37286ab99e86', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': '```pycon\n 编号 名称 单位 单价(元) 销售量 销售额 \n0 mb2033 法式面包 包 ¥7.40 305080 ¥2,257,592.00 \n1 mb2034 奶昔蛋糕 包 ¥5.80 93200 ¥540,560.00 \n2 mb2035 奶油夹心饼干 包 ¥3.10 215300 ¥667,430.00 \n3 mb2036 葱油饼 包 ¥2.20 102300 ¥225,060.00 \n4 mb2037 花生桃酥 包 ¥3.80 130000 ¥494,000.00 \n```'}], name='python', id='a48d70fd-2e01-48ee-a9a5-25dc0eec04d6', tool_call_id='be9a29de-7f5d-4010-a85b-37286ab99e86', artifact=[]), AIMessage(content='从数据中我们可以看到,“花生桃酥”的销售量为130,000包。', additional_kwargs={'parent_id': 'some-parent-id'}, response_metadata={}, id='5c5b703d-2eea-444b-a627-0828dca06df2')]
Output:
Here are some extra column information that might help you understand the dataset:
- 产品销量表.csv:
- {"column": 名称, "dtype": "string", "values": ["花生桃酥", ...]}
- {"column": 销售额 , "dtype": "string", "values": [" ¥931,000.00 ", " ¥225,060.00 ", " ¥58,500.00 ", ...]}
The output confirms that the RAG approach effectively enriches the agent's responses by incorporating dataset context. This improvement allows the agent to provide detailed, actionable insights rather than generic answers, thereby enhancing its utility for complex queries.