Enhance TableGPT Agent with RAG¶

While the File Reading Workflow is adequate for most scenarios, it may not always provide the information necessary for the LLM to generate accurate code. Consider the following examples:

A categorical column in the dataset contains 'foo', 'bar', and 'baz', but 'baz' only appears after approximately 100 rows. In this case, the LLM may not encounter the 'baz' value through df.head().
The user's query may not align with the dataset's content for several reasons:
- The dataset lacks proper governance. For instance, a cell value might be misspelled from 'foo' to 'fou'.
- There could be a typo in the user's query. For example, if the user queries, "Show me the data for 'fou'," but the dataset contains 'foo' instead.

In such situations, the Dataset Retriever plugin can be utilized to fetch additional information about the dataset from external sources, thereby providing the LLM with more context and improving its ability to generate accurate responses.

Quick Start¶

To help you quickly integrate and utilize RAG with the TableGPT Agent, follow the steps outlined in this section. These instructions will guide you through the process of loading datasets, enhancing retrieval with document compression, and integrating with a powerful LLM-based agent. By the end of this quick start, you'll be able to issue complex queries and receive enriched, context-aware responses.

Step 1: Install Required Dependencies¶

To get started with using RAG in the TableGPT Agent, you need to install the necessary dependencies. The primary package required is langchain, which facilitates building retrieval-augmented workflows.

Run the following command to install it:

pip install langchain

Step 2: Load and Prepare Data with CSVLoader¶

The TableGPT Agent provides a convenient CSVLoader for converting CSV or Excel files into a format that can be processed by the RAG pipeline. This method allows seamless integration of your data for further retrieval and embedding.

Example Code:

In [3]:

Copied!





from langchain_core.vectorstores import InMemoryVectorStore
from tablegpt.retriever import CSVLoader

loader = CSVLoader("产品销量表.csv", autodetect_encoding=True)

documents = []
async for item in loader.alazy_load():
    documents.append(item)

# Initialize with an embedding model
vector_store = InMemoryVectorStore(embedding=SomeEmbeddingModel())

await vector_store.aadd_documents(documents=documents)
dataset_base_retriever = vector_store.as_retriever()
from langchain_core.vectorstores import InMemoryVectorStore
from tablegpt.retriever import CSVLoader

loader = CSVLoader("产品销量表.csv", autodetect_encoding=True)

documents = []
async for item in loader.alazy_load():
    documents.append(item)

# Initialize with an embedding model
vector_store = InMemoryVectorStore(embedding=SomeEmbeddingModel())

await vector_store.aadd_documents(documents=documents)
dataset_base_retriever = vector_store.as_retriever()

Step 3: Build a Context-Aware Retriever with Document Compression¶

To enhance the retrieval process, langchain provides powerful retriever utilities that can be combined with custom compressors. In this step, we utilize the ColumnDocCompressor from tablegpt to focus on relevant columns and build an efficient dataset_retriever.

Example Code:

In [5]:

Copied!





from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from tablegpt.retriever import ColumnDocCompressor

dataset_compressor = DocumentCompressorPipeline(
    transformers=[ColumnDocCompressor()]
)

dataset_retriever = ContextualCompressionRetriever(
    base_compressor=dataset_compressor,
    base_retriever=dataset_base_retriever,
)
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from tablegpt.retriever import ColumnDocCompressor

dataset_compressor = DocumentCompressorPipeline(
    transformers=[ColumnDocCompressor()]
)

dataset_retriever = ContextualCompressionRetriever(
    base_compressor=dataset_compressor,
    base_retriever=dataset_base_retriever,
)

Step 4: Integrate with TableGPT Agent¶

In this step, we integrate the dataset_retriever with the TableGPT Agent using an LLM and a local execution environment. This setup ensures that the agent can handle user queries effectively by leveraging both the LLM and retrieved dataset context.

Example Code:

In [8]:

Copied!





from langchain_openai import ChatOpenAI
from pybox import AsyncLocalPyBoxManager
from tablegpt.agent import create_tablegpt_graph
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR

llm = ChatOpenAI(openai_api_base="YOUR_VLLM_URL", openai_api_key="whatever", model_name="TableGPT2-7B")
pybox_manager = AsyncLocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)

agent = create_tablegpt_graph(
    llm=llm,
    pybox_manager=pybox_manager,
    dataset_retriever=dataset_retriever,
)
from langchain_openai import ChatOpenAI
from pybox import AsyncLocalPyBoxManager
from tablegpt.agent import create_tablegpt_graph
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR

llm = ChatOpenAI(openai_api_base="YOUR_VLLM_URL", openai_api_key="whatever", model_name="TableGPT2-7B")
pybox_manager = AsyncLocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)

agent = create_tablegpt_graph(
    llm=llm,
    pybox_manager=pybox_manager,
    dataset_retriever=dataset_retriever,
)

With this setup, your TableGPT Agent is ready to process user queries, retrieve relevant data, and generate contextually accurate responses. The integration of RAG techniques ensures that the agent leverages external data effectively, providing enhanced insights and performance.

Step 5: Analyze Data with the TableGPT Agent¶

Finally, you can use the TableGPT Agent to perform analysis by sending a query. The response can help determine whether retrieval-augmented generation (RAG) has provided enhanced results. Observing the returned information allows you to assess the accuracy and completeness of the generated response.

Example Code:

In [9]:

Copied!





from datetime import date
from langchain_core.messages import HumanMessage

message = HumanMessage(content="桃酥的销售量是多少?")

_input = {
    "messages": [message],
    "parent_id": "some-parent-id",
    "date": date.today(),
}

response = await agent.ainvoke(_input)

response["messages"]
from datetime import date
from langchain_core.messages import HumanMessage

message = HumanMessage(content="桃酥的销售量是多少?")

_input = {
    "messages": [message],
    "parent_id": "some-parent-id",
    "date": date.today(),
}

response = await agent.ainvoke(_input)

response["messages"]

Out[9]:

[HumanMessage(content='桃酥的销售量是多少?', additional_kwargs={}, response_metadata={}, id='b567e1c3-8943-453c-9ebe-fa8d34cfc388'),
 SystemMessage(content='\nHere are some extra column information that might help you understand the dataset:\n- 产品销量表.csv:\n  - {"column": 名称, "dtype": "string", "values": ["花生桃酥", ...]}\n  - {"column":  销售额 , "dtype": "string", "values": [" ￥931,000.00 ", " ￥225,060.00 ", " ￥58,500.00 ", ...]}\n', additional_kwargs={'parent_id': 'some-parent-id'}, response_metadata={}, id='07fdddf4-05e8-4022-9a78-98ee3744aab2'),
 AIMessage(content="为了回答这个问题，我们首先需要读取文件`产品销量表.csv`，然后找到列名包含“名称”和“销售额”的列，特别是需要找到“花生桃酥”的销售量。让我们先读取数据并查看前几行。\n```python\nimport pandas as pd\n\n# 读取数据\ndf = read_df(uri='产品销量表.csv')\n\n# 显示数据框的前几行\ndf.head()\n```", additional_kwargs={'thought': '为了回答这个问题，我们首先需要读取文件`产品销量表.csv`，然后找到列名包含“名称”和“销售额”的列，特别是需要找到“花生桃酥”的销售量。让我们先读取数据并查看前几行。', 'action': {'tool': 'python', 'tool_input': "import pandas as pd\n\n# 读取数据\ndf = read_df(uri='产品销量表.csv')\n\n# 显示数据框的前几行\ndf.head()"}, 'parent_id': 'some-parent-id'}, response_metadata={}, id='27da6f10-2201-4349-bc23-9f7b42f34742', tool_calls=[{'name': 'python', 'args': {'query': "import pandas as pd\n\n# 读取数据\ndf = read_df(uri='产品销量表.csv')\n\n# 显示数据框的前几行\ndf.head()"}, 'id': 'be9a29de-7f5d-4010-a85b-37286ab99e86', 'type': 'tool_call'}]),
 ToolMessage(content=[{'type': 'text', 'text': '```pycon\n       编号      名称 单位   单价（元）      销售量             销售额 \n0  mb2033    法式面包  包   ￥7.40   305080   ￥2,257,592.00 \n1  mb2034    奶昔蛋糕  包   ￥5.80    93200     ￥540,560.00 \n2  mb2035  奶油夹心饼干  包   ￥3.10   215300     ￥667,430.00 \n3  mb2036     葱油饼  包   ￥2.20   102300     ￥225,060.00 \n4  mb2037    花生桃酥  包   ￥3.80   130000     ￥494,000.00 \n```'}], name='python', id='a48d70fd-2e01-48ee-a9a5-25dc0eec04d6', tool_call_id='be9a29de-7f5d-4010-a85b-37286ab99e86', artifact=[]),
 AIMessage(content='从数据中我们可以看到，“花生桃酥”的销售量为130,000包。', additional_kwargs={'parent_id': 'some-parent-id'}, response_metadata={}, id='5c5b703d-2eea-444b-a627-0828dca06df2')]

Output:

Here are some extra column information that might help you understand the dataset:

产品销量表.csv:

{"column": 名称, "dtype": "string", "values": ["花生桃酥", ...]}

{"column": 销售额 , "dtype": "string", "values": [" ￥931,000.00 ", " ￥225,060.00 ", " ￥58,500.00 ", ...]}

The output confirms that the RAG approach effectively enriches the agent's responses by incorporating dataset context. This improvement allows the agent to provide detailed, actionable insights rather than generic answers, thereby enhancing its utility for complex queries.