Normalize Datasets¶

The Dataset Normalizer plugin is used to transform 'pandas-unfriendly' datasets (e.g., Excel files that do not follow a standard tabular structure) into a more suitable format for pandas. It is backed by an LLM that generates Python code to convert the original datasets into new ones.

In tablegpt-agent, this plugin is used to better format 'pandas-unfriendly' datasets, making them more understandable for the subsequent steps. This plugin is optional; if used, it serves as the very first step in the File Reading Workflow, easing the difficulty of data analysis in the subsequent workflow.

Introduction¶

The Dataset Normalizer is a specialized tool designed to tackle challenges that arise when working with irregular and poorly structured datasets. These challenges are especially prevalent in Excel files, which are often used as a flexible but inconsistent way of storing data.

Analyzing Excel data files can pose significant challenges, such as:

Irregular Formatting: Datasets may lack a consistent tabular structure, with varying cell sizes or non-standard layouts.
Merged Cells: Cells spanning multiple rows or columns can disrupt parsing tools.
Inconsistent Headers: Columns may have incomplete, redundant, or nested headers.
Hidden Data: Data may be stored in additional sheets or rely on calculated fields that are not directly accessible.
Mixed Data Types: Columns may contain inconsistent data types, such as numbers mixed with text.
Empty or Placeholder Rows: Extra rows with missing or irrelevant data can complicate data loading and analysis.

!!! Note: When the tablegpt-agent enables the Dataset Normalizer to format the dataset, the dataset reading process will be noticeably slower. This is because the Dataset Normalizer needs to analyze the dataset and generate transformation code, a process that takes considerable time.

It is worth noting that the data normalization process can effectively address most common data irregularities. However, for more complex datasets, further optimization may be needed, and the results depend on the specific normalization model used.

Quick Start¶

To enable the Dataset Normalizer, ensure you pass it as a parameter when creating the tablegpt-agent. You can follow the example below:

In [7]:

Copied!





from pathlib import Path
from langchain_openai import ChatOpenAI
from pybox import AsyncLocalPyBoxManager
from tablegpt.agent import create_tablegpt_graph
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR

llm = ChatOpenAI(openai_api_base="YOUR_VLLM_URL", openai_api_key="whatever", model_name="TableGPT2-7B")
normalize_llm = ChatOpenAI(openai_api_base="YOUR_VLLM_URL", openai_api_key="whatever", model_name="YOUR_VLLM_MODEL_NAME")
pybox_manager = AsyncLocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)

agent = create_tablegpt_graph(
    llm=llm,
    pybox_manager=pybox_manager,
    normalize_llm=normalize_llm,
    session_id="some-session-id", # This is required when using file-reading
)
from pathlib import Path
from langchain_openai import ChatOpenAI
from pybox import AsyncLocalPyBoxManager
from tablegpt.agent import create_tablegpt_graph
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR

llm = ChatOpenAI(openai_api_base="YOUR_VLLM_URL", openai_api_key="whatever", model_name="TableGPT2-7B")
normalize_llm = ChatOpenAI(openai_api_base="YOUR_VLLM_URL", openai_api_key="whatever", model_name="YOUR_VLLM_MODEL_NAME")
pybox_manager = AsyncLocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)

agent = create_tablegpt_graph(
    llm=llm,
    pybox_manager=pybox_manager,
    normalize_llm=normalize_llm,
    session_id="some-session-id", # This is required when using file-reading
)

Given an Excel file 产品生产统计表.xlsx with merged cells and irregular headers:

产品生产统计表
生产日期	制造编号	产品名称	预定产量	本日产量		累计产量	耗费工时
生产日期	制造编号	产品名称	预定产量	预计	实际	累计产量	本日	累计
2007/8/10	FK-001	猕猴桃果肉饮料	100000	40000	45000	83000	10	20
2007/8/11	FK-002	西瓜果肉饮料	100000	40000	44000	82000	9	18
2007/8/12	FK-003	草莓果肉饮料	100000	40000	45000	83000	9	18
2007/8/13	FK-004	蓝莓果肉饮料	100000	40000	45000	83000	9	18

Add the file for processing in the additional_kwargs of HumanMessage:

In [8]:

Copied!





from typing import TypedDict
from langchain_core.messages import HumanMessage

class Attachment(TypedDict):
    """Contains at least one dictionary with the key filename."""
    filename: str

attachment_msg = HumanMessage(
    content="",
    # Please make sure your iPython kernel can access your filename.
    additional_kwargs={"attachments": [Attachment(filename="产品生产统计表.xlsx")]},
)
from typing import TypedDict
from langchain_core.messages import HumanMessage

class Attachment(TypedDict):
    """Contains at least one dictionary with the key filename."""
    filename: str

attachment_msg = HumanMessage(
    content="",
    # Please make sure your iPython kernel can access your filename.
    additional_kwargs={"attachments": [Attachment(filename="产品生产统计表.xlsx")]},
)

Invoke the tablegpt-agent to normalize the datasets:

In [9]:

Copied!





from datetime import date
from tablegpt.agent.file_reading import Stage

# Reading and processing files.
response = await agent.ainvoke(
    input={
        "entry_message": attachment_msg,
        "processing_stage": Stage.UPLOADED,
        "messages": [attachment_msg],
        "parent_id": "some-parent-id1",
        "date": date.today(),
    },
    config={
        # Using checkpointer requires binding thread_id at runtime.
        "configurable": {"thread_id": "some-thread-id"},
    },
)

response["messages"]
from datetime import date
from tablegpt.agent.file_reading import Stage

# Reading and processing files.
response = await agent.ainvoke(
    input={
        "entry_message": attachment_msg,
        "processing_stage": Stage.UPLOADED,
        "messages": [attachment_msg],
        "parent_id": "some-parent-id1",
        "date": date.today(),
    },
    config={
        # Using checkpointer requires binding thread_id at runtime.
        "configurable": {"thread_id": "some-thread-id"},
    },
)

response["messages"]

Out[9]:

[HumanMessage(content='', additional_kwargs={'attachments': [{'filename': '产品生产统计表.xlsx'}]}, response_metadata={}, id='452ef08b-a45f-49ce-8489-4062cd35d57c'),
 AIMessage(content='我已经收到您的数据文件，我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中，并通过 `df.info` 查看 NaN 情况和数据类型。\n```python\n# Load the data into a DataFrame\ndf = read_df(\'产品生产统计表.xlsx\')\n# Normalize the data\ntry:\n    df = df.copy()\n\n    import pandas as pd\n\n    # Assuming the original data is loaded into a DataFrame named `df`\n\n    # Step 1: Isolate the Table Header\n    # Remove the unnecessary top rows and columns\n    df = df.drop([0, 1])  # Drop the first two rows which contain header information\n    df = df.reset_index(drop=True)  # Reset the index to start from 0\n\n    # Step 2: Store the Result as `final_df`\n    final_df = df\n\n    # Step 3: Rename Columns of final_df\n    # Adjust the column names of final_df to match the desired format\n    final_df.columns = [\'生产日期\', \'制造编号\', \'产品名称\', \'预定产量\', \'本日产量预计\', \'本日产量实际\', \'累计产量\', \'本日耗费工时\', \'累计耗费工时\']\n\n    # Step 4: Data Processing\n    # Ensure the data types are correct and remove any unnecessary rows or columns\n    final_df = final_df.dropna(subset=[\'生产日期\'])  # Remove rows where the production date is missing\n    final_df = final_df.drop_duplicates()  # Remove duplicate rows if any\n    final_df = final_df.reset_index(drop=True)  # Reset the index again to ensure it starts from 0\n\n    # Convert columns to appropriate data types\n    final_df[\'预定产量\'] = final_df[\'预定产量\'].astype(int)\n    final_df[\'本日产量预计\'] = final_df[\'本日产量预计\'].astype(int)\n    final_df[\'本日产量实际\'] = final_df[\'本日产量实际\'].astype(int)\n    final_df[\'累计产量\'] = final_df[\'累计产量\'].astype(int)\n    final_df[\'本日耗费工时\'] = final_df[\'本日耗费工时\'].astype(int)\n    final_df[\'累计耗费工时\'] = final_df[\'累计耗费工时\'].astype(int)\n\n    # Display the transformed DataFrame\n    print(final_df)\n    if final_df.columns.tolist() == final_df.iloc[0].tolist():\n        final_df = final_df.iloc[1:]\n\n    # reassign df with the formatted DataFrame\n    df = final_df\nexcept Exception as e:\n    # Unable to apply formatting to the original DataFrame. proceeding with the unformatted DataFrame.\n    print(f"Reformat failed with error {e}, use the original DataFrame.")\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how=\'all\').dropna(axis=1, how=\'all\')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)\n```', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '我已经收到您的数据文件，我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中，并通过 `df.info` 查看 NaN 情况和数据类型。', 'action': {'tool': 'python', 'tool_input': '# Load the data into a DataFrame\ndf = read_df(\'产品生产统计表.xlsx\')\n# Normalize the data\ntry:\n    df = df.copy()\n\n    import pandas as pd\n\n    # Assuming the original data is loaded into a DataFrame named `df`\n\n    # Step 1: Isolate the Table Header\n    # Remove the unnecessary top rows and columns\n    df = df.drop([0, 1])  # Drop the first two rows which contain header information\n    df = df.reset_index(drop=True)  # Reset the index to start from 0\n\n    # Step 2: Store the Result as `final_df`\n    final_df = df\n\n    # Step 3: Rename Columns of final_df\n    # Adjust the column names of final_df to match the desired format\n    final_df.columns = [\'生产日期\', \'制造编号\', \'产品名称\', \'预定产量\', \'本日产量预计\', \'本日产量实际\', \'累计产量\', \'本日耗费工时\', \'累计耗费工时\']\n\n    # Step 4: Data Processing\n    # Ensure the data types are correct and remove any unnecessary rows or columns\n    final_df = final_df.dropna(subset=[\'生产日期\'])  # Remove rows where the production date is missing\n    final_df = final_df.drop_duplicates()  # Remove duplicate rows if any\n    final_df = final_df.reset_index(drop=True)  # Reset the index again to ensure it starts from 0\n\n    # Convert columns to appropriate data types\n    final_df[\'预定产量\'] = final_df[\'预定产量\'].astype(int)\n    final_df[\'本日产量预计\'] = final_df[\'本日产量预计\'].astype(int)\n    final_df[\'本日产量实际\'] = final_df[\'本日产量实际\'].astype(int)\n    final_df[\'累计产量\'] = final_df[\'累计产量\'].astype(int)\n    final_df[\'本日耗费工时\'] = final_df[\'本日耗费工时\'].astype(int)\n    final_df[\'累计耗费工时\'] = final_df[\'累计耗费工时\'].astype(int)\n\n    # Display the transformed DataFrame\n    print(final_df)\n    if final_df.columns.tolist() == final_df.iloc[0].tolist():\n        final_df = final_df.iloc[1:]\n\n    # reassign df with the formatted DataFrame\n    df = final_df\nexcept Exception as e:\n    # Unable to apply formatting to the original DataFrame. proceeding with the unformatted DataFrame.\n    print(f"Reformat failed with error {e}, use the original DataFrame.")\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how=\'all\').dropna(axis=1, how=\'all\')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)'}, 'model_type': None}, response_metadata={}, id='8e5d0026-215d-46e2-ab60-9174c5bf50bd', tool_calls=[{'name': 'python', 'args': {'query': '# Load the data into a DataFrame\ndf = read_df(\'产品生产统计表.xlsx\')\n# Normalize the data\ntry:\n    df = df.copy()\n\n    import pandas as pd\n\n    # Assuming the original data is loaded into a DataFrame named `df`\n\n    # Step 1: Isolate the Table Header\n    # Remove the unnecessary top rows and columns\n    df = df.drop([0, 1])  # Drop the first two rows which contain header information\n    df = df.reset_index(drop=True)  # Reset the index to start from 0\n\n    # Step 2: Store the Result as `final_df`\n    final_df = df\n\n    # Step 3: Rename Columns of final_df\n    # Adjust the column names of final_df to match the desired format\n    final_df.columns = [\'生产日期\', \'制造编号\', \'产品名称\', \'预定产量\', \'本日产量预计\', \'本日产量实际\', \'累计产量\', \'本日耗费工时\', \'累计耗费工时\']\n\n    # Step 4: Data Processing\n    # Ensure the data types are correct and remove any unnecessary rows or columns\n    final_df = final_df.dropna(subset=[\'生产日期\'])  # Remove rows where the production date is missing\n    final_df = final_df.drop_duplicates()  # Remove duplicate rows if any\n    final_df = final_df.reset_index(drop=True)  # Reset the index again to ensure it starts from 0\n\n    # Convert columns to appropriate data types\n    final_df[\'预定产量\'] = final_df[\'预定产量\'].astype(int)\n    final_df[\'本日产量预计\'] = final_df[\'本日产量预计\'].astype(int)\n    final_df[\'本日产量实际\'] = final_df[\'本日产量实际\'].astype(int)\n    final_df[\'累计产量\'] = final_df[\'累计产量\'].astype(int)\n    final_df[\'本日耗费工时\'] = final_df[\'本日耗费工时\'].astype(int)\n    final_df[\'累计耗费工时\'] = final_df[\'累计耗费工时\'].astype(int)\n\n    # Display the transformed DataFrame\n    print(final_df)\n    if final_df.columns.tolist() == final_df.iloc[0].tolist():\n        final_df = final_df.iloc[1:]\n\n    # reassign df with the formatted DataFrame\n    df = final_df\nexcept Exception as e:\n    # Unable to apply formatting to the original DataFrame. proceeding with the unformatted DataFrame.\n    print(f"Reformat failed with error {e}, use the original DataFrame.")\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how=\'all\').dropna(axis=1, how=\'all\')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)'}, 'id': 'af99b549-09f1-4623-930a-3ffbfde40216', 'type': 'tool_call'}]),
 ToolMessage(content=[{'type': 'text', 'text': "```pycon\nReformat failed with error cannot convert float NaN to integer, use the original DataFrame.\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 26 entries, 0 to 25\nData columns (total 9 columns):\n #   Column  Non-Null Count  Dtype \n---  ------  --------------  ----- \n 0   生产日期    22 non-null     object\n 1   制造编号    25 non-null     object\n 2   产品名称    20 non-null     object\n 3   预定产量    20 non-null     object\n 4   本日产量预计  21 non-null     object\n 5   本日产量实际  21 non-null     object\n 6   累计产量    25 non-null     object\n 7   本日耗费工时  20 non-null     object\n 8   累计耗费工时  20 non-null     object\ndtypes: object(9)\n```"}], name='python', id='58429576-461b-4a50-8c3d-5bbacca22cdb', tool_call_id='af99b549-09f1-4623-930a-3ffbfde40216', artifact=[]),
 AIMessage(content='接下来我将用 `df.head(5)` 来查看数据集的前 5 行。\n```python\n# Show the first 5 rows to understand the structure\ndf.head(5)\n```', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '接下来我将用 `df.head(5)` 来查看数据集的前 5 行。', 'action': {'tool': 'python', 'tool_input': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'model_type': None}, response_metadata={}, id='0c67c2b2-36d2-49cc-8fb8-7f6e5dfc8625', tool_calls=[{'name': 'python', 'args': {'query': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'id': 'a4ccd41a-a872-4ff0-aae5-678a96b9b54a', 'type': 'tool_call'}]),
 ToolMessage(content=[{'type': 'text', 'text': '```pycon\n                  生产日期    制造编号     产品名称    预定产量 本日产量预计 本日产量实际   累计产量 本日耗费工时 累计耗费工时\n0  2007-08-10 00:00:00  FK-001  猕猴桃果肉饮料  100000  40000  45000  83000     10     20\n1  2007-08-11 00:00:00  FK-002   西瓜果肉饮料  100000  40000  44000  82000      9     18\n2  2007-08-12 00:00:00  FK-003   草莓果肉饮料  100000  40000  45000  83000      9     18\n3  2007-08-13 00:00:00  FK-004   蓝莓果肉饮料  100000  40000  45000  83000      9     18\n4  2007-08-14 00:00:00  FK-005  水密桃果肉饮料  100000  40000  45000  83000     10     20\n```'}], name='python', id='d828aa34-7c9e-4fee-8ae1-7b553530292b', tool_call_id='a4ccd41a-a872-4ff0-aae5-678a96b9b54a', artifact=[]),
 AIMessage(content='我已经了解了数据集 产品生产统计表.xlsx 的基本信息。请问我可以帮您做些什么？', additional_kwargs={'parent_id': 'some-parent-id1'}, response_metadata={}, id='e836eba6-9597-4bf8-acfd-2a81871916a6')]

By formatting the content of the last ToolMessage, you can see the normalized data:

生产日期	制造编号	产品名称	预定产量	本日产量预计	本日产量实际	累计产量	本日耗费工时	累计耗费工时
2007/8/10	FK-001	猕猴桃果肉饮料	100000	40000	45000	83000	10	20
2007/8/11	FK-002	西瓜果肉饮料	100000	40000	44000	82000	9	18
2007/8/12	FK-003	草莓果肉饮料	100000	40000	45000	83000	9	18
2007/8/13	FK-004	蓝莓果肉饮料	100000	40000	45000	83000	9	18