Normalize Datasets¶
The Dataset Normalizer plugin is used to transform 'pandas-unfriendly' datasets (e.g., Excel files that do not follow a standard tabular structure) into a more suitable format for pandas. It is backed by an LLM that generates Python code to convert the original datasets into new ones.
In tablegpt-agent
, this plugin is used to better format 'pandas-unfriendly' datasets, making them more understandable for the subsequent steps. This plugin is optional; if used, it serves as the very first step in the File Reading Workflow, easing the difficulty of data analysis in the subsequent workflow.
Introduction¶
The Dataset Normalizer
is a specialized tool designed to tackle challenges that arise when working with irregular and poorly structured datasets. These challenges are especially prevalent in Excel files, which are often used as a flexible but inconsistent way of storing data.
Analyzing Excel data files can pose significant challenges, such as:
- Irregular Formatting: Datasets may lack a consistent tabular structure, with varying cell sizes or non-standard layouts.
- Merged Cells: Cells spanning multiple rows or columns can disrupt parsing tools.
- Inconsistent Headers: Columns may have incomplete, redundant, or nested headers.
- Hidden Data: Data may be stored in additional sheets or rely on calculated fields that are not directly accessible.
- Mixed Data Types: Columns may contain inconsistent data types, such as numbers mixed with text.
- Empty or Placeholder Rows: Extra rows with missing or irrelevant data can complicate data loading and analysis.
!!! Note: When the
tablegpt-agent
enables theDataset Normalizer
to format the dataset, the dataset reading process will be noticeably slower. This is because theDataset Normalizer
needs to analyze the dataset and generate transformation code, a process that takes considerable time.It is worth noting that the data normalization process can effectively address most common data irregularities. However, for more complex datasets, further optimization may be needed, and the results depend on the specific normalization model used.
Quick Start¶
To enable the Dataset Normalizer
, ensure you pass it as a parameter when creating the tablegpt-agent
. You can follow the example below:
from pathlib import Path
from langchain_openai import ChatOpenAI
from pybox import AsyncLocalPyBoxManager
from tablegpt.agent import create_tablegpt_graph
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR
llm = ChatOpenAI(openai_api_base="YOUR_VLLM_URL", openai_api_key="whatever", model_name="TableGPT2-7B")
normalize_llm = ChatOpenAI(openai_api_base="YOUR_VLLM_URL", openai_api_key="whatever", model_name="YOUR_VLLM_MODEL_NAME")
pybox_manager = AsyncLocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)
agent = create_tablegpt_graph(
llm=llm,
pybox_manager=pybox_manager,
normalize_llm=normalize_llm,
session_id="some-session-id", # This is required when using file-reading
)
Given an Excel file 产品生产统计表.xlsx with merged cells and irregular headers:
产品生产统计表 | ||||||||
---|---|---|---|---|---|---|---|---|
生产日期 | 制造编号 | 产品名称 | 预定产量 | 本日产量 | 累计产量 | 耗费工时 | ||
预计 | 实际 | 本日 | 累计 | |||||
2007/8/10 | FK-001 | 猕猴桃果肉饮料 | 100000 | 40000 | 45000 | 83000 | 10 | 20 |
2007/8/11 | FK-002 | 西瓜果肉饮料 | 100000 | 40000 | 44000 | 82000 | 9 | 18 |
2007/8/12 | FK-003 | 草莓果肉饮料 | 100000 | 40000 | 45000 | 83000 | 9 | 18 |
2007/8/13 | FK-004 | 蓝莓果肉饮料 | 100000 | 40000 | 45000 | 83000 | 9 | 18 |
Add the file for processing in the additional_kwargs
of HumanMessage:
from typing import TypedDict
from langchain_core.messages import HumanMessage
class Attachment(TypedDict):
"""Contains at least one dictionary with the key filename."""
filename: str
attachment_msg = HumanMessage(
content="",
# Please make sure your iPython kernel can access your filename.
additional_kwargs={"attachments": [Attachment(filename="产品生产统计表.xlsx")]},
)
Invoke the tablegpt-agent
to normalize the datasets:
from datetime import date
from tablegpt.agent.file_reading import Stage
# Reading and processing files.
response = await agent.ainvoke(
input={
"entry_message": attachment_msg,
"processing_stage": Stage.UPLOADED,
"messages": [attachment_msg],
"parent_id": "some-parent-id1",
"date": date.today(),
},
config={
# Using checkpointer requires binding thread_id at runtime.
"configurable": {"thread_id": "some-thread-id"},
},
)
response["messages"]
[HumanMessage(content='', additional_kwargs={'attachments': [{'filename': '产品生产统计表.xlsx'}]}, response_metadata={}, id='452ef08b-a45f-49ce-8489-4062cd35d57c'), AIMessage(content='我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中,并通过 `df.info` 查看 NaN 情况和数据类型。\n```python\n# Load the data into a DataFrame\ndf = read_df(\'产品生产统计表.xlsx\')\n# Normalize the data\ntry:\n df = df.copy()\n\n import pandas as pd\n\n # Assuming the original data is loaded into a DataFrame named `df`\n\n # Step 1: Isolate the Table Header\n # Remove the unnecessary top rows and columns\n df = df.drop([0, 1]) # Drop the first two rows which contain header information\n df = df.reset_index(drop=True) # Reset the index to start from 0\n\n # Step 2: Store the Result as `final_df`\n final_df = df\n\n # Step 3: Rename Columns of final_df\n # Adjust the column names of final_df to match the desired format\n final_df.columns = [\'生产日期\', \'制造编号\', \'产品名称\', \'预定产量\', \'本日产量预计\', \'本日产量实际\', \'累计产量\', \'本日耗费工时\', \'累计耗费工时\']\n\n # Step 4: Data Processing\n # Ensure the data types are correct and remove any unnecessary rows or columns\n final_df = final_df.dropna(subset=[\'生产日期\']) # Remove rows where the production date is missing\n final_df = final_df.drop_duplicates() # Remove duplicate rows if any\n final_df = final_df.reset_index(drop=True) # Reset the index again to ensure it starts from 0\n\n # Convert columns to appropriate data types\n final_df[\'预定产量\'] = final_df[\'预定产量\'].astype(int)\n final_df[\'本日产量预计\'] = final_df[\'本日产量预计\'].astype(int)\n final_df[\'本日产量实际\'] = final_df[\'本日产量实际\'].astype(int)\n final_df[\'累计产量\'] = final_df[\'累计产量\'].astype(int)\n final_df[\'本日耗费工时\'] = final_df[\'本日耗费工时\'].astype(int)\n final_df[\'累计耗费工时\'] = final_df[\'累计耗费工时\'].astype(int)\n\n # Display the transformed DataFrame\n print(final_df)\n if final_df.columns.tolist() == final_df.iloc[0].tolist():\n final_df = final_df.iloc[1:]\n\n # reassign df with the formatted DataFrame\n df = final_df\nexcept Exception as e:\n # Unable to apply formatting to the original DataFrame. proceeding with the unformatted DataFrame.\n print(f"Reformat failed with error {e}, use the original DataFrame.")\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how=\'all\').dropna(axis=1, how=\'all\')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)\n```', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中,并通过 `df.info` 查看 NaN 情况和数据类型。', 'action': {'tool': 'python', 'tool_input': '# Load the data into a DataFrame\ndf = read_df(\'产品生产统计表.xlsx\')\n# Normalize the data\ntry:\n df = df.copy()\n\n import pandas as pd\n\n # Assuming the original data is loaded into a DataFrame named `df`\n\n # Step 1: Isolate the Table Header\n # Remove the unnecessary top rows and columns\n df = df.drop([0, 1]) # Drop the first two rows which contain header information\n df = df.reset_index(drop=True) # Reset the index to start from 0\n\n # Step 2: Store the Result as `final_df`\n final_df = df\n\n # Step 3: Rename Columns of final_df\n # Adjust the column names of final_df to match the desired format\n final_df.columns = [\'生产日期\', \'制造编号\', \'产品名称\', \'预定产量\', \'本日产量预计\', \'本日产量实际\', \'累计产量\', \'本日耗费工时\', \'累计耗费工时\']\n\n # Step 4: Data Processing\n # Ensure the data types are correct and remove any unnecessary rows or columns\n final_df = final_df.dropna(subset=[\'生产日期\']) # Remove rows where the production date is missing\n final_df = final_df.drop_duplicates() # Remove duplicate rows if any\n final_df = final_df.reset_index(drop=True) # Reset the index again to ensure it starts from 0\n\n # Convert columns to appropriate data types\n final_df[\'预定产量\'] = final_df[\'预定产量\'].astype(int)\n final_df[\'本日产量预计\'] = final_df[\'本日产量预计\'].astype(int)\n final_df[\'本日产量实际\'] = final_df[\'本日产量实际\'].astype(int)\n final_df[\'累计产量\'] = final_df[\'累计产量\'].astype(int)\n final_df[\'本日耗费工时\'] = final_df[\'本日耗费工时\'].astype(int)\n final_df[\'累计耗费工时\'] = final_df[\'累计耗费工时\'].astype(int)\n\n # Display the transformed DataFrame\n print(final_df)\n if final_df.columns.tolist() == final_df.iloc[0].tolist():\n final_df = final_df.iloc[1:]\n\n # reassign df with the formatted DataFrame\n df = final_df\nexcept Exception as e:\n # Unable to apply formatting to the original DataFrame. proceeding with the unformatted DataFrame.\n print(f"Reformat failed with error {e}, use the original DataFrame.")\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how=\'all\').dropna(axis=1, how=\'all\')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)'}, 'model_type': None}, response_metadata={}, id='8e5d0026-215d-46e2-ab60-9174c5bf50bd', tool_calls=[{'name': 'python', 'args': {'query': '# Load the data into a DataFrame\ndf = read_df(\'产品生产统计表.xlsx\')\n# Normalize the data\ntry:\n df = df.copy()\n\n import pandas as pd\n\n # Assuming the original data is loaded into a DataFrame named `df`\n\n # Step 1: Isolate the Table Header\n # Remove the unnecessary top rows and columns\n df = df.drop([0, 1]) # Drop the first two rows which contain header information\n df = df.reset_index(drop=True) # Reset the index to start from 0\n\n # Step 2: Store the Result as `final_df`\n final_df = df\n\n # Step 3: Rename Columns of final_df\n # Adjust the column names of final_df to match the desired format\n final_df.columns = [\'生产日期\', \'制造编号\', \'产品名称\', \'预定产量\', \'本日产量预计\', \'本日产量实际\', \'累计产量\', \'本日耗费工时\', \'累计耗费工时\']\n\n # Step 4: Data Processing\n # Ensure the data types are correct and remove any unnecessary rows or columns\n final_df = final_df.dropna(subset=[\'生产日期\']) # Remove rows where the production date is missing\n final_df = final_df.drop_duplicates() # Remove duplicate rows if any\n final_df = final_df.reset_index(drop=True) # Reset the index again to ensure it starts from 0\n\n # Convert columns to appropriate data types\n final_df[\'预定产量\'] = final_df[\'预定产量\'].astype(int)\n final_df[\'本日产量预计\'] = final_df[\'本日产量预计\'].astype(int)\n final_df[\'本日产量实际\'] = final_df[\'本日产量实际\'].astype(int)\n final_df[\'累计产量\'] = final_df[\'累计产量\'].astype(int)\n final_df[\'本日耗费工时\'] = final_df[\'本日耗费工时\'].astype(int)\n final_df[\'累计耗费工时\'] = final_df[\'累计耗费工时\'].astype(int)\n\n # Display the transformed DataFrame\n print(final_df)\n if final_df.columns.tolist() == final_df.iloc[0].tolist():\n final_df = final_df.iloc[1:]\n\n # reassign df with the formatted DataFrame\n df = final_df\nexcept Exception as e:\n # Unable to apply formatting to the original DataFrame. proceeding with the unformatted DataFrame.\n print(f"Reformat failed with error {e}, use the original DataFrame.")\n# Remove leading and trailing whitespaces in column names\ndf.columns = df.columns.str.strip()\n\n# Remove rows and columns that contain only empty values\ndf = df.dropna(how=\'all\').dropna(axis=1, how=\'all\')\n\n# Get the basic information of the dataset\ndf.info(memory_usage=False)'}, 'id': 'af99b549-09f1-4623-930a-3ffbfde40216', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': "```pycon\nReformat failed with error cannot convert float NaN to integer, use the original DataFrame.\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 26 entries, 0 to 25\nData columns (total 9 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 生产日期 22 non-null object\n 1 制造编号 25 non-null object\n 2 产品名称 20 non-null object\n 3 预定产量 20 non-null object\n 4 本日产量预计 21 non-null object\n 5 本日产量实际 21 non-null object\n 6 累计产量 25 non-null object\n 7 本日耗费工时 20 non-null object\n 8 累计耗费工时 20 non-null object\ndtypes: object(9)\n```"}], name='python', id='58429576-461b-4a50-8c3d-5bbacca22cdb', tool_call_id='af99b549-09f1-4623-930a-3ffbfde40216', artifact=[]), AIMessage(content='接下来我将用 `df.head(5)` 来查看数据集的前 5 行。\n```python\n# Show the first 5 rows to understand the structure\ndf.head(5)\n```', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '接下来我将用 `df.head(5)` 来查看数据集的前 5 行。', 'action': {'tool': 'python', 'tool_input': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'model_type': None}, response_metadata={}, id='0c67c2b2-36d2-49cc-8fb8-7f6e5dfc8625', tool_calls=[{'name': 'python', 'args': {'query': '# Show the first 5 rows to understand the structure\ndf.head(5)'}, 'id': 'a4ccd41a-a872-4ff0-aae5-678a96b9b54a', 'type': 'tool_call'}]), ToolMessage(content=[{'type': 'text', 'text': '```pycon\n 生产日期 制造编号 产品名称 预定产量 本日产量预计 本日产量实际 累计产量 本日耗费工时 累计耗费工时\n0 2007-08-10 00:00:00 FK-001 猕猴桃果肉饮料 100000 40000 45000 83000 10 20\n1 2007-08-11 00:00:00 FK-002 西瓜果肉饮料 100000 40000 44000 82000 9 18\n2 2007-08-12 00:00:00 FK-003 草莓果肉饮料 100000 40000 45000 83000 9 18\n3 2007-08-13 00:00:00 FK-004 蓝莓果肉饮料 100000 40000 45000 83000 9 18\n4 2007-08-14 00:00:00 FK-005 水密桃果肉饮料 100000 40000 45000 83000 10 20\n```'}], name='python', id='d828aa34-7c9e-4fee-8ae1-7b553530292b', tool_call_id='a4ccd41a-a872-4ff0-aae5-678a96b9b54a', artifact=[]), AIMessage(content='我已经了解了数据集 产品生产统计表.xlsx 的基本信息。请问我可以帮您做些什么?', additional_kwargs={'parent_id': 'some-parent-id1'}, response_metadata={}, id='e836eba6-9597-4bf8-acfd-2a81871916a6')]
By formatting the content of the last ToolMessage
, you can see the normalized data:
生产日期 | 制造编号 | 产品名称 | 预定产量 | 本日产量预计 | 本日产量实际 | 累计产量 | 本日耗费工时 | 累计耗费工时 |
---|---|---|---|---|---|---|---|---|
2007/8/10 | FK-001 | 猕猴桃果肉饮料 | 100000 | 40000 | 45000 | 83000 | 10 | 20 |
2007/8/11 | FK-002 | 西瓜果肉饮料 | 100000 | 40000 | 44000 | 82000 | 9 | 18 |
2007/8/12 | FK-003 | 草莓果肉饮料 | 100000 | 40000 | 45000 | 83000 | 9 | 18 |
2007/8/13 | FK-004 | 蓝莓果肉饮料 | 100000 | 40000 | 45000 | 83000 | 9 | 18 |