File Reading¶
When working with dataset files, maintaining a clear separation between file reading and data analysis workflows can significantly improve control and clarity. At TableGPT Agent, we've designed a robust and structured approach to handling file reading that empowers the LLM (Large Language Model) to effectively analyze dataset files without being overwhelmed by unnecessary details. This method not only enhances the LLM's ability to inspect the data but also ensures a smoother and more reliable data analysis process.
Traditionally, allowing an LLM to directly inspect a dataset might involve simply calling the df.head()
function to preview its content. While this approach suffices for straightforward use cases, it often lacks depth when dealing with more complex or messy datasets. To address this, we've developed a multi-step file reading workflow designed to deliver richer insights into the dataset structure while preparing it for advanced analysis.
Here's how the workflow unfolds:
Normalization (Optional)¶
Not all files are immediately suitable for direct analysis. Excel files, in particular, can pose challenges—irregular formatting, merged cells, and inconsistent headers are just a few examples. To tackle these issues, we introduce an optional normalization step that preprocesses the data, transforming it into a format that is “pandas-friendly.”
This step addresses the most common quirks in Excel files, such as non-standard column headers, inconsistent row structures, or missing metadata. By resolving these typical issues upfront, the data is transformed into a format that is 'pandas-friendly' ensuring smooth integration with downstream processes.
Example Scenario:
Imagine you have an Excel file that looks like this:
# Load the data into a DataFrame
df1 = read_df('产品生产统计表.xlsx', header=[0, 1, 2])
df1.head(5)
产品生产统计表 | |||||||||
---|---|---|---|---|---|---|---|---|---|
生产日期 | 制造编号 | 产品名称 | 预定产量 | 本日产量 | 累计产量 | 耗费工时 | |||
Unnamed: 0_level_2 | Unnamed: 1_level_2 | Unnamed: 2_level_2 | Unnamed: 3_level_2 | 预计 | 实际 | Unnamed: 6_level_2 | 本日 | 累计 | |
0 | 2007-08-10 00:00:00 | FK-001 | 猕猴桃果肉饮料 | 100000.0 | 40000 | 45000 | 83000 | 10.0 | 20.0 |
1 | 2007-08-11 00:00:00 | FK-002 | 西瓜果肉饮料 | 100000.0 | 40000 | 44000 | 82000 | 9.0 | 18.0 |
2 | 2007-08-12 00:00:00 | FK-003 | 草莓果肉饮料 | 100000.0 | 40000 | 45000 | 83000 | 9.0 | 18.0 |
3 | 2007-08-13 00:00:00 | FK-004 | 蓝莓果肉饮料 | 100000.0 | 40000 | 45000 | 83000 | 9.0 | 18.0 |
4 | 2007-08-14 00:00:00 | FK-005 | 水密桃果肉饮料 | 100000.0 | 40000 | 45000 | 83000 | 10.0 | 20.0 |
The file is riddled with merged cells, empty rows, and redundant formatting that make it incompatible with pandas. If you try to load this file directly, pandas might misinterpret the structure or fail to parse it entirely.
With our normalization feature, irregular datasets can be seamlessly transformed into clean, structured formats. When using the create_tablegpt_agent
method, simply pass the normalize_llm
parameter. The system will automatically analyze the irregular data and generate the appropriate transformation code, ensuring the dataset is prepared in the optimal format for further analysis.
Below is an example of the code generated for the provided irregular dataset:
# Normalize the data
try:
df = df1.copy()
import pandas as pd
# Assuming the original data is loaded into a DataFrame named df
# Here is the transformation process:
# Step 1: Isolate the Table Header
# Remove the unnecessary top rows and columns
final_df = df.iloc[2:, :9].copy()
# Step 2: Rename Columns of final_df
# Adjust the column names to match the desired format
final_df.columns = ['生产日期', '制造编号', '产品名称', '预定产量', '本日产量预计', '本日产量实际', '累计产量', '本日耗费工时', '累计耗费工时']
# Step 3: Data Processing
# Ensure there are no NaN values and drop any duplicate rows if necessary
final_df.dropna(inplace=True)
final_df.drop_duplicates(inplace=True)
# Convert the appropriate columns to numeric types
final_df['预定产量'] = final_df['预定产量'].astype(int)
final_df['本日产量预计'] = final_df['本日产量预计'].astype(int)
final_df['本日产量实际'] = final_df['本日产量实际'].astype(int)
final_df['累计产量'] = final_df['累计产量'].astype(int)
final_df['本日耗费工时'] = final_df['本日耗费工时'].astype(int)
final_df['累计耗费工时'] = final_df['累计耗费工时'].astype(int)
# Display the transformed DataFrame
if final_df.columns.tolist() == final_df.iloc[0].tolist():
final_df = final_df.iloc[1:]
# reassign df1 with the formatted DataFrame
df1 = final_df
except Exception as e:
# Unable to apply formatting to the original DataFrame. proceeding with the unformatted DataFrame.
print(f"Reformat failed with error {e}, use the original DataFrame.")
Using the generated transformation code, the irregular dataset is converted into a clean, structured format, ready for analysis:
df1.head(5)
生产日期 | 制造编号 | 产品名称 | 预定产量 | 本日产量预计 | 本日产量实际 | 累计产量 | 本日耗费工时 | 累计耗费工时 | |
---|---|---|---|---|---|---|---|---|---|
2 | 2007-08-12 00:00:00 | FK-003 | 草莓果肉饮料 | 100000 | 40000 | 45000 | 83000 | 9 | 18 |
3 | 2007-08-13 00:00:00 | FK-004 | 蓝莓果肉饮料 | 100000 | 40000 | 45000 | 83000 | 9 | 18 |
4 | 2007-08-14 00:00:00 | FK-005 | 水密桃果肉饮料 | 100000 | 40000 | 45000 | 83000 | 10 | 20 |
5 | 2007-08-15 00:00:00 | FK-006 | 荔枝果肉饮料 | 100000 | 40000 | 44000 | 82000 | 10 | 20 |
6 | 2007-08-16 00:00:00 | FK-007 | 樱桃果肉饮料 | 100000 | 40000 | 46000 | 84000 | 9 | 18 |
Dataset Structure Overview¶
After normalization, the next step dives into the structural aspects of the dataset using the df.info()
function. Unlike df.head()
, which only shows a snippet of the data, df.info()
provides a holistic view of the dataset’s structure. Key insights include:
- Column Data Types: Helps identify numerical, categorical, or textual data at a glance.
- Non-Null Counts: Reveals the completeness of each column, making it easy to spot potential gaps or inconsistencies.
By focusing on the foundational structure of the dataset, this step enables the LLM to better understand the quality and layout of the data, paving the way for more informed analyses.
# Remove leading and trailing whitespaces in column names
df1.columns = df1.columns.str.strip()
# Remove rows and columns that contain only empty values
df1 = df1.dropna(how='all').dropna(axis=1, how='all')
# Get the basic information of the dataset
df1.info(memory_usage=False)
<class 'pandas.core.frame.DataFrame'> Index: 18 entries, 2 to 19 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 生产日期 18 non-null object 1 制造编号 18 non-null object 2 产品名称 18 non-null object 3 预定产量 18 non-null int64 4 本日产量预计 18 non-null int64 5 本日产量实际 18 non-null int64 6 累计产量 18 non-null int64 7 本日耗费工时 18 non-null int64 8 累计耗费工时 18 non-null int64 dtypes: int64(6), object(3)
Dataset Content Preview¶
Finally, we utilize the df.head()
function to provide a visual preview of the dataset’s content. This step is crucial for understanding the actual values within the dataset—patterns, anomalies, or trends often become apparent here.
The number of rows displayed (n
) is configurable to balance between granularity and simplicity. For smaller datasets or detailed exploration, a larger n
might be beneficial. However, for larger datasets, displaying too many rows could overwhelm the LLM with excessive details, detracting from the primary analytical objectives.
# Show the first 5 rows to understand the structure
df1.head(5)
生产日期 | 制造编号 | 产品名称 | 预定产量 | 本日产量预计 | 本日产量实际 | 累计产量 | 本日耗费工时 | 累计耗费工时 | |
---|---|---|---|---|---|---|---|---|---|
2 | 2007-08-12 00:00:00 | FK-003 | 草莓果肉饮料 | 100000 | 40000 | 45000 | 83000 | 9 | 18 |
3 | 2007-08-13 00:00:00 | FK-004 | 蓝莓果肉饮料 | 100000 | 40000 | 45000 | 83000 | 9 | 18 |
4 | 2007-08-14 00:00:00 | FK-005 | 水密桃果肉饮料 | 100000 | 40000 | 45000 | 83000 | 10 | 20 |
5 | 2007-08-15 00:00:00 | FK-006 | 荔枝果肉饮料 | 100000 | 40000 | 44000 | 82000 | 10 | 20 |
6 | 2007-08-16 00:00:00 | FK-007 | 樱桃果肉饮料 | 100000 | 40000 | 46000 | 84000 | 9 | 18 |
Why This Matters¶
This structured, multi-step approach is not just about processing data; it's about making the LLM smarter in how it interacts with datasets. By systematically addressing issues like messy formatting, structural ambiguity, and information overload, we ensure the LLM operates with clarity and purpose.
The separation of file reading from analysis offers several advantages:
- Enhanced Accuracy: Preprocessing and structure-checking reduce the risk of errors in downstream analyses.
- Scalability: Handles datasets of varying complexity and size with equal efficiency.
- Transparency: Provides clear visibility into the dataset’s structure, enabling better decision-making.
By adopting this method, TableGPT Agent transforms the way dataset files are read and analyzed, offering a smarter, more controlled, and ultimately more user-friendly experience.