Data Describing
Applicable to all data preparation workflows as the starting point.
Use the Describer module to generate statistical description reports for datasets, helping you examine basic statistical information, identify data quality issues, understand data distribution characteristics, and assess whether data is suitable for synthesis.
Describer is the first step of data preparation. It is strongly recommended to execute this before any data integration or constraint definition.
Generate Statistical Reports Using Describer
Basic Usage
Loader:
data:
filepath: benchmark://adult-income
schema: benchmark://adult-income_schema
Describer:
profile_data:
method: describe # or use default (will auto-determine as describe)
source: LoaderParameter Description
method: Evaluation methoddescribe: Single dataset statistical descriptioncompare: Dataset comparison analysis (requires two data sources)default: Automatically determines based on source count (recommended)- Default:
default
source: Data source module- Single source (describe mode): Directly specify module name, such as
Loader - Two sources (compare mode): Use dictionary format to specify
baseandtarget - Available values:
Loader,Splitter,Preprocessor,Synthesizer,Postprocessor,Constrainer
- Single source (describe mode): Directly specify module name, such as
Generated Statistical Reports
Describer generates statistical reports at three levels:
1. Global Level
Overall dataset statistics:
- Number of records (total records)
- Number of columns
- Memory usage
- Data type distribution (numeric, categorical, datetime)
2. Columnwise Level
Numeric Column Statistics:
- Mean
- Standard deviation (std)
- Median
- Minimum (min)
- Maximum (max)
- Quartiles (Q1, Q3)
- Missing value ratio
Categorical Column Statistics:
- Number of unique values (nunique)
- Most frequent value
- Frequency
- Missing value ratio
Datetime Column Statistics:
- Time range (earliest, latest date)
- Time interval distribution
3. Pairwise Level
Correlation analysis between columns:
- Correlation coefficient matrix between numeric columns
- Identification of highly correlated column pairs
- Association strength between categorical columns
Data Quality Check
Use Describer reports for data quality assessment:
Missing Value Check
- Observe missing value ratio for each column
- High missing value ratio (>30%) may affect synthesis quality
- Consider using nan_groups constraints to define handling rules
Outlier Check
- Review minimum and maximum values of numeric columns
- Observe relationship between standard deviation and quartiles
- Extreme values may need handling in preprocessing stage
Category Distribution Check
- Check number of unique values in categorical columns
- Evaluate category balance (whether there are dominant categories)
- Minority categories (<1%) may disappear after synthesis
Data Correlation Check
- Identify highly correlated columns (correlation coefficient >0.9)
- Consider whether dimensionality reduction or feature selection is needed
- Understand dependencies between columns
Compare Original Data with Synthetic Data
When you need to compare two datasets (e.g., original data vs. synthetic data):
Loader:
original:
filepath: 'original_data.csv'
schema: 'data_schema.yaml'
Synthesizer:
synthetic:
method: custom_data
filepath: 'synthetic_data.csv'
schema: 'data_schema.yaml'
Describer:
compare_data:
method: compare # or use default (will auto-determine as compare)
source:
base: Loader # Base data (original data)
target: Synthesizer # Comparison target (synthetic data)Comparison Report Content
Global Level (with Score):
- Overall similarity score
- Difference in number of records
- Overall comparison of column statistics
Columnwise Level:
- Difference or percentage change in statistics for each column
- Distribution similarity (JS divergence)
- Change in missing value ratio
Custom Comparison Method
Describer:
custom_comparison:
method: compare
source:
base: Loader
target: Synthesizer
stats_method: # Custom statistical methods
- mean
- std
- nunique
- jsdivergence
compare_method: diff # Use difference instead of percentage change
aggregated_method: mean
summary_method: meanAvailable Statistical Methods:
- Numeric:
mean,std,median,min,max - Categorical:
nunique,jsdivergence
Comparison Methods:
pct_change: Percentage change(target - base) / abs(base)(default)diff: Absolute differencetarget - base
Practical Recommendations
Checking Order:
- Execute Describer to generate statistical report
- Review Global level to understand overall situation
- Check Columnwise level to identify problematic columns
- Observe Pairwise level to understand column associations
Decision Making:
- If multi-table data needs are found → Refer to Multi-table Relationships
- If constraints are needed → Refer to Business Logic Constraints
- If data quality is good → Proceed directly to Getting Started
Quality Standards:
- Missing value ratio recommended <20%
- Outlier ratio recommended <5%
- Category balance recommended dominant category <80%
- Highly correlated column pairs recommended for further analysis
Notes
- Describer does not modify original data, only generates statistical reports
- source parameter is required and must explicitly specify data source
- compare mode must use dictionary format to specify
baseandtarget - Statistical methods automatically filter applicable calculations based on data type
- Inapplicable statistical methods will return NaN