Iterative Tuning: Data Property Adjustment

Overview

Data property adjustment is a critical step in ensuring synthetic data quality, but it is not a one-time configuration task, but rather an iterative tuning process.

Iterative Adjustment Workflow

In practical applications, data property adjustment follows this iterative cycle:

Initial Synthesis: Use default or basic data processing settings to generate the first version of synthetic data
Evaluation Analysis: Review synthesis quality through evaluation metrics (such as Column Shapes, Column Pair Trends, Synthesis)
Problem Diagnosis: When evaluation results are unsatisfactory, deeply analyze which columns or data characteristics cause problems
Targeted Adjustments: Based on data characteristics, adjust individual column processing methods or overall dataset settings
Re-synthesis and Evaluation: Apply new settings, generate synthetic data again and evaluate
Continuous Optimization: Repeat the above steps until satisfactory evaluation results are achieved

This process emphasizes evaluation-driven optimization strategy: rather than blindly applying all possible adjustment methods, selectively choose and adjust processing strategies based on specific problems revealed by evaluation results. For example:

When Column Shapes scores are low, logarithmic transformation may be needed to handle long-tailed distributions
When categorical variable distributions differ significantly, uniform encoding may need to be applied
When time logic shows contradictions, time anchoring scaling may be needed
When synthesis quality of certain subgroups is poor, split-synthesize-merge strategy may need to be considered

Nature of Adjustment Methods

Different data characteristics require different processing strategies to ensure synthesizers can effectively learn data’s statistical properties and business logic. Each adjustment method is a tool designed for specific types of data problems, and through appropriate combination and application, synthesis quality can be significantly improved.

This section introduces four common data property adjustment methods, covering important issues such as handling complex data structures, categorical variables, distribution skewness, and temporal relationships. The important thing is understanding when and why these methods are needed, rather than mechanically applying all techniques.

Adjustment Methods Overview

1. High Heterogeneity Data: Split-Synthesize-Merge

For data containing multiple heterogeneous attributes, use the Split-Synthesize-Merge strategy to divide data into more homogeneous subsets, synthesize them separately, then integrate. This method originated from hardware limitations in processing large datasets, but practice has shown significant effectiveness in improving synthesis quality for heterogeneous and imbalanced data.

Applicable Scenarios:

Data volume too large to load into memory at once (e.g., nationwide population data ~5GB)
Obvious inherent heterogeneity exists (e.g., single-person households vs. multi-person households)
Highly imbalanced data (e.g., fraud detection with 99% normal transactions vs. 1% fraud)
Need to adopt different synthesis strategies or parameters for different subgroups

Expected Effects:

Overcome hardware limitations to process large datasets
Improve synthesis quality for each subgroup (more precisely capture subgroup characteristics)
Improve synthesis effects for rare categories
Flexibly adjust strategies for different subgroups

Detailed Explanation: High Heterogeneity Data: Split-Synthesize-Merge

2. Categorical Data: Uniform Encoding

For categorical variables, use Uniform Encoding method to map discrete category values to the continuous [0,1] interval, where interval size is determined by category occurrence frequency, enabling synthesizers to more effectively learn category distribution and associations.

Applicable Scenarios:

Data contains nominal or ordinal scale categorical variables
Moderate number of categories (no more than 100 unique values per variable)
Occurrence frequency differs between categories
Using deep learning synthesizers (such as CTGAN, TVAE)

Expected Effects:

Avoid introducing non-existent ordering relationships
Preserve original category distribution information
Improve synthetic data fidelity (average improvement of 15-40%)
Reduce invalid or unreasonable synthetic samples

Detailed Explanation: Categorical Data: Uniform Encoding

3. Long-Tailed Distribution: Logarithmic Transformation

For numerical variables exhibiting long-tailed distribution (heavy-tailed distribution), use logarithmic transformation (log or log1p) to convert skewed distributions into more symmetric distributions, making it easier for synthesizers to capture data characteristics. Logarithmic transformation can compress value ranges, reduce extreme value impact, and improve model learning effectiveness.

Applicable Scenarios:

Numerical variables show significant right or left skew (skewness absolute value > 1)
Extreme values or outliers exist (ratio of maximum to median > 10)
Variable’s dynamic range is large (such as income, transaction amounts, website traffic)
Variables show multiplicative rather than additive relationships

Expected Effects:

Transform skewed distributions into approximately normal distributions
Reduce dominant influence of extreme values
Improve synthesizer’s learning effectiveness and training stability
Improve numerical distribution similarity (Column Shapes score average improvement of 10-30%)

Detailed Explanation: Long-Tailed Distribution: Logarithmic Transformation

4. Multi-Timestamp Data: Time Anchoring Scaling

For data containing multiple time points, use Time Anchoring method to transform each time point into time differences relative to an anchor point, ensuring synthetic data maintains logical relationships and business constraints between time points.

Applicable Scenarios:

Data contains 2 or more time or date columns
Clear sequential relationships exist between time points
Time points represent different lifecycle stages of the same entity
Intervals between time points reflect important business patterns

Expected Effects:

Greatly reduce synthetic records violating time logic
Better preserve association patterns between time points
Improve synthesizer’s learning efficiency and stability
Ensure business logic consistency

Detailed Explanation: Multi-Timestamp Data: Time Anchoring Scaling

Selecting Appropriate Adjustment Methods

Different data characteristics require different adjustment strategies. Here is a selection guide:

Data Characteristic	Recommended Method	Priority
Contains multiple categorical variables	Uniform Encoding	Required
Contains multiple time points	Time Anchoring Scaling	Strongly Recommended
Numerical distribution shows long tail	Logarithmic Transformation	Recommended
Data structure highly heterogeneous	Split-Synthesize-Merge	Case by Case

Combined Usage

In practical applications, multiple adjustment methods often need to be used in combination:

Example 1: Enterprise Financing Data

Time Anchoring Scaling (handling multiple application, approval, tracking time points)
Uniform Encoding (handling industry categories, financing types, etc.)
Logarithmic Transformation (handling financing amounts and other long-tailed distributions)

Example 2: Student Enrollment Data

Uniform Encoding (handling departments, admission methods, identity categories, etc.)
Category as Time (handling birth year-month-day)
High Cardinality Handling (handling department codes, etc.)

Notes

Assess Necessity: Not all data needs all adjustment methods, should choose based on actual data characteristics
Sequence Considerations: Execution order of certain adjustment methods affects results, requires careful planning
Business Logic: Adjustment process should always consider business logic to avoid generating unreasonable data
Effect Verification: After adjustment, should confirm whether expected effects are achieved through evaluation metrics

Iterative Tuning: Data Property Adjustment

Overview

Iterative Adjustment Workflow

Nature of Adjustment Methods

Adjustment Methods Overview

1. High Heterogeneity Data: Split-Synthesize-Merge

2. Categorical Data: Uniform Encoding

3. Long-Tailed Distribution: Logarithmic Transformation

4. Multi-Timestamp Data: Time Anchoring Scaling

Selecting Appropriate Adjustment Methods

Combined Usage

Notes

Related Documents