Business Logic Constraints
Applicable when business rules need to be ensured.
Define logical relationships between fields, range restrictions, and proportion maintenance through Constraints YAML to ensure synthesized data complies with business specifications. Supports four constraint types: missing value groups, field constraints, field combinations, and field proportions.
If your data has no special constraint requirements, you can directly use default synthesis methods, refer to Getting Started.
Constraints YAML Example
Save the following content as business_constraints.yaml:
# Missing value handling rules
nan_groups:
# Delete entire record when capital is missing (incomplete enterprise basic information)
capital: 'delete'
# Field constraint conditions
field_constraints:
# Time logic constraints: established_date < first_apply_date <= latest_apply_date < latest_track_date
- "established_date < first_apply_date"
- "first_apply_date <= latest_apply_date"
- "latest_apply_date < latest_track_date"
# Numerical relationship constraints: approved amount cannot exceed requested amount
- "latest_apply_amount_approved <= latest_apply_amount_requested"
# Range constraints
- "capital > 0" # Capital must be positive
- "latest_track_profit_ratio >= -1.0 & latest_track_profit_ratio <= 1.0" # Profit ratio range
- "latest_apply_amount_requested > 0" # Application amount must be positive
# Field combination constraints (allowlist)
field_combinations:
# Application status and approved amount pairing relationship
-
- latest_apply_status: latest_apply_amount_approved
- approved: # Must have amount when approved
- 1000000
- 5000000
- 10000000
- 20000000
rejected: # Amount is null when rejected
- "pd.NA"
withdrawn: # Amount is null when withdrawn
- "pd.NA"
# Field proportion maintenance
field_proportions:
# Maintain industry distribution with 10% tolerance
- fields: 'industry'
mode: 'all'
tolerance: 0.1
# Maintain risk level distribution with 5% tolerance
- fields: 'latest_track_risk_level'
mode: 'all'
tolerance: 0.05Detailed Constraint Type Descriptions
nan_groups (Missing Value Group Constraints)
Define handling rules when specific fields have null values.
capital: 'delete'
- Description: When
capitalfield is null, delete entire record - Applicable scenario: When a field is null, entire record is meaningless (e.g., company has no capital information)
- Handling methods:
delete(delete entire record),erase(clear related fields),copy(copy other field values)
field_constraints (Field Constraints)
Define numerical ranges and logical relationships for single or multiple fields.
Time Logic Constraints
"established_date < first_apply_date"- Description: Company establishment date must be earlier than first application date
- Applicable scenario: Time sequence business logic
"first_apply_date <= latest_apply_date"- Description: First application date must be earlier than or equal to latest application date
- Applicable scenario: Equality allowed because there may be only one application
Numerical Relationship Constraints
"latest_apply_amount_approved <= latest_apply_amount_requested"- Description: Approved amount cannot exceed requested amount
- Applicable scenario: Basic review principles of financial institutions
Range Constraints
"capital > 0"- Description: Capital must be positive
- Applicable scenario: Fields that logically cannot be zero or negative
"latest_track_profit_ratio >= -1.0 & latest_track_profit_ratio <= 1.0"- Description: Profit ratio ranges from -100% to 100%
- Applicable scenario: Reasonable range for ratio-type fields
- Syntax: Use
&(AND) to combine multiple conditions
field_combinations (Field Combination Constraints)
Define valid pairings between field values (allowlist).
Application Status and Approved Amount Pairing
-
- latest_apply_status: latest_apply_amount_approved
- approved: [1000000, 5000000, 10000000, 20000000]
rejected: ["pd.NA"]
withdrawn: ["pd.NA"]- Description: Define allowed approved amounts for different application statuses
approved: Approved status can have specific amounts (list common amounts)rejected,withdrawn: Rejected or withdrawn status amount must be null- Note: This is an allowlist; unlisted combinations are considered invalid
Syntax Rules
- Null values must use
"pd.NA"(case-sensitive) - String values must be quoted like
"approved" - Numerical values do not need quotes
field_proportions (Field Proportion Constraints)
Maintain category distributions or null value proportions close to original data.
Maintain Industry Distribution
- fields: 'industry'
mode: 'all'
tolerance: 0.1fields: Field name to maintain proportionmode: 'all': Maintain all category distributionstolerance: 0.1: Allowable error range (±10%)
Maintain Risk Level Distribution
- fields: 'latest_track_risk_level'
mode: 'all'
tolerance: 0.05tolerance: 0.05: Stricter tolerance (±5%) ensures risk distribution is close to original data
Maintain Null Value Proportion
- fields: 'workclass'
mode: 'missing'
tolerance: 0.03mode: 'missing': Only maintain missing value (NA) proportion- Applicable scenario: Preserve data quality characteristics
Note: This constraint achieves proportions by removing excess data, which may reduce total number of synthesized records.
Usage Methods
Iterative Sampling Mode (Synthesized Data Generation)
Synthesis process automatically filters data that does not meet constraints:
Loader:
data:
filepath: 'denormalized_data.csv'
schema: 'denormalized_schema.yaml'
Preprocessor:
default:
method: 'default'
Synthesizer:
default:
method: 'default'
Constrainer:
apply_constraints:
method: resample # or auto (automatically determines)
constraints_yaml: 'business_constraints.yaml'
target_rows: None # If not specified, uses original data row count
sampling_ratio: 10.0 # Each sampling is 10 times the target row count
max_trials: 300 # Maximum 300 attempts
Postprocessor:
default:
method: 'default'
Reporter:
output:
method: 'save_data'
source: 'Postprocessor'Validation Check Mode (Check Existing Data)
To check if original or synthesized data meets constraints:
Loader:
original_data:
filepath: 'denormalized_data.csv'
schema: 'denormalized_schema.yaml'
Constrainer:
check_constraints:
method: validate
source: Loader # Specify data source to check
constraints_yaml: 'business_constraints.yaml'
Reporter:
validation_report:
method: save_validation
output: 'data_validation' # Output filename prefix
include_details: true # Include detailed violation recordsThis will generate three CSV files:
data_validation_summary.csv: Overall statistics (pass rate, violation rate, etc.)data_validation_violations.csv: Violation statistics for each constraintdata_validation_details.csv: Detailed violation records (up to 10 examples per rule)
Constraint Definition Principles
In practice, it is recommended to first discuss with domain experts to confirm hard constraints and data quality requirements in business rules. Then identify constraint patterns through exploratory data analysis and write identified constraints into YAML files. After definition is complete, use Constrainer in validate mode to check if original data meets constraints. If violation rate is too high (>5%), reassess reasonableness of constraint definitions and gradually refine constraint definitions based on validation results.
Important principles when defining constraints:
- All constraint conditions must be satisfied simultaneously (AND logic)
- Execution order of four constraint types is fixed and cannot be adjusted
- field_combinations uses allowlist approach, only allowing explicitly listed combinations
- Null values must use
"pd.NA"(case-sensitive) - String values must be quoted like
"approved" - field_constraints supports pandas query syntax, can use
&(AND),|(OR) and other operators to combine conditions