Data Preparation: Data Governance Check

Data Preparation: Data Governance Check

Choose appropriate preparation methods based on data structure and business requirements. We recommend starting with data profiling to understand data quality and characteristics before deciding on multi-table integration or constraint definitions.

flowchart
    Start[Data Preparation] --> Describer[Data<br/>Profiling]
    Describer --> MultiTable{Multi-table<br/>Data?}
    MultiTable -->|Yes| Denormalize[Multi-table<br/>Denormalization]
    MultiTable -->|No| ConstraintCheck{Need<br/>Constraints?}
    Denormalize --> ConstraintCheck
    
    ConstraintCheck -->|Yes| Constraints[Business Logic<br/>Constraints]
    ConstraintCheck -->|No| Complete[Preparation<br/>Complete]
    Constraints --> Complete

    %% Macaron color scheme
    style Start fill:#B0E0E6,stroke:#87CEEB,stroke-width:2px,color:#333
    style Describer fill:#B4E7CE,stroke:#98D8C8,stroke-width:2px,color:#333
    style MultiTable fill:#E6E6FA,stroke:#DDA0DD,stroke-width:2px,color:#333
    style Denormalize fill:#B4E7CE,stroke:#98D8C8,stroke-width:2px,color:#333
    style ConstraintCheck fill:#E6E6FA,stroke:#DDA0DD,stroke-width:2px,color:#333
    style Constraints fill:#B4E7CE,stroke:#98D8C8,stroke-width:2px,color:#333
    style Complete fill:#D3D3D3,stroke:#A9A9A9,stroke-width:2px,color:#333

Legend:

  • Light blue box: Starting point
  • Light purple box: Decision node
  • Light green box: Action node

Data Preparation Workflow

Follow these preparation steps based on your data characteristics:

Step 1: Data Profiling

  • Data Profiling - Starting point for all data preparation (Required)
    • Generate statistical reports using Describer module
    • Review basic statistical information
    • Identify data quality issues
    • Understand data distribution characteristics

Step 2: Multi-table Data Processing

  • Multi-table Relationships - When data is scattered across related tables
    • Use database denormalization to integrate multiple tables
    • Choose appropriate granularity based on downstream tasks
    • Provides Python pandas and SQL integration examples
    • Avoids immature multi-table synthesis techniques

Step 3: Constraint Definition

  • Business Logic Constraints - When business rules need to be enforced
    • Define logical relationships between fields
    • Maintain category distributions and missing value ratios
    • Use Constrainer for validation and filtering
    • Provides complete YAML configuration examples

Next Steps

After completing data preparation, you can:

  1. Refer to Getting Started to begin data synthesis
  2. Check Best Practices for handling special data types
  3. Learn more about PETsARD YAML configuration details