Evaluation Interpretation: Purpose-Driven Assessment

Evaluation Interpretation: Purpose-Driven Assessment

After completing data preparation, evaluating the quality of synthetic data is a critical step to ensure it meets application requirements. Evaluation strategies should be determined based on the intended use of synthetic data, as different application scenarios require different evaluation focuses and standards. This chapter will help you select appropriate evaluation methods and parameter settings based on data usage.

Quality assessment of synthetic data encompasses three core aspects:

  • Privacy Protection: Ensuring synthetic data does not leak personally identifiable information from the original data
  • Data Fidelity: Measuring the similarity of synthetic data to original data in statistical properties
  • Data Utility: Verifying the performance of synthetic data in specific machine learning tasks

For these three aspects, our team recommends always prioritizing privacy protection, then determining the importance of the other two based on different application scenarios:

  • Data Release Scenarios: When synthetic data will be publicly released or shared with third parties, pursue high fidelity to maintain data versatility
  • Specific Task Modeling: When synthetic data is used for specific machine learning tasks (such as data augmentation, model training), pursue high utility to meet task requirements
flowchart TD
    Start([開始評估])
    Diagnostic{Step 1:<br/>資料診斷性通過?}
    DiagnosticFail[資料結構問題<br/>需檢查合成過程]
    Privacy{Step 2:<br/>隱私保護力通過?}
    PrivacyFail[隱私風險過高<br/>需調整合成參數]
    Purpose{Step 3:<br/>合成資料使用目的?}
    Release[情境 A:<br/>資料釋出<br/>無特定下游任務]
    Task[情境 B:<br/>特定任務應用<br/>資料增益/模型訓練]
    FidelityFocus[評估重點:<br/>追求最高保真度]
    UtilityFocus[評估重點:<br/>追求高實用性<br/>保真度達標即可]

    Start --> Diagnostic
    Diagnostic -->|否| DiagnosticFail
    Diagnostic -->|是| Privacy
    Privacy -->|否| PrivacyFail
    Privacy -->|是| Purpose
    Purpose -->|A| Release
    Purpose -->|B| Task
    Release --> FidelityFocus
    Task --> UtilityFocus

    style Start fill:#e1f5fe
    style DiagnosticFail fill:#ffcdd2
    style PrivacyFail fill:#ffcdd2
    style FidelityFocus fill:#c8e6c9
    style UtilityFocus fill:#c8e6c9

Chapter Navigation

1. Privacy Risk Estimation: Protection Parameter Configuration

Privacy protection is the primary key to synthetic data quality assessment. This section explains how to use the Anonymeter tool to evaluate three privacy attack modes (singling out, linkability, inference), and provides parameter configuration recommendations and risk interpretation standards.

2. Release or Modeling: Fidelity or Utility

Select fidelity or utility as the primary evaluation aspect based on the intended use of synthetic data. This section explains that data release scenarios should pursue high fidelity, specific task modeling should pursue high utility, and how to conduct evaluation and interpretation.

3. Synthetic Data Modeling Use: Experiment Design Selection

When synthetic data is used for specific machine learning tasks, experiment design determines how to train and evaluate models. This section explains the differences between domain transfer and dual model control group designs, selection criteria, and application scenarios.