Statistics

Statistics

When enable_stats: true is set, the system automatically calculates and records field statistics for data quality analysis, synthetic data validation, and feature understanding. For large datasets (over 1 million rows), calculations can be time-consuming; use with caution.

Enable Statistics

Global Setting

id: my_schema
enable_stats: true  # Enable globally
attributes:
  age:
    type: int

Per-field Setting

attributes:
  age:
    type: int
    enable_stats: true   # Enable
  notes:
    type: str
    enable_stats: false  # Disable

Statistical Items

Common Statistics (All Fields)

ItemDescription
row_countTotal row count
na_countNull value count
na_percentageNull value percentage
detected_typeDetected data type
actual_dtypepandas dtype

Numeric Statistics

Calculated only when type is int or float and category: false:

ItemDescription
meanMean value
stdStandard deviation
minMinimum value
maxMaximum value
medianMedian
q1First quartile
q3Third quartile

Categorical Statistics

Calculated only when category: true:

ItemDescription
unique_countUnique value count
modeMode (most frequent value)
mode_frequencyMode frequency
category_distributionCategory distribution (max 20)

Statistics Structure

attributes:
  age:
    type: int
    enable_stats: true
    stats:
      row_count: 1000
      na_count: 50
      na_percentage: 0.05
      mean: 35.5
      std: 12.3
      min: 18
      max: 85
      median: 34.0
      q1: 27.0
      q3: 43.0