benchmark://
Loader supports using the benchmark:// protocol to automatically download and load benchmark datasets.
Usage Examples
Click the below button to run this example in Colab:
Note: If using Colab, please see the runtime setup guide.
Loading Benchmark Dataset
Loader:
load_benchmark:
filepath: benchmark://adult-incomeLoading Benchmark Dataset with Benchmark Schema
Loader:
load_benchmark_with_schema:
filepath: benchmark://adult-income
schema: benchmark://adult-income_schemaEither local or benchmark-provided filepath and schema can be used interchangeably.
Available Benchmark Datasets
Demographic Datasets
| Dataset Name | Protocol Path | Description |
|---|---|---|
| Adult Income | benchmark://adult-income | UCI Adult Income census dataset (48,842 rows, 15 columns) |
| Adult Income Schema | benchmark://adult-income_schema | Schema definition for Adult Income dataset |
| Adult Income (Original) | benchmark://adult-income_ori | Original training data (for demo) |
| Adult Income (Control) | benchmark://adult-income_control | Control group data (for demo) |
| Adult Income (Synthetic) | benchmark://adult-income_syn | SDV Gaussian Copula synthetic data (for demo) |
| Taiwan Salary Statistics | benchmark://taiwan-salary-statistics-300k | Taiwan salary statistics dataset (300K records) |
| Taiwan Salary Statistics (No DI) | benchmark://taiwan-salary-statistics-300k-no-di | Taiwan salary statistics dataset - No Direct Identification (300K records, with name and ID removed, birth date and address split) |
Taiwan Salaries Statistics
This is a simulated dataset created by the InnoServe team in 2024 for challenge questions, simulating the Ministry of Labor’s occupational salary survey statistics.
Description
- This dataset references the compilation methodology of the “Survey of Employed Workers’ Earnings” conducted monthly by the Directorate-General of Budget, Accounting and Statistics (DGBAS) (link), simulating a wide-table structure that links comprehensive income tax files with labor insurance files, labor pension monthly contribution wage files, and National Health Insurance files
- The simulation process uses publicly available 2023 aggregate statistics and references multiple government open data sources for numerical simulation. The data content does not involve any real individuals or legal entities. Any similarity to names or company names is purely coincidental
- This dataset simulates only Taiwanese workers but includes all 20 municipalities and counties nationwide, including Kinmen and Lienchiang
Best Practices Sample Datasets
| Dataset Name | Protocol Path | Description |
|---|---|---|
| Multi-table Companies | benchmark://best-practices_multi-table_companies | Multi-table example - Company data |
| Multi-table Applications | benchmark://best-practices_multi-table_applications | Multi-table example - Application data |
| Multi-table Tracking | benchmark://best-practices_multi-table_tracking | Multi-table example - Tracking data |
| Multi-timestamp | benchmark://best-practices_multi-table | Multi-timestamp example data |
| Categorical & High-cardinality | benchmark://best-practices_categorical_high-cardinality | Categorical and high-cardinality example data |
How It Works
- Protocol Detection: Loader detects
benchmark://protocol - Automatic Download: Downloads dataset from AWS S3 bucket
- Integrity Check: Verifies data integrity using SHA256
- Local Cache: Data is stored in
benchmark/directory - Data Loading: Loads data using local path
When to Use
Benchmark datasets are suitable for:
- Testing New Algorithms: Test on data with known characteristics
- Parameter Tuning: Compare effects of different parameter settings
- Performance Benchmarking: Compare with academic research results
- Teaching Demonstrations: Provide standardized example data
Notes
- First use requires network connection to download data
- Datasets are cached locally in
benchmark/directory - Large dataset downloads may take considerable time
- Protocol names are case-insensitive (lowercase recommended)
- All datasets are verified with SHA256 to ensure integrity