SplitterAdapter
SplitterAdapter handles data splitting for training/validation sets with overlap control functionality.
Class Architecture
classDiagram
class SplitterAdapter {
+config: dict
+splitter: Splitter
+is_custom_data: bool
+ori_loader_adapter: LoaderAdapter
+ctrl_loader_adapter: LoaderAdapter
+__init__(config)
+run() tuple~dict, dict, list~
-_create_loader_config(config, key) dict
}
class Splitter {
+config: dict
+num_samples: int
+train_split_ratio: float
+split(data, metadata) tuple
-_bootstrap_with_overlap_control()
}
class LoaderAdapter {
+load() tuple~DataFrame, Schema~
}
SplitterAdapter ..> Splitter : uses for data splitting
SplitterAdapter ..> LoaderAdapter : uses for custom_data method
%% Style definitions
class SplitterAdapter {
<<Main Class>>
}
style SplitterAdapter fill:#E6E6FA
class Splitter {
<<Core Module>>
}
style Splitter fill:#4169E1,color:#fff
class LoaderAdapter {
<<Optional: Custom Data>>
}
style LoaderAdapter fill:#FFE4E1
note for SplitterAdapter "1. Normal mode: Uses Splitter for bootstrap sampling\n2. Custom data mode: Uses LoaderAdapter for ori/control data\n3. Provides overlap control for multiple samples"Legend:
- Light purple box: SplitterAdapter main class
- Blue box: Core splitting module
- Light pink box: LoaderAdapter used for custom data mode
..>: Dependency relationship
Main Features
- Unified interface for data splitting
- Bootstrap sampling with overlap control
- Support for multiple sample generation
- Returns split data, metadata, and training indices
- Integration with pipeline system
Method Reference
__init__(config: dict)
Initializes SplitterAdapter instance with splitting configuration.
Parameters:
config: dict, required- Configuration parameter dictionary
- Keys:
num_samples,train_split_ratio,random_state,max_overlap_ratio,max_attempts
run(input: dict)
Executes data splitting operation.
Parameters:
input: dict, required- Must contain:
data: pd.DataFrame - Dataset to splitmetadata: Schema - Data metadataexist_train_indices: list[set] (optional) - Existing training indices to avoid overlap
- Must contain:
Returns:
No direct return value. Use get_result() to get split results.
get_result()
Gets the splitting results.
Returns:
tuple[dict, dict, list[set]]: Split data, metadata, and training indices
set_input(data, metadata, exist_train_indices=None)
Sets input data for the splitter.
Parameters:
data: pd.DataFrame - Dataset to splitmetadata: Schema - Data metadataexist_train_indices: list[set] (optional) - Existing training indices
Usage Example
from petsard.adapter import SplitterAdapter
# Configure splitter
adapter = SplitterAdapter({
"num_samples": 3,
"train_split_ratio": 0.8,
"random_state": 42
})
# Set input
adapter.set_input(data=df, metadata=schema)
# Execute splitting
adapter.run({
"data": df,
"metadata": schema
})
# Get results
split_data, metadata_dict, train_indices = adapter.get_result()Notes
- This is an internal API, not recommended for direct use
- Prefer using YAML configuration files and Executor
- Sample numbering starts from 1, not 0
- Results are cached until next run() call