Splitter API
Data splitting module for creating training and validation sets with overlap control.
Class Architecture
classDiagram
class Splitter {
-config: dict
+__init__(num_samples, train_split_ratio, random_state, max_overlap_ratio, max_attempts)
+split(data, metadata, exist_train_indices) tuple[dict, dict, list[set]]
+get_train_indices() list[set]
-_bootstrapping(index, exist_train_indices) dict
-_check_overlap_acceptable(new_train_sample, existing_train_sets) bool
-_update_metadata_with_split_info(metadata, train_rows, validation_rows) Schema
}
class Schema {
+id: str
+attributes: dict
+description: str
+stats: TableStats
}
class DataFrame {
<<pandas>>
+index: Index
+columns: Index
+iloc: IndexingMixin
+reset_index()
}
Splitter ..> Schema : uses
Splitter ..> DataFrame : usesLegend:
- Blue box: Main class
- Orange box: Subclass implementations
- Light purple box: Configuration and data classes
<|--: Inheritance relationship*--: Composition relationship..>: Dependency relationship
Basic Usage
from petsard import Splitter
# Basic splitting
splitter = Splitter(num_samples=3, train_split_ratio=0.8)
split_data, metadata_dict, train_indices = splitter.split(data=df)
# Strict overlap control
splitter = Splitter(
num_samples=5,
train_split_ratio=0.7,
max_overlap_ratio=0.1 # Maximum 10% overlap
)Constructor (init)
Initialize a data splitter instance.
Syntax
def __init__(
num_samples: int = 1,
train_split_ratio: float = 0.8,
random_state: int | float | str = None,
max_overlap_ratio: float = 1.0,
max_attempts: int = 30
)Parameters
num_samples : int, optional
- Number of times to resample the data
- Default:
1 - Must be positive integer
train_split_ratio : float, optional
- Ratio of data for training set
- Default:
0.8 - Range:
0.0to1.0
random_state : int | float | str, optional
- Seed for reproducibility
- Default:
None - Can be integer, float, or string
max_overlap_ratio : float, optional
- Maximum allowed overlap ratio between samples
- Default:
1.0(100% - allows complete overlap) - Range:
0.0to1.0 - Set to
0.0for no overlap between samples
max_attempts : int, optional
- Maximum attempts for sampling with overlap control
- Default:
30 - Used when overlap control is active
Returns
- Splitter
- Initialized splitter instance
Examples
from petsard import Splitter
# Basic splitter with default settings
splitter = Splitter()
# Multiple samples with reproducibility
splitter = Splitter(
num_samples=5,
train_split_ratio=0.8,
random_state=42
)
# Strict overlap control
splitter = Splitter(
num_samples=3,
max_overlap_ratio=0.1, # Max 10% overlap
max_attempts=50
)
# No overlap between samples
splitter = Splitter(
num_samples=5,
max_overlap_ratio=0.0, # Completely non-overlapping
random_state="experiment_v1"
)Notes
- The functional API returns tuples directly from the
split()method - Uses functional programming patterns with immutable data structures
- For detailed split method usage, see the split() documentation
- Recommend using YAML configuration for complex experiments
- Bootstrap sampling is used internally for generating multiple samples