a) each drug/drugclass matrices for all data and feature types b) year bin holdouts for drug/class c) country holdouts for drug/class d) Leave-one-out from the years and countries for drug/class e) MDR based on classes
character path to the DuckDB that contains the view of metadata and feature parquets
character path to the directory where the results files (matrices) will be written
numeric number of cross-validation folds; default is 5
numeric training/validation split specification. Two formats accepted:
Shorthand for CV: split = 0 (converted internally to c(1, 0))
Vector form: c(train_prop, val_prop) where test_prop = 1 - train - val
For CV: c(1, 0) means 80% training data with k-fold CV, 20% stratified testing
For classical splits: all three partitions must be > 0
Example: c(0.7, 0.15) = 70% train, 15% val, 15% test
numeric minimum number of samples for each combination of drug classes for MDR matrix; default is 25
character "minimal" or "debug"; when "debug", prints full diagnostics per matrix
invisible(TRUE) on success; side effects: writes matrices/files
if (FALSE) { # \dontrun{
# Generate ML input matrices with 5-fold cross-validation (using shorthand)
generateMLInputs(
parquet_duckdb_path = "results/Cje_parquet.duckdb",
out_path = "results/",
n_fold = 5,
split = 0, # shorthand for CV mode
min_n = 25,
verbosity = "minimal"
)
# Same as above but using vector notation
generateMLInputs(
parquet_duckdb_path = "results/Cje_parquet.duckdb",
out_path = "results/",
n_fold = 5,
split = c(1, 0), # explicit CV mode
verbosity = "minimal"
)
# Generate with classical train/val/test split (70/15/15)
generateMLInputs(
parquet_duckdb_path = "results/Cje_parquet.duckdb",
out_path = "results/",
n_fold = NULL,
split = c(0.7, 0.15), # 70% train, 15% val, 15% test
min_n = 25,
verbosity = "debug"
)
} # }