a) each drug/drugclass matrices for all data and feature types b) year bin holdouts for drug/class c) country holdouts for drug/class d) Leave-one-out from the years and countries for drug/class e) MDR based on classes

generateMLInputs(
  parquet_duckdb_path = "results/Cje_parquet.duckdb",
  out_path = "results/",
  n_fold = 5,
  split = c(1, 0),
  min_n = 25,
  verbosity = c("minimal", "debug")
)

Arguments

parquet_duckdb_path

character path to the DuckDB that contains the view of metadata and feature parquets

out_path

character path to the directory where the results files (matrices) will be written

n_fold

numeric number of cross-validation folds; default is 5

split

numeric training/validation split specification. Two formats accepted:

  • Shorthand for CV: split = 0 (converted internally to c(1, 0))

  • Vector form: c(train_prop, val_prop) where test_prop = 1 - train - val

    • For CV: c(1, 0) means 80% training data with k-fold CV, 20% stratified testing

    • For classical splits: all three partitions must be > 0 Example: c(0.7, 0.15) = 70% train, 15% val, 15% test

min_n

numeric minimum number of samples for each combination of drug classes for MDR matrix; default is 25

verbosity

character "minimal" or "debug"; when "debug", prints full diagnostics per matrix

Value

invisible(TRUE) on success; side effects: writes matrices/files

Examples

if (FALSE) { # \dontrun{
# Generate ML input matrices with 5-fold cross-validation (using shorthand)
generateMLInputs(
  parquet_duckdb_path = "results/Cje_parquet.duckdb",
  out_path = "results/",
  n_fold = 5,
  split = 0, # shorthand for CV mode
  min_n = 25,
  verbosity = "minimal"
)

# Same as above but using vector notation
generateMLInputs(
  parquet_duckdb_path = "results/Cje_parquet.duckdb",
  out_path = "results/",
  n_fold = 5,
  split = c(1, 0), # explicit CV mode
  verbosity = "minimal"
)

# Generate with classical train/val/test split (70/15/15)
generateMLInputs(
  parquet_duckdb_path = "results/Cje_parquet.duckdb",
  out_path = "results/",
  n_fold = NULL,
  split = c(0.7, 0.15), # 70% train, 15% val, 15% test
  min_n = 25,
  verbosity = "debug"
)
} # }