Run the entire AMR ML pipeline from a parquet-backed DuckDB

This function provides a complete end-to-end AMR machine learning workflow. Given a DuckDB file produced by runDataProcessing(), it:

Generates all ML feature matrices (drug, class, year, country, MDR, LOO)
Creates all ML directory structures
Prepares ML input lists for every mode
Runs logistic regression ML models (standard + stratified + cross-test + MDR)
Saves performance metrics, fitted models, predictions, and top feature rankings

runModelingPipeline(
  parquet_duckdb_path,
  threads = 16,
  n_fold = 5,
  split = c(1, 0),
  min_n = 25,
  prop_vi_top_feats = c(0, 1),
  pca_threshold = 0.99,
  verbose = TRUE,
  use_saved_split = TRUE
)

Arguments

parquet_duckdb_path: Path to a <bug>_parquet.duckdb produced by data_processing.R
threads: Number of parallel workers. Default: 16
n_fold: Cross-validation folds (default: 5). Use 0 or NULL for classical splits.
split: Training/validation split (default: c(1,0) for CV mode)
min_n: Minimum samples per drug class for MDR matrices (default: 25)
prop_vi_top_feats: Proportion of variable importance for top features (default: c(0,1))
pca_threshold: PCA variance threshold (not used unless use_pca = TRUE)
verbose: Print progress updates? Default: TRUE
use_saved_split: Whether to inherit split/seed/n_fold from ml_parameters.json

Value

Invisibly returns the output directory used for ML results.