This function provides a complete end-to-end AMR machine learning workflow. Given a DuckDB file produced by runDataProcessing(), it:

  1. Generates all ML feature matrices (drug, class, year, country, MDR, LOO)

  2. Creates all ML directory structures

  3. Prepares ML input lists for every mode

  4. Runs logistic regression ML models (standard + stratified + cross-test + MDR)

  5. Saves performance metrics, fitted models, predictions, and top feature rankings

runModelingPipeline(
  parquet_duckdb_path,
  threads = 16,
  n_fold = 5,
  split = c(1, 0),
  min_n = 25,
  prop_vi_top_feats = c(0, 1),
  pca_threshold = 0.99,
  verbose = TRUE,
  use_saved_split = TRUE
)

Arguments

parquet_duckdb_path

Path to a <bug>_parquet.duckdb produced by data_processing.R

threads

Number of parallel workers. Default: 16

n_fold

Cross-validation folds (default: 5). Use 0 or NULL for classical splits.

split

Training/validation split (default: c(1,0) for CV mode)

min_n

Minimum samples per drug class for MDR matrices (default: 25)

prop_vi_top_feats

Proportion of variable importance for top features (default: c(0,1))

pca_threshold

PCA variance threshold (not used unless use_pca = TRUE)

verbose

Print progress updates? Default: TRUE

use_saved_split

Whether to inherit split/seed/n_fold from ml_parameters.json

Value

Invisibly returns the output directory used for ML results.