R/run_ML.R
runModelingPipeline.RdThis function provides a complete end-to-end AMR machine learning workflow.
Given a DuckDB file produced by runDataProcessing(), it:
Generates all ML feature matrices (drug, class, year, country, MDR, LOO)
Creates all ML directory structures
Prepares ML input lists for every mode
Runs logistic regression ML models (standard + stratified + cross-test + MDR)
Saves performance metrics, fitted models, predictions, and top feature rankings
Path to a <bug>_parquet.duckdb produced by data_processing.R
Number of parallel workers. Default: 16
Cross-validation folds (default: 5). Use 0 or NULL for classical splits.
Training/validation split (default: c(1,0) for CV mode)
Minimum samples per drug class for MDR matrices (default: 25)
Proportion of variable importance for top features (default: c(0,1))
PCA variance threshold (not used unless use_pca = TRUE)
Print progress updates? Default: TRUE
Whether to inherit split/seed/n_fold from ml_parameters.json
Invisibly returns the output directory used for ML results.