Run machine learning models with multiple configurations

Executes machine learning pipeline with support for stratification, leave-one-out (LOO), and cross-testing configurations using logistic regression with parallel processing. Provides flexible model training across different experimental designs. Split/seed/n_fold are resolved from ml_parameters.json when available via .resolveSplitParams().

runMLmodels(
  path,
  stratify_by = NULL,
  LOO = FALSE,
  cross_test = FALSE,
  threads = 16,
  split = c(0.8, 0),
  n_fold = 5,
  prop_vi_top_feats = c(0, 1),
  pca_threshold = 0.99,
  verbose = TRUE,
  return_tune_res = TRUE,
  return_fit = TRUE,
  return_pred = TRUE,
  use_saved_split = TRUE,
  shuffle_labels = FALSE,
  use_pca = FALSE
)

Arguments

path: Character scalar. Base directory containing matrix files.
stratify_by: Character scalar or NULL. One of "year", "country", or NULL (no stratification).
LOO: Logical. Perform Leave-One-Out analysis. Default is FALSE.
cross_test: Logical. Perform cross-testing between groups. Default is FALSE.
threads: Integer. Number of parallel workers for model training. Default is 16.
split: Numeric vector of length 2. Train/validation split proportions.
n_fold: Integer. Number of cross-validation folds. Default is 5.
prop_vi_top_feats: Numeric vector of length 2. Proportion range for variable-importance selection.
pca_threshold: Numeric. PCA variance threshold. Default 0.99.
verbose: Logical. Print progress messages during model training. Default TRUE.
return_tune_res: Logical. Return tuning results from cross-validation. Default TRUE.
return_fit: Logical. Return fitted model objects. Default TRUE.
return_pred: Logical. Return prediction results. Default TRUE.
use_saved_split: Logical. Whether to inherit split/seed/n_fold from ml_parameters.json. Default TRUE.
shuffle_labels: Logical. Randomly shuffle labels for baseline runs. Default FALSE.
use_pca: Logical. Use PCA on predictors. Default FALSE.

Value

NULL (invisible). Called for side effects (model training and result saving).

NULL (invisible). Called for side effects (writes results).

Details

This function supports multiple analysis configurations:

Standard mode (stratify_by = NULL, LOO = FALSE, cross_test = FALSE):

Trains models using train/test split from the same dataset
Saves results to ML_* directories

Cross-test without stratification (stratify_by = NULL, cross_test = TRUE):

Trains on one drug/class, tests on another drug/class
Pairs different drugs within same feature type
Saves results to cross_test_ML_* directories

Cross-test with stratification (stratify_by != NULL, cross_test = TRUE):

Trains on one stratum (year/country), tests on another stratum
Same drug/class across different stratification groups
Saves results to cross_test_ML_year_* or cross_test_ML_country_* directories

LOO with cross-test (LOO = TRUE, cross_test = TRUE):

Trains on leave-out dataset (one stratum excluded)
Tests on the full dataset including the left-out stratum
Saves results to LOO_cross_test_ML_year_* or LOO_cross_test_ML_country_* directories

Model configuration:

Algorithm: Logistic Regression with elastic net regularization
Penalty values: 10^seq(-4, -1, length.out = 10)
Mixture (alpha): 0, 0.2, 0.4, 0.6, 0.8, 1.0 (ridge to lasso)
Selection metric: Matthews Correlation Coefficient (MCC)
Random seed: 5280 (for reproducibility)
PCA: Disabled

Output file naming: Files are saved with prefixes and suffixes indicating the configuration:

LOO: Prefixed with "LOO_"
Cross-test: Prefixed with "cross_test_"
Stratification: Suffixed with "_country" or "_year"

For example: "LOO_cross_test_ML_year_performance.tsv"

Note

This function requires the following packages:

future - for parallel processing backend
future.apply - for parallel lapply
readr - for reading/writing TSV files
dplyr, purrr, stringr, tibble - for data manipulation

Ensure that loadMLInputTibble(), runMLPipeline(), and createMLinputList() are available in your environment before calling this function.

Examples

if (FALSE) { # \dontrun{
# Standard ML models (no stratification)
runMLmodels("/path/to/results")

# Cross-test between drugs (no stratification)
runMLmodels("/path/to/results", cross_test = TRUE)

# Stratified by year
runMLmodels("/path/to/results", stratify_by = "year")

# Cross-test with year stratification
runMLmodels("/path/to/results",
            stratify_by = "year",
            cross_test = TRUE,
            threads = 32)

# LOO analysis stratified by country with cross-testing
runMLmodels("/path/to/results",
            stratify_by = "country",
            LOO = TRUE,
            cross_test = TRUE,
            verbose = TRUE)

# Run without saving model fits (save disk space)
runMLmodels("/path/to/results",
            stratify_by = "year",
            return_fit = FALSE)
} # }