Executes machine learning pipeline with support for stratification, leave-one-out (LOO), and cross-testing configurations using logistic regression with parallel processing. Provides flexible model training across different experimental designs. Split/seed/n_fold are resolved from ml_parameters.json when available via .resolveSplitParams().

runMLmodels(
  path,
  stratify_by = NULL,
  LOO = FALSE,
  cross_test = FALSE,
  threads = 16,
  split = c(0.8, 0),
  n_fold = 5,
  prop_vi_top_feats = c(0, 1),
  pca_threshold = 0.99,
  verbose = TRUE,
  return_tune_res = TRUE,
  return_fit = TRUE,
  return_pred = TRUE,
  use_saved_split = TRUE,
  shuffle_labels = FALSE,
  use_pca = FALSE
)

Arguments

path

Character scalar. Base directory containing matrix files.

stratify_by

Character scalar or NULL. One of "year", "country", or NULL (no stratification).

LOO

Logical. Perform Leave-One-Out analysis. Default is FALSE.

cross_test

Logical. Perform cross-testing between groups. Default is FALSE.

threads

Integer. Number of parallel workers for model training. Default is 16.

split

Numeric vector of length 2. Train/validation split proportions.

n_fold

Integer. Number of cross-validation folds. Default is 5.

prop_vi_top_feats

Numeric vector of length 2. Proportion range for variable-importance selection.

pca_threshold

Numeric. PCA variance threshold. Default 0.99.

verbose

Logical. Print progress messages during model training. Default TRUE.

return_tune_res

Logical. Return tuning results from cross-validation. Default TRUE.

return_fit

Logical. Return fitted model objects. Default TRUE.

return_pred

Logical. Return prediction results. Default TRUE.

use_saved_split

Logical. Whether to inherit split/seed/n_fold from ml_parameters.json. Default TRUE.

shuffle_labels

Logical. Randomly shuffle labels for baseline runs. Default FALSE.

use_pca

Logical. Use PCA on predictors. Default FALSE.

Value

NULL (invisible). Called for side effects (model training and result saving).

NULL (invisible). Called for side effects (writes results).

Details

This function supports multiple analysis configurations:

Standard mode (stratify_by = NULL, LOO = FALSE, cross_test = FALSE):

  • Trains models using train/test split from the same dataset

  • Saves results to ML_* directories

Cross-test without stratification (stratify_by = NULL, cross_test = TRUE):

  • Trains on one drug/class, tests on another drug/class

  • Pairs different drugs within same feature type

  • Saves results to cross_test_ML_* directories

Cross-test with stratification (stratify_by != NULL, cross_test = TRUE):

  • Trains on one stratum (year/country), tests on another stratum

  • Same drug/class across different stratification groups

  • Saves results to cross_test_ML_year_* or cross_test_ML_country_* directories

LOO with cross-test (LOO = TRUE, cross_test = TRUE):

  • Trains on leave-out dataset (one stratum excluded)

  • Tests on the full dataset including the left-out stratum

  • Saves results to LOO_cross_test_ML_year_* or LOO_cross_test_ML_country_* directories

Model configuration:

  • Algorithm: Logistic Regression with elastic net regularization

  • Penalty values: 10^seq(-4, -1, length.out = 10)

  • Mixture (alpha): 0, 0.2, 0.4, 0.6, 0.8, 1.0 (ridge to lasso)

  • Selection metric: Matthews Correlation Coefficient (MCC)

  • Random seed: 5280 (for reproducibility)

  • PCA: Disabled

Output file naming: Files are saved with prefixes and suffixes indicating the configuration:

  • LOO: Prefixed with "LOO_"

  • Cross-test: Prefixed with "cross_test_"

  • Stratification: Suffixed with "_country" or "_year"

For example: "LOO_cross_test_ML_year_performance.tsv"

Note

This function requires the following packages:

  • future - for parallel processing backend

  • future.apply - for parallel lapply

  • readr - for reading/writing TSV files

  • dplyr, purrr, stringr, tibble - for data manipulation

Ensure that loadMLInputTibble(), runMLPipeline(), and createMLinputList() are available in your environment before calling this function.

See also

createMLinputList for generating input file lists, runMDRmodels for MDR-specific model execution, createMLResultDir for directory structure creation

Examples

if (FALSE) { # \dontrun{
# Standard ML models (no stratification)
runMLmodels("/path/to/results")

# Cross-test between drugs (no stratification)
runMLmodels("/path/to/results", cross_test = TRUE)

# Stratified by year
runMLmodels("/path/to/results", stratify_by = "year")

# Cross-test with year stratification
runMLmodels("/path/to/results",
            stratify_by = "year",
            cross_test = TRUE,
            threads = 32)

# LOO analysis stratified by country with cross-testing
runMLmodels("/path/to/results",
            stratify_by = "country",
            LOO = TRUE,
            cross_test = TRUE,
            verbose = TRUE)

# Run without saving model fits (save disk space)
runMLmodels("/path/to/results",
            stratify_by = "year",
            return_fit = FALSE)
} # }