Stitches together core ML functions into one pipeline.

runMLPipeline(
  ml_input_tibble,
  model = "LR",
  split = c(0.6, 0.2),
  n_fold = 2,
  prop_vi_top_feats = c(0, 1),
  n_top_feats = NA,
  use_pca = FALSE,
  pca_threshold = 0.95,
  penalty_vec = 10^seq(-4, -1, length.out = 10),
  mix_vec = 0:5/5,
  min_n_vec = c(2, 6, 12),
  tree_vec = c(100, 500, 1000),
  select_best_metric = "mcc",
  seed = 123,
  shuffle_labels = FALSE,
  test_data = NA,
  return_tune_res = FALSE,
  return_fit = FALSE,
  return_pred = FALSE,
  verbose = TRUE
)

Arguments

ml_input_tibble

An ML-ready tibble generated by loadMLInputTibble(). This must have a target variable column named either genome_drug.resistant_phenotype ("Resistant" or "Susceptible " classification for one bug/drug combination) or resistant_classes (multi-class classification for determining the drug classes to which each genome is resistant), but not both.

model

rlang::chr Logistic regression ("LR"), random forest ("RF"), or boosted tree ("BT")

split

pillar::num Vector of length 2 indicating the proportion of data to be designated as training and validation, respectively. Note: if test_data is provided, these numbers will be scaled so that they sum to 1 and will still represent fractions of ml_input_tibble (not including the input test_data). Please do not directly provide numbers that sum to 1 since the function is not equipped to handle this. If cross-validation is enabled here split = c(1,0), we will still retain a 20% test holdout for final reporting. Cross-validation is run on the 80% training portion, and not on the testing set.

n_fold

pillar::num Number of folds of cross-validation

prop_vi_top_feats

pillar::num A vector of length 2 with elements together indicating the proportion of total variable importance the top features should comprise. To get the features that contribute to the top 10 to 20% of total variable importance, for example, set prop_vi_top_feats = c(0.1, 0.2). Returns all features by default.

n_top_feats

pillar::num Number of top features to extract per drug

use_pca

arrow::bool Set to TRUE to use PCA instead of all features.

pca_threshold

pillar::num The proportion of total variance for which the principle components account

penalty_vec

pillar::num A vector containing penalty (regularization strength) values to try (for logistic regression). It is recommended to choose values 10^-4 to 10^4.

mix_vec

pillar::num A vector containing mixture values to try for logistic regression. 0 corresponds to L2 regularization; 1 corresponds to L1; intermediate values correspond to elastic net.

min_n_vec

[num] A vector containing min_n values (the number of data points in a node required for the node to be split) to try for random forest or boosted tree. It is recommended to choose values in the range 1 to 100.

tree_vec

[num] A vector containing values to try for the number of trees in random forest or boosted tree. It is recommended to choose values in the range 100 to 1000.

select_best_metric

rlang::chr Metric to select best model: "f_meas", "pr_auc", or "bal_accuracy"

seed

pillar::num For reproducible analysis

shuffle_labels

arrow::bool Set to TRUE to randomly shuffle AMR phenotype labels for baseline comparisons.

test_data

A tibble to use as testing data instead of a subset of ml_input_tibble. This can be useful for testing different geographical or temporal holdouts. The split argument still tells how ml_input_tibble should be divided for training and validation.

return_tune_res

arrow::bool Set to TRUE to return tuning results.

return_fit

arrow::bool Set to TRUE to return the model fit.

return_pred

arrow::bool Set to TRUE to return the predicted and actual AMR phenotypes.

verbose

arrow::bool The function will stay quiet if set to FALSE.

Value

A list with two elements: A performance_tibble and a top_feat_tibble. Tuning results, the fit object, and model predictions may also be returned if return_tune_res, return_fit, and/or return_pred, respectively, are set to TRUE.