runMLPipeline() — runMLPipeline • amRml

Stitches together core ML functions into one pipeline.

runMLPipeline(
  ml_input_tibble,
  model = "LR",
  split = c(0.6, 0.2),
  n_fold = 2,
  prop_vi_top_feats = c(0, 1),
  n_top_feats = NA,
  use_pca = FALSE,
  pca_threshold = 0.95,
  penalty_vec = 10^seq(-4, -1, length.out = 10),
  mix_vec = 0:5/5,
  min_n_vec = c(2, 6, 12),
  tree_vec = c(100, 500, 1000),
  select_best_metric = "mcc",
  seed = 123,
  shuffle_labels = FALSE,
  test_data = NA,
  return_tune_res = FALSE,
  return_fit = FALSE,
  return_pred = FALSE,
  verbose = TRUE
)

Arguments

ml_input_tibble: An ML-ready tibble generated by loadMLInputTibble(). This must have a target variable column named either genome_drug.resistant_phenotype ("Resistant" or "Susceptible " classification for one bug/drug combination) or resistant_classes (multi-class classification for determining the drug classes to which each genome is resistant), but not both.
model: rlang::chr Logistic regression ("LR"), random forest ("RF"), or boosted tree ("BT")
split: pillar::num Vector of length 2 indicating the proportion of data to be designated as training and validation, respectively. Note: if test_data is provided, these numbers will be scaled so that they sum to 1 and will still represent fractions of ml_input_tibble (not including the input test_data). Please do not directly provide numbers that sum to 1 since the function is not equipped to handle this. If cross-validation is enabled here split = c(1,0), we will still retain a 20% test holdout for final reporting. Cross-validation is run on the 80% training portion, and not on the testing set.
n_fold: pillar::num Number of folds of cross-validation
prop_vi_top_feats: pillar::num A vector of length 2 with elements together indicating the proportion of total variable importance the top features should comprise. To get the features that contribute to the top 10 to 20% of total variable importance, for example, set prop_vi_top_feats = c(0.1, 0.2). Returns all features by default.
n_top_feats: pillar::num Number of top features to extract per drug
use_pca: arrow::bool Set to TRUE to use PCA instead of all features.
pca_threshold: pillar::num The proportion of total variance for which the principle components account
penalty_vec: pillar::num A vector containing penalty (regularization strength) values to try (for logistic regression). It is recommended to choose values 10^-4 to 10^4.
mix_vec: pillar::num A vector containing mixture values to try for logistic regression. 0 corresponds to L2 regularization; 1 corresponds to L1; intermediate values correspond to elastic net.
min_n_vec: [num] A vector containing min_n values (the number of data points in a node required for the node to be split) to try for random forest or boosted tree. It is recommended to choose values in the range 1 to 100.
tree_vec: [num] A vector containing values to try for the number of trees in random forest or boosted tree. It is recommended to choose values in the range 100 to 1000.
select_best_metric: rlang::chr Metric to select best model: "f_meas", "pr_auc", or "bal_accuracy"
seed: pillar::num For reproducible analysis
shuffle_labels: arrow::bool Set to TRUE to randomly shuffle AMR phenotype labels for baseline comparisons.
test_data: A tibble to use as testing data instead of a subset of ml_input_tibble. This can be useful for testing different geographical or temporal holdouts. The split argument still tells how ml_input_tibble should be divided for training and validation.
return_tune_res: arrow::bool Set to TRUE to return tuning results.
return_fit: arrow::bool Set to TRUE to return the model fit.
return_pred: arrow::bool Set to TRUE to return the predicted and actual AMR phenotypes.
verbose: arrow::bool The function will stay quiet if set to FALSE.

Value

A list with two elements: A performance_tibble and a top_feat_tibble. Tuning results, the fit object, and model predictions may also be returned if return_tune_res, return_fit, and/or return_pred, respectively, are set to TRUE.