Stitches together core ML functions into one pipeline.

runMLPipeline(
  ml_input_tibble,
  model = "LR",
  split = c(0.6, 0.2),
  n_fold = 2,
  prop_vi_top_feats = c(0, 1),
  n_top_feats = NA,
  use_pca = FALSE,
  pca_threshold = 0.95,
  penalty_vec = 10^seq(-4, -1, length.out = 10),
  mix_vec = 0:5/5,
  min_n_vec = c(2, 6, 12),
  tree_vec = c(100, 500, 1000),
  select_best_metric = "mcc",
  seed = 123,
  shuffle_labels = FALSE,
  test_data = NA,
  return_tune_res = FALSE,
  return_fit = FALSE,
  return_pred = FALSE,
  verbose = TRUE
)

Arguments

ml_input_tibble

An ML-ready tibble generated by loadMLInputTibble(). This must have a target variable column named either genome_drug.resistant_phenotype ("Resistant" or "Susceptible " classification for one bug/drug combination) or resistant_classes (multi-class classification for determining the drug classes to which each genome is resistant), but not both.

model

chr Logistic regression ("LR"), random forest ("RF"), or boosted tree ("BT")

split

num Vector of length 2 indicating the proportion of data to be designated as training and validation, respectively. Note: if test_data is provided, these numbers will be scaled so that they sum to 1 and will still represent fractions of ml_input_tibble (not including the input test_data). Please do not directly provide numbers that sum to 1 since the function is not equipped to handle this. If cross-validation is enabled here split = c(1,0), we will still retain a 20% test holdout for final reporting. Cross-validation is run on the 80% training portion, and not on the testing set.

n_fold

num Number of folds of cross-validation

prop_vi_top_feats

num A vector of length 2 with elements together indicating the proportion of total variable importance the top features should comprise. To get the features that contribute to the top 10 to 20% of total variable importance, for example, set prop_vi_top_feats = c(0.1, 0.2). Returns all features by default.

n_top_feats

num Number of top features to extract per drug

use_pca

bool Set to TRUE to use PCA instead of all features.

pca_threshold

num The proportion of total variance for which the principle components account

penalty_vec

num A vector containing penalty (regularization strength) values to try (for logistic regression). It is recommended to choose values 10^-4 to 10^4.

mix_vec

num A vector containing mixture values to try for logistic regression. 0 corresponds to L2 regularization; 1 corresponds to L1; intermediate values correspond to elastic net.

min_n_vec

[num] A vector containing min_n values (the number of data points in a node required for the node to be split) to try for random forest or boosted tree. It is recommended to choose values in the range 1 to 100.

tree_vec

[num] A vector containing values to try for the number of trees in random forest or boosted tree. It is recommended to choose values in the range 100 to 1000.

select_best_metric

chr Metric to select best model: "f_meas", "pr_auc", or "bal_accuracy"

seed

num For reproducible analysis

shuffle_labels

bool Set to TRUE to randomly shuffle AMR phenotype labels for baseline comparisons.

test_data

A tibble to use as testing data instead of a subset of ml_input_tibble. This can be useful for testing different geographical or temporal holdouts. The split argument still tells how ml_input_tibble should be divided for training and validation.

return_tune_res

bool Set to TRUE to return tuning results.

return_fit

bool Set to TRUE to return the model fit.

return_pred

bool Set to TRUE to return the predicted and actual AMR phenotypes.

verbose

bool The function will stay quiet if set to FALSE.

Value

A list with two elements: A performance_tibble and a top_feat_tibble. Tuning results, the fit object, and model predictions may also be returned if return_tune_res, return_fit, and/or return_pred, respectively, are set to TRUE.

Examples

data(demo_ml_tibble)
set.seed(1)
runMLPipeline(
  ml_input_tibble = demo_ml_tibble, model = "LR",
  split = c(1, 0), n_fold = 2,
  penalty_vec = 10^c(-3, -1), mix_vec = c(0, 0.5, 1),
  n_top_feats = 10, verbose = FALSE
)
#> Warning: Classes are roughly balanced. Calculation of log2(AUPRC/prior) may be inappropriate.
#> $performance_tibble
#> # A tibble: 1 × 18
#>   num_obs res_prop n_feat model train_prop val_prop lower_prop_vi_top_feats
#>     <int>    <dbl>  <int> <chr>      <dbl>    <dbl>                   <dbl>
#> 1      60      0.5     80 LR             1        0                       0
#> # ℹ 11 more variables: upper_prop_vi_top_feats <dbl>, n_feats_returned <int>,
#> #   n_fold <dbl>, fit_penalty <dbl>, fit_mixture <dbl>, nmcc <dbl>,
#> #   log2_apop <dbl>, f1 <dbl>, bal_acc <dbl>, run_time_sec <dbl>, date <chr>
#> 
#> $top_feat_tibble
#> # A tibble: 10 × 3
#>    Variable    Importance Sign 
#>    <chr>            <dbl> <chr>
#>  1 group_1006        3.12 NEG  
#>  2 group_10040       3.12 POS  
#>  3 group_10013       2.99 POS  
#>  4 group_10051       2.59 POS  
#>  5 group_10052       2.27 NEG  
#>  6 group_10056       2.17 POS  
#>  7 group_10033       2.15 NEG  
#>  8 group_10047       1.78 POS  
#>  9 group_10061       1.70 NEG  
#> 10 group_10046       1.68 NEG  
#>