Stitches together core ML functions into one pipeline.
runMLPipeline(
ml_input_tibble,
model = "LR",
split = c(0.6, 0.2),
n_fold = 2,
prop_vi_top_feats = c(0, 1),
n_top_feats = NA,
use_pca = FALSE,
pca_threshold = 0.95,
penalty_vec = 10^seq(-4, -1, length.out = 10),
mix_vec = 0:5/5,
min_n_vec = c(2, 6, 12),
tree_vec = c(100, 500, 1000),
select_best_metric = "mcc",
seed = 123,
shuffle_labels = FALSE,
test_data = NA,
return_tune_res = FALSE,
return_fit = FALSE,
return_pred = FALSE,
verbose = TRUE
)An ML-ready tibble generated by loadMLInputTibble().
This must have a target variable column named either
genome_drug.resistant_phenotype ("Resistant" or "Susceptible "
classification for one bug/drug combination) or resistant_classes
(multi-class classification for determining the drug classes to which each
genome is resistant), but not both.
chr Logistic regression ("LR"), random forest ("RF"), or boosted tree ("BT")
num Vector of length 2 indicating the proportion of data to
be designated as training and validation, respectively. Note: if test_data
is provided, these numbers will be scaled so that they sum to 1 and will
still represent fractions of ml_input_tibble (not including the input
test_data). Please do not directly provide numbers that sum to 1 since the
function is not equipped to handle this. If cross-validation is enabled here
split = c(1,0), we will still retain a 20% test holdout for final reporting.
Cross-validation is run on the 80% training portion, and not on the testing set.
num Number of folds of cross-validation
num A vector of length 2 with elements together
indicating the proportion of total variable importance the top features
should comprise. To get the features that contribute to the top 10 to 20% of
total variable importance, for example,
set prop_vi_top_feats = c(0.1, 0.2). Returns all features by default.
num Number of top features to extract per drug
bool Set to TRUE to use PCA instead of all features.
num The proportion of total variance for which the principle components account
num A vector containing penalty (regularization
strength) values to try (for logistic regression). It is recommended to
choose values 10^-4 to 10^4.
num A vector containing mixture values to try for logistic
regression. 0 corresponds to L2 regularization; 1 corresponds to L1;
intermediate values correspond to elastic net.
[num] A vector containing min_n values (the number of data
points in a node required for the node to be split) to try for random forest
or boosted tree. It is recommended to choose values in the range 1 to 100.
[num] A vector containing values to try for the number of
trees in random forest or boosted tree. It is recommended to choose values
in the range 100 to 1000.
chr Metric to select best model: "f_meas", "pr_auc", or "bal_accuracy"
num For reproducible analysis
bool Set to TRUE to randomly shuffle AMR phenotype
labels for baseline comparisons.
A tibble to use as testing data instead of a subset of
ml_input_tibble. This can be useful for testing different geographical
or temporal holdouts. The split argument still tells how ml_input_tibble
should be divided for training and validation.
bool Set to TRUE to return tuning results.
bool Set to TRUE to return the model fit.
bool Set to TRUE to return the predicted and actual
AMR phenotypes.
bool The function will stay quiet if set to FALSE.
A list with two elements: A performance_tibble and a
top_feat_tibble. Tuning results, the fit object, and model predictions may
also be returned if return_tune_res, return_fit, and/or return_pred,
respectively, are set to TRUE.
data(demo_ml_tibble)
set.seed(1)
runMLPipeline(
ml_input_tibble = demo_ml_tibble, model = "LR",
split = c(1, 0), n_fold = 2,
penalty_vec = 10^c(-3, -1), mix_vec = c(0, 0.5, 1),
n_top_feats = 10, verbose = FALSE
)
#> Warning: Classes are roughly balanced. Calculation of log2(AUPRC/prior) may be inappropriate.
#> $performance_tibble
#> # A tibble: 1 × 18
#> num_obs res_prop n_feat model train_prop val_prop lower_prop_vi_top_feats
#> <int> <dbl> <int> <chr> <dbl> <dbl> <dbl>
#> 1 60 0.5 80 LR 1 0 0
#> # ℹ 11 more variables: upper_prop_vi_top_feats <dbl>, n_feats_returned <int>,
#> # n_fold <dbl>, fit_penalty <dbl>, fit_mixture <dbl>, nmcc <dbl>,
#> # log2_apop <dbl>, f1 <dbl>, bal_acc <dbl>, run_time_sec <dbl>, date <chr>
#>
#> $top_feat_tibble
#> # A tibble: 10 × 3
#> Variable Importance Sign
#> <chr> <dbl> <chr>
#> 1 group_1006 3.12 NEG
#> 2 group_10040 3.12 POS
#> 3 group_10013 2.99 POS
#> 4 group_10051 2.59 POS
#> 5 group_10052 2.27 NEG
#> 6 group_10056 2.17 POS
#> 7 group_10033 2.15 NEG
#> 8 group_10047 1.78 POS
#> 9 group_10061 1.70 NEG
#> 10 group_10046 1.68 NEG
#>