Stitches together core ML functions into one pipeline.
runMLPipeline(
ml_input_tibble,
model = "LR",
split = c(0.6, 0.2),
n_fold = 2,
prop_vi_top_feats = c(0, 1),
n_top_feats = NA,
use_pca = FALSE,
pca_threshold = 0.95,
penalty_vec = 10^seq(-4, -1, length.out = 10),
mix_vec = 0:5/5,
min_n_vec = c(2, 6, 12),
tree_vec = c(100, 500, 1000),
select_best_metric = "mcc",
seed = 123,
shuffle_labels = FALSE,
test_data = NA,
return_tune_res = FALSE,
return_fit = FALSE,
return_pred = FALSE,
verbose = TRUE
)An ML-ready tibble generated by loadMLInputTibble().
This must have a target variable column named either
genome_drug.resistant_phenotype ("Resistant" or "Susceptible "
classification for one bug/drug combination) or resistant_classes
(multi-class classification for determining the drug classes to which each
genome is resistant), but not both.
rlang::chr Logistic regression ("LR"), random forest ("RF"), or boosted tree ("BT")
pillar::num Vector of length 2 indicating the proportion of data to
be designated as training and validation, respectively. Note: if test_data
is provided, these numbers will be scaled so that they sum to 1 and will
still represent fractions of ml_input_tibble (not including the input
test_data). Please do not directly provide numbers that sum to 1 since the
function is not equipped to handle this. If cross-validation is enabled here
split = c(1,0), we will still retain a 20% test holdout for final reporting.
Cross-validation is run on the 80% training portion, and not on the testing set.
pillar::num Number of folds of cross-validation
pillar::num A vector of length 2 with elements together
indicating the proportion of total variable importance the top features
should comprise. To get the features that contribute to the top 10 to 20% of
total variable importance, for example,
set prop_vi_top_feats = c(0.1, 0.2). Returns all features by default.
pillar::num Number of top features to extract per drug
arrow::bool Set to TRUE to use PCA instead of all features.
pillar::num The proportion of total variance for which the principle components account
pillar::num A vector containing penalty (regularization
strength) values to try (for logistic regression). It is recommended to
choose values 10^-4 to 10^4.
pillar::num A vector containing mixture values to try for logistic
regression. 0 corresponds to L2 regularization; 1 corresponds to L1;
intermediate values correspond to elastic net.
[num] A vector containing min_n values (the number of data
points in a node required for the node to be split) to try for random forest
or boosted tree. It is recommended to choose values in the range 1 to 100.
[num] A vector containing values to try for the number of
trees in random forest or boosted tree. It is recommended to choose values
in the range 100 to 1000.
rlang::chr Metric to select best model: "f_meas", "pr_auc", or "bal_accuracy"
pillar::num For reproducible analysis
arrow::bool Set to TRUE to randomly shuffle AMR phenotype
labels for baseline comparisons.
A tibble to use as testing data instead of a subset of
ml_input_tibble. This can be useful for testing different geographical
or temporal holdouts. The split argument still tells how ml_input_tibble
should be divided for training and validation.
arrow::bool Set to TRUE to return tuning results.
arrow::bool Set to TRUE to return the model fit.
arrow::bool Set to TRUE to return the predicted and actual
AMR phenotypes.
arrow::bool The function will stay quiet if set to FALSE.
A list with two elements: A performance_tibble and a
top_feat_tibble. Tuning results, the fit object, and model predictions may
also be returned if return_tune_res, return_fit, and/or return_pred,
respectively, are set to TRUE.