amRdata is part of a 3-package suite for predicting
antimicrobial resistance (AMR) using machine learning models trained on
bacterial genomic features.
This vignette demonstrates how to download, process and visualize metadata and genomics data.
For metadata curation with amRdata, use
retrieveMetadata() to create a DuckDB containing metadata
tables.
# Download all AMR data for a species from BV-BRC
retrieveMetadata(user_bacs = "Shigella flexneri",
filter_type = "AMR",
base_dir = "../data/",
abx = "All",
overwrite = FALSE,
image = "danylmb/bvbrc:5.3",
verbose = FALSE)This wrote tables ‘amr_phenotype’, ‘genome_data’, and ‘metadata’ to a DuckDB
For downloading genomes with paired AMR phenotype data, use
retrieveGenomes()
retrieveGenomes(base_dir = "../data/",
user_bacs = "Shigella flexneri",
method = c("cli"),
image = "danylmb/bvbrc:5.3",
skip_existing = TRUE,
ftp_workers = 8L,
cli_fasta_workers = 4L,
cli_gff_workers = 4L,
chunk_size = 50L,
verbose = TRUE)This returns character vector of genome IDs and wrote complete file sets on disk.
To write a .txt file listing downloaded genome filepaths (.fna, .faa, .gff)
genomeList(base_dir = "../data/",
user_bacs = "Shigella flexneri",
verbose = TRUE)Internally runs: 1. retrieveGenomes()
2. genomeList()
Allows users to input a species or taxon ID and automate all data downloading and curation steps.
prepareGenomes(user_bacs = "Shigella flexneri",
base_dir = "../data/",
method = c("cli"),
overwrite = FALSE,
verbose = TRUE)Internally runs: 1. runPanaroo2Duckdb() -> run
Panaroo (optional panaroo-merge) to generate pangenome and create
genomes by genes table 2. CDHIT2duckdb() -> run CDHIT to
cluster proteins and create genomes by proteins table 3.
domainFromIPR() -> run IPRScan to annotate proteins to
domains and create genomes by domains table 4. cleanData()
-> Clean metadata and export the feature tables to Parquet +
Parquet-backed DuckDB
runDataProcessing(duckdb_path = "../data/Shigella_flexneri/Sfl.duckdb",
output_path = NULL,
# unified threads for all tools
threads = 16,
# Panaroo
panaroo_split_jobs = FALSE,
panaroo_core_threshold = 0.90,
panaroo_len_dif_percent = 0.95,
panaroo_cluster_threshold = 0.95,
panaroo_family_seq_identity = 0.5,
# CD-HIT
cdhit_identity = 0.9,
cdhit_word_length = 5,
cdhit_memory = 0,
cdhit_extra_args = c("-g","1"),
cdhit_output_prefix = "cdhit_out",
# InterPro
ipr_appl = c("Pfam"),
ipr_threads_unused = NULL,
ipr_version = "5.76-107.0",
ipr_dest_dir = "inst/extdata/interpro",
ipr_platform = "linux/amd64",
auto_prepare_data = TRUE,
# Metadata cleaning
ref_file_path = "../data_raw/",
verbose = TRUE)