amRdata is the first package in the amR suite for antimicrobial resistance (AMR) prediction. It takes a user‑provided species or taxon ID, downloads the corresponding genomes and AST data from BV‑BRC, constructs pangenomes, extracts features at multiple molecular scales, and prepares a unified Parquet‑backed DuckDB file for downstream ML modeling in amRml.
The workflow is comprised of 6 primary processes:
amRdata includes functions to:
See the package vignette for detailed usage.
# Install from GitHub
if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes")
remotes::install_github("JRaviLab/amRdata")
library(amRdata)
# Step 1: Download and prepare genomes with paired AST data from BV-BRC
prepareGenomes(
user_bacs = c("Shigella flexneri"),
base_dir = "data/Shigella_flexneri",
method = "ftp", # or "cli"
verbose = TRUE
)
# Step 2: Run full feature extraction (Panaroo → CD-HIT → InterProScan → metadata cleaning)
runDataProcessing(
duckdb_path = "data/Shigella_flexneri/Sfl.duckdb",
output_path = "data/Shigella_flexneri",
threads = 16,
ref_file_path = "data_raw/"
)
# A final Parquet-backed DuckDB is created:
# data/Shigella_flexneri/Sfl_parquet.duckdb
This contains data for feature presence/absence and counts across scales in genome by feature matrices, as well as all available sample metadata. .fna, .faa, .gff)Functions involved:
.updateBVBRCdata()
.retrieveCustomQuery()
.retrieveQueryIDs()
retrieveGenomes()
.filterGenomes()
prepareGenomes()After initial download, all BV-BRC metadata is cached automatically under: data/bvbrc/bvbrcData.duckdb
The package interfaces with BV-BRC (Bacterial and Viral Bioinformatics Resource Center) to access bacterial genome sequences and antimicrobial susceptibility testing data either using FTP or the BV-BRC CLI wrapped in a Docker container for reproducible access:
.fna, .faa, .gff)Features are extracted at four complementary molecular scales:
Panaroo is executed inside a container using .runPanaroo().
Our pangenome creation approach:
.mergePanaroo()
Outputs are written into the per-taxon DuckDB for efficient storage and querying.
CD-HIT is executed inside a container using .runCDHIT().
Our protein clustering approach:
InterProScan is executed inside a container using domainFromIPR().
Our Pfam domain extraction approach:
Final data formatting and storage is executed using cleanData().
Our final data storage script:
arrow::read_parquet
An example of the process for downloading and processing all data and metadata for Shigella flexneri genomes with paired AST metadata.
library(amRdata)
# 1. Download & filter genomes
prepareGenomes(
user_bacs = c("Shigella flexneri"),
base_dir = "data/Shigella_flexneri",
method = "ftp"
)
# 2. Run multi-scale feature extraction
runDataProcessing(
duckdb_path = "data/Shigella_flexneri.duckdb",
output_path = "data/Shigella_flexneri",
threads = 8, # Or whatever your system supports
ref_file_path = "data_raw/"
)
# 3. Load final data
library(DBI)
library(arrow)
# To view all attached data tables in the database
con <- DBI::dbConnect(duckdb::duckdb(), "Shigella_flexneri/Sfl_parquet.duckdb")
DBI::dbListTables(con)
# To load human-readable data tables into R
# e.g., Looking at gene cluster counts per isolate
Sfl_gene_counts <- arrow::read_parquet("data/Shigella_flexneri/gene_count.parquet")
# To connect gene cluster IDs to their annotated names
Sfl_gene_names <- arrow::read_parquet("data/Shigella_flexneri/gene_names.parquet")External dependencies (managed through Docker)
The user does not need to install these manually.
The package requires:
Processing times vary by species and isolate count:
Data download: 0-1 hours
Pangenome construction: 0-6 hours
Protein clustering: 0-3 hours
Domain annotation: 0-1 hours
Total: 1-12 hours for a complete species analysis
These numbers will all vary greatly based on isolate number, genome complexity, and available hardware.
Parallelization significantly reduces processing time when multiple cores are available.
amRdata is designed to work seamlessly with other amR packages:
library(amRdata)
library(amRml)
library(amRshiny)
# 1. Curate data
prepareGenomes("Shigella flexneri")
runDataProcessing("amRdata/data/Shigella_flexneri/Sfl.duckdb")
# 2. Train models
runMLmodels("amRdata/data/Shigella_flexneri/Sfl_parquet.duckdb")
# 3. Visualize ### To add
launch_dashboard()We welcome contributions! Please see CONTRIBUTING.md for guidelines.
Report bugs and request features at: https://github.com/JRaviLab/amRml/issues
BSD 3-Clause License. See LICENSE for details.
Corresponding author: Janani Ravi (janani.ravi@cuanschutz.edu)
Lab website: https://jravilab.github.io
Please note that amRml is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.