To download, process, and featurize AMR genome isolates and metadata

amRdata is the first package in the amR suite for antimicrobial resistance (AMR) prediction. It takes a user‑provided species or taxon ID, downloads the corresponding genomes and AST data from BV‑BRC, constructs pangenomes, extracts features at multiple molecular scales, and prepares a unified Parquet‑backed DuckDB file for downstream ML modeling in amRml.

The workflow is comprised of 6 primary processes:

BV‑BRC metadata (isolate metadata + AMR phenotypic labels) →
BV-BRC genomes (sequence data) →
Panaroo pangenome (genes, struct) →
CD‑HIT protein clusters (proteins) →
Pfam domain extraction (domains) →
Database formatting

Overview

amRdata includes functions to:

Query and download bacterial genome data from BV-BRC
Acquire paired antimicrobial susceptibility testing (AST) results
Extract molecular features across scales:
- Gene clusters (Panaroo pangenome analysis)
- Protein clusters (CD-HIT sequence similarity)
- Protein domains (Pfam annotations)
- Structural variants (Panaroo pangenome rearrangements)
Store all data in highly efficient Parquet and DuckDB formats

See the package vignette for detailed usage.

Installation

# Install from GitHub
if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")

remotes::install_github("JRaviLab/amRdata")

Quick start


library(amRdata)

# Step 1: Download and prepare genomes with paired AST data from BV-BRC
prepareGenomes(
  user_bacs = c("Shigella flexneri"),
  base_dir  = "data/Shigella_flexneri",
  method    = "ftp",   # or "cli"
  verbose   = TRUE
)

# Step 2: Run full feature extraction (Panaroo → CD-HIT → InterProScan → metadata cleaning)
runDataProcessing(
  duckdb_path = "data/Shigella_flexneri/Sfl.duckdb",
  output_path = "data/Shigella_flexneri",
  threads     = 16,
  ref_file_path = "data_raw/"
)

# A final Parquet-backed DuckDB is created:
#   data/Shigella_flexneri/Sfl_parquet.duckdb

This contains data for feature presence/absence and counts across scales in genome by feature matrices, as well as all available sample metadata.

Package features

Data curation

BV‑BRC data access amRdata uses the BV‑BRC CLI (via Docker) or FTP server to access:

Genome metadata
AMR phenotype data
Genome assemblies (.fna, .faa, .gff)

Functions involved:

.updateBVBRCdata()
.retrieveCustomQuery()
.retrieveQueryIDs()
retrieveGenomes()
.filterGenomes()
prepareGenomes()

After initial download, all BV-BRC metadata is cached automatically under: data/bvbrc/bvbrcData.duckdb

The package interfaces with BV-BRC (Bacterial and Viral Bioinformatics Resource Center) to access bacterial genome sequences and antimicrobial susceptibility testing data either using FTP or the BV-BRC CLI wrapped in a Docker container for reproducible access:

Query isolate metadata with flexible filtering
Download genome files (.fna, .faa, .gff)
Retrieve AST results linking genotypes to phenotypes
Apply quality control filters (assembly quality, metadata completeness)

Feature extraction

Features are extracted at four complementary molecular scales:

1. Gene clusters

Panaroo is executed inside a container using .runPanaroo().

Our pangenome creation approach:

Allows end-to-end single pangenome runs
Offers parallelized multi-batch pangenomes for large isolate sets (>5,000 genomes)
- Supports automated pangenome merging through .mergePanaroo()
Generates gene presence/absence and count matrices per isolate
Identifies structural variants (gene triplets indicating genome rearrangements)

Outputs are written into the per-taxon DuckDB for efficient storage and querying.

2. Protein clusters

CD-HIT is executed inside a container using .runCDHIT().

Our protein clustering approach:

Clusters proteins across all isolates from BV-BRC .faa files
Creates protein presence/absence and count matrices per isolate
Saves cluster names and annotations

3. Pfam domains

InterProScan is executed inside a container using domainFromIPR().

Our Pfam domain extraction approach:

Automatically configures InterPro’s databases for use
Runs parallelized and containerized domain annotation
Maps domain presence/absence and counts to genomes and proteins
Provides another functional annotation layer

4. Data cleaning and storage

Final data formatting and storage is executed using cleanData().

Our final data storage script:

Harmonizes drug names, classes, and countries in BV-BRC metadata
Generates temporal bins to stratify analysis across time
Summarizes AMR information across the dataset
Writes all data into highly compressed data structures
- Parquet: Binary, columnar storage for large matrices
  - These can be made human-readable by calling arrow::read_parquet
- DuckDB: SQL-queryable database for rapid filtering of linked Parquets

Workflow example

An example of the process for downloading and processing all data and metadata for Shigella flexneri genomes with paired AST metadata.

library(amRdata)

# 1. Download & filter genomes
prepareGenomes(
  user_bacs  = c("Shigella flexneri"),
  base_dir   = "data/Shigella_flexneri",
  method     = "ftp"
)

# 2. Run multi-scale feature extraction

runDataProcessing(
  duckdb_path    = "data/Shigella_flexneri.duckdb",
  output_path    = "data/Shigella_flexneri",
  threads        = 8, # Or whatever your system supports
  ref_file_path  = "data_raw/"
)

# 3. Load final data

library(DBI)
library(arrow)

# To view all attached data tables in the database

con <- DBI::dbConnect(duckdb::duckdb(), "Shigella_flexneri/Sfl_parquet.duckdb")
DBI::dbListTables(con)


# To load human-readable data tables into R

# e.g., Looking at gene cluster counts per isolate
Sfl_gene_counts <- arrow::read_parquet("data/Shigella_flexneri/gene_count.parquet")
  
  # To connect gene cluster IDs to their annotated names
  Sfl_gene_names <- arrow::read_parquet("data/Shigella_flexneri/gene_names.parquet")

Data requirements

External dependencies (managed through Docker)

BV‑BRC CLI
Panaroo
CD‑HIT
InterProScan
DuckDB
Arrow (Parquet)

The user does not need to install these manually.

The package requires:

An internet connection to access BV-BRC data and metadata
A local Docker installation
- Containers for internal tools are pulled automatically and do not require configuration
- Make sure Docker is running before you start processing data!
Sufficient storage for databases, downloaded files, and processed output (we recommend 20GB+)
Multicore processing and sufficient (16GB+) of RAM are highly recommended
- Species with many isolates may run poorly or fail to complete on older hardware

Output

Feature matrices dimensions depend on species:

Rows: Number of isolates (typically <10,000)
Columns: Number of features (ballpark estimates)
- Genes: 5,000-50,000
- Proteins: 5,000-50,000
- Domains: 500-10,000
- Structural variants: 1,000-10,000

External dependencies

The package uses established bioinformatics tools:

Panaroo (≥1.3.0): Pangenome analysis
CD-HIT (≥4.8.1): Protein clustering
InterProScan (≥5.0): Domain annotation
Docker: For BV-BRC CLI container

These are automatically managed through the Docker container.

Performance

Processing times vary by species and isolate count:

Data download: 0-1 hours
Pangenome construction: 0-6 hours
Protein clustering: 0-3 hours
Domain annotation: 0-1 hours
Total: 1-12 hours for a complete species analysis
These numbers will all vary greatly based on isolate number, genome complexity, and available hardware.
Parallelization significantly reduces processing time when multiple cores are available.

Integration with amR suite

amRdata is designed to work seamlessly with other amR packages:

library(amRdata)
library(amRml)
library(amRshiny)

# 1. Curate data
prepareGenomes("Shigella flexneri")
runDataProcessing("amRdata/data/Shigella_flexneri/Sfl.duckdb")

# 2. Train models
runMLmodels("amRdata/data/Shigella_flexneri/Sfl_parquet.duckdb")

# 3. Visualize ### To add
launch_dashboard()

amR: Suite metapackage
amRml: ML for AMR prediction
amRshiny: Interactive dashboard

Citation

If you use amRdata in your research, please cite:

Brenner E, Ghosh A, Wolfe E, Boyer E, Vang C, Lesiyon R, Mayer D, Ravi J. (2026).
amR: an R package suite to predict antimicrobial resistance in bacterial pathogens.
R package version 0.99.0.
https://github.com/JRaviLab/amR

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Reporting issues

Report bugs and request features at: https://github.com/JRaviLab/amRml/issues

License

BSD 3-Clause License. See LICENSE for details.

Contact

Corresponding author: Janani Ravi (janani.ravi@cuanschutz.edu)

Lab website: https://jravilab.github.io

Code of conduct

Please note that amRml is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

amRdata

Overview

Installation

Quick start

Package features

Data curation

Feature extraction

1. Gene clusters

2. Protein clusters

3. Pfam domains

4. Data cleaning and storage

Workflow example

Data requirements

Output

External dependencies

Performance

Integration with amR suite

Citation

Contributing

Reporting issues

License

Contact

Code of conduct

Links

License

Community

Citation

Developers

Dev status

amRdata

Overview

Installation

Quick start

Package features

Data curation

Feature extraction

1. Gene clusters

2. Protein clusters

3. Pfam domains

4. Data cleaning and storage

Workflow example

Data requirements

Output

External dependencies

Performance

Integration with amR suite

Related packages

Citation

Contributing

Reporting issues

License

Contact

Code of conduct

Links

License

Community

Citation

Developers

Dev status