Overview

amRdata is part of a 3-package suite for predicting antimicrobial resistance (AMR) using machine learning models trained on bacterial genomic features.

This vignette demonstrates how to download, process and visualize metadata and genomics data.

Data Curation

For metadata curation with amRdata, use retrieveMetadata() to create a DuckDB containing metadata tables.

# Download all AMR data for a species from BV-BRC
retrieveMetadata(user_bacs = "Shigella flexneri",
                             filter_type = "AMR",
                             base_dir = "../data/",
                             abx = "All",
                             overwrite = FALSE,
                             image = "danylmb/bvbrc:5.3",
                             verbose = FALSE)

This wrote tables ‘amr_phenotype’, ‘genome_data’, and ‘metadata’ to a DuckDB

For downloading genomes with paired AMR phenotype data, use retrieveGenomes()

retrieveGenomes(base_dir = "../data/",
                            user_bacs = "Shigella flexneri",
                            method = c("cli"),
                            image = "danylmb/bvbrc:5.3",
                            skip_existing = TRUE,
                            ftp_workers = 8L,
                            cli_fasta_workers = 4L,
                            cli_gff_workers = 4L,
                            chunk_size = 50L,
                            verbose = TRUE)

This returns character vector of genome IDs and wrote complete file sets on disk.

To write a .txt file listing downloaded genome filepaths (.fna, .faa, .gff)

genomeList(base_dir = "../data/",
                       user_bacs = "Shigella flexneri",
                       verbose = TRUE)

A wrapper for downloading genomes and listing the paths

Internally runs: 1. retrieveGenomes()
2. genomeList()

Allows users to input a species or taxon ID and automate all data downloading and curation steps.

prepareGenomes(user_bacs = "Shigella flexneri",
                           base_dir = "../data/",
                           method = c("cli"),
                           overwrite = FALSE,
                           verbose = TRUE)

Data Processing

A wrapper for creating pangenome, protein clusters and annotate domains

Internally runs: 1. runPanaroo2Duckdb() -> run Panaroo (optional panaroo-merge) to generate pangenome and create genomes by genes table 2. CDHIT2duckdb() -> run CDHIT to cluster proteins and create genomes by proteins table 3. domainFromIPR() -> run IPRScan to annotate proteins to domains and create genomes by domains table 4. cleanData() -> Clean metadata and export the feature tables to Parquet + Parquet-backed DuckDB

runDataProcessing(duckdb_path = "../data/Shigella_flexneri/Sfl.duckdb",
                              output_path = NULL,
                              # unified threads for all tools
                              threads = 16,
                              # Panaroo
                              panaroo_split_jobs = FALSE,
                              panaroo_core_threshold = 0.90,
                              panaroo_len_dif_percent = 0.95,
                              panaroo_cluster_threshold = 0.95,
                              panaroo_family_seq_identity = 0.5,
                              # CD-HIT
                              cdhit_identity = 0.9,
                              cdhit_word_length = 5,
                              cdhit_memory = 0,
                              cdhit_extra_args = c("-g","1"),
                              cdhit_output_prefix = "cdhit_out",
                              # InterPro
                              ipr_appl = c("Pfam"),
                              ipr_threads_unused = NULL,
                              ipr_version = "5.76-107.0",
                              ipr_dest_dir = "inst/extdata/interpro",
                              ipr_platform = "linux/amd64",
                              auto_prepare_data = TRUE,
                              # Metadata cleaning
                              ref_file_path = "../data_raw/",
                              verbose = TRUE)

Plots

Simple stats and plots to explore metadata

generateSummary("data/metadata_parquet", 
                            out_path = "data/")
generatePlots("data/metadata_parquet", 
                            out_path = "data/")