Run the full amRdata processing pipeline (Panaroo → CD-HIT → InterProScan → Parquet)

runDataProcessing() orchestrates the complete feature-extraction pipeline for a BV-BRC selection, starting from a per-selection DuckDB (created by prepareGenomes() and populated by downstream steps). It:

Runs Panaroo to build the pangenome and writes gene/struct outputs into DuckDB.
Runs CD-HIT to cluster proteins and writes protein outputs into DuckDB.
Runs InterProScan (Pfam) to annotate protein domains and writes domain outputs into DuckDB.
Cleans BV-BRC metadata (drug names/classes, countries, years) and exports all feature/metadata tables as compressed Parquet files, then creates a Parquet-backed DuckDB with read-only views of those Parquets for downstream ML.

The function is a thin controller that delegates each stage to the corresponding internal helpers (Dockerized tools where applicable) and ensures consistent output locations and table schemas across stages.

runDataProcessing(
  duckdb_path,
  output_path = NULL,
  threads = 16,
  panaroo_split_jobs = FALSE,
  panaroo_core_threshold = 0.9,
  panaroo_len_dif_percent = 0.95,
  panaroo_cluster_threshold = 0.95,
  panaroo_family_seq_identity = 0.5,
  cdhit_identity = 0.9,
  cdhit_word_length = 5,
  cdhit_memory = 0,
  cdhit_extra_args = c("-g", "1"),
  cdhit_output_prefix = "cdhit_out",
  ipr_appl = c("Pfam"),
  ipr_threads_unused = NULL,
  ipr_version = "5.76-107.0",
  ipr_dest_dir = "inst/extdata/interpro",
  ipr_platform = "linux/amd64",
  auto_prepare_data = TRUE,
  ref_file_path = "data_raw/",
  verbose = TRUE
)

Arguments

duckdb_path: Character. Path to the per-selection DuckDB produced by prepareGenomes() (e.g., "data/<Bug>/<Abbrev>.duckdb"). This DB must already contain at least the tables written by prepareGenomes() and subsequent download steps (e.g., files, filtered, and metadata tables).
output_path: Character or NULL. Base directory for writing Panaroo/CD-HIT/InterProScan outputs and final Parquet files. If NULL, defaults to dirname(duckdb_path).
threads: Integer. Shared concurrency budget used across tools (Panaroo, CD-HIT, InterProScan). Passed through to each stage as appropriate. Defaults to 16.
panaroo_split_jobs: Logical. If TRUE, Panaroo runs in multiple batches that can be merged by .mergePanaroo(). If FALSE, Panaroo runs once on all isolates. Default: FALSE.
panaroo_core_threshold: Numeric. Panaroo --core_threshold. Default: 0.90.
panaroo_len_dif_percent: Numeric. Panaroo --len_dif_percent. Default: 0.95.
panaroo_cluster_threshold: Numeric. Panaroo --threshold. Default: 0.95.
panaroo_family_seq_identity: Numeric. Panaroo -f (gene family identity). Default: 0.5.
cdhit_identity: Numeric. CD-HIT -c identity threshold. Default: 0.9.
cdhit_word_length: Integer. CD-HIT -n word length. Default: 5.
cdhit_memory: Integer. CD-HIT -M memory limit (MB). Use 0 for unlimited. Default: 0.
cdhit_extra_args: Character vector. Extra arguments forwarded to cd-hit (e.g., c("-g","1")). Default: c("-g","1").
cdhit_output_prefix: Character. Prefix for CD-HIT output files. Default: "cdhit_out".
ipr_appl: Character vector. InterProScan applications to run; typically c("Pfam"). Default: c("Pfam").
ipr_threads_unused: Deprecated/unused. Kept for backward compatibility; ignored.
ipr_version: Character. InterProScan image tag (e.g., "5.76-107.0"). Default: "5.76-107.0".
ipr_dest_dir: Character. Local destination for InterProScan data bundle (used by .checkInterProData()). Default: "inst/extdata/interpro".
ipr_platform: Character. Docker platform string for InterProScan containers, e.g., "linux/amd64". Default: "linux/amd64".
auto_prepare_data: Logical. If TRUE, ensure InterProScan data are present (download/verify if missing). Default: TRUE.
ref_file_path: Character. Directory containing reference TSVs used by cleanData() for metadata harmonization (e.g., "data_raw/"). Required; defaults to "data_raw/".
verbose: Logical. Print progress messages. Default: TRUE.

Value

Invisibly returns a list with:

duckdb_path – input DuckDB path
panaroo_output – path to the selected Panaroo output directory used for import
parquet_duckdb_path – absolute path to the created Parquet-backed DuckDB

Details

Docker & Platform Notes

All heavy tools (Panaroo, CD-HIT, InterProScan) run inside Docker containers.
On Apple Silicon/ARM hosts, images are forced to --platform linux/amd64 to ensure compatibility.
Ensure Docker Desktop is running and has sufficient memory/CPUs configured.

Input Requirements

The duckdb_path must reference a per-selection DuckDB that contains: files (paths to .gff, .fna, .PATRIC.faa), filtered (genomes selected for download/filtering), and BV-BRC metadata tables written by earlier steps.

Outputs & Side Effects

Writes tool-specific intermediate outputs under output_path (e.g., panaroo_out_*, CD-HIT files).
Writes Parquet files to output_path: gene_count.parquet, protein_count.parquet, domain_count.parquet, struct.parquet, gene_names.parquet, protein_names.parquet, domain_names.parquet, gene_seqs.parquet, protein_seqs.parquet, genome_gene_protein.parquet, metadata.parquet, amr_phenotype.parquet, genome_data.parquet, original_metadata.parquet.
Creates a new Parquet-backed DuckDB (*_parquet.duckdb) with read-only views pointing to those Parquets.

Threading

threads is a shared budget; each stage uses a portion or all of it.
InterProScan can be memory-intensive; on laptops, single-container mode is used internally.

Pipeline Steps

Panaroo via runPanaroo2Duckdb() → writes:
- gene_count (genome × gene counts)
- gene_names
- gene_struct (structural variants)
- gene_ref_seq, genome_gene_protein
CD-HIT via CDHIT2duckdb() (calls internal .runCDHIT()) → writes:
- protein_count (genome × protein-cluster counts)
- protein_names
- protein_cluster_seq (representative sequences)
InterProScan (Pfam) via domainFromIPR() → writes:
- domain_names
- domain_count (genome × domain-family matrix)
Metadata cleaning + Parquet export via cleanData() → writes Parquet files to output_path, and builds a Parquet-backed DuckDB (*_parquet.duckdb) with views:
- gene_count, protein_count, domain_count, struct
- metadata (cleaned), plus amr_phenotype, genome_data, original_metadata
- gene_names, protein_names, domain_names
- gene_seqs, protein_seqs
- genome_gene_protein

Examples