R/data_processing.R
runDataProcessing.RdrunDataProcessing() orchestrates the complete feature-extraction pipeline for a
BV-BRC selection, starting from a per-selection DuckDB (created by
prepareGenomes() and populated by downstream steps). It:
Runs Panaroo to build the pangenome and writes gene/struct outputs into DuckDB.
Runs CD-HIT to cluster proteins and writes protein outputs into DuckDB.
Runs InterProScan (Pfam) to annotate protein domains and writes domain outputs into DuckDB.
Cleans BV-BRC metadata (drug names/classes, countries, years) and exports all feature/metadata tables as compressed Parquet files, then creates a Parquet-backed DuckDB with read-only views of those Parquets for downstream ML.
The function is a thin controller that delegates each stage to the corresponding internal helpers (Dockerized tools where applicable) and ensures consistent output locations and table schemas across stages.
runDataProcessing(
duckdb_path,
output_path = NULL,
threads = 16,
panaroo_split_jobs = FALSE,
panaroo_core_threshold = 0.9,
panaroo_len_dif_percent = 0.95,
panaroo_cluster_threshold = 0.95,
panaroo_family_seq_identity = 0.5,
cdhit_identity = 0.9,
cdhit_word_length = 5,
cdhit_memory = 0,
cdhit_extra_args = c("-g", "1"),
cdhit_output_prefix = "cdhit_out",
ipr_appl = c("Pfam"),
ipr_threads_unused = NULL,
ipr_version = "5.76-107.0",
ipr_dest_dir = "inst/extdata/interpro",
ipr_platform = "linux/amd64",
auto_prepare_data = TRUE,
ref_file_path = "data_raw/",
verbose = TRUE
)Character. Path to the per-selection DuckDB produced by
prepareGenomes() (e.g., "data/<Bug>/<Abbrev>.duckdb"). This DB must
already contain at least the tables written by prepareGenomes() and subsequent
download steps (e.g., files, filtered, and metadata tables).
Character or NULL. Base directory for writing Panaroo/CD-HIT/InterProScan
outputs and final Parquet files. If NULL, defaults to dirname(duckdb_path).
Integer. Shared concurrency budget used across tools (Panaroo, CD-HIT,
InterProScan). Passed through to each stage as appropriate. Defaults to 16.
Logical. If TRUE, Panaroo runs in multiple batches that can be
merged by .mergePanaroo(). If FALSE, Panaroo runs once on all isolates. Default: FALSE.
Numeric. Panaroo --core_threshold. Default: 0.90.
Numeric. Panaroo --len_dif_percent. Default: 0.95.
Numeric. Panaroo --threshold. Default: 0.95.
Numeric. Panaroo -f (gene family identity). Default: 0.5.
Numeric. CD-HIT -c identity threshold. Default: 0.9.
Integer. CD-HIT -n word length. Default: 5.
Integer. CD-HIT -M memory limit (MB). Use 0 for unlimited. Default: 0.
Character vector. Extra arguments forwarded to cd-hit
(e.g., c("-g","1")). Default: c("-g","1").
Character. Prefix for CD-HIT output files. Default: "cdhit_out".
Character vector. InterProScan applications to run; typically c("Pfam").
Default: c("Pfam").
Deprecated/unused. Kept for backward compatibility; ignored.
Character. InterProScan image tag (e.g., "5.76-107.0"). Default: "5.76-107.0".
Character. Local destination for InterProScan data bundle
(used by .checkInterProData()). Default: "inst/extdata/interpro".
Character. Docker platform string for InterProScan containers,
e.g., "linux/amd64". Default: "linux/amd64".
Logical. If TRUE, ensure InterProScan data are present
(download/verify if missing). Default: TRUE.
Character. Directory containing reference TSVs used by cleanData()
for metadata harmonization (e.g., "data_raw/"). Required; defaults to "data_raw/".
Logical. Print progress messages. Default: TRUE.
Invisibly returns a list with:
duckdb_path – input DuckDB path
panaroo_output – path to the selected Panaroo output directory used for import
parquet_duckdb_path – absolute path to the created Parquet-backed DuckDB
Docker & Platform Notes
All heavy tools (Panaroo, CD-HIT, InterProScan) run inside Docker containers.
On Apple Silicon/ARM hosts, images are forced to --platform linux/amd64 to ensure compatibility.
Ensure Docker Desktop is running and has sufficient memory/CPUs configured.
Input Requirements
The duckdb_path must reference a per-selection DuckDB that contains:
files (paths to .gff, .fna, .PATRIC.faa),
filtered (genomes selected for download/filtering), and
BV-BRC metadata tables written by earlier steps.
Outputs & Side Effects
Writes tool-specific intermediate outputs under output_path (e.g., panaroo_out_*, CD-HIT files).
Writes Parquet files to output_path:
gene_count.parquet, protein_count.parquet, domain_count.parquet, struct.parquet,
gene_names.parquet, protein_names.parquet, domain_names.parquet,
gene_seqs.parquet, protein_seqs.parquet, genome_gene_protein.parquet,
metadata.parquet, amr_phenotype.parquet, genome_data.parquet, original_metadata.parquet.
Creates a new Parquet-backed DuckDB (*_parquet.duckdb) with read-only views pointing to those Parquets.
Threading
threads is a shared budget; each stage uses a portion or all of it.
InterProScan can be memory-intensive; on laptops, single-container mode is used internally.
Panaroo via runPanaroo2Duckdb() → writes:
gene_count (genome × gene counts)
gene_names
gene_struct (structural variants)
gene_ref_seq, genome_gene_protein
CD-HIT via CDHIT2duckdb() (calls internal .runCDHIT()) → writes:
protein_count (genome × protein-cluster counts)
protein_names
protein_cluster_seq (representative sequences)
InterProScan (Pfam) via domainFromIPR() → writes:
domain_names
domain_count (genome × domain-family matrix)
Metadata cleaning + Parquet export via cleanData() → writes Parquet
files to output_path, and builds a Parquet-backed DuckDB
(*_parquet.duckdb) with views:
gene_count, protein_count, domain_count, struct
metadata (cleaned), plus amr_phenotype, genome_data, original_metadata
gene_names, protein_names, domain_names
gene_seqs, protein_seqs
genome_gene_protein
prepareGenomes(), runPanaroo2Duckdb(), CDHIT2duckdb(), domainFromIPR(), cleanData()
if (FALSE) { # \dontrun{
# Paths below are illustrative; adapt to your project layout.
runDataProcessing(
duckdb_path = "data/Shigella_flexneri/Sfl.duckdb",
output_path = "data/Shigella_flexneri",
threads = 16,
ref_file_path = "data_raw/"
)
# After completion:
# data/Shigella_flexneri/Sfl_parquet.duckdb
# will contain views over the Parquet files for downstream ML.
} # }