runDataProcessing() orchestrates the complete feature-extraction pipeline for a BV-BRC selection, starting from a per-selection DuckDB (created by prepareGenomes() and populated by downstream steps). It:

  1. Runs Panaroo to build the pangenome and writes gene/struct outputs into DuckDB.

  2. Runs CD-HIT to cluster proteins and writes protein outputs into DuckDB.

  3. Runs InterProScan (Pfam) to annotate protein domains and writes domain outputs into DuckDB.

  4. Cleans BV-BRC metadata (drug names/classes, countries, years) and exports all feature/metadata tables as compressed Parquet files, then creates a Parquet-backed DuckDB with read-only views of those Parquets for downstream ML.

The function is a thin controller that delegates each stage to the corresponding internal helpers (Dockerized tools where applicable) and ensures consistent output locations and table schemas across stages.

runDataProcessing(
  duckdb_path,
  output_path = NULL,
  threads = 16,
  panaroo_split_jobs = FALSE,
  panaroo_core_threshold = 0.9,
  panaroo_len_dif_percent = 0.95,
  panaroo_cluster_threshold = 0.95,
  panaroo_family_seq_identity = 0.5,
  cdhit_identity = 0.9,
  cdhit_word_length = 5,
  cdhit_memory = 0,
  cdhit_extra_args = c("-g", "1"),
  cdhit_output_prefix = "cdhit_out",
  ipr_appl = c("Pfam"),
  ipr_threads_unused = NULL,
  ipr_version = "5.76-107.0",
  ipr_dest_dir = "inst/extdata/interpro",
  ipr_platform = "linux/amd64",
  auto_prepare_data = TRUE,
  ref_file_path = "data_raw/",
  verbose = TRUE
)

Arguments

duckdb_path

Character. Path to the per-selection DuckDB produced by prepareGenomes() (e.g., "data/<Bug>/<Abbrev>.duckdb"). This DB must already contain at least the tables written by prepareGenomes() and subsequent download steps (e.g., files, filtered, and metadata tables).

output_path

Character or NULL. Base directory for writing Panaroo/CD-HIT/InterProScan outputs and final Parquet files. If NULL, defaults to dirname(duckdb_path).

threads

Integer. Shared concurrency budget used across tools (Panaroo, CD-HIT, InterProScan). Passed through to each stage as appropriate. Defaults to 16.

panaroo_split_jobs

Logical. If TRUE, Panaroo runs in multiple batches that can be merged by .mergePanaroo(). If FALSE, Panaroo runs once on all isolates. Default: FALSE.

panaroo_core_threshold

Numeric. Panaroo --core_threshold. Default: 0.90.

panaroo_len_dif_percent

Numeric. Panaroo --len_dif_percent. Default: 0.95.

panaroo_cluster_threshold

Numeric. Panaroo --threshold. Default: 0.95.

panaroo_family_seq_identity

Numeric. Panaroo -f (gene family identity). Default: 0.5.

cdhit_identity

Numeric. CD-HIT -c identity threshold. Default: 0.9.

cdhit_word_length

Integer. CD-HIT -n word length. Default: 5.

cdhit_memory

Integer. CD-HIT -M memory limit (MB). Use 0 for unlimited. Default: 0.

cdhit_extra_args

Character vector. Extra arguments forwarded to cd-hit (e.g., c("-g","1")). Default: c("-g","1").

cdhit_output_prefix

Character. Prefix for CD-HIT output files. Default: "cdhit_out".

ipr_appl

Character vector. InterProScan applications to run; typically c("Pfam"). Default: c("Pfam").

ipr_threads_unused

Deprecated/unused. Kept for backward compatibility; ignored.

ipr_version

Character. InterProScan image tag (e.g., "5.76-107.0"). Default: "5.76-107.0".

ipr_dest_dir

Character. Local destination for InterProScan data bundle (used by .checkInterProData()). Default: "inst/extdata/interpro".

ipr_platform

Character. Docker platform string for InterProScan containers, e.g., "linux/amd64". Default: "linux/amd64".

auto_prepare_data

Logical. If TRUE, ensure InterProScan data are present (download/verify if missing). Default: TRUE.

ref_file_path

Character. Directory containing reference TSVs used by cleanData() for metadata harmonization (e.g., "data_raw/"). Required; defaults to "data_raw/".

verbose

Logical. Print progress messages. Default: TRUE.

Value

Invisibly returns a list with:

  • duckdb_path – input DuckDB path

  • panaroo_output – path to the selected Panaroo output directory used for import

  • parquet_duckdb_path – absolute path to the created Parquet-backed DuckDB

Details

Docker & Platform Notes

  • All heavy tools (Panaroo, CD-HIT, InterProScan) run inside Docker containers.

  • On Apple Silicon/ARM hosts, images are forced to --platform linux/amd64 to ensure compatibility.

  • Ensure Docker Desktop is running and has sufficient memory/CPUs configured.

Input Requirements

  • The duckdb_path must reference a per-selection DuckDB that contains: files (paths to .gff, .fna, .PATRIC.faa), filtered (genomes selected for download/filtering), and BV-BRC metadata tables written by earlier steps.

Outputs & Side Effects

  • Writes tool-specific intermediate outputs under output_path (e.g., panaroo_out_*, CD-HIT files).

  • Writes Parquet files to output_path: gene_count.parquet, protein_count.parquet, domain_count.parquet, struct.parquet, gene_names.parquet, protein_names.parquet, domain_names.parquet, gene_seqs.parquet, protein_seqs.parquet, genome_gene_protein.parquet, metadata.parquet, amr_phenotype.parquet, genome_data.parquet, original_metadata.parquet.

  • Creates a new Parquet-backed DuckDB (*_parquet.duckdb) with read-only views pointing to those Parquets.

Threading

  • threads is a shared budget; each stage uses a portion or all of it.

  • InterProScan can be memory-intensive; on laptops, single-container mode is used internally.

Pipeline Steps

  1. Panaroo via runPanaroo2Duckdb() → writes:

    • gene_count (genome × gene counts)

    • gene_names

    • gene_struct (structural variants)

    • gene_ref_seq, genome_gene_protein

  2. CD-HIT via CDHIT2duckdb() (calls internal .runCDHIT()) → writes:

    • protein_count (genome × protein-cluster counts)

    • protein_names

    • protein_cluster_seq (representative sequences)

  3. InterProScan (Pfam) via domainFromIPR() → writes:

    • domain_names

    • domain_count (genome × domain-family matrix)

  4. Metadata cleaning + Parquet export via cleanData() → writes Parquet files to output_path, and builds a Parquet-backed DuckDB (*_parquet.duckdb) with views:

    • gene_count, protein_count, domain_count, struct

    • metadata (cleaned), plus amr_phenotype, genome_data, original_metadata

    • gene_names, protein_names, domain_names

    • gene_seqs, protein_seqs

    • genome_gene_protein

See also

prepareGenomes(), runPanaroo2Duckdb(), CDHIT2duckdb(), domainFromIPR(), cleanData()

Examples

if (FALSE) { # \dontrun{
# Paths below are illustrative; adapt to your project layout.
runDataProcessing(
  duckdb_path   = "data/Shigella_flexneri/Sfl.duckdb",
  output_path   = "data/Shigella_flexneri",
  threads       = 16,
  ref_file_path = "data_raw/"
)

# After completion:
#   data/Shigella_flexneri/Sfl_parquet.duckdb
# will contain views over the Parquet files for downstream ML.
} # }