R/data_processing.R
runPanaroo2Duckdb.RdrunPanaroo2Duckdb() executes Panaroo on the genomes registered in a
per-selection DuckDB (created earlier by prepareGenomes()), optionally in
multiple batches, and imports all resulting pangenome tables into the same
DuckDB database.
It acts as a high-level wrapper around:
.runPanaroo() — runs Panaroo (single or multi-batch)
.mergePanaroo() — optionally merges batch outputs
.panaroo2duckdb() — loads Panaroo results (gene counts, struct variants,
gene names, reference sequences, long tables) into the DuckDB
The function determines which Panaroo output directory to use (single-run or merged), verifies that a valid pangenome has been produced, and updates the DuckDB with standardized table names consistent with downstream processing steps.
runPanaroo2Duckdb(
duckdb_path,
output_path = NULL,
core_threshold = 0.9,
len_dif_percent = 0.95,
cluster_threshold = 0.95,
family_seq_identity = 0.5,
threads = 16,
split_jobs = FALSE,
verbose = TRUE
)Character. Path to the per-selection DuckDB database created by
prepareGenomes(). Must contain a files table with Panaroo input file paths.
Character or NULL. Directory where Panaroo outputs
(panaroo_out_* or merged merge_output/) will be written. If NULL,
defaults to dirname(duckdb_path).
Numeric. Panaroo --core_threshold parameter.
Default: 0.90.
Numeric. Panaroo --len_dif_percent parameter.
Default: 0.95.
Numeric. Panaroo global clustering --threshold.
Default: 0.95.
Numeric. Panaroo gene family identity -f.
Default: 0.5.
Integer. Total CPU budget to allocate for Panaroo.
If split_jobs = TRUE, threads are divided across batches.
Default: 16.
Logical. If TRUE, Panaroo is run in multiple parallel
batches (up to 5, depending on dataset size), and batch outputs are merged
using .mergePanaroo(). If FALSE, only one Panaroo invocation is run.
Default: FALSE.
Logical. Print status messages during Panaroo execution,
merging, and DuckDB import. Default: TRUE.
Invisibly returns the path to the selected Panaroo output directory
(either the single-run output or the merged merge_output/ directory).
After running .runPanaroo(), the function scans output_path for directories
matching panaroo_out_* and identifies those containing a final_graph.gml file —
the minimum requirement for a valid Panaroo run.
If split_jobs = TRUE and multiple valid outputs are present,
.mergePanaroo() is used to combine the outputs.
If split_jobs = FALSE, the single valid output directory is used directly.
.panaroo2duckdb() is then called to import:
gene presence/absence counts (gene_count)
gene names (gene_names)
structural presence/absence (gene_struct)
gene reference FASTA (gene_ref_seq)
long-form genome → gene → protein tables
These maintain the standardized schema used by downstream feature extraction
and modeling steps in amRdata and amRml.
.runPanaroo() — core Panaroo execution
.mergePanaroo() — merge multiple Panaroo batches
.panaroo2duckdb() — import Panaroo results into DuckDB
runDataProcessing() — full pipeline including CD-HIT & InterProScan
if (FALSE) { # \dontrun{
# Basic usage:
runPanaroo2Duckdb(
duckdb_path = "data/Shigella_flexneri/Sfl.duckdb",
output_path = "data/Shigella_flexneri",
threads = 12,
split_jobs = FALSE
)
# Merging multi-batch pangenomes:
runPanaroo2Duckdb(
duckdb_path = "data/Ecoli/Eco.duckdb",
output_path = "data/Ecoli",
split_jobs = TRUE,
threads = 24
)
} # }