Run Panaroo and import pangenome outputs into DuckDB

runPanaroo2Duckdb() executes Panaroo on the genomes registered in a per-selection DuckDB (created earlier by prepareGenomes()), optionally in multiple batches, and imports all resulting pangenome tables into the same DuckDB database.

It acts as a high-level wrapper around:

.runPanaroo() — runs Panaroo (single or multi-batch)
.mergePanaroo() — optionally merges batch outputs
.panaroo2duckdb() — loads Panaroo results (gene counts, struct variants, gene names, reference sequences, long tables) into the DuckDB

The function determines which Panaroo output directory to use (single-run or merged), verifies that a valid pangenome has been produced, and updates the DuckDB with standardized table names consistent with downstream processing steps.

runPanaroo2Duckdb(
  duckdb_path,
  output_path = NULL,
  core_threshold = 0.9,
  len_dif_percent = 0.95,
  cluster_threshold = 0.95,
  family_seq_identity = 0.5,
  threads = 16,
  split_jobs = FALSE,
  verbose = TRUE
)

Arguments

duckdb_path: Character. Path to the per-selection DuckDB database created by prepareGenomes(). Must contain a files table with Panaroo input file paths.
output_path: Character or NULL. Directory where Panaroo outputs (panaroo_out_* or merged merge_output/) will be written. If NULL, defaults to dirname(duckdb_path).
core_threshold: Numeric. Panaroo --core_threshold parameter. Default: 0.90.
len_dif_percent: Numeric. Panaroo --len_dif_percent parameter. Default: 0.95.
cluster_threshold: Numeric. Panaroo global clustering --threshold. Default: 0.95.
family_seq_identity: Numeric. Panaroo gene family identity -f. Default: 0.5.
threads: Integer. Total CPU budget to allocate for Panaroo. If split_jobs = TRUE, threads are divided across batches. Default: 16.
split_jobs: Logical. If TRUE, Panaroo is run in multiple parallel batches (up to 5, depending on dataset size), and batch outputs are merged using .mergePanaroo(). If FALSE, only one Panaroo invocation is run. Default: FALSE.
verbose: Logical. Print status messages during Panaroo execution, merging, and DuckDB import. Default: TRUE.

Value

Invisibly returns the path to the selected Panaroo output directory (either the single-run output or the merged merge_output/ directory).

Details

Panaroo Output Discovery

After running .runPanaroo(), the function scans output_path for directories matching panaroo_out_* and identifies those containing a final_graph.gml file — the minimum requirement for a valid Panaroo run.

If split_jobs = TRUE and multiple valid outputs are present, .mergePanaroo() is used to combine the outputs.
If split_jobs = FALSE, the single valid output directory is used directly.

DuckDB Integration

.panaroo2duckdb() is then called to import:

gene presence/absence counts (gene_count)
gene names (gene_names)
structural presence/absence (gene_struct)
gene reference FASTA (gene_ref_seq)
long-form genome → gene → protein tables

These maintain the standardized schema used by downstream feature extraction and modeling steps in amRdata and amRml.

Examples

if (FALSE) { # \dontrun{
# Basic usage:
runPanaroo2Duckdb(
  duckdb_path = "data/Shigella_flexneri/Sfl.duckdb",
  output_path = "data/Shigella_flexneri",
  threads     = 12,
  split_jobs  = FALSE
)

# Merging multi-batch pangenomes:
runPanaroo2Duckdb(
  duckdb_path = "data/Ecoli/Eco.duckdb",
  output_path = "data/Ecoli",
  split_jobs  = TRUE,
  threads     = 24
)
} # }