runPanaroo2Duckdb() executes Panaroo on the genomes registered in a per-selection DuckDB (created earlier by prepareGenomes()), optionally in multiple batches, and imports all resulting pangenome tables into the same DuckDB database.

It acts as a high-level wrapper around:

  • .runPanaroo() — runs Panaroo (single or multi-batch)

  • .mergePanaroo() — optionally merges batch outputs

  • .panaroo2duckdb() — loads Panaroo results (gene counts, struct variants, gene names, reference sequences, long tables) into the DuckDB

The function determines which Panaroo output directory to use (single-run or merged), verifies that a valid pangenome has been produced, and updates the DuckDB with standardized table names consistent with downstream processing steps.

runPanaroo2Duckdb(
  duckdb_path,
  output_path = NULL,
  core_threshold = 0.9,
  len_dif_percent = 0.95,
  cluster_threshold = 0.95,
  family_seq_identity = 0.5,
  threads = 16,
  split_jobs = FALSE,
  verbose = TRUE
)

Arguments

duckdb_path

Character. Path to the per-selection DuckDB database created by prepareGenomes(). Must contain a files table with Panaroo input file paths.

output_path

Character or NULL. Directory where Panaroo outputs (panaroo_out_* or merged merge_output/) will be written. If NULL, defaults to dirname(duckdb_path).

core_threshold

Numeric. Panaroo --core_threshold parameter. Default: 0.90.

len_dif_percent

Numeric. Panaroo --len_dif_percent parameter. Default: 0.95.

cluster_threshold

Numeric. Panaroo global clustering --threshold. Default: 0.95.

family_seq_identity

Numeric. Panaroo gene family identity -f. Default: 0.5.

threads

Integer. Total CPU budget to allocate for Panaroo. If split_jobs = TRUE, threads are divided across batches. Default: 16.

split_jobs

Logical. If TRUE, Panaroo is run in multiple parallel batches (up to 5, depending on dataset size), and batch outputs are merged using .mergePanaroo(). If FALSE, only one Panaroo invocation is run. Default: FALSE.

verbose

Logical. Print status messages during Panaroo execution, merging, and DuckDB import. Default: TRUE.

Value

Invisibly returns the path to the selected Panaroo output directory (either the single-run output or the merged merge_output/ directory).

Details

Panaroo Output Discovery

After running .runPanaroo(), the function scans output_path for directories matching panaroo_out_* and identifies those containing a final_graph.gml file — the minimum requirement for a valid Panaroo run.

  • If split_jobs = TRUE and multiple valid outputs are present, .mergePanaroo() is used to combine the outputs.

  • If split_jobs = FALSE, the single valid output directory is used directly.

DuckDB Integration

.panaroo2duckdb() is then called to import:

  • gene presence/absence counts (gene_count)

  • gene names (gene_names)

  • structural presence/absence (gene_struct)

  • gene reference FASTA (gene_ref_seq)

  • long-form genome → gene → protein tables

These maintain the standardized schema used by downstream feature extraction and modeling steps in amRdata and amRml.

See also

Examples

if (FALSE) { # \dontrun{
# Basic usage:
runPanaroo2Duckdb(
  duckdb_path = "data/Shigella_flexneri/Sfl.duckdb",
  output_path = "data/Shigella_flexneri",
  threads     = 12,
  split_jobs  = FALSE
)

# Merging multi-batch pangenomes:
runPanaroo2Duckdb(
  duckdb_path = "data/Ecoli/Eco.duckdb",
  output_path = "data/Ecoli",
  split_jobs  = TRUE,
  threads     = 24
)
} # }