Run Panaroo for Pangenome Analysis in Parallel Batches

Executes Panaroo inside a Docker container on genome annotation files prepared by genomeList(). The function can optionally split input genomes into batches, runs Panaroo with strict cleaning and clustering options, and returns the results of each batch execution.

.runPanaroo(
  duckdb_path = "data/{Bug}/{Bug}.duckdb",
  output_path = "data/{Bug}/",
  core_threshold = 0.9,
  len_dif_percent = 0.95,
  cluster_threshold = 0.95,
  family_seq_identity = 0.5,
  threads = 8,
  split_jobs = FALSE
)

Arguments

duckdb_path: A path to the DuckDB database containing the "files" table.
output_path: Character scalar. Base directory for Panaroo outputs and temporary files.
core_threshold: Numeric. Core genome threshold for Panaroo (--core_threshold). Default 0.90.
len_dif_percent: Numeric. Length difference percentage (--len_dif_percent). Default 0.95.
cluster_threshold: Numeric. Sequence identity threshold (--threshold). Default 0.95.
family_seq_identity: Numeric. Gene family clustering identity (-f). Default 0.5.
threads: Integer. Number of threads for Panaroo and parallel execution. Default 8.
split_jobs: Logical. If TRUE, split into multiple smaller pangenome generation jobs that can be merged by .mergePanaroo(). If FALSE, all isolates in one run.

Value

A list of results for each Panaroo batch in its output directory.

Details

Panaroo uses: --clean-mode strict, --merge_paralogs, --remove-invalid-genes.
Temporary genome file lists are created in output_path.
Output directories are named panaroo_out_<timestamp> under output_path.