brainbeacon.pipeline.cell_embedding.run_bbcellformer_pipeline#
- brainbeacon.pipeline.cell_embedding.run_bbcellformer_pipeline(adata_path, specie, assay, gene_dict_path, gene_mean_path, bb_ckpt_path, cellplm_ckpt_path, output_dir, output_prefix, path_dict=None, config_train=None, config_update=None, n_hvg=1000, cd_weight=0.02, use_hvg=True, use_batch=True, use_spatial=True, weight_mode='expression', force_tokenize=True, use_dev_abs=False, do_fit=True, fit_epochs=100, slice_sample=False, enc_mod='flowformer', mask_type='hidden', output_attentions=False, save_model=True, save_model_path=None, save_embedding_path=None, device=None, seed=42, deterministic=True)#
Run the end-to-end BrainBeacon + CellFormer pipeline and return an updated AnnData.
This function performs: 1) Load AnnData from
adata_pathand setadata.obs["platform"] = assay. 2) Tokenization (BrainBeacon tokenizer) and save token files underoutput_dir. 3) BrainBeacon inference to produce cell embeddings (saved as*_bb_embeddings.npz). 4) CellFormer reconstruction / fitting and save final embeddings and model (optional).- Parameters:
adata_path (str) – Path to the input AnnData (
.h5ad).specie (str) – Species name used in tokenization (e.g.,
"human","mouse").assay (str) – Platform / assay name. Will be stored to
adata.obs["platform"].gene_dict_path (str) – Path to the BrainBeacon gene dictionary (
.h5ad).gene_mean_path (str) – Path to gene mean statistics used by tokenizer.
bb_ckpt_path (str) – Path to BrainBeacon pretrained checkpoint.
cellplm_ckpt_path (str) – Path to CellPLM/CellFormer pretrained checkpoint.
output_dir (str) – Output directory for intermediate files and results.
output_prefix (str) – Prefix used to name output files.
path_dict (dict, optional) – Optional path configuration passed to downstream reconstruction.
config_train (dict) – Training/inference configuration. Must be provided. This function will update it with internal defaults (e.g.,
weight_mode,cd_weight).config_update (dict, optional) – Optional overrides merged into
config_trainafter defaults are set.n_hvg (int, default 1000) – Number of HVGs to use if
use_hvg=True.cd_weight (float, default 0.02) – Cell-density token weight used by expression-weighted pooling.
use_hvg (bool, default True) – Whether to perform HVG selection in tokenization.
use_batch (bool, default True) – Whether to enable batch-related options in CellFormer reconstruction.
use_spatial (bool, default True) – Whether to enable spatial options in CellFormer reconstruction.
weight_mode (str, default "expression") – Pooling mode used for embedding aggregation (e.g.,
"expression").force_tokenize (bool, default True) – If True, redo tokenization and overwrite intermediate outputs. Note: this flag also controls whether to skip BB inference when cached files exist.
use_dev_abs (bool, default False) – Whether to use alternative dev/abs settings in tokenization (project-specific).
do_fit (bool, default True) – Whether to fit/fine-tune CellFormer reconstruction.
fit_epochs (int, default 100) – Number of epochs for fitting when
do_fit=True.slice_sample (bool, optional) – If True, select one slice for training (project-specific behavior).
enc_mod (str, default "flowformer") – Encoder module variant used by CellFormer.
mask_type (str, default "hidden") – Masking strategy,
"hidden"or"input".output_attentions (bool, default False) – Whether to return/record attention weights during reconstruction.
save_model (bool, default True) – Whether to save the trained/fitted CellFormer model.
save_model_path (str, optional) – Path to save the model checkpoint. If None, a default path is used.
save_embedding_path (str, optional) – Path to save final embeddings. If None, a default path is used.
device (torch.device or str, optional) – Device to run on. If None, uses CUDA if available, else CPU.
seed (int, default 42) – Random seed.
deterministic (bool, default True) – Whether to enforce deterministic behavior (when supported).
- Returns:
-adata (
AnnData) Updated AnnData object returned byrun_bbcellformer_recon. Intermediate files (tokenization outputs, BB embeddings, final embeddings/model) are saved underoutput_dirwith the givenoutput_prefix.