Should n_counts reflect pre- or post–gene-symbol-to-Ensembl conversion when unmapped genes are removed?
Hi,
I'm preparing an h5ad file for Geneformer tokenization. My data originally uses gene symbols. I mapped them to Ensembl IDs manually, but some symbols could not be mapped and were removed from the matrix. As a result, the per-cell total counts decreased.
Should I set adata.obs['n_counts'] to:
The original total counts before Ensembl conversion and gene removal, or
The new total counts after conversion (adata.X.sum(axis=1) on the filtered matrix)?
From tokenizer.py, normalization uses obs['n_counts'] directly and does not recompute it from the matrix. I'm unsure which value is correct after manual gene filtering.
Also, is manual pre-filtering of unmapped genes recommended, or should I keep all genes and let TranscriptomeTokenizer handle mapping via the built-in gene_mapping_dict?
Thank you!
Thank you for your question. The n_counts should be the total counts including all genes. As long as this includes all genes, it does not matter if unmapped genes are removed before tokenization, though it's not necessary to remove them beforehand. The TranscriptomeTokenizer will only tokenize genes in the token dictionary, and as default will merge Ensembl IDs that map to the same gene.
