Voozh

Should n_counts reflect pre- or post–gene-symbol-to-Ensembl conversion when unmapped genes are removed?

#586

by suanzaoren - opened 3 days ago

Hi,

I'm preparing an h5ad file for Geneformer tokenization. My data originally uses gene symbols. I mapped them to Ensembl IDs manually, but some symbols could not be mapped and were removed from the matrix. As a result, the per-cell total counts decreased.

Should I set adata.obs['n_counts'] to:

The original total counts before Ensembl conversion and gene removal, or
The new total counts after conversion (adata.X.sum(axis=1) on the filtered matrix)?
From tokenizer.py, normalization uses obs['n_counts'] directly and does not recompute it from the matrix. I'm unsure which value is correct after manual gene filtering.

Also, is manual pre-filtering of unmapped genes recommended, or should I keep all genes and let TranscriptomeTokenizer handle mapping via the built-in gene_mapping_dict?

Thank you!

👁 Image

ctheodoris

Owner 3 days ago

Thank you for your question. The n_counts should be the total counts including all genes. As long as this includes all genes, it does not matter if unmapped genes are removed before tokenization, though it's not necessary to remove them beforehand. The TranscriptomeTokenizer will only tokenize genes in the token dictionary, and as default will merge Ensembl IDs that map to the same gene.

ctheodoris changed discussion status to closed 3 days ago

· Sign up or log in to comment

URL: https://huggingface.co/ctheodoris/Geneformer/discussions/586

⇱ ctheodoris/Geneformer · Should n_counts reflect pre- or post–gene-symbol-to-Ensembl conversion when unmapped genes are removed?

Should n_counts reflect pre- or post–gene-symbol-to-Ensembl conversion when unmapped genes are removed?