VOOZH about

URL: https://huggingface.co/ChatterjeeLab/SF-Cluster

โ‡ฑ ChatterjeeLab/SF-Cluster ยท Hugging Face


YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

๐Ÿ‘ Open In Colab

Or run from Hugging Face: open https://colab.research.google.com/ โ†’ File โ†’ Open notebook โ†’ URL tab โ†’ paste https://huggingface.co/ChatterjeeLab/SF-Cluster/resolve/main/examples/SF_Cluster_Demo.ipynb

Demo

A self-contained, CPU-only Colab notebook is provided at examples/SF_Cluster_Demo.ipynb. It installs the package, downloads a small KaiB demo bundle (filtered MSA + FrustrAI-Seq FI matrix, 2 minutes**.

SF-Cluster (workshop OSS release)

Frustration-guided MSA subset builders for AlphaFold2 multi-conformer prediction. This is the open-source workshop distribution of two subset methods from the SF-Cluster benchmark:

  • mosaic โ€” each subset mixes high / mid / low contrast-FI sequences.
  • gradient โ€” each subset is homogeneous within a contrast-FI quartile.

The contrast score is computed from a per-residue Frustration Index (FI) matrix produced by FrustrAI-Seq (HF model: leuschj/FrustrAI-Seq).

This package is dependency-light (numpy, scipy), provides a CLI, and is designed to be a drop-in replacement for random / uniform MSA subsampling in AF-Cluster-style pipelines.

Algorithm

Given a filtered MSA A of N sequences over L match-state columns, and a per-residue FI matrix F โˆˆ โ„^{Nร—L}:

  1. Column variance: v_l = Var_i(F_{i,l}) over sequences.
  2. High-variance mask: HV = {l : v_l โ‰ฅ percentile(v, 80)}, LV = ยฌHV.
  3. Contrast score per sequence:
    contrast_hvlv(i) = mean_{l โˆˆ HV} F_{i,l} โˆ’ mean_{l โˆˆ LV} F_{i,l}
    
  4. Mosaic (N_SUBSETS = 12, TARGET_SIZE = 32): sort pool by contrast_hvlv, tri-stratify into low/mid/high terciles; for each subset s โˆˆ {0..11}, draw 11 high + 11 low + 10 mid with np.random.default_rng(seed=s).
  5. Gradient (N_SUBSETS = 12, TARGET_SIZE = 32): split sorted pool into 4 quartiles; for each bin b โˆˆ {0..3} and s โˆˆ {0..2} draw 32 sequences from that bin only with np.random.default_rng(seed=10*b + s).

Install

pip install -e .

Python โ‰ฅ 3.10. Dependencies: numpy, scipy.

Inputs

You need two files per case:

  1. A filtered A3M file (ColabFold-style). Lowercase insertion-state letters are preserved verbatim in output subsets; only match-state (uppercase) columns are scored.
  2. A per-residue FI matrix .npy of shape (N_seq, L), where N_seq is the number of sequences in the A3M and L is the number of match-state columns.

The FI matrix is produced by FrustrAI-Seq. We do not bundle weights โ€” see https://github.com/leuschj/FrustrAI-Seq (model card: https://huggingface.co/leuschj/FrustrAI-Seq) for inference instructions. A reference usage pattern is documented in examples/run_demo.sh.

CLI

sf-cluster build \
 --a3m path/to/filtered.a3m \
 --fi path/to/fi_matrix.npy \
 --method mosaic \
 --n-subsets 12 \
 --subset-size 32 \
 --seed 20260422 \
 --out subsets/kaib_mosaic/

Outputs:

subsets/kaib_mosaic/
โ”œโ”€โ”€ mosaic_subset_000.a3m
โ”œโ”€โ”€ mosaic_subset_001.a3m
โ”œโ”€โ”€ ...
โ”œโ”€โ”€ mosaic_subset_011.a3m
โ”œโ”€โ”€ mosaic_subset_index.tsv # subset_id, pool_index, header, score
โ””โ”€โ”€ mosaic_meta.json # provenance + score stats

Library

from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient

pool = pool_msa("filtered.a3m", "fi_matrix.npy")
score = contrast_hvlv(pool.fi_matrix) # (N,) per-sequence
subsets = method_mosaic(score) # list[list[int]] of 12 ร— 32
# or
subsets = method_gradient(score)

Each subset is a list of indices into pool.headers / pool.sequences.

Reproducibility

All RNG draws use np.random.default_rng(seed=...) with method-specific deterministic seeds (see Algorithm ยง4โ€“ยง5). Re-running the same A3M + FI matrix yields byte-identical subset assignments. The CLI also records a provenance JSON ({method}_meta.json) capturing inputs, sizes, and the package version.

LIMITATIONS

  • No frustration model included. You must run FrustrAI-Seq separately to obtain the (N_seq, L) FI matrix. This package only handles the scoring + subset-construction stage.
  • No AF2 runner included. The package emits A3M files; downstream inference (AF2 / ColabFold) is the user's responsibility.
  • Only mosaic and gradient arms are open-sourced here. The other SF-Cluster arms (region_cluster, contrast_nc) require additional feature pipelines and are intentionally excluded from this workshop release.
  • No re-sampling guarantee across subsets. A sequence can appear in multiple subsets (gradient draws from a single quartile with replacement if the quartile is smaller than subset_size).
  • Empirical caveat (read this). Controlled comparison shows uniform subsampling performs equivalently on most Main-21 cases โ€” see paper for boundary conditions under which contrast-FI stratification yields a measurable lift over random subsampling. Treat this package as a research baseline, not a turnkey accuracy improvement.

Citation

If you use this code, please cite the SF-Cluster paper (forthcoming) and FrustrAI-Seq.

License

MIT. See LICENSE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support