YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Or run from Hugging Face: open https://colab.research.google.com/ → File → Open notebook → URL tab → paste https://huggingface.co/ChatterjeeLab/SF-Cluster/resolve/main/examples/SF_Cluster_Demo.ipynb

Demo

A self-contained, CPU-only Colab notebook is provided at examples/SF_Cluster_Demo.ipynb. It installs the package, downloads a small KaiB demo bundle (filtered MSA + FrustrAI-Seq FI matrix, 2 minutes**.

SF-Cluster (workshop OSS release)

Frustration-guided MSA subset builders for AlphaFold2 multi-conformer prediction. This is the open-source workshop distribution of two subset methods from the SF-Cluster benchmark:

mosaic — each subset mixes high / mid / low contrast-FI sequences.
gradient — each subset is homogeneous within a contrast-FI quartile.

The contrast score is computed from a per-residue Frustration Index (FI) matrix produced by FrustrAI-Seq (HF model: leuschj/FrustrAI-Seq).

This package is dependency-light (numpy, scipy), provides a CLI, and is designed to be a drop-in replacement for random / uniform MSA subsampling in AF-Cluster-style pipelines.

Algorithm

Given a filtered MSA A of N sequences over L match-state columns, and a per-residue FI matrix F ∈ ℝ^{N×L}:

Column variance: v_l = Var_i(F_{i,l}) over sequences.
High-variance mask: HV = {l : v_l ≥ percentile(v, 80)}, LV = ¬HV.

Contrast score per sequence:

contrast_hvlv(i) = mean_{l ∈ HV} F_{i,l} − mean_{l ∈ LV} F_{i,l}

Mosaic (N_SUBSETS = 12, TARGET_SIZE = 32): sort pool by contrast_hvlv, tri-stratify into low/mid/high terciles; for each subset s ∈ {0..11}, draw 11 high + 11 low + 10 mid with np.random.default_rng(seed=s).
Gradient (N_SUBSETS = 12, TARGET_SIZE = 32): split sorted pool into 4 quartiles; for each bin b ∈ {0..3} and s ∈ {0..2} draw 32 sequences from that bin only with np.random.default_rng(seed=10*b + s).

Install

pip install -e .

Python ≥ 3.10. Dependencies: numpy, scipy.

Inputs

You need two files per case:

A filtered A3M file (ColabFold-style). Lowercase insertion-state letters are preserved verbatim in output subsets; only match-state (uppercase) columns are scored.
A per-residue FI matrix .npy of shape (N_seq, L), where N_seq is the number of sequences in the A3M and L is the number of match-state columns.

The FI matrix is produced by FrustrAI-Seq. We do not bundle weights — see https://github.com/leuschj/FrustrAI-Seq (model card: https://huggingface.co/leuschj/FrustrAI-Seq) for inference instructions. A reference usage pattern is documented in examples/run_demo.sh.

CLI

sf-cluster build \
 --a3m path/to/filtered.a3m \
 --fi path/to/fi_matrix.npy \
 --method mosaic \
 --n-subsets 12 \
 --subset-size 32 \
 --seed 20260422 \
 --out subsets/kaib_mosaic/

Outputs:

subsets/kaib_mosaic/
├── mosaic_subset_000.a3m
├── mosaic_subset_001.a3m
├── ...
├── mosaic_subset_011.a3m
├── mosaic_subset_index.tsv # subset_id, pool_index, header, score
└── mosaic_meta.json # provenance + score stats

Library

from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient

pool = pool_msa("filtered.a3m", "fi_matrix.npy")
score = contrast_hvlv(pool.fi_matrix) # (N,) per-sequence
subsets = method_mosaic(score) # list[list[int]] of 12 × 32
# or
subsets = method_gradient(score)

Each subset is a list of indices into pool.headers / pool.sequences.

Reproducibility

All RNG draws use np.random.default_rng(seed=...) with method-specific deterministic seeds (see Algorithm §4–§5). Re-running the same A3M + FI matrix yields byte-identical subset assignments. The CLI also records a provenance JSON ({method}_meta.json) capturing inputs, sizes, and the package version.

LIMITATIONS

No frustration model included. You must run FrustrAI-Seq separately to obtain the (N_seq, L) FI matrix. This package only handles the scoring + subset-construction stage.
No AF2 runner included. The package emits A3M files; downstream inference (AF2 / ColabFold) is the user's responsibility.
Only mosaic and gradient arms are open-sourced here. The other SF-Cluster arms (region_cluster, contrast_nc) require additional feature pipelines and are intentionally excluded from this workshop release.
No re-sampling guarantee across subsets. A sequence can appear in multiple subsets (gradient draws from a single quartile with replacement if the quartile is smaller than subset_size).
Empirical caveat (read this). Controlled comparison shows uniform subsampling performs equivalently on most Main-21 cases — see paper for boundary conditions under which contrast-FI stratification yields a measurable lift over random subsampling. Treat this package as a research baseline, not a turnkey accuracy improvement.

Citation

If you use this code, please cite the SF-Cluster paper (forthcoming) and FrustrAI-Seq.

License

MIT. See LICENSE.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

URL: https://huggingface.co/ChatterjeeLab/SF-Cluster

⇱ ChatterjeeLab/SF-Cluster · Hugging Face