How Post-Training Shapes
Biological Reasoning Models

Lukas Fesser
Harvard University
lukas_fesser@g.harvard.edu
&Hanlin Zhang^∗
Harvard University
hanlinzhang@g.harvard.edu
&Michelle M. Li
Harvard University
michelleli@g.harvard.edu
Eric Wang
Google DeepMind
ericzwang@google.com
&Bryan Perozzi
Google Research
bperozzi@google.com
&Shekoofeh Azizi
Google DeepMind
shekazizi@google.com
Sham M. Kakade
Harvard University
sham@seas.harvard.edu
&Marinka Zitnik
Harvard University
marinka@hms.harvard.edu
Equal contribution.

Abstract

Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes reasoning and generalization remains poorly understood. We study when post-training improves performance and when it induces over-specialization. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models under controlled variation in backbone, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL), measuring both in-domain (ID) and out-of-domain (OOD) performance. We find that each post-training stage reshapes generalization in a distinct way rather than contributing uniform gains. CPT improves downstream performance by aligning models with biological language. SFT consistently increases ID performance but causes OOD performance to peak early and decline as models fit the training distribution. RL, when applied to strong SFT checkpoints with aligned rewards, improves OOD performance and partially recovers generalization. These results show that biological reasoning does not improve monotonically with additional supervision or compute. Instead, performance depends on how training stages are composed. Under fixed post-training budgets, the strongest ID-OOD trade-off comes from brief SFT, larger RL allocations, and asymmetric adaptation capacity across stages. Code is available at https://github.com/mims-harvard/bio-posttrain and selected model checkpoints can be found here https://huggingface.co/collections/mims-harvard/bio-posttrain.

1 Introduction

Biology is becoming a central testbed for scientific reasoning models. Recent systems combine language models with biological foundation models trained on DNA, RNA, proteins, and other molecular data [22, 42, 23]. Their predictions require mapping natural-language task descriptions to molecular representations, integrating modality-specific evidence, and carrying intermediate biological state across multiple inference steps. Post-training is widely used to build such models, but its effects remain poorly understood across training stages [65]. Despite strong empirical gains, it remains unclear how different stages of post-training shape reasoning and generalization.

Recent work has explored new forms of supervision, scaling strategies, and training objectives, including reinforcement learning for reasoning [31, 87, 94], large-scale post-training datasets [24, 30], and domain-specific adaptation pipelines [71, 93]. Other studies examine how reward design [79, 96], self-improving and world-model-based approaches [12, 89, 77], and training dynamics [11, 8, 9] influence model behavior. While these approaches improve task performance, they provide limited insight into how individual post-training stages affect generalization.

Biology provides a particularly stringent test of generalization. In mathematics and code, many out-of-domain problems retain the same underlying structure as the training examples, even when surface details change. In biology, unseen pathways, diseases, species, and perturbations often involve different mechanisms and biological processes [90, 76]. As a result, high benchmark performance does not necessarily indicate robust biological reasoning [20, 67, 21]. Models that perform well on familiar benchmarks may fail when transferred to new biological systems [49, 2]. Additional post-training or larger models can therefore increase in-domain performance without improving biological generalization.

👁 Refer to caption

Figure 1: Training dynamics define distinct generalization regimes in biological reasoning models. We compare backbone choice, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL) across genomics, transcriptomics, and protein tasks, and evaluate each stage on biologically meaningful in-domain (ID) and out-of-domain (OOD) splits.

When additional post-training improves biological generalization, rather than primarily increasing fit to the training distribution, remains unclear. Existing studies typically examine one modality, one benchmark family, or one post-training stage at a time. Post-training itself is a sequence of stages, including continued pre-training, supervised fine-tuning, and reinforcement learning [86, 50, 44, 25, 27, 81, 56]. These stages may interact in non-obvious ways, and gains from one stage may depend on the stages that precede or follow it. Yet existing models often differ simultaneously in backbone, data, scale, and supervision, making controlled comparisons difficult. Most studies also focus on final performance rather than training dynamics, and out-of-domain evaluation in biology remains limited and inconsistently defined.

Present work.

We present a controlled study of post-training in biological reasoning models. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models to examine when post-training improves biological generalization and when it primarily increases fit to the training distribution. Using matched model families, tasks, and data settings, we isolate the effects of backbone choice, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL) on both in-domain (ID) and out-of-domain (OOD) performance. We find that the same post-training budget can produce different generalization regimes depending on how it is allocated across stages: SFT increases ID performance but narrows OOD robustness, RL strengthens OOD performance when initialized from strong SFT checkpoints, and CPT improves downstream adaptation.

Our contributions are conceptual, empirical, and practical. Conceptually, we show that biological reasoning does not improve monotonically with additional post-training. Empirically, we present a controlled study across genomics, transcriptomics, and proteins that reveals consistent training dynamics across biological modalities. Practically, we derive design principles for post-training under limited compute. The strongest ID-OOD trade-off comes from combining brief SFT with larger RL allocations and allocating adaptation capacity asymmetrically across stages.

2 Background and Related Work

Biological Foundation Models and Multimodal Reasoning.

Recent work has extended foundation modeling to a wide range of biological data [75], including DNA [43, 98, 57, 7, 18, 5, 15, 37], RNA [26, 10], gene expression [63, 71, 16, 85, 1, 34, 41], and proteins [53, 35, 58, 3]. Much of this literature focuses on learning representations or solving predictive tasks within a single modality, for example by modeling sequences or cellular profiles directly. More recent systems combine language models with biological inputs and structured context [42, 22, 67, 72], enabling tasks that more closely resemble reasoning over pathways, cell states, or protein function. Despite this progress, the literature remains fragmented. Models differ substantially in architecture, modality, training data, supervision format, and evaluation protocol, making it difficult to compare results or isolate which components drive performance. While recent biological foundation models provide a basis for multimodal reasoning, the field still lacks a unified view of how training choices affect downstream behavior, particularly under distribution shift.

Training Dynamics in LLMs: Training Stages Are Not Uniformly Additive.

Modern language models are developed through a sequence of training stages rather than optimized in a single pass. A base model is first selected or pretrained, then adapted through domain-specific continued pre-training, task-specific supervised fine-tuning, and, in some cases, reinforcement learning [60]. These stages serve distinct roles. Continued pre-training expands coverage of domain vocabulary and structure while preserving general capabilities [66]. Supervised fine-tuning specializes the model to task formats and target behaviors, often yielding large gains on benchmark tasks. Reinforcement learning further aligns outputs with task-dependent reward signals, improving performance but increasing sensitivity to reward design and sampling [51].

Recent work suggests that these stages are not uniformly additive [33, 47, 36]. Performance gains can saturate or reverse as additional training is applied, and improvements on targeted evaluations need not translate to better generalization. Pythia [6] enables fine-grained analysis of training dynamics across checkpoints and scales, while EvoLM [65] characterizes non-monotonic behavior across training stages. Fine-tuning scaling laws further show that the interaction between model size, pretraining data, and supervision depends on the adaptation method [95]. A growing body of work has studied how to conduct continued pre-training effectively: Gupta et al. [32] investigate learning rate re-warming schedules, Ibrahim et al. [40] propose scalable data mixing and replay strategies, Parmar et al. [61] characterize when mid-training transfers general capabilities versus degrading them, and Ke et al. [48] introduce soft-masking to mitigate catastrophic forgetting. Cheng et al. [13] further show that reformatting domain corpora as reading comprehension during continued pre-training yields stronger downstream performance than raw text exposure. Together with recent large-scale systems such as Composer 2 [70], these studies indicate that continued pre-training often serves as a critical transition stage enabling effective downstream adaptation [14], while supervised fine-tuning and reinforcement learning can produce strong task gains at the cost of reduced robustness if scaled or applied incautiously [88, 92].

3 Experimental Setup: Tasks, Data, and Training Stages

We design the experimental setup to isolate post-training effects while keeping the biological tasks, model families, and evaluation splits comparable across modalities. We first define the reasoning tasks and ID/OOD splits, then describe the model architecture and post-training pipeline.

3.1 Biological Reasoning Tasks and Evaluation Splits

We evaluate post-training across three domains: DNA, RNA, and proteins. Each task combines natural-language context with modality-specific biological inputs and uses ID and OOD splits. Dataset details and example prompts are provided in the appendix A.

Pathway Prediction.

Pathway prediction asks the model to infer how a genetic variant propagates through a molecular pathway to produce a disease phenotype. We use the KEGG-derived benchmark introduced in BioReason [22], which evaluates mechanistic reasoning over pathway structure rather than variant classification alone. Each example combines a reference DNA sequence, a variant DNA sequence, and textual pathway and gene context; the model generates a natural-language answer grounded in both sequence and pathway information. We define ID and OOD splits by pathway network, so OOD examples come from previously unseen molecular networks.

Drug Target Identification.

Target identification asks the model to choose the most promising therapeutic target for a disease and cell type. We adapt the cell-type-specific target nomination benchmark from MEDEA [78], simplifying it from an agentic tool-use setting to a fixed-input reasoning task. The model receives a natural-language description of the disease, cell type, and candidate genes, together with TranscriptFormer embeddings [63] for five candidates in normal and disease states, and identifies the best-supported target. We use four diseases for training and ID evaluation, and reserve hepatoblastoma as the OOD disease.

Protein Function Prediction.

Protein function prediction asks the model to infer the function of an uncharacterized protein from multimodal evidence. We build on the curated UniProt-based dataset introduced in BioReason-Pro [23], which pairs experimentally supported GO annotations with protein-level context. Each example provides protein embeddings and text context, including organism, InterPro domain annotations [62], and protein–protein interactions. The model predicts protein function from this combined representation. We split data by species, with two held-out species forming the OOD test set, and remove the ontology graph inputs used in the original BioReason-Pro setup to align the task format with our study.

3.2 Language Models, Biological Foundation Models, and Post-Training Pipeline

We study biological reasoning through a common post-training pipeline built on general-purpose LLM backbones. Our main experiments use Qwen3-1.7B and Qwen3-4B, two dense models from the Qwen3 family, which support both reasoning-oriented and standard inference modes [91]. We include Gemma 4 E2B as a backbone ablation to test whether the observed training dynamics persist under a different lightweight open model family [29].

To represent biological modalities, we couple the LLM backbone to frozen biological foundation models through trainable projection layers. For DNA tasks, we use Evo2-1B, a genome foundation model trained for genome-scale sequence modeling and design across domains of life [7]. For RNA and transcriptomic tasks, we use TranscriptFormer, a cross-species single-cell model trained over evolutionary-scale transcriptomic data [63]. For protein tasks, we use ESM-3, a protein language model trained to model sequence, structure, and function across evolutionary scales [35].

Post-training pipeline consists of three stages: (1) Continued pre-training: adaptation on general biological text using the standard next-token prediction loss. (2) Supervised fine-tuning: training on task-specific reasoning examples with the autoregressive language-model loss on target responses. (3) Reinforcement learning: optimization from a supervised checkpoint with a task-aligned reward objective, encouraging outputs that maximize task success rather than imitate reference traces. To enable controlled comparisons, we vary one factor at a time. In the main experiments, we scale SFT and RL compute, study the effect of CPT, and include ablations on backbone choice and LoRA rank. Full hyperparameters and implementation details are provided in the appendix.

We use a model signature to denote each configuration across training stages and biological domains. For example, Q1-P-C-S_8,20-R_16,20 represents a model with the following setup:

•

Q1-P: Qwen3-1.7B backbone evaluated in the protein setting. We use D, R, and P to denote DNA, RNA, and protein tasks, respectively. In our notation, blue denotes Qwen3-1.7B models, orange denotes Qwen3-4B models, and green denotes Gemma 4 E2B models.
•

C: Continued pre-training on general biological text.
•

S_8,20: Eight epochs of supervised fine-tuning on 20,000 task-specific reasoning traces.
•

R_16,20: Reinforcement learning for 16 epochs on 20,000 data points.

When a stage is omitted, the model has not undergone that part of the pipeline. For example, Q4-R-S_4,1 denotes a model fine-tuned directly from the Qwen3-4B + TranscriptFormer backbone for 4 epochs on 1,000 reasoning traces in the RNA setting without CPT or RL, while G-R-C-S_8,1 denotes a Gemma + TranscriptFormer model adapted with CPT and then fine-tuned for 8 epochs.

4 Results: How Training Stages Shape Biological Reasoning in LLMs

We now present our main results related to scaling post-training for biology in compute- or data-bound settings. Concrete model settings for all our experiments in this section, including context windows, input sequence lengths, and other hyperparameters can be found in the appendix.

👁 Refer to caption

Figure 2: Supervised fine-tuning improves in-domain performance but reduces out-of-domain robustness. As SFT compute increases, ID performance continues to improve, while OOD performance peaks early and declines, indicating over-specialization to the training data. DNA/ RNA mean and std. over 3 random seeds.

4.1 Supervised Fine-Tuning Increases Accuracy but Narrows Generalization

We begin by studying how supervised fine-tuning scales in biological reasoning models using pretrained Qwen3-1.7B and Qwen3-4B backbones [91].

Fixed data, variable compute.

We first consider a data-constrained regime based on the DNA and RNA tasks. For each backbone, we train model families of the form Q1-D-S_{{1,2,4,8,16,32},1}, Q4-D-S_{{1,2,4,8,16,32},1}, Q1-R-S_{{1,2,4,8,16,32},1}, and Q4-R-S_{{1,2,4,8,16,32},1}, where the subscript indicates the number of SFT epochs and the use of the full available training set in each domain. We then evaluate both in-domain and out-of-domain performance. Figure 2 reveals a generalization trade-off induced by supervised fine-tuning. The amount of training that maximizes ID performance is consistently larger than the amount that maximizes OOD performance, indicating that continued fine-tuning improves fit to the training distribution after OOD generalization has already peaked. In DNA, for example, Q1-D-S improves its ID accuracy from roughly at 1 epoch to about by 16 epochs, while its OOD accuracy peaks much earlier, around at 2–4 epochs, and then declines to about by 32 epochs. The same pattern appears for the larger Q4-D-S model.

The RNA setting shows the same trend. Q1-R-S gains nearly ID accuracy points from 1 to 4 epochs, but while its OOD performance also improves, it comes within at most of ID accuracy. For Q4-R-S, ID accuracy rises from roughly at 1 epoch to about by 16 epochs, whereas OOD accuracy peaks much earlier, around 4 epochs, and then drifts downward by the end of training. This same pattern persists when more supervision is available: on the proteins task, Q1-P-S_{{1,2,4,8,16,32},20} and Q4-P-S_{{1,2,4,8,16,32},20} both continue to improve ID through about 8 epochs, but OOD peaks earlier and then declines. More SFT compute therefore improves in-domain accuracy more reliably than it improves biological generalization.

Fixed compute, variable data.

We next study a compute-constrained regime in proteins by fixing training to a single SFT epoch and varying the amount of supervision. Concretely, we train Q1-P-S_{1,{4,8,12,16,20}} and Q4-P-S_{1,{4,8,12,16,20}}. Figure 3 shows that increasing data is better behaved than increasing epochs on a fixed dataset. For Q1-P-S, scaling from 4K to 20K training examples increases ID from to and OOD from to . For Q4-P-S, the corresponding gains are from about to on ID and from about to just above on OOD. While these gains are significant, they flatten quickly: after roughly 40–60% of the data, both curves improve only marginally.

👁 Refer to caption

Figure 3: Increasing data improves generalization more reliably than increasing SFT epochs. Scaling dataset size yields gains in both ID and OOD performance, but with diminishing returns, in contrast to the overfitting behavior observed when scaling epochs.

These results suggest that SFT is a strong driver of in-domain biological reasoning, but that scaling it naively, either through more epochs or more data, does not reliably translate into better OOD performance. Instead, the dominant pattern is over-specialization: the model becomes better at the benchmark distribution while becoming less robust to biological shift. At the same time, the protein results indicate that these two ways of scaling compute are not equivalent. Increasing epochs on a fixed dataset produces the sharper trade-off, with OOD performance peaking early and then declining as the model repeatedly fits the same supervision. Increasing data while holding epochs fixed produces a more stable pattern. ID performance still shows diminishing returns, but OOD performance remains roughly monotonic and then plateaus rather than collapsing. Under a fixed compute budget, data scaling is therefore the more robust strategy when additional supervision is available.

4.2 Reinforcement Learning Recovers Generalization After Fine-Tuning

👁 Refer to caption

Figure 4: Reinforcement learning consistently improves out-of-domain robustness. Starting from strong SFT checkpoints, RL increases both ID and OOD performance, with the largest gains in OOD and diminishing returns after the first few epochs.

Scaling RL epochs.

We now ask whether reinforcement learning can recover some of the robustness lost under SFT. Starting from the strongest SFT checkpoints selected on validation performance, we train model families of the form Q1-D-S_8,1-R_{{1,2,4,8,16},1}, Q4-D-S_4,1-R_{{1,2,4,8,16},1}, Q1-R-S_4,1-R_{{1,2,4,8,16},1}, and Q4-R-S_8,1-R_{{1,2,4,8,16},1} in the DNA and RNA settings, together with Q1-P-S_4,20-R_{{1,2,4,8,16},20} and Q4-P-S_4,20-R_{{1,2,4,8,16},20} in the protein setting. Figure 4 shows that, unlike SFT, RL improves both ID and OOD performance quite consistently over the range we study. The gains are not only directional but substantial: in DNA, OOD accuracy rises by about over the RL sweep, and in proteins the OOD improvement is larger still, especially for Q4-P-S_4,20-R_{{1,2,4,8,16},20}, which gains roughly absolute from the first to the best RL checkpoint. Across tasks, the largest gains typically appear in the first few RL epochs, with later epochs yielding smaller additional improvements, especially OOD.

4.3 Continued Pre-Training Establishes the Foundation for Biological Reasoning

👁 Refer to caption

Figure 5: Continued pre-training improves the effectiveness of downstream post-training. CPT improves both SFT and RL performance, with the largest gains appearing after RL and in out-of-domain settings.

We next study whether continued pre-training changes how much downstream post-training can help. In the DNA and RNA settings, we first adapt the base backbones with continued pre-training on biological texts, yielding model families of the form Q1-D/R-C and Q4-D/R-C. We then evaluate these models under the strongest post-training configurations identified above, namely Q1-D-C-S_8,1, Q4-D-C-S_4,1, Q1-D-C-S_8,1-R_16,1, and Q4-D-C-S_4,1-R_16,1 for DNA and Q1-R-C-S_4,1, Q1-R-C-S_4,1-R_8,1, Q4-R-C-S_8,1, and Q4-R-C-S_8,1-R_8,1 for RNA. Figure 5 shows that CPT improves downstream performance at almost every stage we test, but that the size of the gain depends strongly on the stage. The gains are modest at the SFT stage in domain and markedly larger after RL, especially out-of-domain. Here, CPT lifts ID and OOD performance by visibly larger margins than SFT alone.

This effect is especially pronounced for the smaller Q1-D model out-of-domain, where CPT improves the effectiveness of SFT and RL by 0.2 and 0.08, respectively. For Q4-D, the same pattern holds, but from a stronger starting point and with smaller absolute gains. These findings qualitatively also hold in the RNA setting and suggest that especially for smaller models, CPT can act as a bridge between a general-purpose backbone and the reasoning demands of biology. Without CPT, downstream training must learn biological language, task structure, and reasoning behavior at the same time.

4.4 Backbone Strength Shifts Performance Ceiling but Not Training Dynamics

👁 Refer to caption

Figure 6: Stronger backbones improve performance achievable with post-training but preserve training dynamics. G-R does not display an initial drop in performance when starting RL, unlike Q1-R and generally performs better OOD. Mean and std. over 3 random seeds for SFT.

To test whether our main findings depend on the choice of base model, we repeat the RNA experiments with an off-the-shelf backbone Gemma model [29]. In addition to Q1-R and Q4-R, we evaluate the more recent Gemma4-E2B backbone, denoted G-R. Concretely, for SFT scaling we train model families of the form Q1-R-S_{{1,2,4,8,16,32},1}, Q4-R-S_{{1,2,4,8,16,32},1}, and G-R-S_{{1,2,4,8,16,32},1}. For RL scaling, we then start from the strongest SFT checkpoint (as measured by validation loss) for each backbone and train Q1-R-S_4,1-R_{{1,2,4,8,16},1}, Q4-R-S_8,1-R_{{1,2,4,8,16},1}, and G-R-S_4,1-R_{{1,2,4,8,16},1}. With SFT only, G-R trails the smaller Q1-R models OOD, but are somewhat weaker in-domain, as Figure 6 shows.

After one epoch, G-R is still comparable to Q4-R and outperforms Q1-R by around both in- and out-of-domain. After that, more SFT helps the Qwen models more, at least in-domain, but G-R accuracy still increases by . The qualitative trend is therefore unchanged: supervised fine-tuning is most effective for in-domain performance, but out-of-domain performance peaks substantially earlier and then plateaus or even declines with additional epochs. Backbone quality therefore shifts the overall SFT frontier upward, but does not remove the core ID-OOD trade-off induced by SFT.

Reinforcement learning exhibits a similar pattern. Starting from the strongest SFT checkpoint for each backbone, Figure 6 shows that RL improves OOD performance more reliably than additional SFT, and that the larger backbones benefit more smoothly from this stage. In particular, G-R-S_4,1-R_{{1,2,4,8,16},1} qualitatively follows the larger Q4-R-S_8,1-R_{{1,2,4,8,16},1} trajectory more closely than the smaller Q1-R-S_4,1-R_{{1,2,4,8,16},1} model. Both G-R and Q4-R improve more steadily under RL and do not show the initial drop visible for Q1-R when RL begins. Instead, Gemma improves monotonically by 0.08 ID and 0.15 OOD between RL epochs 1 and 16. This seems to indicate that backbone choice does not qualitatively alter the role of the training stage itself. The backbone matters primarily for the level of performance achievable with post-training, whereas the structure of the training dynamics appears to be stable across model families.

4.5 Adaptation Capacity Should Be Allocated Asymmetrically Between SFT and RL

👁 Refer to caption

Figure 7: Optimal adaptation requires asymmetric capacity across training stages. Higher LoRA rank benefits SFT, while lower rank is sufficient for RL, indicating that different stages require different adaptation capacity (both for ID and OOD tasks). Shown are results for drug target identification (RNA) tasks.

We further study how adaptation capacity should be allocated across post-training stages by running a joint SFT–RL LoRA ablation in the RNA setting. Using Q1-R-S_4,1-R_8,1 and Q4-R-S_8,1-R_8,1 as our reference model families, we vary the SFT LoRA rank over and the RL LoRA rank over , with the corresponding scaling factors set proportionally to rank. For each backbone, every model is first fine-tuned with SFT using the same training data, optimizer, and epoch budget as in our main RNA experiments, and is then further optimized with RL using the same reward and training schedule. We evaluate the final checkpoint from each configuration on both ID-test and OOD-test splits, and summarize the results as heatmaps over . This setup isolates whether the best end-to-end pipeline prefers symmetric adapter budgets across stages or an asymmetric allocation in which SFT and RL use different amounts of trainable capacity.

We find a clear asymmetry between the two stages. In both backbones, increasing the SFT rank from to or produces a clear upward shift in ID performance and usually also improves OOD performance, whereas increasing the RL rank beyond yields much smaller gains and can even reduce OOD performance. The highest ID regions in Figure 7 cluster at , showing that SFT benefits from having enough capacity to absorb task format, domain structure, and multimodal reasoning patterns. By contrast, for RL, the strongest ID and OOD regions are concentrated at . The best overall configurations are therefore not those with matched ranks, but those with high-capacity SFT and low-capacity RL. This pattern holds for both Q1-R and Q4-R, suggesting that post-training in biological reasoning should be stage-specific not only in compute and data allocation, but also in adaptation capacity.

4.6 Optimal Post-Training Requires Balancing SFT and RL

👁 Refer to caption

Figure 8: Under a fixed post-training budget, a small amount of SFT followed by more RL gives the best ID-OOD trade-off. Across DNA and RNA, 1–3 SFT epochs followed by larger RL budgets generally give the strongest OOD accuracy, while larger SFT allocations achieve better ID performance.

Finally, we study how to allocate post-training across supervised fine-tuning and reinforcement learning. In this experiment, we evaluate model families of the form Q1-D/R-S_s,1.5-R_8-s,1.5 and G-D/R-S_s,1.5-R_8-s,1.5, where the total post-training schedule is fixed at eight epoch-level passes, and only the split between SFT and RL is varied. This is not a strictly FLOP-matched comparison: an RL epoch is more expensive than an SFT epoch because GRPO uses multiple autoregressive rollouts, reward computation, and KL anchoring. We therefore interpret this setup as an epoch-budget allocation study that compares pure RL, pure SFT, and intermediate stage orderings under a common pass-count constraint, rather than as an exact compute-normalized optimum.

Figure 8 shows that the best allocation is not at either extreme. In the DNA panel, the additional G-D results closely mirror the Q1-D trend: ID accuracy is strongest after a few SFT epochs, peaking near for G-D and for Q1-D, while OOD accuracy is maximized in the early mixed regime around . Pure RL underperforms these mixed schedules, especially OOD, indicating that reward optimization benefits from a supervised warm start. Pure SFT preserves relatively high ID accuracy, but its OOD performance is much weaker, falling to about for Q1-D and for G-D. The G-D curves also show stronger ID retention than Q1-D across several larger-SFT allocations, but this does not remove the OOD decline as SFT dominates the budget. The RNA panel shows the same qualitative trade-off more sharply: G-R maintains higher ID accuracy across most allocations, whereas both models obtain their best OOD performance with only a small amount of SFT before RL. Overall, the strongest OOD results concentrate in the Q1-R-S_1,1.5-R_7,1.5 to Q1-R-S_3,1.5-R_5,1.5 and G-R-S_1,1.5-R_7,1.5 to G-R-S_3,1.5-R_5,1.5 range. This suggests that later post-training passes are better spent on RL once SFT has established task competence, although a compute-normalized study using estimated FLOPs would be needed to identify the exact optimal SFT–RL trade-off.

5 Discussion

👁 Refer to caption

Figure 9: RL shifts the ID-OOD frontier across modalities. Each point is a trained checkpoint; color denotes training stage and marker shape denotes backbone. RL generally improves OOD performance at comparable ID performance across DNA, RNA, and protein tasks.

Our results show that post-training stages play distinct roles in biological reasoning. CPT adapts models to biological language, SFT establishes task competence, and RL improves transfer beyond the training distribution. These stages should therefore not be treated as interchangeable sources of compute. Figure 9 summarizes this pattern across modalities, showing that RL shifts checkpoints toward stronger OOD performance at comparable ID performance. In practice, this suggests a simple recipe: use CPT to align models with biological language, use enough SFT to establish task competence, and allocate later post-training to RL when OOD robustness matters.

The SFT-induced trade-off also highlights why biology provides a demanding setting for studying post-training dynamics. In our RNA experiments, OOD accuracy drops by roughly 18 percentage points from its peak as SFT continues, following an approximately monotonic decline rather than a plateau. Biology exposes generalization failures that are often less apparent in conventional reasoning benchmarks. In mathematics and code, many OOD problems preserve the same underlying structure as the training examples, even when surface details change. In biology, unseen pathways, diseases, species, and perturbations often involve different mechanisms and biological processes. As a result, optimization that improves performance on the training distribution can simultaneously reduce the ability to transfer across biological systems.

Limitations and future work.

Our study has several limitations. First, although we evaluate post-training across DNA, RNA, and protein reasoning tasks with biologically meaningful OOD splits, our conclusions rest on a limited set of tasks, benchmarks, and model families. It remains unclear how broadly these trends extend to other scientific reasoning settings and richer biological workflows. Second, our results suggest that RL depends on reward design, task structure, and the quality of the supervised starting point, but we have only begun to characterize these dependencies. Our fixed-schedule SFT-RL study is also not fully compute-normalized. Finally, our evaluation measures final-answer correctness rather than the validity of intermediate reasoning steps. We therefore cannot fully distinguish genuine biological reasoning from shortcut strategies that produce correct outputs. Future work should test these trade-offs on broader benchmarks and examine how reward design, compute-normalized stage allocation, and adaptation capacity shape ID-OOD robustness in scientific reasoning models.

More broadly, our results suggest that progress in scientific reasoning will depend not only on larger models or more post-training compute, but on understanding how different stages shape generalization. In biology, the strongest models are not those that optimize longest on a fixed distribution, but those that preserve the ability to transfer across biological systems. Understanding and controlling these training dynamics may therefore be as important as scaling model size itself.

Acknowledgements

L.F. is supported by the Kempner Graduate Fellowship at Harvard University. H.Z. and S.K. acknowledge the Chan Zuckerberg Initiative Foundation for establishing the Kempner Institute for the Study of Natural and Artificial Intelligence. M.M.L. and M.Z. gratefully acknowledge the support, in part, by NSF CAREER Award 2339524, ARPA-H Biomedical Data Fabric (BDF) Toolbox Program, Amazon Faculty Research, Google Research Scholar Program, AstraZeneca Research, GlaxoSmithKline Award, Roche Alliance with Distinguished Scientists (ROADS) Program, Sanofi iDEA-iTECH Award, Boehringer Ingelheim Award, Merck Award, Optum AI Research Collaboration Award, Pfizer Research, Gates Foundation (INV-079038), Chan Zuckerberg Initiative, Collaborative Center for XDP at Massachusetts General Hospital, John and Virginia Kaneb Fellowship at Harvard Medical School, Biswas Computational Biology Initiative in partnership with the Milken Institute, Harvard Medical School Dean’s Innovation Fund for the Use of Artificial Intelligence, and the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funders.

Authors affiliated with Google DeepMind and Google Research (Eric Wang, Shekoofeh Azizi, Bryan Perozzi) participated in this work in an advisory capacity only.

References

[1] A. K. Adduri, D. Gautam, B. Bevilacqua, A. Imran, R. Shah, M. Naghipourfar, N. Teyssier, R. Ilango, S. Nagaraj, M. Dong, et al. (2025) Predicting cellular responses to perturbation across diverse contexts with state. BioRxiv, pp. 2025–06. Cited by: §2.
[2] C. Ahlmann-Eltze, W. Huber, and S. Anders (2025) Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines. Nature Methods 22 (8), pp. 1657–1661. Cited by: §1.
[3] E. C. Alley, G. Khimulya, S. Biswas, M. AlQuraishi, and G. M. Church (2019) Unified rational protein engineering with sequence-based deep representation learning. Nature methods 16 (12), pp. 1315–1322. Cited by: §2.
[4] J. S. Amberger, C. A. Bocchini, F. Schiettecatte, A. F. Scott, and A. Hamosh (2015) OMIM.org: online mendelian inheritance in man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Research 43 (D1), pp. D789–D798. External Links: Document Cited by: §A.1.
[5] Ž. Avsec, N. Latysheva, J. Cheng, G. Novati, K. R. Taylor, T. Ward, C. Bycroft, L. Nicolaisen, E. Arvaniti, J. Pan, et al. (2025) AlphaGenome: advancing regulatory variant effect prediction with a unified dna sequence model. BioRxiv, pp. 2025–06. Cited by: §2.
[6] S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, et al. (2023) Pythia: a suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, Cited by: §2.
[7] G. Brixi, M. G. Durrant, J. Ku, M. Naghipourfar, M. Poli, G. Sun, G. Brockman, D. Chang, A. Fanton, G. A. Gonzalez, et al. (2026) Genome modelling and design across all domains of life with evo 2. Nature, pp. 1–13. Cited by: §B.1, §B.3, §B.4, §B.5, §2, §3.2.
[8] A. Catalan-Tatjer, N. Ajroldi, and J. Geiping (2026) Training dynamics impact post-training quantization robustness. ICLR. Cited by: §1.
[9] F. Chen, A. Huang, N. Golowich, S. Malladi, A. Block, J. T. Ash, A. Krishnamurthy, and D. J. Foster (2026) The coverage principle: how pre-training enables post-training. ICLR. Cited by: §1.
[10] J. Chen, Z. Hu, S. Sun, Q. Tan, Y. Wang, Q. Yu, L. Zong, L. Hong, J. Xiao, T. Shen, et al. (2022) Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions. arXiv preprint arXiv:2204.00300. Cited by: §2.
[11] X. Chen, T. Li, and D. Zou (2026) Reshaping reasoning in llms: a theoretical analysis of rl training dynamics through pattern selection. ICLR. Cited by: §1.
[12] Y. Chen, Z. Tan, R. Zhang, M. Qiu, and T. Chen (2026) CellDuality: unlocking biological reasoning in LLMs with self-supervised RLVR. In The Fourteenth International Conference on Learning Representations, External Links: Link Cited by: §1.
[13] D. Cheng, S. Huang, and F. Wei (2024) Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530. Cited by: §2.
[14] T. Chu, Y. Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V. Le, S. Levine, and Y. Ma (2025) SFT memorizes, rl generalizes: a comparative study of foundation model post-training. In International Conference on Machine Learning, pp. 10818–10838. Cited by: §2.
[15] A. Cornman, J. West-Roberts, A. P. Camargo, S. Roux, M. Beracochea, M. Mirdita, S. Ovchinnikov, and Y. Hwang (2024) The omg dataset: an open metagenomic corpus for mixed-modality genomic language modeling. bioRxiv, pp. 2024–08. Cited by: §2.
[16] H. Cui, C. Wang, H. Maan, K. Pang, F. Luo, N. Duan, and B. Wang (2024) ScGPT: toward building a foundation model for single-cell multi-omics using generative ai. Nature methods 21 (8), pp. 1470–1480. Cited by: §2.
[17] CZI Single-Cell Biology Program, S. Abdulla, B. D. Aevermann, P. Assis, S. Badajoz, et al. (2025) CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data. Nucleic Acids Research 53 (D1), pp. D886–D900. External Links: Document Cited by: §A.2.
[18] H. Dalla-Torre, L. Gonzalez, J. Mendoza-Revilla, N. Lopez Carranza, A. H. Grzywaczewski, F. Oteri, C. Dallago, E. Trop, B. P. De Almeida, H. Sirelkhatim, et al. (2025) Nucleotide transformer: building and evaluating robust foundation models for human genomics. Nature Methods 22 (2), pp. 287–297. Cited by: §2.
[19] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. R’e (2022) FlashAttention: fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, Vol. 35, pp. 16344–16359. Cited by: §B.3.
[20] J. Durairaj, Y. Adeshina, Z. Cao, X. Zhang, V. Oleinikovas, T. Duignan, Z. McClure, X. Robin, G. Studer, D. Kovtun, et al. (2024) PLINDER: the protein-ligand interactions dataset and evaluation resource. BioRxiv, pp. 2024–07. Cited by: §1.
[21] Y. Ektefaie, A. Shen, D. Bykova, M. G. Marin, M. Zitnik, and M. Farhat (2024) Evaluating generalizability of artificial intelligence models for molecular datasets. Nature Machine Intelligence 6 (12), pp. 1512–1524. Cited by: §1.
[22] A. Fallahpour, A. Magnuson, P. Gupta, S. Ma, J. Naimer, A. Shah, H. Duan, O. Ibrahim, H. Goodarzi, C. J. Maddison, et al. (2025) Bioreason: incentivizing multimodal biological reasoning within a dna-llm model. arXiv preprint arXiv:2505.23579. Cited by: §A.1, §A.1, §B.4, §B.4, §D.1, §D.1, §1, §2, §3.1.
[23] A. Fallahpour, A. Seyed-Ahmadi, P. Idehpour, O. Ibrahim, P. Gupta, J. Naimer, K. Zhu, A. Shah, S. Ma, A. Adduri, et al. (2026) BioReason-pro: advancing protein function prediction with multimodal biological reasoning. bioRxiv, pp. 2026–03. Cited by: §A.3, §A.3, §B.1, §B.4, §B.4, §1, §3.1.
[24] R. Fan, Z. Wang, and P. Liu (2025) Megascience: pushing the frontiers of post-training datasets for science reasoning. arXiv:2507.16812. Cited by: §1.
[25] K. Feng, K. Ding, Z. Zhu, L. Liang, Q. Zhang, and H. Chen (2026) CoT-Evo: evolutionary distillation of chain-of-thought for scientific reasoning. ICLR. Cited by: §1.
[26] P. Fradkin, R. “. Shi, T. Dalal, K. Isaev, B. J. Frey, L. J. Lee, Q. Morris, and B. Wang (2026) Orthrus: toward evolutionary and functional rna foundation models. Nature Methods, pp. 1–11. Cited by: §2.
[27] Y. Gao, Z. Wang, J. Chen, M. Antkowiak, M. Hu, J. Kong, D. Pratt, J. Liu, E. Ma, Z. Hu, et al. (2025) scPilot: large language model reasoning toward automated single-cell analysis and discovery. NeurIPS. Cited by: §1.
[28] A. Gaulton, L. J. Bellis, A. P. Bento, J. Chambers, M. Davies, A. Hersey, Y. Light, S. McGlinchey, D. Michalovich, B. Al-Lazikani, et al. (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic acids research 40 (D1), pp. D1100–D1107. Cited by: §A.2.
[29] Google DeepMind (2026) Gemma 4. Note: https://deepmind.google/models/gemma/gemma-4/Accessed 2026-05-04 Cited by: §B.3, §B.4, §3.2, §4.4.
[30] E. Guha, R. Marten, S. Keh, N. Raoof, G. Smyrnis, H. Bansal, M. Nezhurina, J. Mercat, T. Vu, Z. Sprague, et al. (2026) OpenThoughts: data recipes for reasoning models. ICLR. Cited by: §1.
[31] D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. (2025) DeepSeek-r1 incentivizes reasoning in llms through reinforcement learning. Nature 645 (8081), pp. 633–638. Cited by: §B.4, §B.4, §B.4, §B.4, §B.4, §B.5, §1.
[32] K. Gupta, D. Iter, and D. Hershcovich (2023) Continual pre-training of large language models: how to (re)warm your model?. arXiv preprint arXiv:2308.04014. Cited by: §B.2, §2.
[33] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith (2020) Don’t stop pretraining: adapt language models to domains and tasks. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 8342–8360. Cited by: §B.2, §2.
[34] M. Hao, J. Gong, X. Zeng, C. Liu, Y. Guo, X. Cheng, T. Wang, J. Ma, X. Zhang, and L. Song (2024) Large-scale foundation model on single-cell transcriptomics. Nature methods 21 (8), pp. 1481–1491. Cited by: §2.
[35] T. Hayes, R. Rao, H. Akin, N. J. Sofroniew, D. Oktay, Z. Lin, R. Verkuil, V. Q. Tran, J. Deaton, M. Wiggert, et al. (2025) Simulating 500 million years of evolution with a language model. Science 387 (6736), pp. 850–858. Cited by: §A.3, §B.1, §B.3, §B.4, §B.5, §B.5, §2, §3.2.
[36] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. (2022) Training compute-optimal large language models. In Advances in Neural Information Processing Systems, Cited by: §2.
[37] Y. Hou, W. Long, H. Hu, H. Su, J. Feng, and Y. Zhang (2026) PhageBench: can llms understand raw bacteriophage genomes?. arXiv preprint arXiv:2604.05775. Cited by: §2.
[38] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen (2022) LoRA: low-rank adaptation of large language models. In International Conference on Learning Representations, Cited by: §B.1, §B.4, §B.4, §B.4, §B.5, §B.5, §B.5, §B.5, §B.5.
[39] R. P. Huntley, T. Sawford, P. Mutowo-Meullenet, A. Shypitsyna, C. Bonilla, M. J. Martin, and C. O’Donovan (2015) The GOA database: gene ontology annotation updates for 2015. Nucleic Acids Research 43 (D1), pp. D1057–D1063. External Links: Document Cited by: §A.3.
[40] A. Ibrahim, B. Thérien, K. Gupta, M. L. Richter, Q. Anthony, T. Lesort, E. Belilovsky, and I. Rish (2024) Simple and scalable strategies to continually pre-train large language models. arXiv preprint arXiv:2403.08763. Cited by: §2.
[41] A. Istrate, D. Li, and T. Karaletsos (2024) ScGenePT: is language all you need for modeling single-cell perturbations?. bioRxiv, pp. 2024–10. Cited by: §2.
[42] A. Istrate, F. Milletari, F. Castrotorres, J. M. Tomczak, M. Torkar, D. Li, and T. Karaletsos (2025) Rbio1-training scientific reasoning llms with biological world models as soft verifiers. bioRxiv, pp. 2025–08. Cited by: §1, §2.
[43] Y. Ji, Z. Zhou, H. Liu, and R. V. Davuluri (2021) DNABERT: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics 37 (15), pp. 2112–2120. Cited by: §2.
[44] C. Jiang, X. Zhang, F. Zhu, X. Chen, J. Zhu, and Z. Zhang (2026) Rethinking LLM reasoning: from explicit trajectories to latent representations. In The Fourteenth International Conference on Learning Representations, External Links: Link Cited by: §1.
[45] M. Kanehisa and S. Goto (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28 (1), pp. 27–30. External Links: Document Cited by: §A.1, §A.1, §A.1.
[46] M. Kanehisa, Y. Sato, M. Furumichi, K. Morishima, and M. Tanabe (2019) New approach for understanding genome variations in KEGG. Nucleic Acids Research 47 (D1), pp. D590–D595. External Links: Document Cited by: §A.1, §A.1, §A.1.
[47] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei (2020) Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. Cited by: §2.
[48] Z. Ke, Y. Shao, H. Lin, T. Konishi, G. Kim, and B. Liu (2023) Continual pre-training of language models. In International Conference on Learning Representations, Cited by: §2.
[49] K. Z. Kedzierska, L. Crawford, A. P. Amini, and A. X. Lu (2025) Zero-shot evaluation reveals limitations of single-cell foundation models. Genome Biology 26 (1), pp. 101. Cited by: §1.
[50] D. Khatri, L. Madaan, R. Tiwari, R. Bansal, S. S. Duvvuri, M. Zaheer, I. S. Dhillon, D. Brandfonbrener, and R. Agarwal (2026) The art of scaling reinforcement learning compute for llms. ICLR. Cited by: §1.
[51] R. Kirk, I. Mediratta, C. Nalmpantis, J. Luketina, E. Hambro, E. Grefenstette, and R. Raileanu (2024) Understanding the effects of RLHF on LLM generalisation and diversity. In International Conference on Learning Representations, Cited by: §2.
[52] M. J. Landrum, S. Chitipiralla, G. R. Brown, C. Chen, B. Gu, J. Hart, D. Hoffman, W. Jang, K. Kaur, C. Liu, et al. (2020) ClinVar: improvements to accessing data. Nucleic Acids Research 48 (D1), pp. D835–D844. External Links: Document Cited by: §A.1.
[53] Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, N. Smetanin, R. Verkuil, O. Kabeli, Y. Shmueli, et al. (2023) Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379 (6637), pp. 1123–1130. Cited by: §2.
[54] I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In International Conference on Learning Representations, Cited by: §B.2, §B.3, §B.3, §B.3, §B.4, §B.4.
[55] M-A-P, G. Zhang, X. Du, Z. Yu, Z. Wang, Z. Wang, S. Guo, T. Zheng, K. Zhu, J. Liu, S. Yue, B. Liu, Z. Peng, Y. Yao, J. Yang, Z. Li, B. Zhang, M. Liu, T. Liu, Y. Gao, W. Chen, X. Zhou, Q. Liu, T. Wang, and W. Huang (2024-12) FineFineWeb: a comprehensive study on fine-grained domain web corpus. Note: https://huggingface.co/datasets/m-a-p/FineFineWebVersion v0.1.0; Hugging Face dataset Cited by: §B.2, §B.2.
[56] S. M. Narayanan, J. D. Braza, R. Griffiths, A. Bou, G. Wellawatte, M. C. Ramos, L. Mitchener, S. G. Rodriques, and A. D. White (2025) Training a scientific reasoning model for chemistry. NeurIPS. Cited by: §1.
[57] E. Nguyen, M. Poli, M. G. Durrant, B. Kang, D. Katrekar, D. B. Li, L. J. Bartie, A. W. Thomas, S. H. King, G. Brixi, et al. (2024) Sequence modeling and design from molecular to genome scale with evo. Science 386 (6723), pp. eado9336. Cited by: §2.
[58] E. Nijkamp, J. A. Ruffolo, E. N. Weinstein, N. Naik, and A. Madani (2023) Progen2: exploring the boundaries of protein language models. Cell systems 14 (11), pp. 968–978. Cited by: §2.
[59] D. Ochoa, A. Hercules, M. Carmona, D. Suveges, J. Baker, C. Malangone, I. Lopez, A. Miranda, C. Cruz-Castillo, L. Fumis, et al. (2023) The next-generation Open Targets Platform: reimagined, redesigned, rebuilt. Nucleic Acids Research 51 (D1), pp. D1353–D1359. External Links: Document Cited by: §A.2.
[60] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. (2022) Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, pp. 27730–27744. Cited by: §B.4, §B.4, §2.
[61] J. Parmar, S. Prabhu, S. Gururangan, H. Awadalla, S. Smith, and N. Muennighoff (2024) Reuse, don’t retrain: a recipe for continued pretraining of language models. arXiv preprint arXiv:2407.07263. Cited by: §2.
[62] T. Paysan-Lafosse, M. Blum, S. Chuguransky, T. Grego, B. L. Pinto, G. A. Salazar, M. L. Bileschi, P. Bork, A. Bridge, L. Colwell, et al. (2023) InterPro in 2022. Nucleic acids research 51 (D1), pp. D418–D427. Cited by: §A.3, §A.3, §B.4, §3.1.
[63] J. D. Pearce, S. E. Simmonds, G. Mahmoudabadi, L. Krishnan, G. Palla, A. Istrate, A. Tarashansky, B. Nelson, O. Valenzuela, D. Li, et al. (2025) A cross-species generative cell atlas across 1.5 billion years of evolution: the transcriptformer single-cell model. bioRxiv, pp. 2025–04. Cited by: §A.2, §B.1, §B.3, §B.4, §B.4, §B.5, §2, §3.1, §3.2.
[64] G. Penedo, H. Kydl’iček, L. B. allal, A. Lozhkov, M. Mitchell, C. Raffel, L. Von Werra, and T. Wolf (2024) The FineWeb datasets: decanting the web for the finest text data at scale. In Advances in Neural Information Processing Systems, Cited by: §B.2.
[65] Z. Qi, F. Nie, A. Alahi, J. Zou, H. Lakkaraju, Y. Du, E. P. Xing, S. M. Kakade, and H. Zhang (2025) EvoLM: in search of lost language model training dynamics. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, Cited by: §1, §2.
[66] H. Que, J. Liu, G. Zhang, C. Zhang, X. Qu, Y. Ma, F. Duan, Z. Bai, J. Wang, Y. Zhang, et al. (2024) D-cpt law: domain-specific continual pre-training scaling law for large language models. Advances in Neural Information Processing Systems 37, pp. 90318–90354. Cited by: §B.2, §2.
[67] O. Queen, Y. Huang, R. Calef, V. Giunchiglia, T. Chen, G. Dasoulas, L. Tai, G. Abbadessa, O. Howell, M. M. Li, et al. (2025) ProCyon: a multimodal foundation model for protein phenotypes. BioRxiv, pp. 2024–12. Cited by: §1, §2.
[68] P. Radivojac, W. T. Clark, T. R. Oron, A. M. Schnoes, T. Wittkop, A. Sokolov, K. Graim, C. Funk, K. Verspoor, A. Ben-Hur, et al. (2013) A large-scale evaluation of computational protein function prediction. Nature Methods 10 (3), pp. 221–227. External Links: Document Cited by: §A.3.
[69] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He (2020) ZeRO: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. External Links: Document Cited by: §B.3, §B.3, §B.4, §B.4.
[70] C. Research et al. (2026) Composer 2 technical report. External Links: 2603.24477, Link Cited by: §2.
[71] S. A. Rizvi, D. Levine, A. Patel, S. Zhang, E. Wang, C. J. Perry, I. Vrkic, N. M. Constante, Z. Fu, S. He, et al. (2026) Scaling large language models for next-generation single-cell analysis. BioRxiv, pp. 2025–04. Cited by: §1, §2.
[72] M. Schaefer, P. Peneder, D. Malzl, S. D. Lombardo, M. Peycheva, J. Burton, A. Hakobyan, V. Sharma, T. Krausgruber, C. Sin, et al. (2025) Multimodal learning enables chat-based exploration of single-cell data. Nature Biotechnology, pp. 1–11. Cited by: §2.
[73] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. External Links: 1707.06347 Cited by: §B.4.
[74] S. T. Sherry, M. H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski, and K. Sirotkin (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Research 29 (1), pp. 308–311. External Links: Document Cited by: §A.1.
[75] E. Simon, K. Swanson, and J. Zou (2024) Language models for biological research: a primer. Nature Methods 21 (8), pp. 1422–1429. Cited by: §2.
[76] G. Studer, X. Robin, S. Bienert, J. Durairaj, P. Škrinjar, G. Tauriello, A. M. Waterhouse, and T. Schwede (2026) A fully automated benchmarking suite to compare macromolecular complexes. Nature Methods 23 (2), pp. 387–394. Cited by: §1.
[77] C. Su, Z. Hao, Z. Zhang, Z. Xia, Y. Wu, H. Su, and J. Zhu (2026) Helix: evolutionary reinforcement learning for open-ended scientific problem solving. ICLR. Cited by: §1.
[78] P. Sui, M. M. Li, S. Gao, W. Shen, V. Giunchiglia, A. Shen, Y. Huang, Z. Kong, and M. Zitnik (2026) Medea: an omics ai agent for therapeutic discovery. bioRxiv, pp. 2026–01. Cited by: §A.2, §A.2, §B.1, §B.3, §B.4, §B.4, §3.1.
[79] H. Sun, Y. Jiang, Z. Tang, Y. Pan, S. Gu, Z. Lin, L. Wang, W. Lou, L. Liu, L. Bai, et al. (2026) Unleashing scientific reasoning for bio-experimental protocol generation via structured component-based reward mechanism. ICLR. Cited by: §1.
[80] D. Szklarczyk, K. Nastou, M. Koutrouli, R. Kirsch, F. Mehryary, R. Hachilif, D. Hu, M. E. Peluso, Q. Huang, T. Fang, N. T. Doncheva, S. Pyysalo, P. Bork, L. J. Jensen, and C. von Mering (2025) The STRING database in 2025: protein networks with directionality of regulation. Nucleic Acids Research 53 (D1), pp. D730–D737. External Links: Document Cited by: §A.3, §A.3, §B.4.
[81] J. Tang, L. Xia, Z. Li, and C. Huang (2025) AI-researcher: autonomous scientific innovation. NeurIPS. Cited by: §1.
[82] J. G. Tate, S. Bamford, H. C. Jubb, Z. Sondka, D. M. Beare, N. Bindal, H. Boutselakis, C. G. Cole, C. Creatore, E. Dawson, et al. (2019) COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Research 47 (D1), pp. D941–D947. External Links: Document Cited by: §A.1.
[83] The Gene Ontology Consortium (2023) The gene ontology knowledgebase in 2023. Genetics 224 (1), pp. iyad031. External Links: Document Cited by: §A.3, §A.3, §A.3, §B.1, §B.4, §B.4, §B.5.
[84] The UniProt Consortium (2023) UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Research 51 (D1), pp. D523–D531. External Links: Document Cited by: §A.3, §A.3.
[85] C. V. Theodoris, L. Xiao, A. Chopra, M. D. Chaffin, Z. R. Al Sayed, M. C. Hill, H. Mantineo, E. M. Brydon, Z. Zeng, X. S. Liu, et al. (2023) Transfer learning enables predictions in network biology. Nature 618 (7965), pp. 616–624. Cited by: §2.
[86] S. Wang, G. Zhang, L. L. Zhang, N. Shang, F. Yang, D. Chen, and M. Yang (2026) LoongRL: reinforcement learning for advanced reasoning over long contexts. ICLR. Cited by: §1.
[87] Y. Wang, Q. Yang, Z. Zeng, L. Ren, L. Liu, B. Peng, H. Cheng, X. He, K. Wang, J. Gao, et al. (2025) Reinforcement learning for reasoning in large language models with one training example. arXiv:2504.20571. Cited by: §1.
[88] Z. Wang, Y. Yang, Q. Jin, and Z. Lu (2025) Gene-r1: reasoning with data-augmented lightweight llms for gene set analysis. In Biocomputing 2026: Proceedings of the Pacific Symposium, pp. 494–507. Cited by: §2.
[89] Z. Wei, R. Ma, Z. Wang, Z. Li, S. Song, and S. Zheng (2026) VCWorld: a biological world model for virtual cell simulation. ICLR. Cited by: §1.
[90] Z. Wei, Y. Wang, Y. Gao, S. Wang, P. Li, D. Si, Y. Gao, S. Wu, D. Li, K. Dong, et al. (2026) Benchmarking algorithms for generalizable single-cell perturbation response prediction. Nature Methods 23 (2), pp. 451–464. Cited by: §1.
[91] A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §B.1, §B.1, §B.1, §B.3, §B.3, §B.3, §B.4, §B.4, §B.4, §B.5, §3.2, §4.1.
[92] M. Yin, Y. Qu, L. Yang, L. Cong, and M. Wang (2025) Toward scientific reasoning in llms: training from expert discussions via reinforcement learning. arXiv preprint arXiv:2505.19501. Cited by: §2.
[93] C. Yu, S. Li, Z. Liu, J. Zhou, X. Guo, K. Yu, Y. Zhou, K. Li, Z. Zang, Z. Lei, and S. Z. Li (2026) CDBridge: a cross-omics post-training bridge strategy for context-aware biological modeling. In The Fourteenth International Conference on Learning Representations, External Links: Link Cited by: §1.
[94] Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, S. Song, and G. Huang (2025) Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?. NeurIPS. Cited by: §1.
[95] B. Zhang, Z. Liu, C. Cherry, and O. Firat (2024) When scaling meets LLM finetuning: the effect of data, model and finetuning method. In International Conference on Learning Representations, Cited by: §2.
[96] S. Zheng, C. Huang, F. Yu, J. Yao, J. Ye, T. Chen, Y. Luo, N. Ding, L. Bai, G. Cui, et al. (2026) Sci-verifier: scientific verifier with thinking. ICLR. Cited by: §1.
[97] N. Zhou, Y. Jiang, T. R. Bergquist, A. J. Lee, B. Z. Kacsoh, A. W. Crocker, K. A. Lewis, G. Georghiou, H. N. Nguyen, M. N. Hamid, et al. (2019) The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biology 20 (1), pp. 244. External Links: Document Cited by: §A.3.
[98] Z. Zhou, Y. Ji, W. Li, P. Dutta, R. Davuluri, and H. Liu (2023) Dnabert-2: efficient foundation model and benchmark for multi-species genome. arXiv preprint arXiv:2306.15006. Cited by: §2.

Appendices

Appendix A Additional Details on Biological Reasoning Tasks

A.1 Pathway Prediction Dataset and Construction

Our pathway prediction benchmark follows the KEGG-derived reasoning dataset introduced in BIOREASON [22]. The source data begins from KEGG Network Variants and associated disease-pathway annotations [45, 46], which BIOREASON then augments with linked variant metadata from ClinVar, dbSNP, OMIM, and COSMIC [52, 74, 4, 82]. The resulting benchmark contains 1,449 examples spanning 298 pathway networks and 37 unique diseases. In the original curation, the key design goal was not simply to label variants, but to preserve the mechanistic chain from mutation to pathway perturbation to phenotype, so that each example could support explicit multi-step biological reasoning rather than endpoint classification alone.

Technically, each KEGG pathway is represented as a structured molecular interaction network using a standardized symbolic notation that encodes activation, inhibition, complex formation, and transcriptional regulation [45, 46]. These pathway graphs are then linked to specific variants through a semi-automated mapping procedure designed to preserve the relationship between genomic loci and pathway entities. For each mapped example, the dataset stores paired reference and variant DNA sequences with precise alignment coordinates; in BIOREASON, these sequences average roughly 4,000 base pairs, with most mutations differing from the reference by only 1–3 nucleotides. The final supervised example consists of variant details, a network definition, and gene-level context on the input side, together with a concise mechanism-to-disease answer and a full reasoning trace on the output side.

A distinctive part of the construction is the generation of causal reasoning paths. In BIOREASON, these traces were produced using Claude 3.7 Sonnet and grounded with contextual disease information from the KEGG disease database [45, 46], then packaged into standardized question-answer pairs for training and evaluation [22]. The reasoning traces have mean length 303.8 words and are intended to make the latent biological mechanism explicit: they verbalize how the mutation perturbs the affected gene, how that perturbation propagates through intermediate pathway interactions, and why the resulting network state is associated with the target disease. In our work, we inherit this benchmark structure but evaluate generalization by splitting at the level of pathway networks, so that out-of-domain examples require transfer to previously unseen molecular systems rather than new variants within familiar ones.

A.2 Target Identification Dataset and Simplifications Relative to MEDEA

Our target identification benchmark is adapted from the cell type specific target nomination task introduced in MEDEA [78], but we convert it from an agentic workflow into a fixed-input reasoning problem. In the original MEDEA setup, the task is defined over five diseases—rheumatoid arthritis, type 1 diabetes mellitus, Sjogren’s syndrome, hepatoblastoma, and follicular lymphoma—and 29 cell types, with each analysis asking the model to select the best therapeutic target from a set of five candidate genes for a specified disease–cell-type context. The full benchmark contains 2,400 analyses in total, generated from disease atlases and target-disease resources to test whether models can identify therapeutically plausible targets at cell-type resolution rather than from bulk tissue averages.

The dataset construction in MEDEA proceeds in three stages. First, each disease atlas is processed from CELLxGENE [17] using a standard single-cell pipeline, followed by one-vs-all differential expression analysis to identify disease-specific marker genes for each cell type and disease combination. Second, disease-associated genes are collected from Open Targets [59], keeping genes with nonzero genetic evidence or ChEMBL evidence [28]. Third, ground-truth cell type specific disease targets are defined as genes satisfying both criteria: they are differentially expressed in the relevant disease–cell-type context and supported by disease-target evidence. For each context, MEDEA then forms five-gene candidate sets by sampling one positive target and four negatives, and uses prompt paraphrasing plus multiple random seeds to generate the final benchmark.

Our version keeps the same core supervision signal but simplifies the task substantially relative to MEDEA [78]. Rather than asking an agent to construct a research plan, invoke tools, retrieve literature, and reconcile evidence across multiple modules, we provide the model with the disease, cell type, candidate genes, and transcriptomic evidence directly. Concretely, instead of tool-based retrieval over single-cell atlases and other external resources, we supply aligned TranscriptFormer embeddings [63] for the five candidate genes in both normal and disease states for the relevant context. This removes the planning, execution, and literature-reasoning burden while preserving the central inferential challenge: the model must still identify which candidate is most supported in the specified disease and cell type, now using a fixed multimodal input rather than an open-ended agentic pipeline. Consistent with the main text, we further use four of the five diseases as the in-domain pool for train/ID-test splitting and reserve hepatoblastoma as the OOD test disease.

A.3 Protein Function Prediction Dataset and Simplifications Relative to BioReason-Pro

Our protein function prediction benchmark is adapted from the curated UniProt-based dataset introduced in BioReason-Pro [23]. The original corpus is designed around experimentally supported protein function annotation rather than generic sequence-level pretraining, and integrates multiple biological modalities into a single example. Starting from UniProt and the GOA database [84, 39], BioReason-Pro retains only proteins with experimental or curated GO evidence codes, standardizes annotations to the January 2023 Gene Ontology [83], and propagates terms upward through the ontology hierarchy to preserve hierarchical completeness. The resulting dataset contains 133,492 proteins spanning 3,135 organisms, with each protein linked not only to its amino acid sequence, but also to organism metadata, subcellular localization, InterPro domain annotations [62], structural information, and protein–protein interaction context [80].

At the instance level, BioReason-Pro constructs a compact multimodal context for each protein by combining InterPro domains with residue ranges [62], the UniProt protein description [84], organism, subcellular localization, STRING interaction partners [80], and GO leaf terms across molecular function, biological process, and cellular component [83]. These contexts are then used to generate synthetic step-by-step reasoning traces, which end in a structured final answer containing a concise function summary, the relevant InterPro domains, GO terms, and an interaction hypothesis. Evaluation follows the CAFA temporal holdout protocol [68, 97]: proteins annotated before November 2022 are used for training and validation, while test proteins are selected from those that gained new experimental annotations between March 2023 and February 2024 and lacked annotations in the target aspect beforehand. The final temporal holdout test set contains 8,630 proteins and 230,824 propagated GO annotations.

Our version keeps the same overall prediction task and temporal evaluation logic, but simplifies the original BioReason-Pro setup to match the common multimodal format used throughout this paper [23]. In BioReason-Pro itself, the model consumes residue-level ESM3 embeddings [35], organism and textual biological context, GO-GPT predictions, and an additional GO graph encoder that injects explicit ontology structure into the language model [83]. In our benchmark, we remove this GO graph input and treat the task as reasoning from protein representations plus textual biological metadata alone. Concretely, the model receives protein embeddings together with text context such as organism, InterPro domains, and protein–protein interactions, and must infer function without direct access to ontology graph embeddings. This simplification preserves the core challenge of integrating sequence-derived and symbolic evidence for protein function prediction, while making the protein task architecturally comparable to the DNA and RNA settings studied in the main text.

A.4 Example Prompts and Inputs for Each Task

Pathway prediction.

A representative input consists of two versions of the same DNA sequence region, with and without the mutation, including DNA-specific start and padding tokens, followed by a pathway network definition and gene annotations. The model is then asked to infer the biological or disease effect associated with the allele. For example:

<|dna_pad|>...<|dna_pad|>

Question: Network Definition of the pathway: SOD1* -| BIP -| ERN1 -> XBP1; Genes in the pathway: SOD1; superoxide dismutase 1 | HSPA5; heat shock protein family A (Hsp70) member 5 | ERN1; endoplasmic reticulum to nucleus signaling 1 | XBP1; X-box binding protein 1. Given this context, what is the biological effect of this SOD1 allele, specifically what disease does this contribute to?

Answer: amyotrophic lateral sclerosis

Drug target identification.

A representative input consists of prepended RNA padding tokens followed by a disease- and cell-type-specific target selection question over candidate genes. For example:

<|rna_pad|>...<|rna_pad|>

Question: Among the genes PDRG1, PIK3CG, TRIM23, SUCO, and GCKR, which one exhibits the highest T cell-specific expression relevant for targeted intervention in follicular lymphoma?

Answer: PIK3CG

The five projected TranscriptFormer representations are inserted before the text prompt in the same order as the five candidate gene names listed in the question; this fixed ordering defines the alignment between continuous RNA embeddings and discrete gene symbols.

Protein function prediction.

A representative input consists of protein-specific padding tokens, followed by organism metadata, InterPro domain annotations, optional protein–protein interaction context, and initial GO term speculations. For example:

<|im_start|>user

Protein: <|protein_pad|><|protein_pad|>…<|protein_pad|>

You are a scientific assistant specialized in protein function prediction. Given a protein sequence, organism information, InterPro domain annotations, protein–protein interaction partners, and initial GO term speculations, reason about the function of the protein and summarize in UniProt format.

Organism: Homo sapiens

InterPro annotations:

•

IPR000795: Translational tr-type GTP-binding domain (domain) [5–217]

•

IPR027417: P-loop containing nucleoside triphosphate hydrolase (homologous superfamily) [5–239]

•

IPR004161: Translation elongation factor EFTu-like, domain 2 (domain) [238–301]

•

IPR049393: Selenocysteine-specific elongation factor, 3rd domain (domain) [317–448]

•

IPR049394: Selenocysteine-specific elongation factor, C-terminal RIFT domain (domain) [465–590]

•

IPR050055: Elongation factor Tu GTPase family (family) [5–445]

Protein–protein interaction partners: None provided.

Initial GO term speculations:

Molecular Function (MF): GO:0003924 GTPase activity, GO:0005525 GTP binding, GO:0003746 translation elongation factor activity

Biological Process (BP): GO:0006415 translational elongation, GO:0006412 translation

Cellular Component (CC): GO:0005829 cytosol

Reason about the function of the protein and focus on Molecular Function, Biological Process, and Cellular Component.

<|im_end|>

<|im_start|>assistant

Answer: This protein is a selenocysteine-specific translation elongation factor that uses GTP binding and hydrolysis to deliver selenocysteinyl-tRNA^Sec during translational elongation. A plausible functional interaction partner is SECIS-binding protein 2 (SBP2).

Appendix B Experimental Setup

B.1 Base Models, Tokenization, and Input Representations

DNA experiments.

The text backbones for DNA continued pre-training (CPT) and the downstream SFT/RL stages are Qwen3-1.7B and Qwen3-4B, loaded with their native tokenizers [91]. For CPT we train the language model alone on tokenized biological free-text (no DNA encoder is attached). For the post-CPT SFT and RL stages we couple the text backbone to a frozen Evo2-1B encoder via a trainable linear projection [7]; the DNA hidden state is prepended to the text embeddings. DNA sequences are clipped to a maximum length of nucleotides, with nucleotides retained on each flank around the variant locus.

RNA experiments.

The RNA experiments follow the same overall setup as the DNA experiments, replacing the sequence encoder and biological input modality while keeping the same text backbones and tokenizer choices. The text backbones for RNA CPT, SFT, and RL are Qwen3-1.7B and Qwen3-4B, loaded with their native tokenizers [91]. As in the DNA setting, CPT is performed on tokenized biological free-text using the language model alone, without attaching the RNA encoder.

For the downstream SFT and RL stages, we couple the text backbone to a frozen TranscriptFormer encoder through a trainable linear projection [63]. Each target-identification example contains a natural-language disease and cell-type context, a five-gene candidate set, and TranscriptFormer representations for the candidate genes in normal and disease states [63, 78]. The projected RNA hidden states are prepended to the text-token embeddings before the prompt tokens, so that the language model conditions jointly on transcriptomic representations and the textual task description.

Protein experiments.

The text backbones for the protein experiments are Qwen3-1.7B and Qwen3-4B-Thinking, loaded with their native BPE tokenizer [91]. The padding token is aliased to the end-of-sequence token. Both the SFT and RL prompts concatenate (i) a system instruction describing the task and available biological context, (ii) a header containing the protein name, organism, and amino-acid sequence, and (iii) a user instruction asking the model to emit GO identifiers across the molecular function, biological process, and cellular component aspects [83].

For SFT and GRPO we use the same BioReason-Pro-style protein-conditioned interface, except that we omit the GO-graph encoder [23]. The text backbone is paired with a frozen ESM-3 small protein encoder [35]. Per-residue embeddings are extracted from layer 37 of ESM-3, projected through a trainable linear layer into the text-embedding space, and inserted at the protein placeholder positions before the text tokens [35]. The protein encoder is kept frozen; the protein projection layer and the LoRA adapter on the text model receive gradients [38].

B.2 Continued Pre-training (Mid-training) Setup

We mid-train the two Qwen3 backbones on the biology subset of FineFineWeb [55]. We use the first 200,000 documents for training and hold out the next 5,000 as a fixed evaluation set, yielding a 200K/5K train/eval split.

We do not perform additional benchmark-specific deduplication or decontamination of the FineWeb biology subset beyond the filtering already implicit in the source corpus [64, 55]. Our goal in CPT is to model a realistic domain-adaptation setting starting from publicly available pretrained LLMs, whose original pretraining corpora are not fully auditable and may already contain task-relevant biological text [33, 66, 32]. We therefore treat CPT as exposure to broad biological language rather than as a controlled from-scratch pretraining intervention. Importantly, the CPT corpus does not include our supervised reasoning traces, RL prompts, or any constructed train/test examples from the downstream benchmarks. The main leakage-sensitive comparisons in the paper are stage-wise and relative: all CPT and non-CPT models use the same downstream splits, and OOD evaluation is defined by held-out pathways, diseases, or species. We therefore interpret CPT results as measuring the effect of additional broad biological language adaptation under realistic pretrained-model conditions, not as evidence of strict benchmark decontamination.

Training uses the standard causal-LM next-token prediction loss. Inputs are tokenized with a maximum length of tokens. We optimize with AdamW under a cosine learning-rate schedule with linear warm-up, weight decay , gradient clipping , and bf16 mixed precision [54]. Each run trains for one epoch with a per-device batch size of . The CPT hyperparameter sweep varies the learning rate over and the gradient-accumulation steps over for both backbones. We select the best checkpoint by validation loss.

B.3 Supervised Fine-Tuning Setup

DNA SFT.

DNA SFT couples a frozen Evo2-1B encoder with either Qwen3-1.7B or Qwen3-4B (including their post-CPT variants; see §B.2) [7, 91]. We use DeepSpeed Stage 2 on a single GPU with bf16 precision, batch size , gradient accumulation , AdamW with and weight decay , and the same warm-up-to-cosine schedule as the protein SFT ( warm-up, decay floor ) [69, 54]. The maximum DNA sequence length is nucleotides and the maximum text length is tokens. The epoch sweep ranges over for each of Qwen3-1.7B, Qwen3-4B, CPT-Qwen3-1.7B, CPT-Qwen3-4B, with the best CPT learning rate selected per backbone from §B.2. Random seed is .

RNA SFT.

RNA SFT follows the same training recipe as DNA SFT, replacing the Evo2 DNA encoder with the frozen TranscriptFormer encoder and using the target-identification examples described in §B.1 [63, 78]. Each example provides a disease, cell type, five candidate genes, and TranscriptFormer embeddings for the corresponding normal and disease states [63]. The projected TranscriptFormer hidden states are prepended to the text-token embeddings, and the model is trained to generate the correct target gene.

We use DeepSpeed Stage 2 on a single GPU with bf16 precision, batch size , gradient accumulation , AdamW with and weight decay , and linear warm-up followed by cosine decay to [69, 54]. The maximum text length is tokens. The main epoch sweep ranges over for each of Qwen3-1.7B and Qwen3-4B, using the full RNA training set [91]. The same sweep is repeated for the Gemma 4 E2B RNA backbone in the backbone ablation [29]. Random seed is .

Protein SFT.

SFT pairs the frozen ESM-3 small encoder with the Qwen3-1.7B or Qwen3-4B-Thinking text backbone (see §B.1) [35, 91].

We use single-GPU training with bf16 mixed precision, batch size , and gradient accumulation (effective batch size ). Optimization uses AdamW with , weight decay , linear warm-up followed by cosine decay to [54]. The maximum text sequence length is tokens and the maximum protein length is residues. Flash attention is enabled where supported, along with gradient checkpointing [19]. Validation is run at the end of each epoch on a held-out split, and we keep the single best checkpoint by validation loss. Two sweeps are run: a data fraction sweep at epoch over of the training data, and an epoch sweep at data over epochs. Random seed is fixed to .

B.4 Reinforcement-Learning (GRPO) Setup

DNA RL.

The DNA GRPO runs use a multimodal architecture wrapping Qwen3-1.7B/4B (or their CPT variants) with a frozen Evo2-1B encoder [31, 91, 7]. The text model receives a fresh LoRA adapter attached on top of the SFT-merged checkpoint (see §B.5); the DNA encoder is frozen and the DNA projection is trainable [38]. We use DeepSpeed Stage 2 with bf16 precision on a single GPU [69]. Generation uses rollouts per prompt with max completion length , , top-, and top-. Optimization runs at with a cosine schedule, warm-up, per-device batch size , gradient accumulation , gradient checkpointing, and for the KL anchor [60]. The reward combines format-adherence and correctness components following [22]. The number of GRPO epochs is swept over ; each RL run selects the best SFT checkpoint by validation accuracy, merges the SFT LoRA into the base weights, and then attaches a fresh RL adapter.

RNA RL.

RNA GRPO follows the same multimodal RL setup as DNA GRPO, replacing the Evo2-1B encoder with the frozen TranscriptFormer encoder and using the target-identification reward [31, 63, 78]. The policy wraps Qwen3-1.7B, Qwen3-4B, or the Gemma 4 E2B RNA backbone with the frozen TranscriptFormer encoder and a trainable projection layer [91, 29, 63]. As in the DNA setting, the SFT LoRA is first merged into the text backbone, after which a fresh RL LoRA adapter is attached for GRPO [38]. The TranscriptFormer encoder remains frozen throughout RL, while the projection layer and RL adapter are trainable.

For each prompt, the model receives the disease, cell type, five candidate genes, and projected TranscriptFormer representations for the corresponding normal and disease states [63, 78]. The reward contains a format-adherence term and a correctness term based on whether the final answer matches the held-out target gene. For the main Qwen RNA sweeps, GRPO starts from the strongest SFT checkpoints identified in the SFT sweep: Qwen3-1.7B-R-SFT4,1 and Qwen3-4B-R-SFT8,1. The number of GRPO epochs is swept over , yielding model families of the form Qwen3-1.7B-R-SFT4,1-RL,1 and Qwen3-4B-R-SFT8,1-RL,1. For the backbone ablation, the same procedure is applied to Gemma 4 E2B-R-SFT4,1-RL,1.

Unless otherwise stated, RNA GRPO uses the same optimization and generation hyperparameters as DNA GRPO: DeepSpeed Stage 2, bf16 precision, rollouts per prompt, max completion length , , top-, top-, AdamW with , cosine decay with warm-up, per-device batch size , gradient accumulation , gradient checkpointing, and KL coefficient [69, 54, 31]. The random seed is fixed to .

For evaluation, the generated final answer is normalized by case, whitespace, punctuation, and common gene-symbol formatting variants, and is counted correct only if it exactly matches the held-out target gene symbol. Following [22], the model gets additional rewards for following the expected output structure, i.e. a reasoning trace followed by a one-gene answer, and for staying below the token limit of 1024 tokens (conciseness reward).

Protein RL.

We apply Group Relative Policy Optimization (GRPO) to the same protein-conditioned Qwen3-1.7B/Qwen3-4B-Thinking policy used in SFT [31, 91]. For each prompt, frozen ESM-3 residue embeddings are projected into the Qwen embedding space and inserted at the protein placeholder positions; unlike BioReason-Pro, no GO-graph embeddings are included [35, 23]. All GRPO runs warm-start from the SFT-trained LoRA adapter and protein projection extracted from the matching SFT checkpoint; the no-warm-start ablation trains from scratch [38]. A frozen copy of this SFT-initialized protein-conditioned policy serves as the reference policy for the KL anchor [60]. Gradient checkpointing is enabled on the policy to control activation memory.

For each minibatch of prompts, we draw rollout completions per prompt with sampling temperature and top-, and compute the propagated GO-F1 reward

where denotes the set of ontology-propagated GO terms extracted from a completion or gold answer [83]. We form group-centered, batch-standard-deviation-normalized advantages

The objective is the per-token clipped surrogate of GRPO with an unbiased KL anchor against the frozen reference policy [73, 31]:

where

The old-policy log probabilities are computed under the pre-update rollout policy and cached or recomputed without gradient flow before the policy update; they are detached only for importance weighting and are not set equal to the current-policy numerator. The per-token estimator is

with clamped to for numerical safety.

We sweep over . Other GRPO settings are: , , , batch size , group size , max new tokens , , top-. The optimizer is AdamW with , weight decay , gradient clipping , and bf16 precision [54]. The training data is capped to match the SFT data fractions of examples. We include InterPro features in the prompt but exclude protein–protein interactions [62, 80]. The reward is computed against ontology-propagated leaf GO terms; examples without resolvable gold terms are dropped [83]. Both the GRPO reward and the evaluation metric are propagated unweighted GO-F1 [23, 83].

RL evaluation.

For each trained run we evaluate on both the ID and OOD test splits. We select the checkpoint with the highest centred 50-step rolling mean of the training reward, as several runs at reach peak reward mid-training before drifting. Generation uses sampling at with max new tokens ; greedy decoding is avoided because the deterministic path on Qwen3-4B-Thinking tends to remain in the reasoning block without emitting a final answer.

B.5 LoRA Configurations and Trainable-Parameter Choices

All LoRA adapters target the seven attention and MLP projection matrices per transformer block: q, k, v, o, gate, up, down projections, with Gaussian initialization and no bias [38].

DNA LoRA.

Both the DNA SFT and RL stages use , , dropout on the same seven target modules. The Evo2 encoder is frozen; the DNA projection is trainable [7]. For RL, when the SFT and RL adapter ranks differ, we merge the SFT LoRA into the base weights and attach a fresh RL adapter at the requested rank [38].

RNA LoRA.

The RNA SFT stage uses , , dropout , while the RL stage uses , , dropout on the same seven target modules. The transcriptformer encoder is frozen; the RNA projection is trainable [63]. For RL, when the SFT and RL adapter ranks differ, we merge the SFT LoRA into the base weights and attach a fresh RL adapter at the requested rank [38].

Protein SFT LoRA.

Rank , scaling , dropout . ESM-3 and the GO graph components are frozen; only the LoRA adapter on the text model and the protein-to-text projection layer receive gradients [35, 83, 38].

Protein RL warm-start.

The SFT LoRA adapter and protein projection are extracted from the SFT checkpoint and reattached to the protein-conditioned Qwen policy used for GRPO [38, 31]. ESM-3 remains frozen [35]. During GRPO, the LoRA adapter and protein projection are trainable, while the underlying Qwen3-4B-Thinking weights are frozen [91]. The no-warm-start ablation trains the full 4 B-parameter model directly without a LoRA adapter.

B.6 Hyperparameters, Context Windows, Sequence Lengths, and Optimization

Table 1 summarizes the configuration used to produce every reported result. All training uses AdamW with , , . Random seed is for protein and RNA experiments and for DNA RL. All runs use bf16 mixed precision with flash attention where supported.

Table 1: Optimization and context-window settings per training stage. “warmcos” denotes linear warm-up followed by cosine decay to .

Stage	Backbone(s)		Schedule	Warm-up	Batch (eff.)
DNA/RNA CPT	Qwen3-1.7B/4B	cosine
Protein SFT	Qwen3-1.7B/4B-Thinking	warmcos
DNA SFT	Qwen3-1.7B/4B (+CPT)	warmcos
RNA SFT	Qwen3-1.7B/4B, Gemma 4 E2B	warmcos
Protein GRPO	Qwen3-1.7B/4B-Thinking	constant	—	,	gen.
DNA GRPO	Qwen3-1.7B/4B (+CPT)	cosine	,	gen.
RNA GRPO	Qwen3-1.7B/4B, Gemma 4 E2B	cosine	,	gen.

Text / protein-residue length caps. Text / DNA length caps (truncated symmetrically around the variant locus). Text length cap; TranscriptFormer embeddings are prepended separately.

Weight decay is on every stage; gradient clipping is . The protein GRPO step uses a constant learning rate; we rely on KL regularization and early-stopping checkpoint selection to control late-stage drift.

Sweep grids.

•

DNA/ RNA CPT: learning rate , gradient accumulation , epoch ( runs total).
•

Protein SFT data sweep: data fraction at epoch.
•

Protein RL data sweep: matched data fractions at epoch with warm-start; the ablation re-runs the same grid.
•

Protein RL epoch sweep: epochs at data.
•

DNA/ RNA SFT/RL epoch sweep: epochs for SFT (capped at for RL), across Qwen3-1.7B, Qwen3-4B, and their post-CPT variants.

B.7 Compute Resources and Training Budget

All training and evaluation are run on a GPU cluster using NVIDIA H100 (80 GB) and H200 (141 GB) GPUs. We use one GPU per run; multi-node training was not required for any reported result.

•

DNA/RNA CPT: H100, walltime up to days per run. Eight runs GPU-days.
•

Protein SFT: H100, walltime up to days per run. Eight runs across the reported data and epoch sweeps.
•

Protein RL: H200, walltime up to days per run with automatic checkpointing and requeueing. The reported warm-start sweep comprises runs, corresponding to the protein RL results in Figure 4. Each training run chains two GPU evaluation jobs, one for the ID split and one for the OOD split, with sampling at . We did not include the exploratory and no-warm-start protein RL runs in the reported results, so they are excluded from the compute accounting here.
•

DNA/RNA SFT/RL: H100, walltime up to days per run. SFT and RL are submitted as chained pairs per backbone, with RL automatically selecting the best SFT checkpoint by validation accuracy.

The dominant reported compute cost is the protein RL sweep: training runs at up to days each ( H200-days), plus paired ID/OOD evaluation jobs ( GPU-days). The DNA/RNA CPT and protein SFT sweeps consume the next-largest shares at GPU-days each.

Appendix C Main-Results Tables

C.1 Results for Figure 2

Task	Model	Split	Base	SFT1	SFT2	SFT4	SFT8	SFT16	SFT32
DNA	Qwen3-1.7B	ID	0.562	0.724	0.838	0.866	0.893	0.907	0.879
DNA	Qwen3-1.7B	OOD	0.576	0.681	0.714	0.736	0.725	0.703	0.687
DNA	Qwen3-4B	ID	0.593	0.848	0.886	0.890	0.838	0.845	0.852
DNA	Qwen3-4B	OOD	0.582	0.702	0.736	0.755	0.752	0.734	0.710
RNA	Qwen3-1.7B	ID	0.219	0.614	0.861	0.875	0.861	0.875	0.861
RNA	Qwen3-1.7B	OOD	0.195	0.365	0.547	0.547	0.568	0.561	0.541
RNA	Qwen3-4B	ID	0.226	0.778	0.847	0.892	0.903	0.911	0.886
RNA	Qwen3-4B	OOD	0.202	0.543	0.584	0.606	0.627	0.594	0.569
Protein	Qwen3-1.7B	ID	0.103	0.282	0.330	0.357	0.368	0.333	0.315
Protein	Qwen3-1.7B	OOD	0.095	0.220	0.254	0.278	0.261	0.245	0.233
Protein	Qwen3-4B	ID	0.126	0.303	0.349	0.368	0.382	0.367	0.332
Protein	Qwen3-4B	OOD	0.108	0.322	0.364	0.340	0.342	0.305	0.291

Table 2: Numerical results corresponding to Figure 2. DNA and RNA are evaluated with accuracy; proteins are evaluated with propagated unweighted . Values are reported for the base model at epoch 0 and after supervised fine-tuning for the indicated number of epochs.

C.2 Results for Figure 3

Task	Model	Split	Base	4K	8K	12K	16K	20K
Protein	Qwen3-1.7B	ID	0.102	0.178	0.214	0.231	0.245	0.250
Protein	Qwen3-1.7B	OOD	0.095	0.122	0.188	0.218	0.216	0.222
Protein	Qwen3-4B	ID	0.118	0.222	0.266	0.248	0.278	0.269
Protein	Qwen3-4B	OOD	0.110	0.237	0.291	0.289	0.314	0.308

Table 3: Full numerical results corresponding to Figure 3 for protein function prediction under fixed-compute, variable-data supervised fine-tuning. All post-training runs use one SFT epoch while varying the number of training examples. Metrics are propagated unweighted on the ID and OOD test splits. Values are reported for the base model at 0K and after SFT on the indicated number of training examples.

C.3 Results for Figure 4

Task	Model	Split	RL1	RL2	RL4	RL8	RL16
DNA	Qwen3-1.7B	ID	0.860	0.939	0.932	0.930	0.924
DNA	Qwen3-1.7B	OOD	0.720	0.751	0.782	0.793	0.815
DNA	Qwen3-4B	ID	0.930	0.980	0.971	0.946	0.952
DNA	Qwen3-4B	OOD	0.738	0.783	0.805	0.818	0.838
RNA	Qwen3-1.7B	ID	0.625	0.722	0.736	0.777	0.785
RNA	Qwen3-1.7B	OOD	0.500	0.655	0.669	0.642	0.689
RNA	Qwen3-4B	ID	0.775	0.847	0.893	0.902	0.914
RNA	Qwen3-4B	OOD	0.582	0.694	0.708	0.732	0.745
Protein	Qwen3-1.7B	ID	0.648	0.776	0.797	0.805	–
Protein	Qwen3-1.7B	OOD	0.592	0.710	0.734	0.738	–
Protein	Qwen3-4B	ID	0.697	0.870	0.930	0.899	–
Protein	Qwen3-4B	OOD	0.682	0.893	0.956	0.909	–

Table 4: Full numerical results corresponding to Figure 4. DNA and RNA are evaluated with accuracy; proteins are evaluated with propagated unweighted . Columns report performance after the indicated number of reinforcement-learning epochs. Protein RL was evaluated through 8 epochs in this figure.

C.4 Results for Figure 5

Task	Model	Split	Base	SFT	SFT+RL	CPT+SFT	CPT+SFT+RL
DNA	Qwen3-1.7B	ID	0.721	0.905	0.939	0.835	0.965
DNA	Qwen3-1.7B	OOD	0.651	0.724	0.893	0.935	0.959
DNA	Qwen3-4B	ID	0.798	0.894	0.980	0.917	0.986
DNA	Qwen3-4B	OOD	0.668	0.753	0.917	0.924	0.970
RNA	Qwen3-1.7B	ID	0.209	0.890	0.773	0.902	0.936
RNA	Qwen3-1.7B	OOD	0.181	0.569	0.692	0.718	0.754
RNA	Qwen3-4B	ID	0.218	0.912	0.918	0.910	0.939
RNA	Qwen3-4B	OOD	0.197	0.588	0.744	0.732	0.765

Table 5: Full numerical results corresponding to Figure 5 for the continued-pretraining ablation. DNA and RNA are evaluated with accuracy on the ID and OOD test splits. The Base column reports the non-CPT backbone before task-specific post-training. SFT and SFT+RL report the strongest non-CPT post-training configurations, while CPT+SFT and CPT+SFT+RL report the corresponding configurations initialized from the CPT-adapted backbone.

C.5 Results for Figure 6

Stage	Split	Base	Epoch 1	Epoch 2	Epoch 4	Epoch 8	Epoch 16	Epoch 32
SFT	ID	0.223	0.778	0.750	0.806	0.778	0.815	0.815
SFT	OOD	0.206	0.520	0.534	0.547	0.561	0.564	0.564
RL	ID	0.815	0.826	0.853	0.882	0.890	0.892	–
RL	OOD	0.564	0.599	0.684	0.702	0.724	0.729	–

Table 6: Full numerical results corresponding to Figure 6 for the Gemma4-E2B RNA backbone ablation. Metrics are accuracy on the RNA ID and OOD test splits. The SFT rows report the supervised fine-tuning epoch sweep, including the base model at epoch 0. The RL rows report the reinforcement-learning epoch sweep initialized from the strongest SFT checkpoint; RL was evaluated through 16 epochs.

C.6 Results for Figure 8

Task	Model	Split	SFT0/RL8	SFT1/RL7	SFT2/RL6	SFT3/RL5	SFT4/RL4	SFT5/RL3	SFT6/RL2	SFT7/RL1	SFT8/RL0
DNA	Qwen3-1.7B	ID	0.842	0.899	0.937	0.929	0.920	0.879	0.843	0.844	0.891
DNA	Qwen3-1.7B	OOD	0.618	0.746	0.752	0.781	0.752	0.731	0.701	0.715	0.713
DNA	Gemma4-E2B	ID	0.864	0.891	0.927	0.946	0.934	0.918	0.903	0.888	0.865
DNA	Gemma4-E2B	OOD	0.685	0.762	0.796	0.785	0.772	0.749	0.713	0.693	0.676
RNA	Qwen3-1.7B	ID	0.722	0.819	0.847	0.819	0.736	0.639	0.653	0.694	0.861
RNA	Qwen3-1.7B	OOD	0.662	0.777	0.757	0.770	0.669	0.566	0.549	0.554	0.568
RNA	Gemma4-E2B	ID	0.754	0.839	0.862	0.876	0.884	0.842	0.817	0.791	0.776
RNA	Gemma4-E2B	OOD	0.683	0.726	0.735	0.748	0.714	0.665	0.638	0.591	0.576

Table 7: Full numerical results corresponding to Figure 8 for the fixed-budget SFT–RL allocation experiment. DNA and RNA are evaluated with accuracy on the ID and OOD test splits. The total post-training budget is fixed to eight epoch-level passes, and columns indicate the allocation between supervised fine-tuning and reinforcement learning. Values are reported as proportions.

Appendix D Additional Experiments

D.1 Scaling Post-Training for Biological Non-Reasoning Tasks

The main experiments in this paper focus on biological reasoning tasks, where the model must integrate biological inputs with natural-language context and produce mechanistic or structured outputs. To test whether the same post-training trends also appear in a more conventional biological prediction setting, we additionally evaluate supervised fine-tuning on variant effect prediction (VEP), using the coding non-SNV benchmark introduced in [22]. Unlike the KEGG-derived pathway prediction task, which requires multi-step mechanistic inference over molecular networks, VEP-Non-SNV is primarily a classification-style task: given paired reference and variant DNA sequences together with gene and chromosome context, the model predicts whether a coding non-SNV is benign or pathogenic, and, when pathogenic, the associated disease phenotype.

The VEP-Non-SNV dataset is constructed from ClinVar coding non-SNVs, filtered to include nuclear-genome variants affecting at most 64 base pairs, with sufficient clinical review status and transcript matching to GRCh38.p14. The original benchmark in [22] uses stratified train/test partitioning to balance disease representation and augments each example with paraphrased prompts. This makes the task biologically meaningful, but less dependent on explicit chain-of-thought-style mechanistic reasoning than the pathway prediction benchmark. In our setting, this experiment therefore serves as a non-reasoning control for testing whether increasing SFT compute continues to improve performance monotonically. [22] describes this benchmark as containing 36,088 core non-SNV entries and defines the task as predicting benign versus pathogenic status, with conditional disease prediction for pathogenic variants.

SFT Epochs	1	2	4	8	16	32
Accuracy	0.7123	0.7789	0.7965	0.8246	0.8316	0.8105

Table 8: SFT epoch scaling on the VEP-Non-SNV task. The model improves with additional supervised fine-tuning up to 16 epochs, after which performance declines, suggesting that even non-reasoning biological prediction tasks can exhibit non-monotonic SFT scaling.

Table 8 reports SFT epoch scaling accuracy for Qwen3-1.7B on VEP-Non-SNV. Performance improves steadily from one to sixteen epochs, increasing from 0.7123 at one epoch to 0.8316 at sixteen epochs, before declining at thirty-two epochs. Thus, even in this less explicitly reasoning-oriented setting, scaling SFT is not strictly monotonic: moderate additional supervision improves performance, but excessive training begins to degrade the final metric. This mirrors the broader pattern observed in the main biological reasoning experiments, where SFT is a strong driver of task performance but can over-specialize when scaled too far.

D.2 DNA LoRA Rank Allocation

👁 Refer to caption

Figure 10: Optimal adaptation requires asymmetric capacity across training stages. Higher LoRA rank benefits SFT, while lower rank is sufficient for RL, indicating that different stages require different adaptation capacity (both for ID and OOD tasks). Shown are results for pathway prediction (DNA) tasks.

D.3 Fixed Epoch, Variable Data during RL

In the main text, we demonstrated non-monotonic behavior when scaling RL epochs at fixed data size. A complementary question is whether scaling the number of RL examples at a fixed single epoch yields similar saturation. Table 9 reports protein function prediction performance (propagated GO-F1) for Qwen3-4B-Thinking trained with GRPO at (strong KL) for one epoch, varying the number of training examples from 4K to 20K.

Both ID and OOD F1 peak at an intermediate data budget (12K for ID, 4K for OOD) and decline as additional examples are added. The best OOD F1 (0.956 at 4K) exceeds the best ID F1 (0.952 at 12K), consistent with the observation in the main text that moderate RL improves generalization disproportionately. At the largest budget (20K), both metrics drop substantially below their respective peaks, with OOD F1 falling from 0.956 to 0.884. This indicates that, under a fixed single-epoch schedule, additional RL data does not substitute for the exploration benefits of multi-epoch training and can instead over-constrain the policy. The result complements the epoch-scaling findings and reinforces the broader conclusion that RL compute allocation—whether measured in epochs or data volume—requires careful calibration to maximize impact on biological reasoning capabilities.

Metric	4K	8K	12K	16K	20K
ID F1
OOD F1

Table 9: Protein function prediction F1 (Qwen3-4B-Thinking, GRPO , 1 epoch) as a function of RL training examples. Increasing data does not monotonically improve performance; both ID and OOD F1 peak at intermediate budgets and decline with further scaling.

D.4 Fixed Budget, Variable Data Allocation between SFT and RL

We next ask whether the fixed-budget SFT–RL trade-off from Section 4.6 also appears when the total number of post-training examples is fixed and only the data allocation between SFT and RL is varied. Table 10 reports protein function prediction performance under a fixed 20K-example post-training budget. Each row allocates a different fraction of the data to SFT and RL while keeping the total number of examples constant.

SFT / RL data	SFT	RL	ID F1	OOD F1
0% / 100%	0	20000	0.7711	0.7953
20% / 80%	4000	16000	0.9470	0.9685
25% / 75%	5000	15000	0.9432	0.9620
40% / 60%	8000	12000	0.9167	0.9360
50% / 50%	10000	10000	0.9285	0.9569
60% / 40%	12000	8000	0.9055	0.9102
80% / 20%	16000	4000	0.8816	0.9065
100% / 0%	20000	0	0.2768	0.3177

Table 10: Fixed-budget data allocation between supervised fine-tuning and reinforcement learning. The total post-training data budget is fixed at 20K examples, and rows vary the fraction assigned to SFT versus RL. Metrics are reported as F1 on the ID and OOD protein function prediction splits.

The best results occur in the mixed SFT–RL regime rather than at either endpoint. Allocating a small fraction of examples to SFT and the majority to RL gives the strongest ID and OOD performance, with the 20%/80% split reaching 0.9470 ID F1 and 0.9685 OOD F1. Pure SFT performs poorly in this setting, while pure RL is substantially better but still far below the mixed allocations, indicating that RL benefits from a modest supervised warm start even when the total data budget is fixed.

D.5 Reward Model Ablations

Domain	Model	SFT init.	RL epoch	OOD Acc.	Format-valid %	Parseable final %	Format-only %	Mean tokens
RNA	Qwen3-1.7B	S8	0	0.567	100.0	69.6	43.2	201.1
RNA	Qwen3-1.7B	S4	0	0.547	100.0	79.1	45.3	210.5
RNA	Qwen3-1.7B	S4	1	0.500	100.0	100.0	50.0	568.2
RNA	Qwen3-1.7B	S4	2	0.655	100.0	100.0	34.5	170.9
RNA	Qwen3-1.7B	S4	4	0.669	100.0	100.0	33.1	419.6
RNA	Qwen3-1.7B	S4	8	0.642	100.0	100.0	35.8	544.6

Table 11: OOD reward-hacking audit across RNA RL checkpoints. Epoch denotes the SFT initialization before RL. Format-only success is defined as a format-valid output with an incorrect final answer.

D.6 ID/ OOD Split Ablations

To test whether the qualitative training dynamics depend on how we define in-domain and out-of-domain settings, we provide an additional ablation in the RNA setting. Instead of splitting the data by held-out disease, we construct an alternative cell-type split for the target-identification task. We label an example as OOD if its canonical cell type is one of regulatory T cell, exhausted T cell, or myeloid cell, and use the remaining examples for training and validation. This yields 1,418 training examples, 75 validation examples, and 102 OOD test examples. The task format, model architecture, and evaluation protocol are otherwise unchanged from the main RNA experiments: the model receives the disease, cell type, five candidate genes, and aligned TranscriptFormer representations, and is evaluated by greedy generation with exact match against the target gene. This split therefore changes only the biological axis of distribution shift, from held-out disease to held-out cellular context, while preserving the same target-identification setup described earlier.

Stage	Checkpoint	ID val.	OOD test
SFT	1 epoch	54.7	60.8
SFT	2 epochs	54.7	43.1
SFT	4 epochs	58.7	39.2
SFT	8 epochs	62.7	41.2
RL	1 epoch	65.3	88.2
RL	2 epochs	68.0	92.2
RL	4 epochs	66.7	93.1
RL	8 epochs	73.3	95.1

Table 12: RNA target-identification performance under the held-out cell-type split. OOD examples are those whose canonical cell type is regulatory T cell, exhausted T cell, or myeloid cell. All values are exact-match accuracies, reported as percentages, under greedy generation.

Under this cell-type split, supervised fine-tuning again improves in-domain performance while failing to improve OOD generalization. Qwen3-1.7B reaches 54.7% ID accuracy after one epoch and increases to 62.7% by eight epochs, but OOD accuracy drops from 60.8% at one epoch to 39.2–43.1% across later SFT checkpoints. Starting GRPO from the four-epoch SFT checkpoint reverses this behavior: OOD accuracy rises from 39.2% at initialization to 88.2% after one RL epoch and continues improving to 95.1% after eight RL epochs, while ID accuracy also increases from 58.7% to 73.3%. This ablation supports the main conclusion that the SFT–RL contrast is not specific to the held-out hepatoblastoma disease split used in the main RNA experiments. Instead, the same qualitative pattern appears under a distinct biologically meaningful OOD definition: SFT fits the training/validation distribution, whereas RL substantially improves transfer to held-out cellular contexts.

Appendix E Asset Licenses and Redistribution Status

Table 13: Existing assets used in this work and their licenses or terms of use.

Asset	Used in this paper	License or terms of use
BIOREASON pathway prediction benchmark (wanglab/kegg)	Pathway prediction benchmark for DNA reasoning experiments	Apache-2.0, as listed on the Hugging Face dataset card.
MEDEA / MedeaDB target-identification task	RNA target-identification benchmark and disease/cell-type/candidate-gene setup.	The MEDEA code repository is Apache-2.0. The benchmark data are distributed separately through the mims-harvard/MedeaDB Hugging Face dataset (CC BY-NC-SA 4.0) referenced by the MEDEA README. We use the targetID task structure and TranscriptFormer setting described in MEDEA. We do not redistribute the raw MedeaDB data unless permitted by the MedeaDB dataset license/terms; any derived splits or preprocessing scripts will be released only under terms compatible with the upstream data license.
BioReason-Pro SFT reasoning data (wanglab/bioreason-pro-sft-reasoning-data)	Protein function reasoning benchmark/data	Apache-2.0, as listed on the Hugging Face dataset card.
FineFineWeb biology subset	Continued pre-training corpus	Open Data Commons Attribution License, ODC-By v1.0; use is also subject to CommonCrawl Terms of Use.
Qwen3-1.7B / Qwen3-4B	Text LLM backbones for DNA, RNA, and protein experiments	Apache-2.0, as listed for Qwen3 model files on Hugging Face.
Gemma 4 E2B	Backbone ablation for RNA experiments	Apache-2.0, as stated in the official Gemma 4 model card.
Evo2-1B	Frozen DNA encoder	Apache-2.0 for the Evo2 repository and released model artifacts.
TranscriptFormer	Frozen transcriptomic encoder	MIT License, as listed by the TranscriptFormer repository/model page.
ESM-3 / ESM3-1B protein encoder	Frozen protein encoder	Custom non-commercial model-weight terms for the ESM-3 Open Model checkpoint used, including EvolutionaryScale’s Cambrian Non-Commercial License Agreement where applicable.

URL: https://arxiv.org/html/2606.16517v1