VOOZH about

URL: https://huggingface.co/openenv-community/replicalab-scientist-grpo-lora

⇱ openenv-community/replicalab-scientist-grpo-lora · Hugging Face


ReplicaLab Scientist — GRPO LoRA Adapter

A LoRA adapter fine-tuned on unsloth/Qwen3.5-0.8B using Group Relative Policy Optimization (GRPO) for multi-agent scientific negotiation.

What is ReplicaLab?

ReplicaLab is a multi-agent constraint-aware planning environment that trains an AI Scientist agent to negotiate feasible scientific replication plans under realistic resource constraints. A Lab Manager enforces budgets, schedules, and equipment limits while a deterministic Judge scores every plan on rigor, feasibility, and fidelity.

Live demo: ayushozha-replicalab.hf.space

Training Details

  • Method: GRPO (Group Relative Policy Optimization) via TRL
  • Base model: unsloth/Qwen3.5-0.8B
  • LoRA config: rank=16, alpha=32, dropout=0.0
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Hardware: NVIDIA H100 80GB HBM3 (Northflank)
  • Steps: 200 (checkpoints at 100, 150, 200)
  • Training framework: Unsloth + TRL 0.24.0 + PEFT 0.18.1

Reward Formula

total_reward = 10 × rigor × feasibility × fidelity × parsimony
 + efficiency_bonus + communication_bonus − penalties

The multiplicative core prevents fake wins: a theoretically strong but impossible plan scores low.

Training Curves

Overview

👁 Training Overview

Reward Over Training

👁 Reward Curve

Training Loss

👁 Loss Curve

KL Divergence

👁 KL Divergence

Completion Length

👁 Completion Length

Evaluation Results

Improvement Over Baseline

👁 Improvements

Side-by-Side Comparison

👁 Eval Comparison

Metric Baseline Scientist Trained Scientist Change
Average reward 4.25 7.10 +67%
Rounds to agreement 4.1 2.8 −32%
Invalid action rate 15% 4% −73%
Agreement rate 50% 80% +60%
Avg rigor score 0.55 0.72 +31%
Avg feasibility score 0.52 0.78 +50%
Avg fidelity score 0.58 0.71 +22%

Scenario Families

Template Domain Example Task
math_reasoning Mathematics Proof planning under deadline and review constraints
ml_benchmark Machine Learning Model replication with compute and time budgets
finance_trading Finance Backtest design under capital and risk limits

Quick Start

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3.5-0.8B")
model = PeftModel.from_pretrained(base_model, "openenv-community/replicalab-scientist-grpo-lora")
tokenizer = AutoTokenizer.from_pretrained("openenv-community/replicalab-scientist-grpo-lora")

# Use within the ReplicaLab environment for scientific negotiation

Framework Versions

  • PEFT: 0.18.1
  • TRL: 0.24.0
  • Transformers: 5.2.0
  • PyTorch: 2.8.0+cu128
  • Datasets: 4.3.0
  • Tokenizers: 0.22.2

Citation

@misc{replicalab2026,
 title = {ReplicaLab: Multi-Agent Constraint-Aware Planning for Scientific Replication},
 author = {Ayush Ojha and Kian and Peixi and Kush},
 year = 2026,
 url = {https://github.com/Ayush10/replicalab-ai}
}

License

MIT

Downloads last month
3

Model tree for openenv-community/replicalab-scientist-grpo-lora

Adapter
(25)
this model

Spaces using openenv-community/replicalab-scientist-grpo-lora 3