VOOZH about

URL: https://huggingface.co/jupyter-agent/jupyter-agent-qwen3-4b-thinking

⇱ jupyter-agent/jupyter-agent-qwen3-4b-thinking · Hugging Face


Jupyter Agent Qwen3-4B Thinking

👁 image/png

Jupyter Agent Qwen3-4B Thinking is a fine-tuned version of Qwen3-4B-Thinking-2507 specifically optimized for data science agentic tasks in Jupyter notebook environments. This model can execute Python code, analyze datasets, and provide step-by-step reasoning with intermediate computations to solve realistic data analysis problems.

  • Model type: Causal Language Model (Thinking)
  • Language(s): English, Python
  • License: Apache 2.0
  • Finetuned from: Qwen/Qwen3-4B-Thinking-2507

Key Features

  • Jupyter-native agent that lives inside notebook environments
  • Code execution with pandas, numpy, matplotlib, and other data science libraries
  • Step-by-step reasoning with intermediate computations and thinking traces
  • Dataset-grounded analysis trained on real Kaggle notebook workflows
  • Tool calling for structured code execution and final answer generation

Performance

On the DABStep benchmark for data science tasks:

Model Easy Tasks Hard Tasks
Qwen3-4B-Thinking-2507 (Base) 44.0% 2.1%
Jupyter Agent Qwen3-4B Thinking 70.8% 3.4%

State-of-the-art performance for small models on realistic data analysis tasks.

Model Sources

Usage

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "jupyter-agent/jupyter-agent-qwen3-4b-thinking"

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
 model_name,
 torch_dtype="auto",
 device_map="auto"
)

# Prepare input
prompt = "Analyze this sales dataset and find the top 3 performing products by revenue."
messages = [
 {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
 messages,
 tokenize=False,
 add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate response
generated_ids = model.generate(
 **model_inputs,
 max_new_tokens=16384
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

Decoding Thinking and Content

For thinking models, you can extract both the reasoning and final response:

try:
 # Find the end of thinking section (</think>)
 index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
 index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("Thinking:", thinking_content)
print("Response:", content)

Agentic Usage with Tool Calling

The model works best with proper scaffolding for tool calling:

tools = [
 {
 "type": "function",
 "function": {
 "name": "execute_code",
 "description": "Execute Python code in a Jupyter environment",
 "parameters": {
 "type": "object",
 "properties": {
 "code": {
 "type": "string",
 "description": "Python code to execute"
 }
 },
 "required": ["code"]
 }
 }
 },
 {
 "type": "function", 
 "function": {
 "name": "final_answer",
 "description": "Provide the final answer to the question",
 "parameters": {
 "type": "object",
 "properties": {
 "answer": {
 "type": "string",
 "description": "The final answer"
 }
 },
 "required": ["answer"]
 }
 }
 }
]

# Include tools in the conversation
messages = [
 {
 "role": "system", 
 "content": "You are a data science assistant. Use the available tools to analyze data and provide insights."
 },
 {"role": "user", "content": prompt}
]

Training Details

Training Data

The model was fine-tuned on the Jupyter Agent Dataset, which contains:

  • 51,389 synthetic notebooks (~0.2B tokens, total 1B tokens)
  • Dataset-grounded QA pairs from real Kaggle notebooks
  • Executable reasoning traces with intermediate computations
  • High-quality educational content filtered and scored by LLMs

Training Procedure

  • Base Model: Qwen3-4B-Thinking-2507
  • Training Method: Full-parameter fine-tuning (not PEFT)
  • Optimizer: AdamW with cosine learning rate scheduling
  • Learning Rate: 5e-6
  • Epochs: 5 (optimal based on ablation study)
  • Context Length: 32,768 tokens
  • Batch Size: Distributed across multiple GPUs
  • Loss: Assistant-only loss (assistant_loss_only=True)
  • Regularization: NEFTune noise (α=7) for full-parameter training

Training Infrastructure

  • Framework: TRL with Transformers
  • Distributed Training: DeepSpeed ZeRO-2 across multiple nodes
  • Hardware: Multi-GPU setup with SLURM orchestration

Evaluation

Benchmark: DABStep

The model was evaluated on DABStep, a benchmark for data science agents with realistic tasks involving:

  • Dataset analysis with pandas and numpy
  • Visualization with matplotlib/seaborn
  • Statistical analysis and business insights
  • Multi-step reasoning with intermediate computations

The model achieves 26.8% improvement over the base model and 11.1% improvement over scaffolding alone.

👁 DABstep Easy Score

We can also see, that the hard score can increase too even though our dataset is focused on easier questions.

👁 DABstep Hard Score

Limitations and Bias

Technical Limitations

  • Context window: Limited to 32K tokens, may struggle with very large notebooks
  • Tool calling format: Requires specific scaffolding for optimal performance
  • Dataset domains: Primarily trained on Kaggle-style data science tasks
  • Code execution: Requires proper sandboxing for safe execution

Potential Biases

  • Domain bias: Trained primarily on Kaggle notebooks, may not generalize to all data science workflows
  • Language bias: Optimized for English and Python, limited multilingual support
  • Task bias: Focused on structured data analysis, may underperform on unstructured data tasks

Recommendations

  • Use in sandboxed environments like E2B for safe code execution
  • Validate outputs before using in production systems
  • Review generated code for security and correctness
  • Consider domain adaptation for specialized use cases

Ethical Considerations

  • Code Safety: Always execute generated code in secure, isolated environments
  • Data Privacy: Be cautious when analyzing sensitive datasets
  • Verification: Validate all analytical conclusions and insights
  • Attribution: Acknowledge model assistance in data analysis workflows

Citation

@misc{jupyteragentqwen3thinking,
 title={Jupyter Agent Qwen3-4B Thinking},
 author={Baptiste Colle and Hanna Yukhymenko and Leandro von Werra},
 year={2025},
 publisher={Hugging Face},
 url={https://huggingface.co/jupyter-agent/jupyter-agent-qwen3-4b-thinking}
}

Related Work

For more details, see our blog post and GitHub repository.

Downloads last month
8
Safetensors
Model size
4B params
Tensor type
F32
·

Model tree for jupyter-agent/jupyter-agent-qwen3-4b-thinking

Finetuned
(244)
this model
Quantizations
1 model

Dataset used to train jupyter-agent/jupyter-agent-qwen3-4b-thinking

Collection including jupyter-agent/jupyter-agent-qwen3-4b-thinking