Slonik-7B-GRPO — GGUF

GGUF quantizations of Phani-labs/Slonik-7B-GRPO, built to run locally through llama.cpp, Ollama, LM Studio, Jan, and other GGUF-compatible runtimes.

Why I built this

I wanted a small text-to-SQL model that could run locally but still handle real PostgreSQL and SQLite questions. Most strong SQL models today are either much larger, cloud-only, or awkward to integrate into local workflows. This project was an experiment to see how far a 7B coding model could go with focused supervised fine-tuning followed by execution-based reinforcement learning.

The surprising part: on the BIRD-PG eval, the 7B model came out ahead of GPT-4o while still being small enough to run on a laptop.

Results

Numbers from the BIRD Mini-Dev 500-example dev set, evaluated against the BIRD PostgreSQL dump loaded into local Postgres + pgvector.

Model	BIRD-PG	BIRD-SQLite	Size
o3-mini	47.78%	—	reasoning
Claude 3.7 Sonnet	39.26%	—	proprietary
Slonik-7B-GRPO (this)	38.20%	45.20%	7B
GPT-4o	34.44%	—	proprietary
Qwen2.5-Coder-32B	22.96%	—	32B
Codestral 22B	21.11%	—	22B
Qwen2.5-Coder-7B (base)	12.22%	—	7B

By difficulty

Tier	BIRD-PG	BIRD-SQLite
Simple	56.1%	66.2%
Moderate	33.6%	38.0%
Challenging	23.5%	32.4%

Available quantizations

File	Quant	Size	Notes
`Slonik-7B-GRPO.Q4_K_M.gguf`	Q4_K_M	4.4 GB	Best quality-to-size tradeoff. Runs on 8 GB VRAM or CPU.
`Slonik-7B-GRPO.Q5_K_M.gguf`	Q5_K_M	5.1 GB	Slightly better quality if you have the memory. Runs on 8 GB VRAM.
`Slonik-7B-GRPO.Q8_0.gguf`	Q8_0	7.6 GB	Near-lossless. Best if you have 12 GB VRAM or enough system RAM.

Most people should start with Q4_K_M. It's the easiest to run and gives the best quality-to-size balance. Use Q5_K_M if you have memory to spare, or Q8_0 if you want results closest to the original model.

Usage

Ollama

ollama pull hf.co/Phani-labs/Slonik-7B-GRPO-GGUF:Q4_K_M
ollama run hf.co/Phani-labs/Slonik-7B-GRPO-GGUF:Q4_K_M

If Ollama has trouble picking the template automatically, use the prompt format shown below.

llama.cpp

./llama-cli -m Slonik-7B-GRPO.Q4_K_M.gguf -p "<|im_start|>user
Schema:
CREATE TABLE orders (id INT, customer_id INT, total NUMERIC, order_date DATE);
CREATE TABLE customers (id INT, name TEXT, country TEXT);

Question: Total revenue by country in 2024, top 5.<|im_end|>
<|im_start|>assistant
" -n 200 --temp 0

LM Studio / Jan

Download Slonik-7B-GRPO.Q4_K_M.gguf, drop it into your models folder, and load it from your local runtime.

Prompt format

Uses the Qwen2.5 chat template (<|im_start|> / <|im_end|>):

<|im_start|>user
Schema:
<your CREATE TABLE statements here>

Question: <your question>

### Hint:
<optional clarifications about column meanings, date formats, join paths>
<|im_end|>
<|im_start|>assistant

The ### Hint: block is optional but helps a lot for non-obvious schemas. Example:

Schema:
CREATE TABLE orders (id INT, customer_id INT, total NUMERIC, order_date DATE);
CREATE TABLE customers (id INT, name TEXT, country TEXT);

Question: Total revenue by country in 2024, top 5.

### Hint:
Join orders.customer_id = customers.id. Revenue is the sum of orders.total.

Training

Two stages, both on a single RTX 5080 Laptop GPU (16 GB VRAM).

Stage 1 — QLoRA SFT (8h 13min)

Standard supervised fine-tuning on 21,847 text-to-SQL pairs:

BIRD train split — 6,601 examples (PostgreSQL/SQLite, expert-curated)
Spider — 8,034 examples (SQLite, classic benchmark)
Gretel synthetic text-to-SQL — 5,212 PostgreSQL examples (synthetic, large coverage)
Custom PG-Modern synth — 2,000 examples generated via DeepSeek-V4, covering pgvector, JSONB, window functions, fulltext search, CTEs, and array operations

LoRA rank 32, alpha 64, 4-bit NF4 base. LR 1e-5, cosine schedule, max_grad_norm 0.5, adamw_torch_fused (the 8-bit Adam variant caused NaN with bf16 on Blackwell). Final eval_loss 0.290.

Stage 2 — GRPO with execution rewards (16h)

GRPO (Group Relative Policy Optimization) with three reward signals: weighted execution match against the BIRD SQLite databases (1.0), syntax validity via sqlglot (0.2), and code-fence formatting (0.1). 2000 steps, num_generations=2.

The total external cost was about $3 (DeepSeek API for the PG-Modern synthesis). Everything else ran locally.

What GRPO actually fixed

The biggest improvement was dialect awareness. SFT kept generating MONTH(date) — that's MySQL syntax and just fails on Postgres. GRPO learned EXTRACT(MONTH FROM date) from the executions that came back as errors.

It also got better at date formats. SFT was guessing patterns like LIKE '%/%/87%' (assuming mm/dd/yy), which returned empty result sets. GRPO settled on LIKE '%1987%' after enough wrong-answer signals.

A smaller but interesting one: it learned when not to quote identifiers. SFT was over-quoting in cases where the DDL was unquoted, which broke case-sensitive matches.

Notes from training

A few things that helped more than I expected:

Execution feedback was much more useful than format-only rewards. The dialect-specific improvements above only happened because the model could see what failed against a real database.
PostgreSQL syntax errors gave the model a strong, unambiguous signal during GRPO.
The hardest remaining failures are still schema-grounding mistakes, especially on tables with many columns or ambiguous join paths. That's a 7B-size limitation more than anything else.

Limitations

This is not a general SQL assistant for every dialect — it's tuned around PostgreSQL and SQLite specifically. Behavior on MySQL or SQL Server isn't validated.

The 7B size still shows up on harder examples. Challenging-tier BIRD-PG accuracy is 23.5%, and schema grounding is imperfect on tables with 30+ columns, where most remaining errors are hallucinated column names.

GRPO occasionally over-quotes identifiers or adds unnecessary DISTINCT. I saw 6 such regressions across 500 BIRD-PG examples. The net gain was still positive, but this is one weakness of binary execution rewards.

Author

Phani

GitHub: slonik-7b
Full-precision weights: Phani-labs/Slonik-7B-GRPO
SFT-only baseline: Phani-labs/Slonik-7B-SFT

Downloads last month: 161

GGUF

Model size

8B params

Architecture

qwen2

Hardware compatibility

4-bit

5-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Phani-labs/Slonik-7B-GRPO-GGUF

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-Coder-7B