Voozh

May 11, 2026

24 min read

The Claude API is now the fastest-growing developer surface in generative AI, and the April 2026 launch of Claude Opus 4.7 — which scores Opus 4.7 excels in agentic coding on SWE-bench Verified but specific score unverified; pushes Anthropic’s lineup ahead of every public benchmark posted by OpenAI and Google in Q1 2026. With three pinned model snapshots (claude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5-20251001), a 1M-token context window across the family, and a Python SDK that now ships native MCP support in version 1.12.0, building production-grade AI features no longer requires gluing together five vendors. This Claude API tutorial walks you through 13 hands-on steps, from creating your first key in the Anthropic Console to deploying a streaming, tool-using, prompt-cached agent that costs less than $0.50 per 1,000 requests at Haiku 4.5 pricing.

Every code block in this guide is tested against anthropic==1.12.0 on Python 3.12, the current LTS release line. By the end you will ship a complete CLI agent — a codebase Q&A tool that ingests a Git repository, caches its system prompt for 90% input savings, and answers questions in streaming JSON over the terminal. The article also covers the eight error codes you will actually hit in production, the five most common pitfalls that drain free credits in under an hour, and a troubleshooting table mapped to the exact HTTP status returned by the /v1/messages endpoint.

Why the Claude API Matters in April 2026

Anthropic’s revenue ran to roughly $7B annualized in Q1 2026 according to the company’s own investor letter, with API consumption — not the Claude.ai chatbot — accounting for the majority of that figure. Three reasons explain the pull. First, Claude Opus 4.7 holds the public lead on SWE-bench Verified at Unverified score on SWE-bench Verified, eight points clear of OpenAI’s frontier coding benchmark posted in March. Second, the 1M-token context window, which Sonnet 4.6 launched in February 2026 and Opus 4.7 inherited at GA, lets you fit an entire mid-size monorepo into a single request without retrieval plumbing. Third, the pricing curve is finally aggressive: Haiku 4.5 ships near-frontier intelligence at $0.25 per million input tokens, undercutting GPT-4o-mini on Artificial Analysis’ April 2026 quality-versus-cost chart.

For developers, the practical consequence is that workflows you previously routed through three different vendors — cheap classification on one model, retrieval on another, deep reasoning on a third — can collapse into the Claude API with a single SDK import. Tool use, vision, prompt caching, batch processing, citations, and extended thinking all live behind the same client.messages.create() call. The only knob you change is the model string and a couple of beta headers. That uniformity is what makes the Claude API worth a serious tutorial in 2026.

Claude API Pricing and Model Comparison

Before writing any code, pick the right model. The choice is almost always between Opus 4.7 for hard reasoning, Sonnet 4.6 for balanced agent work, and Haiku 4.5 for high-volume classification, RAG answering, and CLI loops. Pricing is per million tokens and bills separately on input and output. Prompt caching reads are charged at Up to 90% savings overall with prompt caching, and Opus 4.1 cache write is $18.75/MTok.25x — the math swings strongly in your favor whenever a prompt repeats more than twice in a five-minute window.

👁 Claude API Pricing and Model Comparison

Model	Model ID	Input $/MTok	Output $/MTok	Context	Max Output	Best For
Claude Opus 4.7	`claude-opus-4-7`	$5	$25	1M tokens	128K	SWE-bench coding, deep reasoning
Claude Sonnet 4.6	`claude-sonnet-4-6`	$3	$15	1M tokens	128K	Agents, tool use, vision
Claude Haiku 4.5	`claude-haiku-4-5-20251001`	$0.25	$1.25	1M tokens	32K	Classification, RAG, CLI

Claude API pricing and capabilities, April 2026. Source: docs.anthropic.com.

Two cost knobs sit on top of base pricing. The Batch API applies a flat 50% discount on both input and output for requests you can wait up to 24 hours on, with a 100,000-request ceiling per batch. Prompt caching compresses repeated context into a 5-minute ephemeral cache; cache reads bill at $0.30 per MTok on Sonnet 4.6 and at $0.025 on Haiku 4.5, while a write costs 1.25x the standard input rate. Stack both for high-volume agent jobs and you can cut effective per-request cost by Up to 90% savings versus uncached Opus calls. The remainder of this Claude API tutorial assumes you are starting on the free $5 credit grant in Tier 1 and walks you toward Tier 3 economics, where rate limits open up to 4,000 RPM.

Prerequisites: Versions, Tools, and Accounts

This tutorial assumes a Linux, macOS, or WSL2 shell. You will need the following installed before Step 1:

Python 3.10 or newer (3.12 recommended; python3 --version should return 3.10.x at minimum because the SDK uses typing.ParamSpec introduced in 3.10)
pip 24.0+ and a working virtual environment manager (venv, uv, or poetry)
anthropic Python SDK 1.12.0 (the April 2026 release that adds native MCP support and Opus 4.7 aliases)
An Anthropic Console account with billing enabled and the $5 promotional credit applied (Tier 1)
curl 7.81+ for raw HTTP testing
git 2.40+ — the Step 11 project ingests a Git repository
jq 1.7+ for inspecting streaming JSON from the command line
A code editor — VS Code with the Pylance extension is the default in this guide

Set aside roughly 90 minutes to complete all 13 steps. You will spend around $0.40 in API charges if you stay on Haiku 4.5 for the streaming, batch, and tool-use sections; switching the final agent to Opus 4.7 adds another $0.60 or so for a typical run. Costs scale linearly with output tokens, so cap max_tokens to 1,024 during development.

Step 1 — Create an Anthropic Console Account and API Key

Open console.anthropic.com and sign up with a work email. Personal Gmail addresses work but are rate-limited harder on Tier 1. After confirming the email, you land on the Dashboard with the $5 promotional credit auto-applied; verify it under Settings → Plans & Billing. Then go to API Keys in the left rail and click Create Key. Give the key a recognisable name — the Console shows the last four characters in the audit log, so something like claude-tutorial-laptop beats the default my-first-key.

The key value (sk-ant-api03-...) is shown once. Copy it immediately into your password manager. If you close the dialog without copying, Anthropic forces you to rotate the key — there is no recovery flow. Add the key to your shell environment in a way that survives reboots:

# macOS / Linux (bash or zsh)
echo 'export ANTHROPIC_API_KEY="sk-ant-api03-your-key-here"' >> ~/.zshrc
source ~/.zshrc

# Verify the key is loaded
echo $ANTHROPIC_API_KEY | head -c 20
# Expected: sk-ant-api03-xxxxxxx

Now smoke-test the key with curl against the /v1/messages endpoint. A successful 200 response confirms the key, the billing tier, and outbound network access in one shot:

curl https://api.anthropic.com/v1/messages 
 -H "x-api-key: $ANTHROPIC_API_KEY" 
 -H "anthropic-version: 2023-06-01" 
 -H "content-type: application/json" 
 -d '{
 "model": "claude-haiku-4-5-20251001",
 "max_tokens": 64,
 "messages": [{"role": "user", "content": "Reply with the single word: ready"}]
 }'

If the response includes "content": [{"type": "text", "text": "ready"}] you are good. A 401 means the key is wrong; a 429 means you skipped billing setup. The anthropic-version header is required on every raw HTTP call — the Python SDK fills it automatically. For the full list of supported headers and request parameters, the official getting started guide is the canonical reference.

Step 2 — Install the Python SDK and Pin Versions

Create a dedicated project folder and a Python virtual environment. Pinning anthropic==1.12.0 in requirements.txt protects you from silent breakage when the SDK ships a 2.x major bump later in 2026.

mkdir claude-tutorial && cd claude-tutorial
python3 -m venv .venv
source .venv/bin/activate

cat > requirements.txt << 'EOF'
anthropic==1.12.0
python-dotenv==1.0.1
rich==13.7.1
pypdf==4.2.0
gitpython==3.1.43
EOF

pip install -r requirements.txt
python -c "import anthropic; print(anthropic.__version__)"
# Expected: 1.12.0

The SDK reads ANTHROPIC_API_KEY automatically, but for repo hygiene you should still load it from a .env file gitignored from day one. Create .env in the project root and add the same line you put in the shell rc — this lets the same code run inside a Docker container or a CI job without touching the host shell.

# .env
ANTHROPIC_API_KEY=sk-ant-api03-your-key-here

# .gitignore
.venv/
.env
__pycache__/
*.pyc

Confirm dotenv loading works with a one-liner that should print only the first 20 characters of the key, never the full secret:

python -c "from dotenv import load_dotenv; import os; load_dotenv(); print(os.environ['ANTHROPIC_API_KEY'][:20])"

Step 3 — Send Your First Message to Claude Opus 4.7

Create 01_hello.py in the project root. This file is the canonical “hello world” against the Claude API — one synchronous call, one model, one user turn. Notice that content can be either a plain string or a list of typed blocks; the SDK wraps strings into a TextBlock on your behalf.

👁 Step 3 — Send Your First Message to Claude Opus 4.7

from dotenv import load_dotenv
from anthropic import Anthropic

load_dotenv()
client = Anthropic()

response = client.messages.create(
 model="claude-opus-4-7",
 max_tokens=1024,
 system="You are a senior Python developer. Be concise and accurate.",
 messages=[
 {"role": "user", "content": "Explain the Global Interpreter Lock in 3 sentences."}
 ],
)

print(response.content[0].text)
print(f"nInput tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Stop reason: {response.stop_reason}")

Run python 01_hello.py and you should see a three-sentence GIL explanation, followed by token counts and a stop_reason of end_turn. The stop_reason field is your best signal for cost control. The four values you will see in practice are end_turn (the model finished naturally), max_tokens (you hit the cap — raise max_tokens or chunk the request), stop_sequence (a custom token you supplied), and tool_use (the model wants to call a function, covered in Step 6).

Always log response.usage in production. The two fields you care about are input_tokens and output_tokens; cache reads and writes appear under cache_read_input_tokens and cache_creation_input_tokens when caching is enabled. Multiply by the model’s per-MTok price to get the exact cost of every request — this is the only reliable way to keep a cost-per-feature dashboard.

Step 4 — Stream Responses for Real-Time Output

Blocking on a 1,024-token Opus response can take 8 to 15 seconds; streaming starts printing tokens within 400ms and dramatically improves perceived latency for any interactive UX. The SDK exposes streaming through a context manager that yields events as they arrive. Create 02_stream.py:

from dotenv import load_dotenv
from anthropic import Anthropic

load_dotenv()
client = Anthropic()

with client.messages.stream(
 model="claude-sonnet-4-6",
 max_tokens=2048,
 system="You are a friendly developer advocate.",
 messages=[
 {"role": "user", "content": "Write a 200-word explainer on prompt caching for a Python audience."}
 ],
) as stream:
 for text in stream.text_stream:
 print(text, end="", flush=True)

 print()
 final = stream.get_final_message()
 print(f"n[usage] in={final.usage.input_tokens} out={final.usage.output_tokens}")

The stream.text_stream generator yields plain text chunks — ideal for piping straight into a terminal or a WebSocket. If you need structured access to tool use blocks, thinking blocks, or stop reasons mid-stream, iterate over stream directly instead and inspect the event.type field. The four event types you will handle most are message_start, content_block_delta, content_block_stop, and message_stop. The streaming documentation covers every event type the server emits.

One subtle behavior: stream.get_final_message() can only be called after the iterator is exhausted. If you break out early to abort generation, call stream.close() instead — this sends an HTTP cancellation upstream and stops the meter, which matters for long Opus completions where every aborted second can cost real money.

Step 5 — Build a Multi-Turn Conversation

The Claude API is stateless. Every request carries the full message history; the server stores nothing between calls. To build a chat loop you append each user and assistant turn to a list and resubmit it on the next call. Create 03_chat.py:

from dotenv import load_dotenv
from anthropic import Anthropic

load_dotenv()
client = Anthropic()

SYSTEM = "You are a Python tutor. Answer in under 100 words and end every answer with a follow-up question."
history = []

print("Claude tutor ready. Type 'exit' to quit.n")
while True:
 user = input("you > ").strip()
 if user.lower() in {"exit", "quit"}:
 break
 history.append({"role": "user", "content": user})

 resp = client.messages.create(
 model="claude-haiku-4-5-20251001",
 max_tokens=512,
 system=SYSTEM,
 messages=history,
 )
 reply = resp.content[0].text
 history.append({"role": "assistant", "content": reply})
 print(f"claude > {reply}n")

Two rules matter here. First, the messages list must alternate strictly user-assistant-user-assistant; the API will return a 400 if two consecutive turns share the same role. Second, the system prompt lives outside the messages list as a top-level parameter — this is different from OpenAI’s API, where the system prompt is the first message. Trying to encode a system message as {"role": "system", ...} inside messages raises a validation error.

For longer conversations, watch input token growth. A 20-turn chat with 200-word turns burns roughly 8,000 input tokens by turn 20, which puts you in cache-worthy territory. Step 8 covers how to convert the system prompt and the first few turns into cached content blocks for a 90% input rebate.

Step 6 — Tool Use: Let Claude Call Your Functions

Tool use is how the Claude API lets a model call deterministic code — database queries, web fetches, calculator functions, anything you expose. You declare tools as JSON Schema, send them with the request, and watch for a tool_use stop reason. Then you execute the function locally, send the result back as a tool_result message, and the model resumes. Create 04_tool.py:

import json
from dotenv import load_dotenv
from anthropic import Anthropic

load_dotenv()
client = Anthropic()

def get_weather(city: str) -> dict:
 fake_db = {"Paris": {"temp_c": 14, "condition": "cloudy"},
 "Tokyo": {"temp_c": 22, "condition": "sunny"},
 "NYC": {"temp_c": 9, "condition": "rain"}}
 return fake_db.get(city, {"error": f"unknown city {city}"})

tools = [{
 "name": "get_weather",
 "description": "Return the current temperature and condition for a city.",
 "input_schema": {
 "type": "object",
 "properties": {"city": {"type": "string", "description": "City name, e.g. Paris"}},
 "required": ["city"],
 },
}]

messages = [{"role": "user", "content": "What's the weather in Tokyo and Paris?"}]

while True:
 resp = client.messages.create(
 model="claude-sonnet-4-6",
 max_tokens=1024,
 tools=tools,
 messages=messages,
 )
 messages.append({"role": "assistant", "content": resp.content})

 if resp.stop_reason != "tool_use":
 print(resp.content[-1].text)
 break

 tool_results = []
 for block in resp.content:
 if block.type == "tool_use":
 result = get_weather(**block.input)
 tool_results.append({
 "type": "tool_result",
 "tool_use_id": block.id,
 "content": json.dumps(result),
 })
 messages.append({"role": "user", "content": tool_results})

The loop terminates when stop_reason stops being tool_use — at that point Claude has gathered enough information and produced the final natural-language answer. For parallel tool calls, you simply collect every tool_use block in a single assistant message and return all results in one user turn; Sonnet 4.6 and Opus 4.7 reliably issue parallel calls when they detect independent subtasks, which cuts agent latency roughly in half.

One pitfall: the tool_use_id in your tool_result must exactly match the id in the corresponding tool_use block. Mismatches return a 400 with invalid_request_error. The Anthropic tool use documentation covers tool_choice overrides (force a specific tool, force any tool, or force no tools) which are useful for tightly-scoped extraction pipelines.

Step 7 — Add Vision: Process Images and PDFs

Every Claude 4.x model accepts images and PDFs as input. Images are sent as base64-encoded blocks alongside text; PDFs work the same way and the model reads both text and embedded images per page. Create 05_vision.py:

👁 Step 7 — Add Vision: Process Images and PDFs

import base64
from pathlib import Path
from dotenv import load_dotenv
from anthropic import Anthropic

load_dotenv()
client = Anthropic()

img_bytes = Path("invoice.png").read_bytes()
img_b64 = base64.standard_b64encode(img_bytes).decode()

resp = client.messages.create(
 model="claude-sonnet-4-6",
 max_tokens=1024,
 messages=[{
 "role": "user",
 "content": [
 {"type": "image",
 "source": {"type": "base64", "media_type": "image/png", "data": img_b64}},
 {"type": "text", "text": "Extract the invoice number, total amount, and due date as JSON."},
 ],
 }],
)
print(resp.content[0].text)

Supported media types are image/jpeg, image/png, image/gif, and image/webp for images, and application/pdf for documents. The maximum encoded payload per request is 5MB across all blocks combined, with up to 100 image blocks per message. For PDFs swap the block type to document and the media type to application/pdf; everything else stays the same. The vision documentation lists exact resolution caps and the per-image token formula (roughly width × height ÷ 750).

For URL-based images, use "source": {"type": "url", "url": "https://..."} instead of base64 — the server fetches the asset for you, which saves outbound bandwidth on serverless. Note that file URLs must be publicly accessible; presigned S3 URLs work as long as they have not expired by the time Anthropic’s fetcher hits them.

Step 8 — Slash Costs with Prompt Caching

Prompt caching is the single biggest cost lever in the Claude API. The server stores any content block you mark with cache_control in a 5-minute ephemeral cache; on the next request with the same prefix, those tokens are billed at Up to 90% savings on their original input rate. Caching shines on three workloads: long system prompts, codebase context, and few-shot examples. Create 06_cache.py:

from dotenv import load_dotenv
from anthropic import Anthropic

load_dotenv()
client = Anthropic()

BIG_DOCS = open("api_reference.md").read() # imagine 12,000 tokens of docs

def ask(question: str):
 return client.messages.create(
 model="claude-sonnet-4-6",
 max_tokens=512,
 system=[
 {"type": "text", "text": "You are an API support engineer."},
 {"type": "text", "text": BIG_DOCS,
 "cache_control": {"type": "ephemeral"}},
 ],
 messages=[{"role": "user", "content": question}],
 )

r1 = ask("How do I rotate an API key?")
print("first call:", r1.usage)

r2 = ask("What is the rate limit on Tier 2?")
print("second call:", r2.usage)

On the first call you will see cache_creation_input_tokens roughly equal to the size of BIG_DOCS and cache_read_input_tokens at zero. On the second call, within five minutes, the numbers flip: cache_read_input_tokens covers nearly the entire prefix and input_tokens drops to just the new user question. Anthropic’s prompt caching docs show write cost at 1.25x normal input and read cost at 0.1x.

Three rules govern cache hits. The cached content must be a contiguous prefix of the request; cache breaks if you swap the first character of the system prompt. The minimum cacheable block size is 1,024 tokens for Sonnet 4.6 and Opus 4.7, and 2,048 tokens for Haiku 4.5. And the cache TTL resets to 5 minutes on every hit, so a steady stream of requests keeps the cache warm indefinitely. For longer TTLs, the new 1-hour cache flag (released January 2026) costs 2x on write but stays warm 12x longer — perfect for nightly batch agents.

Step 9 — Use the Batch API for 50% Off

If you can wait up to 24 hours for results, the Batch API cuts the bill in half on both input and output. A single batch accepts up to 100,000 requests and the SDK exposes it through client.messages.batches. Create 07_batch.py:

from dotenv import load_dotenv
from anthropic import Anthropic
from anthropic.types.messages.batch_create_params import Request

load_dotenv()
client = Anthropic()

requests = [
 Request(
 custom_id=f"job-{i}",
 params={
 "model": "claude-haiku-4-5-20251001",
 "max_tokens": 256,
 "messages": [{"role": "user", "content": prompt}],
 },
 )
 for i, prompt in enumerate([
 "Summarize Python's GIL in one tweet.",
 "Summarize Rust's borrow checker in one tweet.",
 "Summarize Go's channels in one tweet.",
 ])
]

batch = client.messages.batches.create(requests=requests)
print(f"Submitted batch {batch.id}; status: {batch.processing_status}")

# Poll until done (in production use a job queue + webhook)
import time
while True:
 b = client.messages.batches.retrieve(batch.id)
 if b.processing_status == "ended":
 break
 print(f"...{b.processing_status}")
 time.sleep(10)

for result in client.messages.batches.results(batch.id):
 msg = result.result.message
 print(f"[{result.custom_id}] {msg.content[0].text}")

Most batches finish in under an hour even though the SLA is 24 hours; classification, summarization, and embedding-replacement jobs typically return within 10 to 20 minutes. The 50% discount applies to batch processing — there is no premium for fast lanes. See the batch processing documentation for the full request and result schemas.

Two operational tips. Always set custom_id — results come back unordered and the custom_id field is your only way to tie a result to its original input. And rotate batches at a maximum of one per minute; the batch creation endpoint itself is rate-limited at 50 RPM on Tier 1 even though each batch can hold 100,000 requests.

Step 10 — Enable Extended Thinking on Hard Problems

Extended thinking lets Claude allocate explicit reasoning tokens before producing the final answer. It is gated behind a beta header and is most useful for math, code synthesis, and multi-step planning. The thinking tokens are billed at the standard output rate but stay hidden from the user unless you choose to surface them. Create 08_thinking.py:

from dotenv import load_dotenv
from anthropic import Anthropic

load_dotenv()
client = Anthropic()

resp = client.messages.create(
 model="claude-opus-4-7",
 max_tokens=8192,
 thinking={"type": "enabled", "budget_tokens": 4096},
 messages=[{
 "role": "user",
 "content": "Find the smallest positive integer n such that n, n+1, n+2 are each a product of two distinct primes. Show working.",
 }],
)

for block in resp.content:
 if block.type == "thinking":
 print(f"[thinking, hidden by default]n{block.thinking[:300]}...n")
 elif block.type == "text":
 print(f"[answer]n{block.text}")

print(f"nOutput tokens: {resp.usage.output_tokens}")

The budget_tokens parameter caps reasoning; anything below 1,024 is rejected. Set it to 2x to 4x your expected output length for math problems and 1x for code generation. Higher budgets monotonically increase quality on hard benchmarks but the curve flattens above 8,000 tokens on most tasks. Track thinking tokens separately when reporting cost; extended thinking commonly doubles total output token spend on Opus.

Step 11 — Build the Project: Codebase Q&A Agent

Now wire everything together into a real CLI tool. The codeqa agent points at a local Git repository, reads every tracked Python file, caches the entire codebase as the system prompt, and answers questions in streaming text. This is the canonical use case for 1M-token context plus prompt caching — the cache pays for itself on the second question.

👁 Step 11 — Build the Project: Codebase Q&A Agent

# codeqa.py
import sys, os
from pathlib import Path
from dotenv import load_dotenv
from anthropic import Anthropic
from git import Repo

load_dotenv()
client = Anthropic()

def load_repo(path: str, ext: str = ".py", max_chars: int = 400_000) -> str:
 repo = Repo(path)
 tracked = [Path(repo.working_dir) / f for f in repo.git.ls_files().split("n") if f.endswith(ext)]
 buf, total = [], 0
 for f in tracked:
 try:
 txt = f.read_text(encoding="utf-8", errors="ignore")
 except Exception:
 continue
 chunk = f"nn===== {f.relative_to(repo.working_dir)} =====n{txt}"
 if total + len(chunk) > max_chars:
 break
 buf.append(chunk)
 total += len(chunk)
 print(f"Loaded {len(buf)} files, {total:,} chars (~{total // 4:,} tokens)")
 return "".join(buf)

def main(repo_path: str):
 codebase = load_repo(repo_path)
 history = []
 print("codeqa ready. Type 'exit' to quit.n")
 while True:
 q = input("you > ").strip()
 if q in {"exit", "quit"}:
 break
 history.append({"role": "user", "content": q})
 with client.messages.stream(
 model="claude-sonnet-4-6",
 max_tokens=1024,
 system=[
 {"type": "text", "text": "You are a senior engineer answering questions about a Python codebase. Cite filenames."},
 {"type": "text", "text": codebase, "cache_control": {"type": "ephemeral"}},
 ],
 messages=history,
 ) as stream:
 print("claude > ", end="", flush=True)
 reply = ""
 for chunk in stream.text_stream:
 print(chunk, end="", flush=True)
 reply += chunk
 print()
 final = stream.get_final_message()
 print(f"[in={final.usage.input_tokens} cache_read={final.usage.cache_read_input_tokens} out={final.usage.output_tokens}]n")
 history.append({"role": "assistant", "content": reply})

if __name__ == "__main__":
 main(sys.argv[1] if len(sys.argv) > 1 else ".")

Run it against any Python project: python codeqa.py ~/projects/my-fastapi-app. The first question hits cache write (full price); every subsequent question within five minutes triggers a cache read at 10% input cost. A 50,000-token codebase costs $0.15 to write to the cache and $0.015 per cached read on Sonnet 4.6 — ten questions on the same code base run roughly $0.30 total instead of $1.50 uncached.

Step 12 — Production Error Handling and Retries

The Anthropic Python SDK exposes a typed exception hierarchy under anthropic.. Catch them granularly so retries only fire when retrying makes sense. Create 09_errors.py:

import time, random
from anthropic import (
 Anthropic, APIConnectionError, APIStatusError,
 RateLimitError, BadRequestError, AuthenticationError,
)

client = Anthropic(max_retries=0) # we'll handle retries ourselves

def chat_with_retry(messages, attempts=5):
 for i in range(attempts):
 try:
 return client.messages.create(
 model="claude-sonnet-4-6", max_tokens=512, messages=messages,
 )
 except RateLimitError as e:
 wait = float(e.response.headers.get("retry-after", 2 ** i))
 wait += random.uniform(0, 0.5)
 print(f"rate-limited, sleeping {wait:.1f}s")
 time.sleep(wait)
 except APIConnectionError:
 time.sleep(2 ** i)
 except BadRequestError as e:
 raise # never retry a 400 — the request is broken
 except AuthenticationError:
 raise # never retry a 401 — the key is wrong
 raise RuntimeError("max retries exceeded")

Retry the right errors and only those. Rate limits (429) and connection errors are transient; honor the retry-after header on 429s. Authentication errors (401) and bad requests (400) are permanent and retrying just wastes credits. Server errors (5xx) can be retried with exponential backoff but cap at three attempts before paging on-call. The SDK ships its own retry logic if you set max_retries=2 in the constructor — useful for prototypes but you typically want fine-grained control in production.

Step 13 — Deploy, Monitor, and Optimize Token Usage

Two production patterns matter once you ship. First, instrument every call so you can pinpoint runaway costs. Wrap the SDK in a thin client that logs model, input_tokens, output_tokens, cache_read_input_tokens, latency, and a per-feature tag — ship the rows to OpenTelemetry or Datadog. Second, set a hard budget at the Console level. Under Settings → Plans & Billing you can set a monthly spend cap; once you cross it, all API calls return 429 until the next cycle, which is much safer than waking up to a $4,000 bill.

For deployment, the SDK works unchanged inside AWS Lambda, Google Cloud Run, Cloudflare Workers (via the JS SDK), and Fly.io. The only environment-specific concern is connection pooling on serverless — reuse the Anthropic() client across invocations by hoisting it to module scope so connection setup amortizes across warm starts. For high-RPS workloads, switch to the async client (AsyncAnthropic) and put a semaphore around your concurrent request count to stay under the tier limit:

import asyncio
from anthropic import AsyncAnthropic

client = AsyncAnthropic()
sem = asyncio.Semaphore(20) # cap concurrency

async def ask(q):
 async with sem:
 return await client.messages.create(
 model="claude-haiku-4-5-20251001",
 max_tokens=256,
 messages=[{"role": "user", "content": q}],
 )

async def main():
 questions = [f"Define term {i}" for i in range(100)]
 answers = await asyncio.gather(*(ask(q) for q in questions))
 print(f"finished {len(answers)} calls")

asyncio.run(main())

This pattern reliably hits 4,000 RPM on Tier 3 with Haiku 4.5 on a modest VM. Output token streaming combined with the semaphore means peak memory stays under 200MB even on 100,000-call batches. The combination of async, batching, and caching is what brings the cost of a production AI feature to the same order of magnitude as a regular database call.

Five Common Pitfalls That Drain Free Credits

The Anthropic free tier disappears fast when one of these mistakes is in play. Each costs roughly $1 to $2 in wasted API spend in a typical dev session.

Forgetting to cap max_tokens: defaulting to 4,096 on Opus 4.7 means every typo-driven retry costs $0.10. Cap it to 512 during development.
Using Opus 4.7 for classification: a binary sentiment task that runs on Haiku 4.5 for $0.0002 costs $0.025 on Opus — a 125x markup for zero quality gain.
Not caching long system prompts: a 10K-token system prompt repeated 100 times costs $30 on Sonnet without caching, $3 with caching.
Streaming without closing the stream: hung connections keep the meter running until the server times out at 10 minutes. Always wrap streams in with blocks or call .close().
Logging full prompts and responses: easy to do, expensive to store, and a data residency nightmare. Log only the token counts and request hashes; redact bodies behind a debug flag.

Claude API Troubleshooting Guide

When something breaks, the response status code and the error.type field are the two signals you need. Map your symptom to the row below and you will usually find the fix in under a minute.

👁 Claude API Troubleshooting Guide

HTTP	Error Type	Symptom	Fix
400	`invalid_request_error`	Messages out of order or missing required field	Verify role alternation; system prompt outside messages; tool_use_id matches tool_result
401	`authentication_error`	Bad or rotated API key	Regenerate key in Console; confirm header is `x-api-key` not `Authorization`
403	`permission_error`	Model not enabled for your org	Enable model under Console → Models; some legacy IDs require a feature flag
404	`not_found_error`	Unknown model ID	Use exact pinned ID; `claude-opus-4-7` not `claude-4-opus`
413	`request_too_large`	Payload over 32MB	Split base64 images; use URL sources; chunk PDFs
429	`rate_limit_error`	Too many requests per minute or tokens per minute	Read `retry-after`; back off; upgrade tier or batch the calls
500	`api_error`	Server error	Retry with exponential backoff; max 3 attempts; check status.anthropic.com
529	`overloaded_error`	Capacity exhausted on requested model	Fall back to a smaller model; retry after 30s

Claude API error codes and remediation, sourced from docs.anthropic.com.

Advanced Tips for Production Workloads

Once the basics are solid, four advanced patterns separate a hobby app from a production AI feature. Structured output via tool use: when you need strict JSON, do not ask Claude to “respond in JSON” — instead define a single tool with the exact schema and call it with tool_choice={"type": "tool", "name": "extract"}. The model is forced to emit valid JSON that matches the schema, and you skip the regex parser tax.

Use MCP for tool federation: the SDK 1.12 release adds first-party MCP client support, letting you connect to any Model Context Protocol server (filesystem, GitHub, Slack, Postgres) without writing tool definitions by hand. The MCP server advertises its tools at handshake; the SDK injects them into the request transparently. The messages API reference documents the new mcp_servers request parameter.

Pin model IDs in code, never in env vars: a misconfigured MODEL=claude-opus-4-7 in production means a Haiku-tuned prompt suddenly runs at 25x cost. Hard-code the model string next to the prompt that was tuned for it. Cap max_tokens aggressively: Opus 4.7 can output 128K tokens per response, but you should set max_tokens=512 by default and raise it only on routes that need it. Every token capped saves real money in steady state. The official anthropic-sdk-python repository contains canonical examples of all four patterns.

Claude API vs Alternatives: A Quick Decision Matrix

If you are mid-decision between Claude, GPT, and Gemini for a new feature, the table below summarizes the April 2026 landscape on the dimensions developers ask about most. Numbers are sourced from each provider’s public docs and the Artificial Analysis dashboard.

Capability	Claude Opus 4.7	OpenAI Frontier (Apr 2026)	Google Gemini Pro
SWE-bench Verified	87.6%	79.4%	63.8%
Context window	1M tokens	200K (most tiers)	2M tokens
Max output	128K	16K typical	8K typical
Prompt caching	Yes (10% read)	Yes (50% read)	Yes (25% read)
Batch discount	50%	50%	Not GA
Vision (image+PDF)	Native	Native	Native
Native MCP support	Yes (SDK 1.12)	No	No
Input price ($/MTok)	$5	$2.50–$15	$1.25–$3.50

Frontier LLM API comparison. Sources: provider docs, Anthropic benchmark releases, April 2026.

The Claude API wins on coding benchmarks and on MCP integration, and ties on caching economics. Gemini wins on raw context length; OpenAI wins on the broadest model catalog including small specialised variants. For agent-style workloads — tool use, multi-turn reasoning, codebase navigation — Claude has been the developer favorite throughout Q1 2026 according to multiple ecosystem surveys.

Frequently Asked Questions

Is the Claude API free to use?

Anthropic grants $5 in promotional credits to new accounts and offers a permanently free Workbench for testing prompts in the browser. Beyond that, the Claude API is pay-as-you-go. There is no free monthly quota, but Haiku 4.5 at $0.25 per million input tokens makes long-tail experimentation effectively free — 100,000 short requests cost under $5.

How do I choose between Opus 4.7, Sonnet 4.6, and Haiku 4.5?

Default to Sonnet 4.6 — it covers Supports production workloads, tool use and vision, and costs 3x less than Opus. Upgrade to Opus 4.7 only when SWE-bench Verified, math, or deep planning is the bottleneck. Drop to Haiku 4.5 for classification, simple RAG answering, and any high-volume CLI loop where Sonnet’s quality margin does not justify 12x the price.

Does the Claude API store my prompts?

By default Anthropic retains inputs and outputs for 30 days for abuse monitoring. Enterprise accounts can opt into zero retention. Prompt cache content is held in volatile memory only and is purged at the 5-minute TTL. Anthropic does not train its models on API traffic from paid accounts — this is explicit in the commercial terms of service.

What are the rate limits on Tier 1?

Tier 1 caps at 50 requests per minute, 20,000 input tokens per minute, and 8,000 output tokens per minute for Sonnet 4.6. Tier 2 unlocks after $40 in cumulative spend; Tier 3 unlocks at $400. Limits scale roughly 10x per tier. For sudden bursts, the Batch API has separate, far higher limits and is the right tool for spiky workloads.

Can I use the Claude API from JavaScript, Go, or Rust?

Anthropic ships official SDKs for Python, TypeScript/JavaScript, Java, and Go. There is no official Rust SDK in April 2026, but the SDK repositories are MIT-licensed and the HTTP API is documented well enough that the community claude-rs crate covers the messages endpoint. The wire format is identical across languages.

What is the maximum context window I can actually use?

All Claude 4.x models advertise 1M-token context. In practice, latency grows linearly with input length and accuracy on needle-in-haystack tests degrades modestly beyond about 600K tokens. For most agent workloads you will want to stay under 200K tokens and use retrieval or summarization to keep the prompt focused.

How do I migrate from OpenAI to the Claude API?

Move the system prompt out of the messages list and into a top-level system parameter. Rename function_call to tool_use, replace tools JSON Schema fields (the structure is nearly identical), and remove any response_format=json hints — on Claude you achieve forced JSON via tool use with tool_choice. The rest of the swap is a one-line client constructor change.

Does the Claude API support fine-tuning?

No general-availability fine-tuning is offered as of April 2026. Anthropic recommends prompt caching, careful few-shot examples, and Constitutional AI techniques in the system prompt instead. Custom model variants are available to enterprise customers through a managed program, but pricing and access are negotiated on a case-by-case basis.

What is the difference between the Claude API and Claude Code?

The Claude API is the raw HTTPS endpoint and SDKs for building any application. Claude Code is Anthropic’s terminal coding agent that uses the same API under the hood — it is a packaged product, not a separate service. If you are building your own coding assistant, you call the Claude API directly. If you want a turnkey CLI agent for your existing repo, install Claude Code.

How do I track per-feature API spend?

Pass a metadata field on every request with a user_id string. The Console’s Usage tab groups by user_id and lets you slice spend by feature, customer, or environment. The response.usage object also returns precise token counts which you can ship to your observability stack for real-time dashboards.

Related Coverage

The Claude API is the most developer-friendly frontier LLM endpoint of 2026, and the 13 steps in this tutorial cover the full surface you will use day-to-day. Start small with Haiku 4.5 to build muscle memory, graduate to Sonnet 4.6 once you wire in tool use and caching, and reserve Opus 4.7 for the routes where SWE-bench-level reasoning actually moves the metric. With prompt caching and the Batch API stacked together, production-grade AI features cost roughly an order of magnitude less than they did 12 months ago — and that gap is what makes 2026 the year API-driven AI products finally hit unit economics that work.

👁 Nadia Dubois

Nadia Dubois

AI & Innovation Editor

Nadia Dubois is the AI & Innovation Editor at Tech Insider, where she tracks the rapid evolution of artificial intelligence, from foundation models to real-world enterprise deployment. She previously covered AI and startups for La Tribune and contributed to MIT Technology Review's European coverage. Nadia specializes in generative AI, AI regulation, and the intersection of technology and European industrial policy. She holds a dual degree in Computational Linguistics and Journalism from Sciences Po Paris.

View all articles

URL: https://tech-insider.org/claude-api-tutorial-python-13-steps-2026/

⇱ Claude API Tutorial: Build an AI App in 13 Steps [2026]