You're building an API and defaulting to JSON because everyone does. The endpoint works fine. Then it needs to handle large payloads at high throughput, and suddenly the bottleneck is payload parsing. Or you're writing a config file and three months later a new team member can't figure out what half of it means because there are no comments allowed. Or you're exporting data to finance and the CSV import broke because one field contained a comma.
Every data format is a tradeoff. JSON is not always the right answer. Neither is anything else. The question is which tradeoffs match your actual constraints.
This guide covers six formats you'll encounter in production work — what each one is, where it fits, and how to recognize when you're reaching for the wrong tool.
JSON — The Default
JSON (JavaScript Object Notation) is the lingua franca of the web. It's human-readable, supported natively in JavaScript, and has library support in every language worth using. Every REST API you interact with today almost certainly speaks JSON.
What you get: text-based, nested key-value pairs, arrays, strings, numbers, booleans, and null. No comments. No dates (you serialize them as strings). No binary data (you base64-encode it). No schema enforcement unless you add one separately.
What you lose: parsing speed at scale, compact wire size, type safety without tooling.
// A typical API response
const order = {
id: "0195d3a2-f8c0-7b4e-8f32-1a2b3c4d5e6f",
customerId: "cust_01234",
status: "pending",
amount: 4999,
currency: "EUR",
items: [
{ sku: "WIDGET-001", quantity: 2, unitPrice: 2499 },
{ sku: "WIDGET-002", quantity: 1, unitPrice: 1 },
],
createdAt: "2026-04-23T09:12:00Z",
};
// Serialize / parse — built-in in Node.js, zero dependencies
const serialized = JSON.stringify(order);
const parsed = JSON.parse(serialized) as typeof order;
Pros: Universal support. Browser-native. Human-readable and editable. Trivial to debug — paste it into any JSON formatter. Schema validators exist (JSON Schema, Zod, Ajv).
Cons: No comments. Verbose — field names repeat on every object in an array. Numbers lose precision above 2^53 (use strings for large integers). Dates are strings by convention, not by type. Parsing is CPU-intensive at high volumes.
Use it when: building REST APIs, storing documents in PostgreSQL/MongoDB, passing data between services that don't share a schema definition, any context where a human might need to read the raw payload.
Avoid it when: you're parsing large payloads (100KB+) at high req/sec and it shows up in your profiler, binary data is common, or payload size is a hard constraint (mobile clients, IoT).
YAML — Config Files and CI/CD
YAML (YAML Ain't Markup Language) is a superset of JSON (as of YAML 1.2) designed for human-writable configuration. It replaces braces and quotes with indentation and supports comments — two things JSON deliberately omits.
You have already written a lot of YAML: GitHub Actions workflows, Docker Compose files, Kubernetes manifests, Helm charts, swagger.yaml.
# Same order data as above — notice: no quotes required on strings,
# no commas, and comments are allowed
order:
id: 0195d3a2-f8c0-7b4e-8f32-1a2b3c4d5e6f
customer_id: cust_01234
status: pending
amount: 4999
currency: EUR
items:
- sku: WIDGET-001
quantity: 2
unit_price: 2499
- sku: WIDGET-002
quantity: 1
unit_price: 1
created_at: "2026-04-23T09:12:00Z" # quoted to prevent datetime parsing
Pros: Comments. More readable than JSON for multi-level nesting. Multi-line strings with | (literal) and > (folded) blocks. Anchors and aliases (&anchor, *alias) for DRY config.
Cons: Indentation-sensitive — a misplaced space breaks the file silently. The type coercion rules are surprising in YAML 1.1 parsers (still widely used): yes, no, on, off, true, false are all valid booleans, and Norwegian country code NO parses as false. YAML 1.2 removed most of this, but not all parsers have caught up. Numbers with leading zeros parse as octal in YAML 1.1. Slow to parse relative to JSON. Not suitable for machine-generated data.
# YAML 1.1 footguns (Python's PyYAML, Ruby's Psych <4, many older tools)
country: NO # parses as false in YAML 1.1 parsers
version: 012 # parses as octal 10 in strict parsers
port: 8080 # integer, not string — may surprise you
api_key: 1234e5678 # parses as scientific notation float
Use it when: writing configuration files that humans edit frequently — CI/CD pipelines, Docker Compose, infrastructure definitions, application configs where comments add real value.
Avoid it when: the data is machine-generated, the people editing it aren't developers, or you need strict type guarantees. YAML's implicit type coercion has caused real production incidents.
TOML — Application Configuration
TOML (Tom's Obvious, Minimal Language) was designed specifically to fix YAML's footguns while remaining human-readable. It's unambiguously typed, not indentation-sensitive, and has no surprising coercion rules.
You know TOML from Cargo.toml (Rust), pyproject.toml (Python packaging), and increasingly from application config files in Go projects.
# Same data — notice explicit types and section headers
[order]
id = "0195d3a2-f8c0-7b4e-8f32-1a2b3c4d5e6f"
customer_id = "cust_01234"
status = "pending"
amount = 4999
currency = "EUR"
created_at = 2026-04-23T09:12:00Z # native datetime type — no quoting needed
[[order.items]]
sku = "WIDGET-001"
quantity = 2
unit_price = 2499
[[order.items]]
sku = "WIDGET-002"
quantity = 1
unit_price = 1
Pros: Native datetime type. Explicit strings require quotes — no NO becoming false. Integers, floats, booleans, and datetimes are distinct types. Comments. Not indentation-sensitive — sections are flat. Easier to write tooling for than YAML.
Cons: Less suitable for deeply nested structures — the [[array.of.tables]] syntax becomes awkward. Fewer language libraries than JSON or YAML. Not useful for data interchange (only configuration use cases make sense).
# pyproject.toml — real example of TOML for project config
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "eu-vat-rates-data"
version = "2026.4.23"
description = "EU VAT rates dataset"
requires-python = ">=3.9"dependencies = []
[project.urls]
Homepage = "https://github.com/vatnode/eu-vat-rates-data"
Use it when: writing application-level configuration that developers own and edit — database configs, server settings, CLI tool configuration. Especially appropriate for Rust and Python projects where the ecosystem expects it.
Avoid it when: the config is deeply nested (YAML is more concise for that), or the consumers are non-developers who expect a simpler format.
CSV — Tabular Data and Exports
CSV (Comma-Separated Values) is the lowest common denominator of data exchange. Every spreadsheet application on earth reads it. Every database can export it. It's the format that finance, operations, and non-technical stakeholders actually use.
CSV has no official specification. RFC 4180 is the closest thing to a standard, and most implementations deviate from it in some way.
id,customer_id,status,amount,currency,created_at
0195d3a2-f8c0-7b4e-8f32-1a2b3c4d5e6f,cust_01234,pending,4999,EUR,2026-04-23T09:12:00Z
0195d3a2-f8c0-7b4e-8f33-2b3c4d5e6f7a,cust_05678,completed,12000,EUR,2026-04-23T09:15:00Z
CSV is flat. No nesting, no arrays, no objects. One row = one record. Every value is a string unless the consuming application interprets it otherwise.
// Parsing CSV in Node.js — use a library, never hand-roll the parser
import { parse } from "csv-parse/sync";
import { stringify } from "csv-stringify/sync";
import { readFileSync, writeFileSync } from "fs";
interface OrderRow {
id: string;
customer_id: string;
status: string;
amount: string; // CSV has no number type — always a string
currency: string;
created_at: string;
}
// Parse
const raw = readFileSync("orders.csv", "utf-8");
const rows = parse(raw, {
columns: true, // use first row as headers
skip_empty_lines: true,
trim: true,
}) as OrderRow[];
const orders = rows.map((row) => ({
...row,
amount: parseInt(row.amount, 10), // explicit conversion required
}));
// Generate
const output = stringify(orders, {
header: true,
columns: ["id", "customer_id", "status", "amount", "currency", "created_at"],
});
writeFileSync("orders-export.csv", output);
Pros: Universal tool support (Excel, Google Sheets, LibreOffice). Trivial to generate. Small file size for flat tabular data. Non-developers can open and understand it immediately.
Cons: No types. No nesting. No schema. Delimiter conflicts (a field containing a comma breaks naive parsers). Encoding issues (UTF-8 vs. Windows-1252 for European characters). No standard for null vs. empty string. Excel auto-converts strings that look like dates or numbers.
The encoding issue is particularly sharp for European projects — Finnish names with ä, ö, ü, and similar characters are frequently mangled when the exporting system uses UTF-8 and the importing system expects Windows-1252. Always specify encoding explicitly and add a UTF-8 BOM if Excel is in the receiving chain.
Use it when: exporting data for non-developers (finance reports, operations data, client-facing exports), importing data from third-party systems that only speak CSV, or any context involving a spreadsheet.
Avoid it when: the data is hierarchical, types matter at parse time, or the pipeline is fully automated without human review.
Protocol Buffers — High-Performance Binary
Protocol Buffers (Protobuf) is a binary serialization format developed by Google and used internally across their infrastructure. It's the wire format for gRPC. You define a schema in a .proto file, compile it to code, and serialize/deserialize with generated functions.
The schema definition is not optional — it's the point. Protobuf enforces structure at compile time, not runtime.
// order.proto
syntax = "proto3";
package commerce;
message OrderItem {
string sku = 1;
int32 quantity = 2;
int32 unit_price = 3;
}
message Order {
string id = 1;
string customer_id = 2;
string status = 3;
int32 amount = 4;
string currency = 5;
repeated OrderItem items = 6;
int64 created_at_unix = 7; // Unix timestamp — no native datetime in proto3
}
// Using @bufbuild/protobuf v2 (buf's modern TypeScript runtime)
// npm install @bufbuild/protobuf
// Generate code: npx buf generate
import { create, toBinary, fromBinary } from "@bufbuild/protobuf";
import { OrderSchema } from "./generated/order_pb";
// Serialize
const order = create(OrderSchema, {
id: "0195d3a2-f8c0-7b4e-8f32-1a2b3c4d5e6f",
customerId: "cust_01234",
status: "pending",
amount: 4999,
currency: "EUR",
createdAtUnix: BigInt(Math.floor(Date.now() / 1000)), // seconds, not milliseconds
items: [{ sku: "WIDGET-001", quantity: 2, unitPrice: 2499 }],
});
const bytes = toBinary(OrderSchema, order); // Uint8Array — compact binary
// Deserialize
const decoded = fromBinary(OrderSchema, bytes);
console.log(decoded.amount); // 4999 — typed integer, not a string
Size comparison: the same order object serialized to JSON is roughly 280 bytes. In Protobuf, it's around 80 bytes — about 70% smaller.
Parsing speed: Protobuf deserialization is typically 2–6x faster than JSON parsing in Node.js (higher in Go or C++). The gap widens with payload size and narrows with small messages. At sustained high throughput, the difference is measurable.
Pros: Compact binary. Fast serialization/deserialization. Strict schema enforced at compile time. Field numbers enable backward-compatible schema evolution. Language-agnostic generated code (Go, Java, Python, TypeScript, C++, and many more).
Cons: Not human-readable — you cannot curl an endpoint and read the response. Requires a .proto file and a code generation step. Schema changes must be managed carefully (never reuse field numbers). Higher operational complexity: the .proto files become an API contract that must be versioned and distributed.
Use it when: building internal service-to-service communication where both sides control the schema, performance is a hard requirement (high-frequency trading, telemetry pipelines, game servers), or you are already using gRPC.
Avoid it when: external clients need to inspect payloads, the schema changes frequently and the tooling overhead is a burden, or the team is small and the setup cost exceeds the performance benefit.
MessagePack — Binary JSON
MessagePack describes itself as "like JSON but fast and small." That's accurate. It's a binary format that maps directly to JSON's data model — objects, arrays, strings, numbers, booleans, null — but uses a compact binary encoding instead of text.
The practical advantage over JSON: no parsing of text. Numbers are binary-encoded in 1–8 bytes depending on size. Strings do not need quote characters or escape sequences. There is no code generation step, no schema file — you serialize a native data structure directly.
// npm install @msgpack/msgpack
import { encode, decode } from "@msgpack/msgpack";
const order = {
id: "0195d3a2-f8c0-7b4e-8f32-1a2b3c4d5e6f",
customerId: "cust_01234",
status: "pending",
amount: 4999,
currency: "EUR",
createdAt: new Date("2026-04-23T09:12:00Z"), // @msgpack/msgpack serializes Date via extension type (library-specific)
items: [{ sku: "WIDGET-001", quantity: 2, unitPrice: 2499 }],
};
// Serialize — returns Uint8Array
const packed = encode(order);
console.log(packed.byteLength); // ~160 bytes vs ~280 bytes JSON (for this example — actual savings depend on key lengths)
// Deserialize
const unpacked = decode(packed) as typeof order;
MessagePack has extension types — a mechanism for encoding types that JSON cannot represent natively: arbitrary binary, bigint, and custom types. Date support via extension types is library-specific (available in @msgpack/msgpack, not guaranteed by the spec). This solves one of JSON's most annoying limitations without requiring a separate schema system.
// Custom extension type for Decimal values (useful for financial data)
import { encode, decode, ExtensionCodec } from "@msgpack/msgpack";
import Decimal from "decimal.js";
const extensionCodec = new ExtensionCodec();
extensionCodec.register({
type: 1,
encode: (input: unknown): Uint8Array | null => {
if (input instanceof Decimal) {
return new TextEncoder().encode(input.toString());
}
return null;
},
decode: (data: Uint8Array): Decimal => {
return new Decimal(new TextDecoder().decode(data));
},
});
const payload = { amount: new Decimal("49.99"), currency: "EUR" };
const packed = encode(payload, { extensionCodec });
const unpacked = decode(packed, { extensionCodec }) as typeof payload;
// unpacked.amount is a Decimal instance, not a float
Pros: No schema required — drop-in replacement for JSON in most internal APIs. Roughly 30–50% smaller payloads than JSON for typical objects. Faster parsing than JSON. Binary extension types for Date, Buffer, and custom types.
Cons: Not human-readable. Less universal than JSON — requires a MessagePack library on both sides. Smaller ecosystem than JSON Schema for validation. Still slower than Protobuf (no schema means no precomputed field layout).
Use it when: you need something faster and smaller than JSON without the operational overhead of Protobuf — internal caching layers, WebSocket message frames, session storage, or any internal API where you control both ends.
Avoid it when: external consumers need to inspect payloads, or you need strict schema validation and backward-compatibility guarantees (use Protobuf for that).
Format Comparison
| Property | JSON | YAML | TOML | CSV | Protobuf | MessagePack |
|---|---|---|---|---|---|---|
| Human-readable | Yes | Yes | Yes | Yes | No | No |
| Comments | No | Yes | Yes | No | Yes (.proto) | No |
| Native types | Limited | Limited | Rich | None | Rich | Rich |
| Schema required | No | No | No | No | Yes | No |
| Relative payload size | 100% | ~115% | ~110% | ~70%* | ~30% | ~55% |
| Parse speed (relative) | 1x | ~0.5x | ~0.8x | ~1.2x | ~5–10x | ~3–5x |
| Nesting support | Yes | Yes | Limited | No | Yes | Yes |
| Binary data | No† | No† | No | No | Yes | Yes |
| Ecosystem maturity | Max | High | Medium | Max | High | Medium |
* CSV size advantage applies only to flat tabular data — nested data cannot be represented at all.
† JSON and YAML can represent binary as base64-encoded strings, increasing size by ~33%.
Decision Framework
Before picking a format, answer these questions:
1. Who reads the output?
If a human reads it directly — developer, operations engineer, finance team — you need a text format. JSON for API payloads, YAML or TOML for configs, CSV for reports and exports.
If it's machine-to-machine only, binary formats are worth considering.
2. Do both sides share a schema definition?
If yes, and if performance matters, Protobuf is worth the setup cost. The .proto file becomes a contract that both sides compile against — changes are caught at build time, not in production.
If no, JSON or MessagePack are more pragmatic. MessagePack if wire size matters; JSON if debuggability matters more.
3. Is the data hierarchical or flat?
Flat tabular data with a fixed set of columns — CSV. Any nesting at all — everything else.
4. What does the team already know?
A team that has never used Protobuf will spend a week on tooling setup and schema management before writing any business logic. That cost is real. For a two-person startup, JSON everywhere plus MessagePack for the one hot path is a better tradeoff than Protobuf across the board.
5. What are the performance constraints?
If parsing JSON is not showing up in your profiler, you do not have a JSON performance problem. Profile first, optimize second. I've seen teams add Protobuf to internal services handling 100 requests a day because they read a benchmark post.
6. Is the config user-editable?
If yes: TOML for application config (especially in Rust or Python projects), YAML for infrastructure config (GitHub Actions, Docker, Kubernetes). Both support comments. Neither is appropriate as a data interchange format.
Quick Reference
| Situation | Format |
|---|---|
| REST API, external consumers | JSON |
| Internal high-throughput API, shared schema | Protobuf |
| Internal API, no schema file wanted | MessagePack |
| CI/CD pipeline, Kubernetes, Docker Compose | YAML |
| Application config, developer-editable | TOML |
| Export to Excel / non-developer stakeholders | CSV |
| Database dump for data pipeline | CSV or JSON (newline-delimited) |
| WebSocket message frames at scale | MessagePack |
| Cache storage (Redis, Memcached) | MessagePack |
Wrapping Up
The mistake I see most often is defaulting to JSON for everything and then treating the performance or size problem as a framework problem. It's usually a format problem. The second most common mistake is introducing Protobuf prematurely, before the team has the tooling discipline to manage .proto files across multiple services.
slug="api-integrations"
text="Integrating services that speak different formats — JSON, CSV exports, SOAP, Protobuf — and making that boundary reliable and maintainable is a core part of the integration work I do."
/>
Use JSON as your default. Add MessagePack when you measure a real parsing bottleneck or a real payload size constraint. Add Protobuf when you have multiple services that need a strict contract and the engineering time to maintain it. Use YAML for infrastructure config. Use TOML for application config. Use CSV for anything that ends up in a spreadsheet.
The format you choose is the one both sides of your system have to live with. Pick boring over clever until boring stops working.
Format decisions compound. In the eu-vat-rates-data open-source project, I publish the same dataset to five ecosystems — each one expects data in a different idiomatic form. Getting that right up front meant zero cross-registry bugs. If you need API integrations that bridge format boundaries cleanly, or a technical consultation on a data architecture decision, get in touch.
items={[
{
q: "When should I use Protobuf instead of JSON?",
a: "When you control both ends of the wire, have a schema that evolves predictably, and performance is a hard constraint — high-frequency internal services, telemetry pipelines, gRPC backends. At small scale or with external consumers who need to inspect payloads, the tooling overhead of .proto files and code generation outweighs the performance gain. Profile first: if JSON is not in your profiler, you do not have a JSON problem.",
},
{
q: "What is MessagePack and when is it better than JSON?",
a: "MessagePack is a binary format with the same data model as JSON but roughly 30-50% smaller payloads and faster parsing. No schema file, no code generation — you serialize native data structures directly. It is the pragmatic middle ground between JSON and Protobuf: faster and smaller than JSON, simpler to operate than Protobuf. Good fit for Redis cache storage, WebSocket frames, and internal APIs where you control both ends.",
},
{
q: "Why does YAML cause production bugs more often than JSON?",
a: "YAML 1.1 parsers — which are still widely used — have surprising type coercion rules: NO parses as false (breaking country codes), leading zeros parse as octal, and strings that look like floats become floats. JSON is explicit about types, so there is no ambiguity. For configuration files that humans edit, TOML is safer than YAML and has no coercion footguns.",
},
{
q: "Should I use CSV for data pipelines?",
a: "CSV is the right choice when the output ends up in a spreadsheet or is consumed by a non-developer — finance reports, operations exports, client-facing data. For automated machine-to-machine pipelines where you control both ends, prefer JSON or newline-delimited JSON. CSV's lack of types, lack of nesting, and inconsistent encoding behaviour (UTF-8 vs Windows-1252) cause silent data corruption in unattended pipelines.",
},
{
q: "What is the difference between YAML and TOML for config files?",
a: "Both support comments and human-readable structure, but TOML is safer: strings require explicit quotes so there is no accidental type coercion, datetimes are a native type, and the format is not indentation-sensitive. YAML is better for deeply nested structures like Kubernetes manifests or GitHub Actions workflows. For application-level config that developers own, TOML is the stricter and more predictable choice.",
},
]}
/>
Further reading:
- RFC 4627 — The application/json Media Type
- RFC 4180 — Common Format for CSV Files
- TOML specification
- Protocol Buffers documentation
- MessagePack specification
- msgpack/msgpack npm package
- buf — modern Protobuf toolchain
- One Package, Five Registries: How I Maintain eu-vat-rates-data — how TOML, JSON, and Python packaging work together in a multi-ecosystem project
- UUID v7 in Production: Why Your Database Hates v4 — identifier format decisions for database-heavy systems
For further actions, you may consider blocking this person and/or reporting abuse
