VOOZH about

URL: https://dev.to/aws-builders/high-cardinality-file-access-analysis-with-honeycomb-otel-1962

⇱ High-Cardinality File Access Analysis with Honeycomb + OTel - DEV Community


TL;DR

We built a serverless pipeline that ships FSx for ONTAP audit logs to Honeycomb, where its high-cardinality query engine turns file access data into actionable insights. Two delivery paths verified:

[Path A: Direct]
FSx for ONTAP → S3 Access Point → EventBridge Scheduler → Lambda → Honeycomb Events Batch API

[Path B: OTel Collector]
FSx for ONTAP → S3 Access Point → EventBridge Scheduler → Lambda → OTel Collector → OTLP → Honeycomb

Why Honeycomb for file access logs? Because file access data is inherently high-cardinality: thousands of users × millions of file paths × dozens of operations × multiple SVMs. Traditional log tools force you to pre-aggregate or sample. Honeycomb lets you query the raw events at full resolution.

┌──────────────────────────────────────────────────────┐
│ Honeycomb Query Engine │
│ │
│ "Show me which users accessed /vol/finance/* │
│ between 2am-4am last Tuesday" │
│ │
│ → BubbleUp: auto-detect anomalous dimensions │
│ → Heatmap: visualize access density over time │
│ → GROUP BY user, path, operation — no pre-indexing │
│ │
│ 20M events/month FREE │
└──────────────────────────────────────────────────────┘

This is Part 10 of the Serverless Observability for FSx for ONTAP series.


Why Honeycomb for File Access Logs?

Most observability tools index a fixed set of fields. When you have high-cardinality dimensions — like file paths (/vol/data/project-alpha/2026/Q1/report-final-v3.docx) or Active Directory usernames — you hit index bloat, slow queries, or forced sampling.

Honeycomb's columnar storage handles this natively:

Capability Traditional Logs Honeycomb
Query by arbitrary field Pre-index or full scan Instant (columnar)
GROUP BY high-cardinality field Expensive / limited Native
BubbleUp (anomaly detection) Manual investigation Semi-automatic (select time range, BubbleUp identifies differing dimensions)
Heatmap visualization Requires pre-aggregation Raw events

For FSx for ONTAP audit logs, this means you can ask questions like:

  • "Which users accessed the most files in the last hour?" (GROUP BY user)
  • "What's different about the spike at 3am?" (BubbleUp)
  • "Show me the access pattern heatmap for /vol/finance/" (Heatmap)

Architecture

┌─────────────────────────────────────────────────────────┐
│ Event Sources │
├─────────────────────────────────────────────────────────┤
│ │
│ EventBridge Scheduler │
│ rate(5 minutes) ──→ Lambda │
│ │ lists new files via │
│ │ S3 Access Point │
│ │ (checkpoint in SSM) │
│ ▼ │
│ Honeycomb Events Batch API │
│ (x-honeycomb-team header) │
│ │ │
│ EMS Webhook │ │
│ ──→ API GW ──→ Lambda ─────────────┤ │
│ (ems_handler) │ │
│ ▼ │
│ FPolicy Honeycomb │
│ ──→ ECS Fargate ──→ SQS (BubbleUp, │
│ ──→ Bridge Lambda Heatmap, │
│ ──→ EventBridge Explore) │
│ ──→ Lambda (fpolicy_handler) ──────────────────────────┤
└─────────────────────────────────────────────────────────┘

Two Verified Delivery Paths

Path A: Direct Events Batch API

Simplest path. Lambda sends events directly to Honeycomb's Events Batch API.

# Batch format
[
 {
 "time": "2026-01-15T12:00:00Z",
 "data": {
 "source": "fsxn-ontap",
 "service": "ontap-audit",
 "event_type": "4663",
 "svm": "svm-prod-01",
 "user": "admin@corp.local",
 "operation": "ReadData",
 "path": "/vol/data/file.txt",
 "result": "Success",
 "client_ip": "10.0.x.x"
 }
 }
]

Path B: OTel Collector (OTLP)

For multi-backend delivery or when you want enrichment/redaction in the pipeline. Verified in Part 5 with Honeycomb as one of the backends.

The OTel Collector uses the otlp_http exporter with x-honeycomb-dataset header:

exporters:
 otlphttp/honeycomb:
 endpoint: https://api.honeycomb.io
 headers:
 x-honeycomb-team: ${HONEYCOMB_API_KEY}
 x-honeycomb-dataset: fsxn-audit

Quick Start (30 Minutes)

1. Get a Honeycomb Ingest Key

  1. Sign up at honeycomb.io (free tier: 20M events/month)
  2. Go to AccountTeam SettingsAPI Keys
  3. Create an Ingest Key (starts with hcaik_)

⚠️ Critical: You MUST use an Ingest Key (hcaik_*). Environment Keys (hcxik_*) will be rejected.

2. Store Credentials

aws secretsmanager create-secret \
 --name "honeycomb/fsxn-api-key" \
 --secret-string '{"api_key":"hcaik_01abc..."}' \
 --region ap-northeast-1

3. Deploy CloudFormation Stack

aws cloudformation deploy \
 --template-file integrations/honeycomb/template.yaml \
 --stack-name fsxn-honeycomb-integration \
 --parameter-overrides \
 S3AccessPointArn=arn:aws:s3:ap-northeast-1:123456789012:accesspoint/fsxn-audit-ap \
 HoneycombApiKeySecretArn=arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:honeycomb/fsxn-api-key-XXXXXX \
 HoneycombDataset=fsxn-audit \
 S3BucketName=my-fsxn-audit-bucket \
 --capabilities CAPABILITY_NAMED_IAM \
 --region ap-northeast-1

4. Verify in Honeycomb

Navigate to your dataset → Explore Data:

WHERE service = "ontap-audit" | COUNT

Events should appear within seconds.

Honeycomb Query Examples

Basic Investigation

# All failed access attempts
WHERE result = "Failure" | GROUP BY user, path | COUNT

# Top 20 users by file access volume
GROUP BY user | COUNT | ORDER BY COUNT DESC | LIMIT 20

# Operations breakdown
GROUP BY operation | COUNT

High-Cardinality Analysis (Honeycomb's Strength)

# BubbleUp: What's different about the 3am spike?
# Select the spike in the time series → click BubbleUp
# Honeycomb auto-identifies which dimensions differ

# Heatmap: Access density by hour
WHERE operation = "ReadData" | HEATMAP(timestamp)

# Trace a specific user's activity
WHERE user = "admin@corp.local" | VISUALIZE COUNT | GROUP BY operation, path

# Find unusual path access patterns
GROUP BY path | COUNT | WHERE COUNT > 100

Security Investigation

# After-hours access to sensitive paths
WHERE path CONTAINS "confidential" AND hour(timestamp) NOT BETWEEN 9 AND 17
| GROUP BY user | COUNT

# Users accessing paths they haven't accessed before
# (Use Honeycomb's "compare to baseline" feature)

# Bulk file operations (potential exfiltration)
WHERE operation = "ReadData" | GROUP BY user | COUNT | WHERE COUNT > 1000

Event Schema (13 Fields)

All fields are queryable at full cardinality without pre-indexing:

Field Example Cardinality
source fsxn-ontap Low
service ontap-audit Low
event_type 4663 Low (~10 types)
svm svm-prod-01 Low (~5-20)
user admin@corp.local High (thousands)
operation ReadData Low (~10 types)
path /vol/data/report.pdf Very High (millions)
result Success / Failure Low (2)
client_ip 10.0.x.x Medium (hundreds)
s3_key audit/svm-prod-01/2026/... Very High

Cost Analysis

Honeycomb pricing is event-based, not volume-based:

Monthly Log Volume Estimated Events Honeycomb Cost
1 GB ~500K events Free (20M/month included)
10 GB ~5M events Free
30 GB ~15M events Free
50 GB ~25M events Paid tier (~$100/month)
Component Monthly Cost (10 GB/month)
Lambda (5-min polling) ~$3
EventBridge Scheduler ~$1
Secrets Manager ~$1
Honeycomb Free (5M events < 20M limit)
Total ~$5

The 20M events/month free tier covers most FSx for ONTAP deployments. Estimate ~500 events per MB of audit log data.

Gotchas & Lessons Learned

# Discovery Impact
1 Must use Ingest Key (hcaik_*) — Environment Key (hcxik_*) is silently rejected Events disappear without error if wrong key type
2 Events with timestamps older than ~4 hours are rejected Test data must use current timestamps
3 5MB max request body size; our implementation batches in chunks of 100 events for reliability Lambda splits large files into multiple requests
4 Honeycomb processes data in US regions only Evaluate cross-border data transfer requirements
5 Dataset auto-created on first event if it doesn't exist No pre-provisioning needed
6 OTel Collector path requires x-honeycomb-dataset header Without it, events go to a default dataset

Direct vs OTel Collector: When to Use Which

Criteria Direct (Path A) OTel Collector (Path B)
Simplicity ✅ Fewer components More infrastructure
Multi-backend ❌ Honeycomb only ✅ Any OTLP backend
Enrichment/redaction ❌ In Lambda only ✅ Collector processors
Cost Lower (no Collector) Collector compute cost
Recommendation Single-backend PoC Production multi-backend

Note from Honeycomb: Honeycomb recommends OTLP as the primary ingest path for new production deployments. The Events Batch API (Path A) remains fully supported and is simpler for single-backend PoCs. If you start with Path A, migrating to Path B (OTLP) requires no changes to your Honeycomb queries — only the delivery mechanism changes.

Production Readiness

This integration follows the project's Production Readiness Levels:

Level What You Get Go/No-Go to Next
Level 1 (this Quick Start) Audit poller + DLQ Logs arrive, checkpoint advances, DLQ empty 24h
Level 2 + Honeycomb queries + alerts SLOs met 7 days, security review done
Level 3 + DynamoDB ledger + poison-pill SLOs met 30 days, compliance pack
Level 4 + OTel Collector + redaction Multi-backend, PII redaction, DR tested

Data classification note: Honeycomb receives user and path fields which are classified as PII/sensitive. Since Honeycomb processes data in US regions only, evaluate cross-border transfer requirements. For PII-sensitive deployments, use the OTel Collector path (Path B) with redaction processors. See Data Classification Guide.

Full criteria: Pipeline SLO Definitions | DLQ Replay Runbook

CloudFormation Templates

Template Purpose Key Parameters
template.yaml FSx audit log poller S3AccessPointArn, HoneycombApiKeySecretArn, HoneycombDataset
template-ems.yaml EMS webhook handler HoneycombApiKeySecretArn, HoneycombDataset
template-fpolicy.yaml FPolicy EventBridge handler HoneycombApiKeySecretArn, HoneycombDataset, EventBusName

Resources

Series Navigation


Questions about high-cardinality analysis or the Honeycomb integration? Drop a comment below.

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations