Voozh

TL;DR

The existing AWS Blog approach ships FSx for ONTAP audit logs to Splunk via two EC2 instances (syslog-ng + Universal Forwarder). We replaced it with a single Lambda function — same Splunk index, same SPL queries, 90% AWS infrastructure cost reduction.

[Before] FSx for ONTAP → syslog-ng (EC2) → Splunk UF (EC2) → Splunk
 Monthly AWS infra cost: ~$66 (2× t3.medium + EBS)
 Ops burden: OS patching, agent updates, scaling

[After] FSx for ONTAP → S3 Access Point → Lambda → Splunk HEC
 Monthly AWS infra cost: ~$6 (Lambda + EventBridge)
 Ops burden: Zero (managed services only)

Important: The 90% cost reduction refers to AWS infrastructure costs only (EC2/Lambda/EventBridge). Splunk platform licensing costs remain unchanged regardless of the delivery method.

This is Part 8 of the Serverless Observability for FSx for ONTAP series.

The Problem with EC2-Based Splunk Integration

The AWS Blog's architecture works, but it comes with operational overhead:

Concern	EC2-Based	Serverless
Monthly cost	~$66 fixed	~$6 pay-per-use
OS patching	Monthly	None
Agent updates	Manual (UF + syslog-ng)	None
Scaling	Manual instance resize	Automatic (Lambda concurrency)
Availability	Single AZ (unless you add redundancy)	Multi-AZ by default
Time to deploy	Hours (provision + configure)	30 minutes (CloudFormation)

If you're already running this EC2 pattern and want to modernize, this article shows you how — with a parallel deployment strategy that ensures zero data loss during cutover.

Architecture

┌──────────────────────────────────────────────────────────┐
│ FSx for ONTAP │
│ │
│ Audit Volume ──→ S3 Access Point │
│ │ │
│ ▼ │
│ EventBridge Scheduler (rate: 5 min) │
│ │ │
│ ▼ │
│ Lambda (Python 3.12) │
│ • Reads audit logs via S3 AP │
│ • Parses JSON/EVTX │
│ • Formats as Splunk HEC events │
│ • Sends with Authorization: Splunk <token> │
│ • Checkpoints in SSM Parameter Store │
│ │ │
│ ▼ │
│ Splunk HEC │
│ https://<splunk>:8088/services/collector/event │
│ Response: {"text":"Success","code":0} │
│ │
│ SPL: index=fsxn_audit sourcetype=fsxn:ontap:audit │
└──────────────────────────────────────────────────────────┘

High-Volume Alternative: Firehose Path

For sustained >1000 events/sec, use Kinesis Data Firehose with its built-in Splunk destination:

FSx for ONTAP → S3 AP → Lambda (transform) → Kinesis Data Firehose → Splunk HEC

A separate template-firehose.yaml is provided for this path.

Migration Strategy (Zero Data Loss)

Phase 1: Parallel Deployment (Day 1-3)

Deploy the serverless stack alongside the existing EC2 pipeline. Use a separate Splunk index for validation:

aws cloudformation deploy \
 --template-file integrations/splunk-serverless/template.yaml \
 --stack-name fsxn-splunk-integration \
 --parameter-overrides \
 S3AccessPointArn=<S3_AP_ARN> \
 SplunkHecTokenSecretArn=<SECRET_ARN> \
 SplunkHecEndpoint=https://splunk.example.com:8088 \
 S3BucketName=<BUCKET> \
 SplunkIndex=fsxn_audit_serverless \
 --capabilities CAPABILITY_IAM

Compare events between old and new pipelines for 48 hours:

| stats count by index
| where index IN ("fsxn_audit", "fsxn_audit_serverless")

Phase 2: Cutover (Day 4-5)

Once event parity is confirmed:

Update the stack to use the production index (fsxn_audit)
Stop the syslog-ng and UF services on EC2 (don't terminate yet)
Monitor for 24 hours

Phase 3: Cleanup (Day 7+)

# Terminate EC2 instances
# Remove security groups, IAM roles, EBS volumes
# Delete old CloudFormation/Terraform resources

What Changes for Splunk Users

Unchanged ✅

Index name and sourcetype (configurable)
SPL queries — same field names
Dashboards and saved searches
Alert rules

Changed ⚠️

host field: EC2 hostname → SVM name
source field: syslog path → fsxn-observability
Delivery latency: near-real-time (syslog) → polling interval (default 5 min)

HEC Event Format

{"time":1716508800,"host":"svm-prod-01","source":"fsxn-observability","sourcetype":"fsxn:ontap:audit","index":"fsxn_audit","event":{"event_type":"4663","user":"admin@corp.local","operation":"ReadData","path":"/vol/data/report.pdf","result":"Success","client_ip":"10.0.1.50"}}

SPL Query Examples

# Failed access attempts
index=fsxn_audit sourcetype=fsxn:ontap:audit result=Failure
| stats count by user, path
| sort -count

# Operations timeline
index=fsxn_audit sourcetype=fsxn:ontap:audit
| timechart span=5m count by operation

# Top users
index=fsxn_audit sourcetype=fsxn:ontap:audit
| stats count by user
| sort -count
| head 20

# Specific user investigation
index=fsxn_audit sourcetype=fsxn:ontap:audit user="admin@corp.local"
| table _time, operation, path, result, client_ip

Cost Comparison

Component	EC2-Based (monthly)	Serverless (monthly)	Savings
EC2 instances (2× t3.medium)	$60	$0	100%
EBS volumes (2× 20GB)	$6	$0	100%
Lambda	$0	~$5	—
EventBridge Scheduler	$0	~$0.01	—
Secrets Manager	$0	~$0.40	—
Total	$66	$6	91%

Note: EC2 cost assumes 2× t3.medium (as per the AWS Blog reference architecture). Actual EC2 costs vary by instance type and region. Splunk Cloud licensing costs are contract-dependent and may differ significantly from list pricing.

Network Considerations

Splunk Deployment	Lambda Config	Notes
Splunk Cloud (public HEC)	Lambda outside VPC	Simplest
Splunk Enterprise (private VPC)	Lambda in VPC + NAT	Same VPC as Splunk
Splunk Cloud (PrivateLink)	Lambda in VPC + VPC Endpoint	Most secure

⚠️ VerifySSL: Set to true in production. Only use false for self-signed certs in dev environments.

Rollback Plan

If issues are discovered after cutover:

Start the stopped EC2 instances (syslog-ng + UF)
Verify syslog-ng is receiving events
Delete the serverless CloudFormation stack
Investigate and resolve before re-attempting

The serverless Lambda uses checkpointing — no events are lost during the overlap period (brief duplicates are possible).

What's Next

Firehose path: For high-volume logs (>1000 events/sec), use template-firehose.yaml
HEC Acknowledgment (useACK): For Level 2+, enable HEC indexer acknowledgment to guarantee at-least-once delivery. Lambda waits for ack before advancing checkpoint
CIM compliance: Map fields to Splunk's Common Information Model (Authentication or Change data model) for compatibility with Splunk Enterprise Security correlation searches
Index pre-creation: The fsxn_audit index must be created before first ingestion (Splunk Cloud: Admin Console; Enterprise: indexes.conf)
EMS webhooks: Real-time ARP ransomware detection alerts
FPolicy: Sub-second file operation streaming
Production Readiness: Progress from Level 1 (this Quick Start) to Level 4 (Enterprise) — see the Pipeline SLO Definitions

Production Readiness

This integration follows the project's Production Readiness Levels:

Level	What You Get	Go/No-Go to Next
Level 1 (this Quick Start)	Audit poller + DLQ	Logs arrive, checkpoint advances, DLQ empty 24h
Level 2	+ Splunk dashboards + alerts	SLOs met 7 days, security review done
Level 3	+ DynamoDB ledger + poison-pill	SLOs met 30 days, compliance pack
Level 4	+ OTel Collector + redaction	Multi-backend, PII redaction, DR tested

Data classification: Splunk receives user and path fields (PII/sensitive). For Splunk Cloud, data is processed in the vendor's infrastructure. For self-hosted Splunk Enterprise, data stays in your VPC. See Data Classification Guide for field-by-field PII classification and handling patterns.

Full criteria: Pipeline SLO Definitions | DLQ Replay Runbook

Resources

Series Navigation

Part 1: Why Your FSx for ONTAP Audit Logs Deserve Better Than EC2
Part 2: Shipping FSx for ONTAP Logs to Datadog — The Serverless Way
Part 3: Event-Driven Ransomware Detection with ONTAP ARP + Datadog
Part 4: FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate
Part 5: Escape Vendor Lock-in: Multi-Backend Log Delivery with OTel Collector for FSx for ONTAP.
Part 6: Direct-to-Grafana: Shipping FSx for ONTAP Logs to Grafana Cloud Loki via OTLP Gateway
Part 7: Ship FSx for ONTAP Audit Logs to New Relic via Serverless Lambda Pipeline
Part 8: EC2 to Serverless: Modernizing Splunk Integration (this post)
Part 9: Data Sovereignty: FSx for ONTAP Logs in Your VPC with Elastic
Part 10: High-Cardinality File Access Analysis with Honeycomb + OTel
Part 11: AI-Powered Root Cause: Correlating File Access with APM via Dynatrace
Part 12: FSx for ONTAP Audit Logs with Data Residency in your region with Sumo Logic
Part 13: 9 Vendors, One Architecture

Questions about the Splunk migration or serverless HEC delivery? Drop a comment below.

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

URL: https://dev.to/aws-builders/ec2-to-serverless-modernizing-fsx-for-ontap-splunk-integration-e8l

⇱ EC2 to Serverless: Modernizing FSx for ONTAP Splunk Integration - DEV Community