VOOZH about

URL: https://dev.to/aws-builders/ec2-to-serverless-modernizing-fsx-for-ontap-splunk-integration-e8l

⇱ EC2 to Serverless: Modernizing FSx for ONTAP Splunk Integration - DEV Community


TL;DR

The existing AWS Blog approach ships FSx for ONTAP audit logs to Splunk via two EC2 instances (syslog-ng + Universal Forwarder). We replaced it with a single Lambda function — same Splunk index, same SPL queries, 90% AWS infrastructure cost reduction.

[Before] FSx for ONTAP → syslog-ng (EC2) → Splunk UF (EC2) → Splunk
 Monthly AWS infra cost: ~$66 (2× t3.medium + EBS)
 Ops burden: OS patching, agent updates, scaling

[After] FSx for ONTAP → S3 Access Point → Lambda → Splunk HEC
 Monthly AWS infra cost: ~$6 (Lambda + EventBridge)
 Ops burden: Zero (managed services only)

Important: The 90% cost reduction refers to AWS infrastructure costs only (EC2/Lambda/EventBridge). Splunk platform licensing costs remain unchanged regardless of the delivery method.

This is Part 8 of the Serverless Observability for FSx for ONTAP series.


The Problem with EC2-Based Splunk Integration

The AWS Blog's architecture works, but it comes with operational overhead:

Concern EC2-Based Serverless
Monthly cost ~$66 fixed ~$6 pay-per-use
OS patching Monthly None
Agent updates Manual (UF + syslog-ng) None
Scaling Manual instance resize Automatic (Lambda concurrency)
Availability Single AZ (unless you add redundancy) Multi-AZ by default
Time to deploy Hours (provision + configure) 30 minutes (CloudFormation)

If you're already running this EC2 pattern and want to modernize, this article shows you how — with a parallel deployment strategy that ensures zero data loss during cutover.

Architecture

┌──────────────────────────────────────────────────────────┐
│ FSx for ONTAP │
│ │
│ Audit Volume ──→ S3 Access Point │
│ │ │
│ ▼ │
│ EventBridge Scheduler (rate: 5 min) │
│ │ │
│ ▼ │
│ Lambda (Python 3.12) │
│ • Reads audit logs via S3 AP │
│ • Parses JSON/EVTX │
│ • Formats as Splunk HEC events │
│ • Sends with Authorization: Splunk <token> │
│ • Checkpoints in SSM Parameter Store │
│ │ │
│ ▼ │
│ Splunk HEC │
│ https://<splunk>:8088/services/collector/event │
│ Response: {"text":"Success","code":0} │
│ │
│ SPL: index=fsxn_audit sourcetype=fsxn:ontap:audit │
└──────────────────────────────────────────────────────────┘

High-Volume Alternative: Firehose Path

For sustained >1000 events/sec, use Kinesis Data Firehose with its built-in Splunk destination:

FSx for ONTAP → S3 AP → Lambda (transform) → Kinesis Data Firehose → Splunk HEC

A separate template-firehose.yaml is provided for this path.

Migration Strategy (Zero Data Loss)

Phase 1: Parallel Deployment (Day 1-3)

Deploy the serverless stack alongside the existing EC2 pipeline. Use a separate Splunk index for validation:

aws cloudformation deploy \
 --template-file integrations/splunk-serverless/template.yaml \
 --stack-name fsxn-splunk-integration \
 --parameter-overrides \
 S3AccessPointArn=<S3_AP_ARN> \
 SplunkHecTokenSecretArn=<SECRET_ARN> \
 SplunkHecEndpoint=https://splunk.example.com:8088 \
 S3BucketName=<BUCKET> \
 SplunkIndex=fsxn_audit_serverless \
 --capabilities CAPABILITY_IAM

Compare events between old and new pipelines for 48 hours:

| stats count by index
| where index IN ("fsxn_audit", "fsxn_audit_serverless")

Phase 2: Cutover (Day 4-5)

Once event parity is confirmed:

  1. Update the stack to use the production index (fsxn_audit)
  2. Stop the syslog-ng and UF services on EC2 (don't terminate yet)
  3. Monitor for 24 hours

Phase 3: Cleanup (Day 7+)

# Terminate EC2 instances
# Remove security groups, IAM roles, EBS volumes
# Delete old CloudFormation/Terraform resources

What Changes for Splunk Users

Unchanged ✅

  • Index name and sourcetype (configurable)
  • SPL queries — same field names
  • Dashboards and saved searches
  • Alert rules

Changed ⚠️

  • host field: EC2 hostname → SVM name
  • source field: syslog path → fsxn-observability
  • Delivery latency: near-real-time (syslog) → polling interval (default 5 min)

HEC Event Format

{"time":1716508800,"host":"svm-prod-01","source":"fsxn-observability","sourcetype":"fsxn:ontap:audit","index":"fsxn_audit","event":{"event_type":"4663","user":"admin@corp.local","operation":"ReadData","path":"/vol/data/report.pdf","result":"Success","client_ip":"10.0.1.50"}}

SPL Query Examples

# Failed access attempts
index=fsxn_audit sourcetype=fsxn:ontap:audit result=Failure
| stats count by user, path
| sort -count

# Operations timeline
index=fsxn_audit sourcetype=fsxn:ontap:audit
| timechart span=5m count by operation

# Top users
index=fsxn_audit sourcetype=fsxn:ontap:audit
| stats count by user
| sort -count
| head 20

# Specific user investigation
index=fsxn_audit sourcetype=fsxn:ontap:audit user="admin@corp.local"
| table _time, operation, path, result, client_ip

Cost Comparison

Component EC2-Based (monthly) Serverless (monthly) Savings
EC2 instances (2× t3.medium) $60 $0 100%
EBS volumes (2× 20GB) $6 $0 100%
Lambda $0 ~$5
EventBridge Scheduler $0 ~$0.01
Secrets Manager $0 ~$0.40
Total $66 $6 91%

Note: EC2 cost assumes 2× t3.medium (as per the AWS Blog reference architecture). Actual EC2 costs vary by instance type and region. Splunk Cloud licensing costs are contract-dependent and may differ significantly from list pricing.

Network Considerations

Splunk Deployment Lambda Config Notes
Splunk Cloud (public HEC) Lambda outside VPC Simplest
Splunk Enterprise (private VPC) Lambda in VPC + NAT Same VPC as Splunk
Splunk Cloud (PrivateLink) Lambda in VPC + VPC Endpoint Most secure

⚠️ VerifySSL: Set to true in production. Only use false for self-signed certs in dev environments.

Rollback Plan

If issues are discovered after cutover:

  1. Start the stopped EC2 instances (syslog-ng + UF)
  2. Verify syslog-ng is receiving events
  3. Delete the serverless CloudFormation stack
  4. Investigate and resolve before re-attempting

The serverless Lambda uses checkpointing — no events are lost during the overlap period (brief duplicates are possible).

What's Next

  • Firehose path: For high-volume logs (>1000 events/sec), use template-firehose.yaml
  • HEC Acknowledgment (useACK): For Level 2+, enable HEC indexer acknowledgment to guarantee at-least-once delivery. Lambda waits for ack before advancing checkpoint
  • CIM compliance: Map fields to Splunk's Common Information Model (Authentication or Change data model) for compatibility with Splunk Enterprise Security correlation searches
  • Index pre-creation: The fsxn_audit index must be created before first ingestion (Splunk Cloud: Admin Console; Enterprise: indexes.conf)
  • EMS webhooks: Real-time ARP ransomware detection alerts
  • FPolicy: Sub-second file operation streaming
  • Production Readiness: Progress from Level 1 (this Quick Start) to Level 4 (Enterprise) — see the Pipeline SLO Definitions

Production Readiness

This integration follows the project's Production Readiness Levels:

Level What You Get Go/No-Go to Next
Level 1 (this Quick Start) Audit poller + DLQ Logs arrive, checkpoint advances, DLQ empty 24h
Level 2 + Splunk dashboards + alerts SLOs met 7 days, security review done
Level 3 + DynamoDB ledger + poison-pill SLOs met 30 days, compliance pack
Level 4 + OTel Collector + redaction Multi-backend, PII redaction, DR tested

Data classification: Splunk receives user and path fields (PII/sensitive). For Splunk Cloud, data is processed in the vendor's infrastructure. For self-hosted Splunk Enterprise, data stays in your VPC. See Data Classification Guide for field-by-field PII classification and handling patterns.

Full criteria: Pipeline SLO Definitions | DLQ Replay Runbook

Resources

Series Navigation


Questions about the Splunk migration or serverless HEC delivery? Drop a comment below.

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations