Ryft Blog

Why Apache Iceberg Finally Unlocks Security Data Lakes

Yuval Yogev
Yuval Yogev
August 6, 2025
8
Mins read
Engineering
Why Apache Iceberg Finally Unlocks Security Data Lakes

Security is actually a Big Data problem. If you could store all the data in the world, query it in sub-second speed, model entities correctly and correlate all data sources, it would be a huge step forward in detecting and preventing cyber attacks. It’s no surprise that literally everyone is trying to build a security data lake. However, security data lakes are notoriously hard to build and operate efficiently.

The nature of security workloads

Security data infrastructure is fundamentally different from classical analytics. It needs to support:

  • Always-on detections
  • Deep investigations
  • Massive data volumes
  • Complex joins across sources
  • Constantly changing schemas
  • Strict compliance & data sovereignty
  • AI-driven enrichment and modeling

Security Teams need to keep years of logs searchable at low cost, while running everything from threat hunting to compliance audits.

Security Vendors need to offer all of the above - at scale, for multiple customers - while maintaining healthy margins and avoiding lock-in.

The first generation of security data lakes usually relied on tools like Splunk, Elasticsearch, or other NoSQL databases. Then, solutions like Snowflake and Databricks made a big step forward in flexibility and architecture, and became popular among security teams. However, all of these proved to be either too expensive to store raw data long-term, too inflexible to support varied workloads, or too tightly coupled to a single vendor. For vendors providing offerings like SIEM, XDR and others, this proved to be a good way to start, but a challenging approach long term, as margins became almost non-existent.

Apache Iceberg changes the equation. It’s an open table format purpose-built for scalable, flexible, and cost-effective analytics - and it’s quickly becoming the new standard for modern cybersecurity data lakes.

Here’s why.

Future-proof infrastructure: open, scalable, efficient

Unlimited retention at lower costs

Storing years of raw security telemetry is often impossible with traditional architectures. Costs skyrocket. Detection and investigation capabilities are reduced with shrinking retention windows. Talking with security teams and vendors, you often hear sentences like:

  • “When we ingested raw CloudTrail logs into Elasticsearch without filtering, we ran out of storage in weeks”
  • “We only retain the EDR alerts, and not the EDR raw events. We know it could be useful for incident response, but we couldn’t find the budget”

Iceberg separates compute from storage (for real this time), and is designed to operate directly on cheap object storage like S3. For folks that get special AWS discounts this is amplified as they can apply those on their storage directly, without a margin from a data warehouse in the middle. Of course, even S3 can become very expensive at large scale, however with Iceberg the ability to control it is higher, from storage format tuning, to tiering and retention.

Better margins for security vendors

For security vendors, whether you're building a detection and response platform, or a SIEM, infrastructure efficiency directly impacts your margins. Iceberg helps by letting you scale compute independently from storage. You can run lightweight engines for real-time detections, and spin up heavier compute only when needed for deep investigations or model training.

This architecture avoids the overhead of always-on clusters and eliminates the need to pass on bloated infrastructure costs to customers. You’re no longer forced to rely on upstream vendors and their pricing models just to make your own offering viable.

No vendor lock-in

Iceberg is fully open - no proprietary formats, no vendor-specific runtimes. You control where it runs (your cloud, your region), and which engines you use to query it. This flexibility is critical for any vendor or security team that wants to stay in control of its data and tooling. Your ability to make experiments, use the newest tools in the market, and at the end of the day make the most of your data, unlocks instantly. More on that in the next section.

Data ownership

When customer data is sensitive, and in security, it always is - you can’t afford to ship it to a black-box SaaS. Iceberg allows for flexible deployments, so you can more easily support air-gapped, on-prem or BYOC deployments. Even more importantly, you no longer have to send your customers’ most sensitive data to a third-party vendor and trust they’ll treat it with the same care you would.

One table format, many engines

Choosing the best tool for the job

Security teams don’t just run one type of workload - they run many. From continuous detections, to point lookups, cross correlation of sources, IOC searches, AI model training and more. If you found a data solution that supports all those in one place, that would be amazing, but until then, Iceberg unlocks the ability to choose the best tool for each workload.

Imagine if you could use the following tools on your authentication logs:

  • ClickHouse for dashboards identifying trends like new user-agents and geo-locations
  • Spark for streaming detections, and enrichment with external data sources like GeoIP databases
  • Trino for IOC sweeps and other ad-hoc queries
  • Graph engines like PuppyGraph for complex detections like impossible travel and lateral movement
  • Python/DuckDB for quick pivots and prototyping

All of these engines can query the same Iceberg tables - directly.

With Iceberg, you don’t need to maintain multiple pipelines or duplicate data across formats. Data is written once and read by any engine, using standard table APIs. That means less infrastructure, less complexity, and far faster iteration across teams.

One Table Format, Many Engines

Security ❤️ AI

Every security team and vendor today is trying to figure out how to 10x their capabilities with AI and LLMs - whether it’s automated triage, summarizing alerts, or querying telemetry with natural language. But if your data is locked inside a vendor platform, you're stuck. You either wait for them to build exactly what you need, or start copying data into yet another system just to make it usable.

With Iceberg, your data is immediately accessible to modern AI tooling. You can point tools like LangChain or LlamaIndex at your tables, or train your own models on raw logs using Spark, Ray or Daft, all without exporting or transforming the data.

Iceberg features that fit cybersecurity workloads

Schema evolution

Security data changes constantly. New sources, new fields, new vendors - and standards like OCSF are evolving quickly.

Iceberg supports field-level schema evolution. You can add, rename, or reorder fields without breaking existing queries or reprocessing old data.

Partition evolution

You can also evolve how tables are partitioned over time. Start with timestamp-based partitions, then add dimensions like event_type or customer_id as access patterns evolve. No table rewrites required.

With advanced optimizations like adaptive partitioning, you could even change your partitioning strategy dynamically. For example: CloudTrail logs might require hourly partitions during peak working hours, but can be switched to daily partitions on weekends.

Time travel and snapshot isolation

Iceberg supports snapshot-based queries, so you can “go back in time” and rerun detection logic exactly as it would have worked during a breach. Or reconstruct the full system state for a post-incident report. Or compare the before-and-after of a dataset modified during an IR process.

Merge support for enrichment and tagging

Security data isn’t static, and you often need to tag historical records (e.g. "this IP was malicious") or apply enrichment after ingestion. Iceberg supports upserts and row-level deletes, so you can safely update your data without full rewrites.

Vendor adoption

Vendors have already identified the opportunity of building security data lakes on top of Apache Iceberg. Amazon Security Lake was announced with native support for Apache Iceberg, which set the standard for other parts of the ecosystem, enabling dozens of partners to ingest data directly in Iceberg format. Current integrations include DataBahn, Monad, SailPoint, Sysdig, Talon, Palo Alto Networks, Securonix and more.

Vendor Adoption

Ingestion tools like Matano, Monad and others, use Iceberg as their storage layer. Crowdstrike, Orca Security and others publicly describe using S3 + Iceberg for their workloads. Panther Labs supports querying Iceberg tables directly - whether sourced from Amazon Security Lake or external Snowflake-hosted Iceberg tables. Query.AI supports Iceberg tables and generates federated Athena queries over them. And Cribl recently launched its own Iceberg based lake product as part of their Cribl Lake offering. Additionally, dozens of security startups which are currently in stealth, are starting their data infrastructure with Iceberg as its foundation from day one.

Just this week, Microsoft has joined the party and announced their Sentinel Data Lake, which started making waves in the industry as well.

Microsoft has joined the party and announced their Sentinel Data Lake, which started making waves in the industry as well.

What’s coming next

Apache Iceberg is evolving fast, and it’s about to get even better for security workloads.

Coming in Iceberg V3

Native support for semi-structured data via the new VARIANT column type. That means queries like userIdentity.sessionContext.sessionIssuer.userName = 'John' can be queried as if userName was a top-level column, with no extra parsing or ingestion logic.

On the horizon

  • Fine-grained access controls
  • Smarter secondary indexing
  • Improved support for real-time + AI-native workloads

Conclusion

The cybersecurity landscape is moving fast, and does not seem to slow down. Security is becoming top priority for organizations, attackers are working harder, and new AI tooling introduces novel approaches for detection and prevention.

Whether you’re building a SIEM, an in-house security platform, or the next AI-native detection engine - Iceberg gives you control, flexibility, and performance that legacy solutions can’t match.

For security teams: it’s your way to finally own your data - with performance, retention, and AI-readiness built in.

For security vendors: it’s your infrastructure advantage - unlocking margins, features, and agility your competitors can’t replicate.

We’re already seeing rapid adoption among the best security teams and vendors, building things that were unimaginable just a few years ago.

Security is a big data problem. And Apache Iceberg feels like what we have been waiting for to build a truly efficient security data lake.

Table of Contents