High performant graph queries on Apache Iceberg powered by Ryft and PuppyGraph

Preface
Graph workloads traditionally rely on specialized graph storage systems, but these come with significant challenges in scalability, performance, and data duplication. By combining the storage optimization capabilities of Ryft on Apache Iceberg with PuppyGraph’s advanced graph query engine, teams can run high-performance, scalable graph queries directly on their Iceberg data lake - without moving or duplicating data.
Introduction: Graph queries performance and scale
Graph queries are great for uncovering the relationships that drive real-world complexity, whether you're identifying fraud rings, mapping attack paths in a cybersecurity breach, or understanding connections in a supply chain network. These workloads demand fast, multi-hop traversal across massive datasets, often with complex patterns and evolving schemas.
Traditionally, running these queries required purpose-built graph databases. But those systems come with tradeoffs: data duplication, limited scalability, and ongoing operational overhead. A better approach is emerging: running graph workloads directly on scalable, optimized lakehouse data based on Apache Iceberg.
Apache Iceberg as the Foundation for Large-Scale Graph Analytics
Apache Iceberg decouples compute from storage while providing advanced table semantics such as hidden partitioning, schema evolution, and rich metadata statistics. These features, originally designed for analytics, also enable powerful graph workloads without requiring data duplication.
By using Iceberg tables as the storage layer for graph workloads, graph engines can map tables to nodes and edges and leverage Iceberg metadata for efficient query planning. That means multi-hop traversals, pattern matching, and other complex graph queries can all run on the same Parquet files you already use, no migration required.
The key is to optimize the data layout so pruning is efficient at every level from partitions down to row groups, enabling each graph hop to become a highly targeted scan. Partition keys and sort columns should align with graph filters and match conditions to maximize query efficiency.
Graph Query Acceleration with PuppyGraph
PuppyGraph brings graph semantics to your existing Apache Iceberg tables, no separate graph database required. It lets teams model, traverse, and query relationships across their datasets without data duplication or complex ETL and support.
PuppyGraph supports open graph query languages like openCypher and Gremlin, so data engineers can express complex graph logic, such as multi-hop traversals, PageRank, and pattern matching, using familiar syntax. Because it operates directly on Iceberg tables, it integrates cleanly with the rest of your analytics stack.
Under the hood, PuppyGraph is built for performance by using columnar scanning, vectorized execution, and predicate pushdown to minimize I/O and accelerate query times. And with a distributed compute engine, it scales to handle petabyte-level, interconnected datasets, executing deep graph queries over billions of records in seconds.

Storage Optimization at Scale with Ryft
Ryft automates the heavy lifting of maintaining and optimizing Apache Iceberg tables, eliminating operational bottlenecks and keeping data consistently fast and efficient. By handling data compaction, snapshot expiration, and rewrite operations, Ryft ensures that tables stay compact, clean, and optimized for high-performance analytics.
A well-compacted and optimized table layout dramatically reduces data fragmentation and the number of small files, enabling query engines to scan less data and complete queries faster. For graph workloads, this translates directly into quicker multi-hop traversals and faster pattern matching, since PuppyGraph can leverage Iceberg’s optimized structure to push down filters and skip unnecessary data blocks with minimal I/O overhead.
In addition, Ryft uses adaptive optimization to automatically tune compaction, compression, file sizes and more, all based on actual query and ingestion patterns. For example, if your queries frequently filter specific columns, Ryft can adjust files layouts and properties to improve scan performance. This workload-aware approach ensures that the table layout evolves to match real access patterns over time, without requiring manual tuning

Better Together: Unlocking High-Performance Graph Queries

High-performance graph queries rely on two things: an optimized storage layer that minimizes scans, and an engine that can traverse complex relationships efficiently. Ryft and PuppyGraph combine these strengths by using optimized Apache Iceberg tables with a graph engine built to run directly on data lake storage.
Ryft keeps Iceberg tables compact and partitioned to match real query patterns, reducing file counts and scan volumes. PuppyGraph adds a graph abstraction that runs traversals and pattern matching directly on the lake, eliminating the need for separate graph databases or duplicated datasets.
This approach enables fast, scalable graph workloads on your existing data stack - no new pipelines, no extra data copies, and minimal operational overhead.
For more information, visit Ryft & PuppyGraph websites.