News

High performant graph queries on Apache Iceberg powered by Ryft and PuppyGraph

Yossi Reitblat

July 21, 2025

Preface

Graph workloads traditionally rely on specialized graph storage systems, but these come with significant challenges in scalability, performance, and data duplication. By combining the storage optimization capabilities of Ryft on Apache Iceberg with PuppyGraph’s advanced graph query engine, teams can run high-performance, scalable graph queries directly on their Iceberg data lake - without moving or duplicating data.

Introduction: Graph queries performance and scale

Graph queries are great for uncovering the relationships that drive real-world complexity, whether you're identifying fraud rings, mapping attack paths in a cybersecurity breach, or understanding connections in a supply chain network. These workloads demand fast, multi-hop traversal across massive datasets, often with complex patterns and evolving schemas.

Traditionally, running these queries required purpose-built graph databases. But those systems come with tradeoffs: data duplication, limited scalability, and ongoing operational overhead. A better approach is emerging: running graph workloads directly on scalable, optimized lakehouse data based on Apache Iceberg.

Apache Iceberg as the Foundation for Large-Scale Graph Analytics

Apache Iceberg decouples compute from storage while providing advanced table semantics such as hidden partitioning, schema evolution, and rich metadata statistics. These features, originally designed for analytics, also enable powerful graph workloads without requiring data duplication.

By using Iceberg tables as the storage layer for graph workloads, graph engines can map tables to nodes and edges and leverage Iceberg metadata for efficient query planning. That means multi-hop traversals, pattern matching, and other complex graph queries can all run on the same Parquet files you already use, no migration required.

The key is to optimize the data layout so pruning is efficient at every level from partitions down to row groups, enabling each graph hop to become a highly targeted scan. Partition keys and sort columns should align with graph filters and match conditions to maximize query efficiency.

Graph Query Acceleration with PuppyGraph

PuppyGraph brings graph semantics to your existing Apache Iceberg tables, no separate graph database required. It lets teams model, traverse, and query relationships across their datasets without data duplication or complex ETL and support.

PuppyGraph supports open graph query languages like openCypher and Gremlin, so data engineers can express complex graph logic, such as multi-hop traversals, PageRank, and pattern matching, using familiar syntax. Because it operates directly on Iceberg tables, it integrates cleanly with the rest of your analytics stack.

Under the hood, PuppyGraph is built for performance by using columnar scanning, vectorized execution, and predicate pushdown to minimize I/O and accelerate query times. And with a distributed compute engine, it scales to handle petabyte-level, interconnected datasets, executing deep graph queries over billions of records in seconds.

Figure: PuppyGraph Query & Visualization Engine UI

‍

Storage Optimization at Scale with Ryft

Ryft automates the heavy lifting of maintaining and optimizing Apache Iceberg tables, eliminating operational bottlenecks and keeping data consistently fast and efficient. By handling data compaction, snapshot expiration, and rewrite operations, Ryft ensures that tables stay compact, clean, and optimized for high-performance analytics.

A well-compacted and optimized table layout dramatically reduces data fragmentation and the number of small files, enabling query engines to scan less data and complete queries faster. For graph workloads, this translates directly into quicker multi-hop traversals and faster pattern matching, since PuppyGraph can leverage Iceberg’s optimized structure to push down filters and skip unnecessary data blocks with minimal I/O overhead.

In addition, Ryft uses adaptive optimization to automatically tune compaction, compression, file sizes and more, all based on actual query and ingestion patterns. For example, if your queries frequently filter specific columns, Ryft can adjust files layouts and properties to improve scan performance. This workload-aware approach ensures that the table layout evolves to match real access patterns over time, without requiring manual tuning

Better Together: Unlocking High-Performance Graph Queries

High-performance graph queries rely on two things: an optimized storage layer that minimizes scans, and an engine that can traverse complex relationships efficiently. Ryft and PuppyGraph combine these strengths by using optimized Apache Iceberg tables with a graph engine built to run directly on data lake storage.

Ryft keeps Iceberg tables compact and partitioned to match real query patterns, reducing file counts and scan volumes. PuppyGraph adds a graph abstraction that runs traversals and pattern matching directly on the lake, eliminating the need for separate graph databases or duplicated datasets.

This approach enables fast, scalable graph workloads on your existing data stack - no new pipelines, no extra data copies, and minimal operational overhead.

‍

For more information, visit Ryft & PuppyGraph websites.

Table of Contents

Example H2

Example H3

Get the latest posts straight to your inbox

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Browse other blogs

News

Announcing Ryft Data Retention & Compliance Enforcement for Apache Iceberg

Today, we’re introducing two new capabilities in Ryft: Automated Data Retention and Data Compliance Enforcement for Apache Iceberg™. These features integrate directly into the Ryft platform to ensure efficient, policy-driven data deletion and compliance, working seamlessly alongside table maintenance and optimization.

Yuval Yogev

December 8, 2025

News

Announcing Ryft Adaptive Optimization

Today, we’re officially introducing Ryft Adaptive Optimization - always-on, dynamic optimization engine for Apache Iceberg™. Our engine continuously compacts, rewrites, indexes, and reorders data based on how your tables are actually used, delivering up to 5× faster queries, 10x storage reduction, and 7x better compaction efficiency compared to other engines.

Yossi Reitblat

October 23, 2025

September 17, 2025

News

Unlocking Iceberg management for everyone

Ryft, in many ways, is a story 15 years in the making. Yuval Yogev, Guy Gadon and I went to the same high school, worked together at 8200, building high-scale data infrastructure, and went our separate ways - all to realize that we really enjoy solving complicated data infrastructure problems together with the people we love.

Yossi Reitblat

October 27, 2025

July 8, 2025

See Ryft in Action