The Resurgence of SQL: Insights from Ryanne Dolan from LinkedIn

The Data Engineering Show - Podcast • Sep 23, 2024

Audio Brief

Show transcript

In this conversation, data engineering trends of the past decade are explored, focusing on the surprising resurgence of SQL and how LinkedIn manages complexity through consumer-driven data architectures. There are three key takeaways from this discussion. First, SQL's declarative power is re-emerging as a modern standard for scalable data systems. Second, large tech companies manage complexity by creating higher-level abstractions, often leveraging SQL. Third, a shift to consumer-driven data architecture empowers users to define their data needs, automating infrastructure provisioning. SQL, once considered legacy, has made a significant comeback due to advancements in distributed systems and query optimizers. Its declarative nature now efficiently layers on modern, cloud-scale infrastructure, simplifying complex data operations. Modern tooling like Apache Flink demonstrates how complex stream processing can be expressed in just a few lines of SQL, replacing extensive custom code. When dealing with highly complex and evolving systems, the focus shifts to building robust abstractions. This approach, often using SQL, hides the intricate details of legacy and multi-stage data pipelines, allowing for incremental improvements rather than full rewrites. It simplifies the user experience by providing a unified interface over diverse backend systems. A key architectural shift is towards a consumer-driven model, moving away from systems where producers push data. In this model, consumers declare their data needs, typically via a SQL query, and the system automatically provisions the necessary infrastructure. LinkedIn's experimental Hoptimator project exemplifies this by generating entire data pipelines, including Flink jobs and Kafka topics, from a single SQL query. This prevents redundant pipelines and reduces engineering overhead. Overall, the discussion highlights a future where declarative interfaces and automation simplify complex data engineering challenges.

Episode Overview

Guest Ryan Dolan, a "boomerang" employee at LinkedIn, discusses the evolution of data engineering trends over the past decade.
The conversation centers on the surprising resurgence of SQL, tracing its journey from being considered legacy technology to becoming a modern standard for interacting with scalable data systems.
The discussion explores how large tech companies manage complexity by creating higher-level abstractions, using SQL to simplify intricate, manually-built data pipelines.
LinkedIn's experimental "Hoptimator" project is introduced as an example of a "consumer-driven" data architecture, where entire data pipelines are automatically generated from a single SQL query.

Key Concepts

The Resurgence of SQL: SQL, once deemed old and unscalable, has made a major comeback. This is due to advancements in distributed systems and query optimizers that allow its declarative power to be layered on top of modern, cloud-scale infrastructure.
Abstraction over Complexity: At large companies like LinkedIn, the tech stack is too evolved and complex to be a single, unified system. The focus is on "glue engineering" and building higher-level abstractions (often using SQL) to hide the complexity of legacy and multi-stage data pipelines.
The Role of Modern Tooling: Technologies like Apache Flink have proven that extremely complex stream processing and data transformations can be expressed in just a few lines of SQL, replacing what previously required extensive, custom-written procedural code (e.g., multi-stage Java jobs).
Producer-Driven vs. Consumer-Driven Models: A key architectural shift is moving from a "producer-driven" model, where teams build and push data products, to a "consumer-driven" model. In the latter, consumers declare their data needs (e.g., via a SQL query), and the system automatically orchestrates the infrastructure to deliver it.
Hoptimator Project: An open-source, experimental framework from LinkedIn that embodies the consumer-driven philosophy. It takes a simple ANSI SQL SELECT query and automatically generates and deploys the entire underlying data pipeline, including Flink jobs, Kafka topics, materializers, and security ACLs.

Quotes

At 0:54 - "Yeah, so actually first off, I started at LinkedIn almost exactly 10 years ago, but I left and came back. So I'm one of those, you know, big tech boomerangs that everyone wants to be." - Ryan Dolan explains his history with LinkedIn, giving him a long-term perspective on industry trends.
At 4:03 - "I think a big one, like I noticed this right away, almost day one when I came back and it absolutely shocked me that SQL was popular." - Ryan Dolan shares his surprise at the resurgence of SQL after years of the industry focusing on NoSQL.
At 9:09 - "It all happened for a good reason. It all happened because databases reached an endpoint. They sucked. They just couldn't scale to cloud scale." - Host Eldad Farkas provides context for why the industry initially moved away from traditional databases to NoSQL solutions.
At 17:14 - "That's sort of like what I'm working on actually... is trying to sort of roll up all this complexity that's been human-made... and sort of rolling them up into abstractions that we can easily just express in SQL." - Describing his focus on simplifying data engineering through SQL abstractions.
At 21:27 - "Hoptimator is an experiment... where we take literally just an ANSI SELECT query... and from that literal query, we can stand up Flink jobs, we can set up materializers." - Introducing the core concept of the Hoptimator project: generating entire data pipelines from a single SQL query.
At 28:02 - "What you just alluded to is something I call producer-driven versus consumer-driven." - Introducing the architectural paradigm shift from producers pushing data to consumers pulling data by defining their needs.

Takeaways

Prioritize declarative interfaces like SQL over procedural code for data operations. This simplifies development, reduces maintenance, and aligns with the industry trend towards AI-driven code generation, where high-level instructions are more valuable than low-level implementation details.
When dealing with complex, evolving systems, focus on building robust abstractions that hide the underlying complexity rather than attempting a complete, monolithic rewrite. This allows for incremental improvement and simplifies the user experience.
Adopt a "consumer-driven" mindset for data architecture. Empower users to declare the data they need, and build automation that provisions the necessary infrastructure on-demand. This prevents the proliferation of redundant, intermediate data pipelines and reduces engineering overhead.

Audio Brief

Episode Overview

Key Concepts

Quotes

Takeaways

More from The Data Engineering Show - Podcast

Postgres vs. Elasticsearch: Instacart’s Unexpected Winner in High-Stakes Search with Ankit Mittal

Building AI On-Call Assistants & Data Pipelines with Paarth Chotani at Uber

From Zero to 100M Users: Inside Notion’s Data Stack and AI Strategy with Sumit Gupta

Beyond Database Optimization with AI: Hannes Mühleisen from DuckDB