Joseph Machado, Senior Data Engineer @ LinkedIn talks best practices
Audio Brief
Show transcript
This episode explores the transformative journey of data engineering over the last decade, highlighting the enduring value of fundamental principles amid constant technological change.
There are three key takeaways from this discussion. First, prioritize fundamental data engineering principles over specific tools for a lasting career. Second, data teams must adopt rigorous software engineering practices, including robust testing and CI/CD. Third, tool selection, particularly for data formats, depends heavily on an organization's scale and integration requirements.
The data engineering landscape has evolved dramatically, from early Java MapReduce to modern cloud-based solutions like Spark and Snowflake. Despite this rapid change, successful data engineers focus on core principles such as data modeling, query optimization, and rigorous testing. These foundational skills provide long-term career resilience, far outweighing the mastery of any single, transient technology.
Many data teams still lag behind traditional software engineering in adopting best practices. Implementing robust CI/CD pipelines, comprehensive testing, and ensuring a positive local developer experience are crucial. This adherence to software engineering disciplines prevents long-term maintenance issues, enhances data quality, and improves overall team efficiency.
When selecting data tools and formats, an organization's size and complexity are critical factors. Smaller companies may thrive with integrated, proprietary solutions offered by single vendors like Snowflake. In contrast, larger enterprises often require the flexibility of open table formats, such as Apache Iceberg, to manage data seamlessly across diverse systems and environments like Spark and Trino.
Ultimately, building effective data platforms today demands a deep understanding of enduring principles and disciplined engineering practices, adapting technology choices to organizational scale.
Episode Overview
- The episode features Joseph Machado, a Senior Data Engineer at LinkedIn and creator of the "Start Data Engineering" blog, who shares his decade-long journey in the data field.
- The discussion covers the evolution of the data engineering stack, from early technologies like Java MapReduce and IBM DB2 to modern tools like Spark, Snowflake, and Airflow.
- A central theme is the enduring importance of fundamental principles (data modeling, testing, software engineering best practices) over specific, ever-changing tools.
- The speakers explore the challenges of developer experience in data engineering, the trade-offs of modern tools, and the role of open table formats like Apache Iceberg.
Key Concepts
- Evolution of Data Stacks: The conversation traces the technological progression in data engineering, starting from on-premise, code-heavy systems (Java MapReduce, HDFS) to more abstracted, SQL-centric cloud data warehouses (Snowflake, Databricks) and orchestration tools (Airflow).
- Data Roles Progression: Joseph Machado's career path is highlighted as an example of the industry's evolution, where he started with the title "Data Scientist" but was performing tasks now clearly defined as data engineering.
- Enduring Fundamentals: A key point is that despite the rapid changes in technology, the core principles of good software engineering, proper data modeling, and robust testing remain constant and crucial for building sustainable data platforms.
- Developer Experience Challenges: The discussion points out that while modern tools make some tasks easier, they can also introduce complexity and make local testing difficult, potentially degrading the developer experience compared to traditional software engineering.
- Open Table Formats: The role of formats like Apache Iceberg is discussed, particularly for large companies that need to manage data across different systems (e.g., Spark, Trino) and environments (on-premise and cloud).
Quotes
- At 01:21 - "I was in software engineering, moved into data engineering, but at that time my title was data scientist, but basically I was just doing data engineering." - Joseph Machado explaining the ambiguity of data roles early in his career and how he transitioned from software to data.
- At 05:03 - "A lot of places software engineering based concepts like testing, making sure you have proper CI/CD setup, it's not really followed in the data team." - Joseph highlighting a common gap where data teams lag behind software engineering teams in adopting fundamental development practices.
- At 12:12 - "I don't think tools matter as much. I think the principles matter, like test data before you publish it to your stakeholders." - Joseph emphasizing his core belief that foundational practices are more critical to a data engineer's success than mastering any single tool.
- At 23:31 - "It depends on the company size. I do not think they're going to eat the world anytime soon, just because Snowflake has its own internal format." - Joseph giving his perspective on whether open table formats like Iceberg will completely take over, suggesting that large vendors with proprietary formats will coexist for the foreseeable future.
Takeaways
- Focus on Principles, Not Just Tools: To build a lasting career in data engineering, prioritize understanding core concepts like data modeling, query optimization, and testing methodologies. Tools will evolve, but these fundamental skills are transferable and enduring.
- Advocate for Strong Software Engineering Practices: Data teams should adopt and enforce practices common in software engineering, such as robust CI/CD, comprehensive testing, and maintaining a good local developer experience. This prevents long-term maintenance issues and improves data quality.
- Understand Your Organization's Scale for Tool Selection: The choice between open formats (like Iceberg) and proprietary vendor formats (like Snowflake's) often depends on company size and complexity. Smaller companies may benefit from a single vendor, while larger enterprises need the flexibility of open formats to work across multiple systems.