Joe Reis and Matt Housley on the fundamentals of data engineering
Audio Brief
Show transcript
This episode examines the critical shift from data science to data engineering, highlighting the industry's struggle with foundational infrastructure, data quality, and the Dev/Data divide.
There are four key takeaways from this discussion. First, prioritize adopting established software engineering practices for data quality. Second, focus on mastering fundamental, tool-agnostic data techniques over specific tools. Third, actively engage stakeholders to define and deliver tangible business value. Finally, recognize data success is often limited by organizational and communication challenges, not just technology.
The data industry often neglects foundational problems like data quality and governance, issues that software engineering has addressed with practices such as schema contracts and registries. Data teams should adopt these proven methods to build robust infrastructure.
Modern data education frequently errs by prioritizing specific tools over timeless, fundamental techniques. Success lies in mastering concepts like data modeling and governance first, as technologies evolve rapidly.
Generic advice to "deliver business value" is often unhelpful for junior professionals without a clear path. True value delivery requires moving beyond technical execution to deeply understand stakeholder problems and how data can directly solve them.
The success of data initiatives is heavily dependent on socio-technical skills like communication, collaboration, and navigating organizational dynamics. These "social club" aspects are often more limiting than purely technical challenges.
Ultimately, building a mature and effective data organization requires prioritizing foundational engineering, practical education, and strong cross-functional collaboration.
Episode Overview
- The discussion follows the guests' journey from data science to data engineering, a move driven by the observation that many companies lack the foundational infrastructure needed for analytics and machine learning to succeed.
- The episode offers a sharp critique of the data industry's tendency to chase technological fads and buzzwords while neglecting persistent, fundamental problems like data quality, governance, and the "Dev/Data divide."
- A central theme is the failure of data education, which often provides vague, unhelpful advice (e.g., "deliver business value") and focuses on specific tools rather than timeless, foundational techniques.
- The conversation advocates for a new approach centered on adopting proven software engineering practices, prioritizing fundamental skills over tools, and recognizing the critical role of communication and collaboration in data roles.
Key Concepts
- The "Recovering Data Scientist": This term describes the career shift from data science to data engineering, motivated by the practical need to build the robust infrastructure that is a prerequisite for successful analytics and ML.
- The Dev/Data Divide: Data teams are often treated as downstream consumers of application data, forcing them to deal with data quality issues that have already been solved in software engineering through practices like schema contracts and registries.
- The Paradox of Modern Tooling: Despite the availability of powerful and easy-to-use data tools, the industry still struggles with the same foundational problems it has faced for years, suggesting the issue lies in methodology and education, not technology.
- Critique of Vague Industry Advice: The conversation criticizes common but impractical advice like "deliver business value" without providing a clear path for junior professionals to do so, highlighting a gap in mentorship and practical education.
- Techniques Over Tools: A core educational philosophy that advocates for teaching timeless, fundamental techniques (e.g., data modeling, governance) first, as specific technologies are transient and can be learned on top of a strong conceptual foundation.
- Data as a "Social Club": The idea that success in data is not purely a technical challenge but is heavily dependent on socio-technical skills like communication, collaboration, and navigating organizational dynamics.
Quotes
- At 1:38 - "You see a lot of companies hiring data scientists and I think forgetting to build the foundation that would help data scientists succeed." - Joe Reis explains that their move into data engineering was motivated by the need to build the necessary infrastructure that was often missing for data scientists.
- At 6:35 - "My dream actually is that we just stop talking about data." - Joe Reis expresses his ultimate goal for the industry, where data is so seamlessly integrated and valuable that it becomes an invisible, foundational layer that doesn't require constant discussion.
- At 13:42 - "I think if you want to know where data is going... just look at where software engineering has been for the past 10, 20 years and just adopt those practices." - Joe Reis argues that the data field is behind software engineering and should adopt its established best practices.
- At 17:13 - "The world's moved beyond reports at this point... if we're still struggling with that, I don't know, I'll go do something else with my time. Go become a veterinarian or something, it's more fun." - Joe Reis expresses his frustration with the industry's continued focus on basic BI reporting.
- At 18:07 - "This kid just came out of college, it's his first year in a data role and you're going to tell him he needs to deliver value. How is he going to do that?" - Pointing out how generic advice to "deliver value" is often impractical and unhelpful for junior professionals.
- At 22:38 - "The way we teach it is absolutely, it's backwards, right? Know the techniques and then learn the tools." - Joe Reis explains his philosophy that data education should prioritize timeless techniques over specific, ever-changing technologies.
- At 24:55 - "Data is not a technical game. It's a social club." - An insight into the importance of communication, collaboration, and understanding organizational dynamics to succeed in data roles.
Takeaways
- Prioritize learning and implementing established software engineering principles, such as schema management and data contracts, to proactively improve data quality and reliability.
- Focus your professional development on mastering fundamental, tool-agnostic techniques like data modeling and governance before specializing in specific, trendy technologies.
- To provide tangible business value, move beyond technical execution and actively engage with stakeholders to deeply understand their problems and how data can directly solve them.
- Recognize that data success is often limited by organizational and communication challenges, not just technology; invest time in building relationships and improving cross-team collaboration.