Should you run notebooks on production environments?
Here’s an alternative using Apache Beam
Have you ever felt the pressure to adopt the latest technology, the newest tool, the shiniest framework, just because everyone else is doing it or because of the fear of missing out (a.k.a. FOMO)? It’s tempting, isn’t it? To jump on the bandwagon, to chase the hype, to feel like you’re on the cutting edge. But sometimes, the most innovative path lies in remembering the fundamentals.
There’s a seductive allure to the quick fix, the easy button, the shortcut to success. And in the data space, that temptation often takes the form of notebooks. Notebooks are powerful tools, offering flexibility and a low barrier to entry. But they can also be a dangerous siren song, luring engineers away from proven development practices and toward a chaotic and unsustainable path.
- Random dude on the internet: Download this notebook to try it!
- The Startup founder: Great! I’m going to become filthy rich with this!
My friends! That random dude on the internet literally said, “Download this notebook to try it out!” He didn’t say, “Download this notebook so you could deploy it into production and become filthy rich!”
The truth is, relying solely on notebooks for your AI product is like building a house without a foundation. Sure, you can put up walls and a roof quickly, but it’s only a matter of time before the whole structure collapses under its own weight.
Relying solely on notebooks for your AI product is like building a house without a foundation.
Notebooks can be a breeding ground for bad habits — a lack of testing, poor version control, and a tangled mess of dependencies. They encourage a “get it done” mentality at the expense of long-term maintainability and scalability.
Now, I’m not saying notebooks are inherently evil. They have their place in the data world, particularly for exploration, analysis, and experimentation. But for production-level code, a more robust approach is needed.
Notebooks have their place in the data world, particularly for exploration, analysis, and experimentation. But for production-level code, a more robust approach is needed.
Think of it like building a bridge. You wouldn’t rely on sketches and prototypes alone, would you? You’d need detailed blueprints, rigorous engineering standards, and a team of skilled builders working in unison.
Data and ML/AI engineering demand the same level of rigor. We need to embrace those software engineering best practices — version control, automated testing, continuous integration and deployment — that have proven their worth in building reliable and scalable systems.
Let’s resist the temptation of the easy path, the allure of the quick fix. Let’s remember the fundamentals, build a strong foundation, and create data pipelines that are not just fast but also robust, maintainable, and built to last.
Apache Beam as a Structural Antidote
Beam’s unified programming model encourages a more disciplined approach, moving away from the free-for-all of notebooks towards a well-defined structure for building and managing data pipelines. Instead of haphazardly copy-pasting code snippets in a notebook, Beam guides you to define your data transformations as a series of interconnected steps, forming a clear and understandable data flow.
By translating notebook code into Beam’s structured framework, data teams can:
- Improve Code Maintainability: Beam’s modular structure and clear syntax make it easier to understand, debug, and maintain complex pipelines.
- Enhance Testability: Beam’s testing capabilities enable more rigorous validation of pipeline logic, reducing errors and increasing confidence in the data.
- Facilitate Collaboration: Beam’s well-defined structure provides a shared language for data engineers and data scientists to collaborate, breaking down silos and promoting a more unified approach to data processing.
- Collaborate on Solutions: The modular structure of Beam pipelines allows different team members to focus on specific aspects of the data processing workflow, promoting specialization and efficient teamwork.
- Communicate Effectively: A shared language and a visual representation of the pipeline facilitate clear communication between data engineers, data scientists, and other stakeholders.
- Track and Manage Changes: Beam code, unlike notebooks, it’s compatible with git’s version control capabilities. This enable teams to track changes, collaborate on updates, and ensure that everyone is working with the latest version of the pipeline.
Beam’s DAG: A Visual Aid for Collaboration
The directed acyclic graph (DAG) that Beam generates is a powerful visualization tool. It provides a clear and intuitive representation of the data flow, making it easier for data teams to:
- Understand Pipeline Logic: The DAG clearly shows the sequence of transformations, enabling quick comprehension of the pipeline’s functionality.
- Identify Bottlenecks: The DAG can highlight performance bottlenecks or areas where data flow might be obstructed.
- Communicate and Collaborate: The DAG serves as a shared visual reference for discussions and collaboration between data engineers, data scientists, and other stakeholders.
Depending on where you are running your Beam pipeline, your DAG can have more or less information. Let’s see some DAG examples running on different execution environments:
Should Data Scientists Become Fluent in Beam’s Syntax?
Not necessarily. Their primary focus should remain on analytical tasks and model development. However, a shared understanding of Beam’s terminology and concepts — like Pipelines, PCollections, PTransforms, and Windowing — facilitates better communication between data scientists and data engineers, bridging the gap between experimentation and implementation.
It’s like speaking a common language. Even if you’re not a fluent speaker, understanding the basic vocabulary allows for clearer communication and more effective teamwork.
Migrating notebooks? Here’s a step-by-step guide
Migrating existing notebook code to Beam requires effort, but it’s an investment that pays off in the long run. Here’s a suggested approach:
- Identify Critical Pipelines: Prioritize the migration of notebooks that support essential data products or workflows.
- Identify data sources and sinks: You should know where you read your inputs from and where you should save the outputs.
- Break Down the Logic: Analyze the notebook code and decompose it into a series of well-defined data transformations.
- Translate to Beam Code: Rewrite the logic using Beam’s PTransforms, DoFns, and other components, leveraging Beam’s structured approach.
- Test and Validate Rigorously: Implement unit tests to ensure the correctness of the Beam code and validate its behavior against the original notebook’s output.
By embracing Beam’s structured approach and its collaborative features, data teams can move away from the “hope and pray” mentality often associated with notebook-based pipelines. Beam’s emphasis on testability, scalability, maintainability, and transparency empowers data teams to build robust, reliable, and scalable pipelines, while fostering collaboration and a shared understanding of the data processing workflow. It’s a crucial step towards building a data-driven culture that prioritizes data quality, trust, and confidence in the insights derived from data.
Final Thoughts
Notebooks are great for experimentation, but production needs structure. Apache Beam offers flexibility without chaos — let’s explore how it fits your needs. With my extensive experience in data engineering and deep knowledge of Apache Beam, I can provide the expertise and support necessary to implement and optimize Apache Beam for your specific use cases.
Contact me to discuss how we can leverage Apache Beam to elevate your data processing capabilities and drive your business forward.
Downloadable Documents
- Apache Beam — The Future of Data Processing
- Apache Beam — Direct Runner
- Apache Beam — Dataflow Runner
- Apache Beam — Flink Runner
Further Reading
- Apache Beam — Running a Python job on a Spark cluster on Kubernetes
- LinkedIn’s journey from Spark & Samza to Apache Beam
- Introducing the Beam College
- Apache Beam at Lyft
- Apache Beam at Intuit
- Apache Beam at Yelp
- Apache Beam at LinkedIn
- Say Goodbye to the Lambda Architecture
- Moving Beyond Lambda: The Unified Apache Beam Model for Simplified Data Processing
- Apache Beam for Real-time ETL-Integration
- Apache Beam & Dataflow for Real-time Marketing Intelligence
- Apache Beam & Dataflow for Real-time ML & Generative AI
- Apache Beam, Dataflow ML and Gemma for real-time sentiment analysis
Thank You For Reading. How About Another Article?
Are You Not Entertained?
Hit 50 times that 👏 button and something wonderful will happen.
- Follow David Regalado for more educational content and stuff!