Running Apache Beam outside of GCP

Where is Apache Beam being used for production workloads when running outside of Google Cloud?

13 min readJul 24, 2024

As you embark on your Apache Beam journey, you may wonder: can I leverage this technology beyond Google Cloud? The answer is a resounding yes! Apache Beam’s design philosophy is centered around providing a abstraction layer for developers, allowing them to concentrate on crafting business logic rather than getting bogged down in implementation details. This flexibility enables you to seamlessly integrate Apache Beam into your preferred ecosystem, streamlining your development process and unlocking greater productivity.

In this article, we are going to cover some examples where Apache Beam is being used for production data pipelines outside of Google Cloud Platform. The objective is to showcase the feasibility of running Apache Beam outside of Google Cloud Platform (GCP) and highlight that large enterprises are successfully utilizing this technology in their own environments. The information here is backed by official blogs from Apache Beam, Yelp, LinkedIn, Intuit, and Lyft.

In upcoming articles, we’ll take a step-by-step approach to deploying Apache Beam on various runners, exploring the possibilities and best practices for each.

1. Using Apache Beam at Yelp
2. Using Apache Beam at LinkedIn
3. Using Apache Beam at Intuit
4. Using Apache Beam at Lyft

Running Apache Beam outside of GCP Where is Apache Beam being used for production workloads when running outside of Google Cloud? | David Regalado | @thecodemancer_ #apachebeam #GCP #Google #GoogleCloud #GoogleCloudPlatform — Running Apache Beam outside of GCP. Image by the author.

1. Using Apache Beam at Yelp

A little bit of context: Yelp’s “Business Properties” encompass various data points associated with a business. For instance, a restaurant’s properties might include accepted payment methods, amenities, and operating hours. There are two types: Business Attributes, which are part of the legacy system, and Business Features, which are in a dedicated microservice, aligning with Yelp’s Service Oriented Architecture.

Business Attributes Data Collection and Transformation

Yelp manages a substantial volume of business attributes data spread across numerous MySQL tables. Efficiently synchronizing and transforming this data for various consumers, such as offline data warehouses and real-time analytics systems, is crucial. Yelp employs Apache Beam, paired with Apache Flink as the distributed processing backend, to streamline the transformation and formatting of business attributes data.

Data Transformation: Apache Beam transformation jobs process input streams generated by the MySQL replication handler. These streams replicate data from MySQL tables, which Apache Beam then standardizes and transforms into a consistent format.
Unified Data Stream: The transformed data is published into a single unified stream, ensuring consistency across all business properties. This unified stream abstracts the internal complexities, simplifying data consumption for various services and systems.

By using Apache Beam, Yelp can efficiently handle the high throughput and complex transformations required for their business attributes data, ensuring real-time synchronization and high-quality data for both offline and streaming consumers.

Streaming Business Features

Yelp stores business features and associated metadata in Cassandra. Synchronizing this data in real-time with other systems, while maintaining a consistent format, is essential for seamless data consumption. Apache Beam, alongside Cassandra source connectors, is used to stream and transform business features data into a unified format.

Data Formatting: Apache Beam transformation jobs process the output stream from Cassandra, ensuring that business features data matches the unified format used for business attributes.
Consistent Data Stream: The resulting data is published into the same unified output stream as business attributes, maintaining consistency and facilitating easier data integration and consumption.

Apache Beam’s capabilities allow Yelp to handle real-time data processing efficiently, ensuring that business features are consistently formatted and readily available for various downstream applications.

Enriching Data with Properties Metadata

To provide comprehensive insights, Yelp needs to enrich business properties data with associated metadata, such as modification timestamps and accuracy confidence levels. Yelp utilizes a Joinery Flink job to join business data streams with their corresponding metadata. Apache Beam plays a crucial role in this process by transforming and preparing the data for integration.

Data Enrichment: Apache Beam transformation jobs process and standardize business properties data. The Joinery Flink job then merges this data with metadata from multiple Kafka topics, creating enriched data streams.
Comprehensive Data Streams: The enriched data stream contains both business properties and relevant metadata, ensuring that consumers have access to complete and accurate information.

Apache Beam’s ability to handle complex transformations and integrations ensures that Yelp’s data streams are enriched and ready for diverse use cases, from analytics to real-time monitoring.

Final Data Formatting

Ensuring data consistency, removing invalid entries, and adding supplementary fields are critical for providing high-quality data to various consumers. Apache Beam is used to perform final data formatting tasks, addressing inconsistencies and preparing the data for consumption.

Data Cleansing: Apache Beam transformation jobs clean the data, removing duplicates and invalid entries, and adding necessary fields to ensure completeness.
Consistent and Reliable Data: The final transformed data is published into a consolidated stream, which is exposed for consumption by offline systems like Redshift and Yelp’s Data Lake, as well as real-time consumers within the organization.

By leveraging Apache Beam, Yelp ensures that their data is consistent, reliable, and ready for analysis and reporting, providing valuable insights for various stakeholders.

Real-Time Consumption and Integration

Yelp’s marketing systems and other services require timely synchronization of business property data for campaigns and other real-time applications. The consolidated data stream, formatted and enriched using Apache Beam, is used to sync business property data with real-time systems.

Timely Synchronization: Apache Beam ensures that the data stream is up-to-date and accurate, providing real-time data to marketing systems and other services.
Seamless Integration: The same data stream can be consumed by multiple real-time applications, facilitating efficient data sharing and integration across the organization.

Apache Beam’s robust streaming capabilities enable Yelp to meet the demands of real-time data consumers, ensuring timely and accurate data synchronization for critical applications.

2. Using Apache Beam at LinkedIn

LinkedIn utilizes Apache Beam extensively for real-time stream processing, managing over 4 trillion events daily through more than 3,000 pipelines. This framework supports critical services such as machine learning, notifications, and anti-abuse AI modeling. With a massive user base of over 950 million members, LinkedIn’s ability to maintain smooth operations is vital for connecting professionals worldwide.

LinkedIn’s Open-Source Journey

LinkedIn has a strong history of contributing to the open-source community, managing over 75 projects. Key tools developed by LinkedIn include Apache Kafka for data ingestion and Apache Samza for event streaming, forming the foundation of their data processing ecosystem.

Despite these innovations, the need for a unified and more efficient processing system led them to Apache Beam.

Transition to Apache Beam

The release of Apache Beam in 2016 addressed LinkedIn’s need for a unified data processing model that supports both batch and stream processing. Apache Beam’s advanced API and multi-language support (Python, Go, and Java) allowed LinkedIn to create sophisticated multi-language pipelines and run them on any engine, significantly enhancing their processing capabilities and efficiency.

Let’s delve into the applications of Apache Beam at LinkedIn, uncovering the diverse range of use cases that demonstrate its capabilities in data processing and management.

Unified Streaming and Batch Pipelines

LinkedIn’s transition to Apache Beam involved migrating several use cases, including their standardization process. This process involves real-time processing of user data, such as job titles and skills, to improve job recommendations. Apache Beam’s unified model allowed LinkedIn to handle both real-time standardization and periodic backfilling efficiently, reducing memory and CPU usage by 50% (from ~5000 GB-hours and ~4000 CPU hours to ~2000 GB-hours and ~1700 CPU hours) and accelerating processing time by 94% (from 7.5 hours to 25 minutes).

Hundreds of streaming Apache Beam jobs now power real-time standardization, listening to events 24/7, enriching streams with additional data from remote tables, performing necessary processing, and writing results to output databases.

Depending on the target processing type, streaming, or batch, the unified Apache Beam standardization pipeline can be deployed through the Samza cluster as a streaming job or through the Spark cluster as a batch backfilling job.

Anti-Abuse and Near Real-Time AI Modeling

Apache Beam fortifies LinkedIn’s anti-abuse platform, Chronos, which detects and prevents abuse in near real-time. The flexibility of Apache Beam’s architecture enabled the integration of anti-abuse pipelines with Kafka, significantly reducing the time to label abusive actions from one day to just five minutes. This improvement enhances LinkedIn’s ability to detect and prevent various forms of abuse swiftly.

Chronos relies on two streaming Apache Beam pipelines: the Filter pipeline and the Model pipeline.

Notifications Platform

LinkedIn’s Notifications Platform relies on Apache Beam and Apache Samza to generate and distribute notifications to members. Apache Beam handles large volumes of data in real-time, enabling timely and relevant notifications. The advanced API and reusable components of Apache Beam expedite development and scaling of the platform, enhancing user engagement through timely updates.

Real-Time ML Feature Generation

LinkedIn’s machine learning models for job recommendations and search feed depend on real-time feature generation. Apache Beam replaced the older, offline pipeline, reducing latency from 24–48 hours to mere seconds. This real-time processing capability allows LinkedIn’s ML models to deliver more personalized and timely recommendations.

Managed Stream Processing Platform

With over 3,000 Apache Beam pipelines, LinkedIn developed Managed Beam to streamline and automate the creation and management of streaming applications. Managed Beam simplifies the development and operational processes for AI engineers, significantly reducing the time to onboard new applications from months to days. Apache Beam’s abstraction and portability facilitate easy integration and scalability.

3. Using Apache Beam at Intuit

Intuit is a global technology platform known for its financial and marketing automation solutions, including TurboTax, QuickBooks, Mint, Credit Karma, and Mailchimp. To support its mission of powering prosperity, Intuit developed a self-service Stream Processing Platform using Apache Beam to accelerate real-time applications and streamline data processing.

Self-service Stream Processing

In 2019, Intuit’s Data Infrastructure team began designing a Stream Processing Platform to provide a seamless experience for developers, focusing on business logic rather than operational and infrastructure management. Apache Beam was chosen as the core data processing technology due to its flexibility and portability, allowing the team to use various programming languages and execution engines. Initially, Apache Beam pipelines were used with Apache Samza to handle Kafka streams.

Apache Beam’s runner agnosticism proved crucial when Intuit switched from Apache Samza to Apache Flink without causing disruptions to users. This flexibility highlighted the benefits of Apache Beam, ensuring smooth transitions and future-proofing the platform.

The platform’s extensibility allowed Intuit to create a custom SDK layer for better compatibility with their Kafka installation. The Stream Processing Platform provided a graphical user interface (GUI) for designing, deploying, monitoring, and debugging data processing pipelines, using Argo Workflows for deployment on Kubernetes.

Powering Real-time Data

One of the most impactful applications of Apache Beam at Intuit is the unified clickstream processing pipeline. This pipeline consumes, aggregates, and processes raw clickstream events from Kafka, enriching the data with geolocation and other features. Apache Beam’s composite transforms, such as windowing, timers, and stateful processing, enable fine-grained control over data freshness, allowing for real-time data enrichment every minute, a significant improvement over the previous four-hour interval.

Another critical use case is the feature store ingestion platform, which supports new AI and ML-powered customer experiences. Apache Beam pipelines ingest real-time features generated by other pipelines, writing them to the Intuit feature store for ML model training and inference. The platform also offers a backfill capability, allowing pipelines to process historic data and bootstrap state before switching to streaming context.

Results

Since the launch of Intuit’s Stream Processing Platform, the number of Apache Beam-powered streaming pipelines has doubled annually, with over 160 active production pipelines running on 710 nodes across six Kubernetes clusters as of July 2022. These pipelines handle approximately 17.3 billion events and 82 TB of data, processing 800,000 transactions per second during peak seasons.

Louder for the people in the back:

More than 160 active production pipelines handle approximately 17.3 billion events and 82 TB of data, processing 800,000 transactions per second during peak seasons.

Apache Beam’s abstraction of execution engines allowed Intuit to switch primary runners without rewriting code, future-proofing the platform for evolving execution runtimes. This flexibility democratized stream processing across Intuit’s development teams, enabling engineers to onboard quickly and migrate from batch jobs to streaming applications.

With Apache Beam, Intuit accelerated the development and launch of production-grade streaming data pipelines, reducing the development time from three months to one month. The time to design preproduction pipelines shrank to just 10 days. The migration to Apache Beam streaming pipelines also resulted in a 5x optimization in memory and compute costs. Intuit continues to develop Apache Beam streaming pipelines for new use cases, with 150 more pipelines in preproduction.

4. Using Apache Beam at Lyft

Lyft, Inc. is an American mobility service provider offering ride-hailing, rentals, bike-sharing, food delivery, and business transportation solutions. Operating in the US and Canada, Lyft requires a powerful real-time streaming infrastructure to connect drivers and riders efficiently. Apache Beam has become a critical technology for Lyft, enabling large-scale streaming data processing and machine learning (ML) pipelines.

Democratizing Stream Processing

Initially, Lyft built streaming ETL pipelines using Amazon Kinesis and Apache Flink to process events for their data lake. However, increasing demands for real-time ML models and diverse programming language preferences prompted Lyft to explore Apache Beam in 2019. Apache Beam’s portability and multi-language capabilities were key attractions. It offers various runners, including the Beam Flink runner, and supports multiple programming languages.

By leveraging Apache Beam, Lyft enabled data infrastructure teams to use Java and product teams to use Python, streamlining the creation and execution of streaming pipelines. This flexibility allowed teams to write pipelines comfortably and run them on the Beam Flink runner. Lyft’s Data Platform team built a control plane of in-house services to manage Flink applications on Kubernetes, using a blue/green deployment strategy for critical pipelines and custom macros for better observability and CI/CD integration. The team also developed a lightweight, YAML-based DSL and reusable Apache Beam PTransforms for filtering and enriching events.

Powering Real-time Machine Learning Pipelines

Lyft’s Marketplace team uses Apache Beam to support real-time ML, generating streaming features and executing ML models. They separated Feature Generation and ML Model Execution into multiple pipelines. Apache Beam pipelines generate real-time features and write them to Kafka for model execution. Features are processed using stateful ParDo transforms, and ML models are invoked based on feature availability. The model outputs can feed back into other models, creating a DAG workflow.

Processing ~4 million events per minute with sub-second latency, Apache Beam allowed Lyft to reduce latency by 60%, simplify code, and onboard new teams and use cases onto streaming.

Amplifying Use Cases

Lyft has used Apache Beam for over 60 use cases, improving real-time user experiences and fulfilling business commitments. For example:

Map Data Delivery: Transitioned from batch to streaming to identify road closures in real-time, processing ~400k events per second to improve routing and ETA.
Airport Reporting: Migrated to a streaming pipeline to report pick-ups and drop-offs at airports, improving compliance scores and reducing latency from 5 to 2 seconds.

Lyft’s use of open-source software includes significant contributions to Apache Beam, sharing their integrations and experiences at various industry events.

Results

Apache Beam’s portability enabled Lyft to run mission-critical data pipelines written in non-JVM languages on a JVM-based runner, avoiding code rewrites and reducing development time from days to hours. Full isolation of user code and native CPython execution facilitated easy onboarding and adoption.

Apache Beam’s unified programming model resolved Lyft’s programming language dilemma, allowing both Python and Java usage.

Lyft successfully built and scaled over 60 streaming pipelines, processing events in near-real-time with very low latencies. They plan to leverage Beam SQL and the Go SDK for full Apache Beam multi-language capabilities.

Final Thoughts

Apache Beam’s versatility and power extend far beyond the confines of Google Cloud, offering robust solutions for real-time and batch data processing across various industries. As demonstrated by its successful implementations at Yelp, LinkedIn, Intuit, and Lyft, Apache Beam is capable of handling complex data transformations, ensuring data consistency, and enabling seamless integrations.

If you are considering adopting Apache Beam for your data processing needs, or if you are looking to enhance your current data infrastructure with this powerful technology, I am here to help. With my extensive experience in data engineering and deep knowledge of Apache Beam, I can provide the expertise and support necessary to implement and optimize Apache Beam for your specific use cases.

Contact me to discuss how we can leverage Apache Beam to elevate your data processing capabilities and drive your business forward.

Thank you for reading. How about another article?

What makes you a great data engineer?

Key areas to stand out

davidregalado255.medium.com

Do you want more?

Hit 50 times that 👏 button and something wonderful will happen.

Spanish version of this post here!
Follow David Regalado for more educational content and stuff!
Follow Data Engineering Latam (Data Engineering Latam) for more content related to Data Engineering, Data Science and Data Management.

Running Apache Beam outside of GCP

Where is Apache Beam being used for production workloads when running outside of Google Cloud?

Contents

1. Using Apache Beam at Yelp

Business Attributes Data Collection and Transformation

Streaming Business Features

Enriching Data with Properties Metadata

Final Data Formatting

Real-Time Consumption and Integration

2. Using Apache Beam at LinkedIn

LinkedIn’s Open-Source Journey

Transition to Apache Beam

Unified Streaming and Batch Pipelines

Anti-Abuse and Near Real-Time AI Modeling

Notifications Platform

Real-Time ML Feature Generation

Managed Stream Processing Platform

3. Using Apache Beam at Intuit

Self-service Stream Processing

Powering Real-time Data

Results

4. Using Apache Beam at Lyft

Democratizing Stream Processing

Powering Real-time Machine Learning Pipelines

Amplifying Use Cases

Results

Final Thoughts

Thank you for reading. How about another article?

What makes you a great data engineer?

Key areas to stand out

Do you want more?

Written by David Regalado

No responses yet