How Difficult is the Google Cloud Data Engineering Exam?
My personal journey + 10 tips to prepare for the exam
Spanish version of this post here
It has been more than 2 years since I took the exam at the request of the company where I worked. Back then I only had a month to prepare. A lot? Too little? What do you think was the result?
Google recommends 3+ years of industry experience, including 1+ year building and managing solutions with GCP to be successful with certification. The truth is that I did not pass. And I must confess that this experience marked me so much that the following months I decided to study each and every specialization that was available at that time on Coursera. I studied and studied.
And I’m still studying…
Although sometimes things get in the way like conferences to which I am invited as a speaker, having to learn about DevOps, having to learn a new programming language, creating the largest and coolest data community in Latin America, and other things in life.
There must be a better way to optimize my time. According to my research, here are some tips that I have found on the web.
Update: In December 2022 I took the exam again and this time I passed!!!
Tip 1: You can’t study everything. Better study what they are going to evaluate you
According to Google Cloud, the definition of what a Data Engineer does is as follows:
The Professional Data Engineer enables data-driven decision making by collecting, transforming, and publishing data. The Data Engineer must be able to design, compile, operate, secure, and monitor data processing systems, with particular emphasis on security, compliance, scalability, efficiency, reliability, fidelity, flexibility, and portability. Additionally, the Data Engineer must be able to constantly leverage, deploy, and train pre-existing machine learning models.
Having made that clear, the following topics are included in the exam (divided into 4 sections):
Section 1. Designing data processing systems
- Mapping storage systems to business requirements
- Data modeling
- Trade-offs involving latency, throughput, transactions
- Distributed systems
- Schema design
- Data publishing and visualization (e.g., BigQuery)
- Batch and streaming data (e.g., Dataflow, Dataproc, Apache Beam, Apache Spark and Hadoop ecosystem, Pub/Sub, Apache Kafka)
- Online (interactive) vs. batch predictions
- Job automation and orchestration (e.g., Cloud Composer)
- Choice of infrastructure
- System availability and fault tolerance
- Use of distributed systems
- Capacity planning
- Hybrid cloud and edge computing
- Architecture options (e.g., message brokers, message queues, middleware, service-oriented architecture, serverless functions)
- At least once, in-order, and exactly once, etc., event processing
- Awareness of current state and how to migrate a design to a future state
- Migrating from on-premises to cloud (Data Transfer Service, Transfer Appliance, Cloud Networking)
- Validating a migration
A lot? This is just section 1!
Section 2. Building and operationalizing data processing systems
- Effective use of managed services (Cloud Bigtable, Cloud Spanner, Cloud SQL, BigQuery, Cloud Storage, Datastore, Memorystore)
- Storage costs and performance
- Life cycle management of data
- Data cleansing
- Batch and streaming
- Transformation
- Data acquisition and import
- Integrating with new data sources
- Provisioning resources
- Monitoring pipelines
- Adjusting pipelines
- Testing and quality control
Section 3. Operationalizing machine learning models
- ML APIs (e.g., Vision API, Speech API)
- Customizing ML APIs (e.g., AutoML Vision, Auto ML text)
- Conversational experiences (e.g., Dialogflow)
- Ingesting appropriate data
- Retraining of machine learning models (AI Platform Prediction and Training, BigQuery ML, Kubeflow, Spark ML)
- Continuous evaluation
- Distributed vs. single machine
- Use of edge compute
- Hardware accelerators (e.g., GPU, TPU)
- Machine learning terminology (e.g., features, labels, models, regression, classification, recommendation, supervised and unsupervised learning, evaluation metrics)
- Impact of dependencies of machine learning models
- Common sources of error (e.g., assumptions about data)
Section4. Ensuring solution quality
- Identity and access management (e.g., Cloud IAM)
- Data security (encryption, key management)
- Ensuring privacy (e.g., Data Loss Prevention API)
- Legal compliance (e.g., Health Insurance Portability and Accountability Act (HIPAA), Children’s Online Privacy Protection Act (COPPA), FedRAMP, General Data Protection Regulation (GDPR))
- Building and running test suites
- Pipeline monitoring (e.g., Cloud Monitoring)
- Assessing, troubleshooting, and improving data representations and data processing infrastructure
- Resizing and autoscaling resources
- Performing data preparation and quality control (e.g., Dataprep)
- Verification and monitoring
- Planning, executing, and stress testing data recovery (fault tolerance, rerunning failed jobs, performing retrospective re-analysis)
- Choosing between ACID, idempotent, eventually consistent requirements
- Mapping to current and future business requirements
- Designing for data and application portability (e.g., multicloud, data residency requirements)
- Data staging, cataloging, and discovery
More detail here.
Tip 2: Take notes
Don’t limit yourself to just reading or watching videos. Try to apply more active learning by taking notes in a notebook or using tools like Remnote
The first time you will see that you will spend a lot of time taking notes but the knowledge will be fixed more time than if you are only in listening mode. Remember, it’s not about watching more videos in the least amount of time. There will be an opportunity to watch the videos at 2x speed of the player later. In fact, this is best done when you already know the subject and are just looking to do a quick review.
Tip 3: Study for free here!
Google wants you to get certified so it offers you free training. I don’t know for how long this will be available, so go ahead and take advantage of it:
Tip 4: Read books
This book is required reading:
Official Google Cloud Certified Professional Data Engineer Study Guide by Dan Sullivan.
Tip 5: Read about Best Practices
- Best practices for Cloud Storage
- Best practices for Dataproc
- Best practices for Spanner
- Best practices for BigQuery
- Best practices to import from Mysql
- Best practices to import from PostgreSQL
- Best practices to import from SQL Server
- BigTable
- IAM
- Windowing in Apache Beam
Tip 6: Get down to business with Qwiklabs badges (now called cloudskillsboost)
Qwiklabs is a learning platform that combines theory and practice through labs where you are given access to the GCP cloud without using your personal account. A series of labs make up a quest. This particular quest is what you need to prepare for the exam:
Sometimes there are events where credits are given away so you have to stay tuned.
Tip 7: Take the preparation program on Coursera
This program consists of the following 6 courses:
- Google Cloud Big Data and Machine Learning Fundamentals
- Modernizing Data Lakes and Data Warehouses with Google Cloud
- Building Batch Data Pipelines on GCP
- Building Resilient Streaming Analytics Systems on Google Cloud
- Smart Analytics, Machine Learning, and AI on GCP
- Preparing for the Google Cloud Professional Data Engineer Exam
Preparing for Google Cloud Certification: Cloud Data Engineer Professional Certificate in Coursera.
Tip 8: Get familiar with the format of the exam questions and the content that may come to you
According to each case, you will have to select one option. You get feedback on why and why not of each option when you finish answering all the questions. Here comes the trick. You should write down those topics where you did not do well (for example, Kubernetes) and study them more intensively.
Professional Data Engineer Sample Questions
Tip 9: Check out the resources available in the Learning and Certifications Hub
Tip 10: Schedule that exam!
Finally, I once heard that a good technique to study conscientiously is to schedule the exam. Give yourself a deadline. That way you will have the pressure that you have to take that exam and you will not let time pass.
Do you have any other tip? Tell me in the comments.
For more information, I recommend the following:
- More interesting courses listed here
Thanks for reading! Do you want more?
Hit 50 times the like button and something wonderful will happen.
- 👉Follow me for more nerdy talks!
- 👉Follow Data Engineering Latam for more content related to Data Engineering, Data Science and Data Management.