Topic Cheatsheet For GCP's Professional Machine Learning Engineer Beta Exam
Topic Cheatsheet For GCP's Professional Machine Learning Engineer Beta Exam
Authors/contributors: David Chen, PhD
Credits & disclaimers can be found in README section of the source repository. Some references are included as %comments or hyperlinks the source file.
2. M step ∗ De-noising autodencoders tasks (*but unlike parallel computing, synchroniza-
Repeat until convergence ∗ Sparse autencoders tion across different ventors is NOT essential)
– Application Procedures during Implementing a Training Pipeline
Overfitting ∗ Data representation (feature engineering) • Perform data validation (e.g. via Cloud Dataprep)
Bias-variance trade-off ∗ Dimensionality reduction / data compression • Decouple components with Cloud Build (fully server-less
CI/CD platform supporting any language)
• Characteristics of Loss vs. iteration curves, separately plot-
ted for
III. Production-level ML with Cloud – Add layer of technical abstraction
– Separate content producer & end users
– Training set MLOps: CI/CD in an ML System
– Validation and/or test set – Ensures software components are not tightly depen-
• Underfitting vs. overffiting patterns DevOps Data Engineering MLOps dent on one another
• Construct & test parametrized pipeline definition in SDK (e.g.
Ways to address overfitting Version ctrl. Code Code Code, data, model gcloud ml-engine)
Pipeline - Data, ETL Training, serving • Tune compute performance
1. Get more high-quality, well-labeled training data Validation Unit tests Unit tests Model valid.
2. Regularization • Store data & generated artifacts (e.g. binaries, tarballs) via
CI/CD Production Data pipeline (both) Cloud Storage
• L2 penalty
• L1 (LASSO) penalty Type Transac.? Complex Q? Cap.
Tools for virtualization:
• Elastic net
3. Ensemble learning
• Virtual Machines (VMs) Cloud Datastore NoSQL ✓ 7 Terabytes+
• Bagging
• Containers Bigtable NoSQL (limited) 7 Petabytes+
– Random Forest: Only the randomly chosen 1 ≤
– Clusters Cloud Storage Blobstore 7 7 Petabytes+
– Pods Cloud SQL SQL ✓ ✓ Terabytes
𝑚 < 𝑀 features used in split
• Kubernetes (K8s) Cloud Spanner SQL ✓ ✓ Petabytes
– Bagged Trees: all 𝑀 features available used in
split BigQuery SQL 7 ✓ Petabytes+
GCP tools:
• Boosting (e.g. Gradient Boosted Trees/XGBoost) • BigQuery Considerations for Implementing the Serving Pipeline
• Model binary options
Recommendation Systems – Google-managed data warehouse
• Google Cloud serving options
– Highly scalable, fast, optimized
User info Domain knowledge – Suitable for analysis & storage of structured data • Testing for target performance
– Multi-processing enabled • Setup of trigger & pipeline schedule
Content-based ✓ Deployment with CI/CD (final step in MLOps), along with
Collaborative Filtering ✓ • Cloud Dataprep
– Managed cloud service for quick data exploration & • A/B testing: Google Optimize
Knowledge-based ✓ • Canary testing, automated by GKE with Spinnaker
transformation
A hybrid recommendation systems uses more than one of the above, – Auto-scalable, eases data-preparation process
though not 100% possible at all times, it is generally the preferred ML Solution Monitoring
• Cloud Dataflow: provides serverless, parallel, distributed in-
solution. frastructure for both batch & stream data processing by mak- Considerations in monitoring ML solutions:
ing use of Apache Beam TM 1. Monitor performance/quality of ML model predictions on
Deep Learning an ongoing-basis (via Cloud Monitoring (Compute Engine)
• Cloud ML APIs
Subtypes of Neural Networks – Cloud Vision AI with a metric model), and then debug with Cloud Debugger
• Feed forward neural network – Cloud Natural Language 2. Use robust logging strategies (e.g. Cloud Logging, espe-
• Convolutional Neural Network (CNN) & computer vision – Cloud Speech to Text cially Stackdriver (aka Cloud Operations) with beautiful
• Recurrent Neural Network (RNN) – Cloud Video Intelligence dashboards)
– Sequence data (speech/text, time series) 3. Establish continuous evaluation metrics
– Vanishing gradient problem ML Pipeline Design Troubleshoot ML Solutions:
– Gated Recurrent Units (GRU) The ML code is only a small part of a production-level ML system • Permission issues (IAM)
– Long-short term memory (LSTM) • Identify components, parameters, triggers, compute needs • Training error
– Application to Natural Language Processing (NLP) • Orchestration Framework • Serving error
∗ Language models – Cloud Composer (based on Apache Airflow deploy- • ML system failures/biases (at production)
∗ Embeddings ment) Tune performance of ML solutions in production
∗ Architectures (e.g. transformers) – GCP App Engine • Simplify (optimize) of input pipeline
• Autoencoders (deep learning) – Cloud Storage – Reduce data redundancy in NLP model
– General architecture – Cloud Kubernetes Engine – Utilize Cloud Storage (e.g. object storage)
∗ Encoding layers – Cloud Logging & Monitoring – Simplification can take place in various places during
∗ Lower-dimensional representation (returned or • Strategies beyond single cloud: the pipeline
used as input for subsequent autoencoder in a – Hybrid Cloud: blend of public & private cloud for • Identify of appropriate retraining policy
stack) mixed computing, storage, & services, allowing for – Under what circumstance(s)? How often? (e.g. when
∗ Decoding layers agility (i.e. quick adaptation during business digital significant deviation or drift identified; periodically)
– Flavors to address trival solutions: transformation) – How? (e.g. by batch vs. online learning)