Topic Cheatsheet For GCP's Professional Machine Learning Engineer Beta Exam

This document provides a cheat sheet overview of topics covered in Google Cloud's Professional Machine Learning Engineer Beta exam, including: 1. Common ML abbreviations and an overview of the machine learning process from data preparation through model deployment. 2. Key concepts for each stage of ML including exploratory data analysis, feature engineering, model development, and model monitoring. 3. Descriptions of common supervised and unsupervised learning algorithms like naive Bayes, decision trees, and support vector machines.

Uploaded by

Arpit Soni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views2 pages

Topic Cheatsheet For GCP's Professional Machine Learning Engineer Beta Exam

Uploaded by

Arpit Soni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Topic Cheatsheet for GCP’s Professional Machine Learning Engineer Beta Exam


Authors/contributors: David Chen, PhD
Credits & disclaimers can be found in README section of the source repository. Some references are included as %comments or hyperlinks the source file.

Abbreviations Exploratory Data Analysis (EDA) II. ML Model Development

Common abbreviations. ML, machine learning; DL, deep learn- • Evaluation of data quality (domain- and organization- Model Development At-a-Glance
ing; AI, artificial intelligence, CV, computer vision; GC(P), Google specific knowledge/information may be needed) Generic ML Workflow
Cloud (Platform); CI/CD: continuous integration / continuous de- • Data visualization (descriptive statistics)
livery; SDK, software development kit; API, application program- • Inferential statistics (e.g. t-test to compare means, KS-tests 1. Training
ming interface; K8s, Kubernetes; GKE, Google Kubernetes Engine. to compare distributions) as needed, scale as needed • Choose a model framework
MLE, maximum likelihood estimation; ROC, receiver-operation – Supervised
curve; AU(RO)C, area under the (receiver-operation) curve Feature Engineering – Unsupervised
• Consider Transfer Learning (if applicable)
• Necessary (e.g. time series) or beneficial in many ML tasks: • Monitoring / tracking metrics
I. Preparation for ML • Encoding structured data types • Strategies to handle overfitting (e.g. regularization,
• Feature Crosses: used to define a synthetic feature when data ensemble learning, drop-out) & underfitting (increase
Understanding the ”Data Science Steps for ML” cannot be linearly separated (e.g. feature cross products model complexity)
1. Data extraction x1 × x2 • Interpretability
2. Exploratory data analysis • Feature selection, e.g. 2. Validation
3. Data preparation for the ML Task – Univariate statistical methods (e.g. 𝜒 2 test, t- • Check overfitting & underfitting
4. Model training test/linear model) • Compare trained model against pre-defined baseline
5. Model evaluation – Recursive Feature Elimination (RFE) (e.g. simple model or benchmark)
6. Model validation • Unit tests
7. Model serving Special considerations
3. Scale-up & Serving**
• Microservices with REST API • Imbalanced class distributions • Unit tests
• Deployment on mobile devices – Needs to be known, at minimum • Cloud AI model explainability
• Batch predictions – Affects the metrics to employ (e.g. F1 score, AUC • Distributed training
8. Model monitoring would be superior to crude accuracy in imbalanced bi- • Scalable Model Analysis
nary classification)
Deﬁning an ML Problem – Can affect optimization choices: modify objective ML Models
function; oversampling the minority class(es) Gradient descent is used to optimize the objective functions of a
ML as Solution to Business Problems • Data Leakage machine-learning model:
• (Re)define your business problems – Certain features available in your training data might Gradient Descent n Resolution
• Consider whether the problem could be solved without ML not be available in the unknowns to predict!
– When training, be careful not to include raw or engi- Full-batch all (N) complete
• Define/anticipate utility of the ML output
neered features that are computed from the classifica- Mini-batch 1<n<N intermediate
• Identify data sources
tion/regression label Stochastic 1 noisy approximation
• Pre-define ”success” to solving the business challenge
– Metric(s) used to define success An epoch is the number of passes through the entire training dataset,
– Key results (product or deliverables)
Data Pipelines
and is a hyperparameter to be defined/tuned by the user.
– Incorrect or low-quality output (i.e. ”unsuccessful” • Should be designed & built in advance for at-scale applica-
models) tions Supervised Learning (with related concepts)
• Batching vs. Streaming • Naive Bayes (flavors: Gaussian, Bernoulli, Multinomial)
Components of an ML Solution – Batching: Use of data stored in data lakes, processed in • Decision trees (concept of entropy)
periodic intervals • Support Vector Machine (SVM)
• Define Predictive Outcome – Streaming (data streams): Use of data from live streams.
• Identify Problem Type: Supervised (Classification or Regres- – Linearly vs. non-linearly separable
Unique challenge due to 3𝑉𝑠: Volume, Velocity (real- – Kernels
sion), Unsupervised, Reinforcement time), Variety (esp. unstructred data) (useful tool:
• Identify Input Feature Format Cloud Dataflow) Unsupervised Learning
• Feasibility and implementation • Monitoring • Clustering
– ”Four Golden Signals” of your cloud-based service: la- – K-means
Data Preparation tency, traffic, error, saturation – Hierarchical Clustering
– Dashboards (Stackdriver Cloud Monitoring Dash- – DBSCAN
Data Ingestion
boards API) can be a powerful tool in displaying mul- • Dimensionality reduction
• Obtaining & importing data for use or storage tiple metrics – Principal Component Analysis (PCA)
• File input types • Privacy, compliance, legal issues: Know what the restrictions – t-SNE
• Database maintenance, migration are and plan ahead (e.g. privacy-preserving ML/AI, corrupt- • Gaussian Mixture Model (GMM), optimized by Expectation-
• Streaming data (from IoT devices, databases, or enduser) ing input, ...)(useful tool: Cloud IAM) Maximization (EM):
1. E step ∗ Undercomplete autoencoder – Multi Cloud: multiple clouds designated for different


2. M step ∗ De-noising autodencoders tasks (*but unlike parallel computing, synchroniza-
Repeat until convergence ∗ Sparse autencoders tion across different ventors is NOT essential)
– Application Procedures during Implementing a Training Pipeline
Overﬁtting ∗ Data representation (feature engineering) • Perform data validation (e.g. via Cloud Dataprep)
Bias-variance trade-off ∗ Dimensionality reduction / data compression • Decouple components with Cloud Build (fully server-less
CI/CD platform supporting any language)
• Characteristics of Loss vs. iteration curves, separately plot-
ted for
III. Production-level ML with Cloud – Add layer of technical abstraction
– Separate content producer & end users
– Training set MLOps: CI/CD in an ML System
– Validation and/or test set – Ensures software components are not tightly depen-
• Underfitting vs. overffiting patterns DevOps Data Engineering MLOps dent on one another
• Construct & test parametrized pipeline definition in SDK (e.g.
Ways to address overfitting Version ctrl. Code Code Code, data, model gcloud ml-engine)
Pipeline - Data, ETL Training, serving • Tune compute performance
1. Get more high-quality, well-labeled training data Validation Unit tests Unit tests Model valid.
2. Regularization • Store data & generated artifacts (e.g. binaries, tarballs) via
CI/CD Production Data pipeline (both) Cloud Storage
• L2 penalty
• L1 (LASSO) penalty Type Transac.? Complex Q? Cap.
Tools for virtualization:
• Elastic net
3. Ensemble learning
• Virtual Machines (VMs) Cloud Datastore NoSQL ✓ 7 Terabytes+
• Bagging
• Containers Bigtable NoSQL (limited) 7 Petabytes+
– Random Forest: Only the randomly chosen 1 ≤
– Clusters Cloud Storage Blobstore 7 7 Petabytes+
– Pods Cloud SQL SQL ✓ ✓ Terabytes
𝑚 < 𝑀 features used in split
• Kubernetes (K8s) Cloud Spanner SQL ✓ ✓ Petabytes
– Bagged Trees: all 𝑀 features available used in
split BigQuery SQL 7 ✓ Petabytes+
GCP tools:
• Boosting (e.g. Gradient Boosted Trees/XGBoost) • BigQuery Considerations for Implementing the Serving Pipeline
• Model binary options
Recommendation Systems – Google-managed data warehouse
• Google Cloud serving options
– Highly scalable, fast, optimized
User info Domain knowledge – Suitable for analysis & storage of structured data • Testing for target performance
– Multi-processing enabled • Setup of trigger & pipeline schedule
Content-based ✓ Deployment with CI/CD (final step in MLOps), along with
Collaborative Filtering ✓ • Cloud Dataprep
– Managed cloud service for quick data exploration & • A/B testing: Google Optimize
Knowledge-based ✓ • Canary testing, automated by GKE with Spinnaker
transformation
A hybrid recommendation systems uses more than one of the above, – Auto-scalable, eases data-preparation process
though not 100% possible at all times, it is generally the preferred ML Solution Monitoring
• Cloud Dataflow: provides serverless, parallel, distributed in-
solution. frastructure for both batch & stream data processing by mak- Considerations in monitoring ML solutions:
ing use of Apache Beam TM 1. Monitor performance/quality of ML model predictions on
Deep Learning an ongoing-basis (via Cloud Monitoring (Compute Engine)
• Cloud ML APIs
Subtypes of Neural Networks – Cloud Vision AI with a metric model), and then debug with Cloud Debugger
• Feed forward neural network – Cloud Natural Language 2. Use robust logging strategies (e.g. Cloud Logging, espe-
• Convolutional Neural Network (CNN) & computer vision – Cloud Speech to Text cially Stackdriver (aka Cloud Operations) with beautiful
• Recurrent Neural Network (RNN) – Cloud Video Intelligence dashboards)
– Sequence data (speech/text, time series) 3. Establish continuous evaluation metrics
– Vanishing gradient problem ML Pipeline Design Troubleshoot ML Solutions:
– Gated Recurrent Units (GRU) The ML code is only a small part of a production-level ML system • Permission issues (IAM)
– Long-short term memory (LSTM) • Identify components, parameters, triggers, compute needs • Training error
– Application to Natural Language Processing (NLP) • Orchestration Framework • Serving error
∗ Language models – Cloud Composer (based on Apache Airflow deploy- • ML system failures/biases (at production)
∗ Embeddings ment) Tune performance of ML solutions in production
∗ Architectures (e.g. transformers) – GCP App Engine • Simplify (optimize) of input pipeline
• Autoencoders (deep learning) – Cloud Storage – Reduce data redundancy in NLP model
– General architecture – Cloud Kubernetes Engine – Utilize Cloud Storage (e.g. object storage)
∗ Encoding layers – Cloud Logging & Monitoring – Simplification can take place in various places during
∗ Lower-dimensional representation (returned or • Strategies beyond single cloud: the pipeline
used as input for subsequent autoencoder in a – Hybrid Cloud: blend of public & private cloud for • Identify of appropriate retraining policy
stack) mixed computing, storage, & services, allowing for – Under what circumstance(s)? How often? (e.g. when
∗ Decoding layers agility (i.e. quick adaptation during business digital significant deviation or drift identified; periodically)
– Flavors to address trival solutions: transformation) – How? (e.g. by batch vs. online learning)