0% found this document useful (0 votes)

241 views

Data Engineering Quick Reference

The document provides an overview of key concepts in data engineering, including databases, data warehousing, big data technologies, data processing, data streaming, data visualization, cloud technologies, data governance, data modeling, data integration, data architecture, machine learning, data science, programming languages, and cloud computing services. It lists and defines common tools and technologies used in each of these areas.

Uploaded by

Ale G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

241 views

Data Engineering Quick Reference

Uploaded by

Ale G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Data Engineering Quick Reference

Databases
┣ Relational Database : A database that stores data in tables with a defined
schema

┣ NoSQL Database : A database that does not use the traditional relational
database model

┣ SQL : A language used to interact with relational databases

┣ MongoDB : A popular NoSQL database that stores data in JSON-like documents

┣ Cassandra : A popular NoSQL database that is designed for high scalability

and availability

┣ Redis : An in-memory key-value store used for caching and other high-
performance use cases

┗ Amazon RDS : A managed relational database service provided by AWS

Data Warehousing
┣ Data Warehouse : A large, centralized repository of data from various
sources used for business intelligence and decision-making

┣ OLAP : Online Analytical Processing, used for analyzing data from a data
warehouse

┣ Star Schema : A type of data model used in data warehousing that consists
of a central fact table surrounded by dimension tables

┣ Snowflake Schema : A variation of the star schema that uses normalized

dimension tables

┣ Slowly Changing Dimensions (SCD) : A technique used for managing changes to

dimensional data over time

┣ ETL : Extract, Transform, Load, the process of moving data from source
systems into a data warehouse

┗ Amazon Redshift : A cloud-based data warehousing service provided by AWS

BY: Waleed Mousa

Data Engineering Quick Reference
Big Data Technologies
┣ Hadoop : An open-source framework for distributed storage and processing
of large data sets

┣ Spark : An open-source distributed computing system used for big data

processing and analytics

┣ Hive : A data warehousing system built on top of Hadoop for querying and
analysis of large data sets

┣ Pig : A high-level platform for creating MapReduce programs used for

large-scale data processing

┣ MapReduce : A programming model for processing large data sets across

clusters of computers

┗ Impala : A distributed SQL query engine for processing big data sets
stored in Hadoop

┣ Kafka : A distributed streaming platform used for building real-time data

pipelines and streaming applications

┗ Amazon EMR : A managed big data processing service provided by AWS

Data Processing
┣ Data Pipeline : A set of processes used to extract, transform, and load
data from various sources into a destination system

┣ ETL Tools : Tools used to automate the extraction, transformation, and

loading of data

┣ Apache Airflow : An open-source platform used for creating, scheduling, and

monitoring data pipelines

┣ AWS Glue : A fully-managed ETL service provided by AWS

┣ Talend : A popular open-source ETL tool used for data integration and
management

┗ Data Governance : The process of managing the availability, usability,

integrity, and security of data

BY: Waleed Mousa

Data Engineering Quick Reference
Data Streaming
┣ Data Stream : A continuous flow of data that is processed in real-time

┣ Apache Kafka : A distributed streaming platform used for building real-

time data pipelines and streaming applications

┣ Kinesis : A fully-managed data streaming service provided by AWS

┣ Flume : A distributed system for collecting, aggregating, and moving large

amounts of log data from different sources to a centralized data store

┣ Spark Streaming : An extension of the Spark API used for processing real-
time data streams

┗ Flink : An open-source distributed stream processing framework used for

real-time data processing

Data Visualization
┣ Tableau : A popular data visualization tool used for creating interactive
dashboards and reports

┣ Power BI : A business analytics service provided by Microsoft used for

creating interactive visualizations and reports

┣ D3.js : A JavaScript library used for creating interactive data

visualizations in the browser

┣ ggplot2 : A popular data visualization package for R

┗ matplotlib : A popular data visualization package for Python

BY: Waleed Mousa

Data Engineering Quick Reference
Cloud Technologies
┣ AWS : Amazon Web Services, a cloud computing platform provided by Amazon

┣ Azure : A cloud computing platform provided by Microsoft

┣ GCP : Google Cloud Platform, a cloud computing platform provided by Google

┣ Docker : A containerization platform used for packaging and deploying

applications

┗ Kubernetes : An open-source container orchestration platform used for

automating the deployment, scaling, and management of containerized applications

Data Governance
┣ Data Security : The process of ensuring data privacy and confidentiality

┣ Data Quality : The process of ensuring data accuracy, consistency, and

completeness

┣ Data Lineage : The process of tracking data from its source to its
destination

┣ Data Discovery : The process of identifying data assets and their

relationships

┗ Data Stewardship : The process of managing data assets and their use

BY: Waleed Mousa

Data Engineering Quick Reference
Data Modeling
┣ Entity-Relationship Model : A data modeling technique used to represent
the relationships between entities in a system

┣ Dimensional Modeling : A data modeling technique used in data warehousing

for creating optimized data structures

┣ Data Flow Diagrams : A diagrammatic representation of the flow of data

through a system

┣ UML : Unified Modeling Language, a standardized language used for object-

oriented modeling

┗ ERD Tools : Tools used for creating entity-relationship diagrams and other
data modeling diagrams

Data Integration
┣ Data Federation : The process of combining data from multiple sources into
a single virtual view

┣ Data Replication : The process of copying data from one database to

another in near-real time

┣ Data Synchronization : The process of ensuring that data is consistent

across multiple systems

┣ Extract, Load, Transform (ELT) : A data integration approach where data is

extracted from source systems, loaded into a staging area, and transformed
before being loaded into a target system

┗ Change Data Capture (CDC) : A data integration technique where changes in

source systems are captured and propagated to target systems in near-real time

BY: Waleed Mousa

Data Engineering Quick Reference
Data Architecture
┣ Data Lake : A storage repository that holds a vast amount of raw,
unstructured data in its native format

┣ Data Mart : A subset of a data warehouse that is designed for a specific

business function or department

┣ Data Hub : A centralized repository of data that serves as a single source

of truth for an organization

┣ Data Virtualization : A data integration technique that allows data to be

accessed and manipulated in real-time without copying or moving it

┗ Master Data Management (MDM) : The process of creating and maintaining a

single, trusted view of key business data

Machine Learning
┣ Supervised Learning : A type of machine learning where the algorithm is
trained on labeled data

┣ Unsupervised Learning : A type of machine learning where the algorithm is

trained on unlabeled data

┣ Reinforcement Learning : A type of machine learning where the algorithm

learns from feedback in an environment

┣ Deep Learning : A type of machine learning that uses neural networks to

model complex relationships in data

┣ TensorFlow : An open-source machine learning framework developed by Google

┣ PyTorch : An open-source machine learning framework developed by Facebook

┗ Scikit-learn : A popular machine learning library for Python

BY: Waleed Mousa

Data Engineering Quick Reference
Data Science
┣ Statistical Analysis : The process of analyzing data to uncover
relationships and patterns

┣ Data Exploration : The process of identifying patterns and trends in data

┣ Predictive Modeling : The process of using data to make predictions about

future events

┣ Time Series Analysis : The process of analyzing data that is collected

over time

┣ Spatial Analysis : The process of analyzing data that is related to

geographic locations

┣ Data Visualization : The process of representing data graphically

┗ Data Mining : The process of discovering patterns and relationships in

large datasets

Programming Languages
┣ Python : A popular programming language used for data engineering and
machine learning

┣ Java : A popular programming language used for building enterprise-level

applications and big data technologies

┣ Scala : A programming language used for building big data technologies and
data streaming applications

┣ SQL : A language used for interacting with relational databases

┗ R : A programming language used for statistical computing and data

analysis

BY: Waleed Mousa

Data Engineering Quick Reference

Cloud Computing Services

┣ EC2 : Elastic Compute Cloud, a virtual server provided by AWS

┣ S3 : Simple Storage Service, a scalable object storage service provided by

AWS

┣ Lambda : A serverless compute service provided by AWS

┣ CloudFormation : A service provided by AWS for modeling and setting up

cloud resources

┣ Azure VM : A virtual machine provided by Azure

┣ Azure Blob Storage : A scalable object storage service provided by Azure

┣ Azure Functions : A serverless compute service provided by Azure

┣ Azure Resource Manager : A service provided by Azure for modeling and

setting up cloud resources

┣ GCE : Google Compute Engine, a virtual machine provided by GCP

┣ Cloud Storage : A scalable object storage service provided by GCP

┣ Cloud Functions : A serverless compute service provided by GCP

┗ Cloud Deployment Manager : A service provided by GCP for modeling and

setting up cloud resources.

BY: Waleed Mousa

Data Engineering Quick Reference

Resources
┣ Data Engineering with Python by Paul Crickard III

┣ Designing Data-Intensive Applications by Martin Kleppmann

┣ Data Engineering Cookbook by Andreas Kretz

┣ Streaming Systems by Tyler Akidau, Slava Chernyak, and Reuven Lax

┗ AWS Certified Data Analytics Study Guide by Richard Wentk

Useful Technologies
┣ Apache Airflow : A platform used for creating, scheduling, and monitoring
data pipelines

┣ Apache Kafka : A distributed streaming platform used for building real-

time data pipelines and streaming applications

┣ Spark : An open-source distributed computing system used for big data

processing and analytics

┣ Docker : A containerization platform used for packaging and deploying

applications

┗ Kubernetes : An open-source container orchestration platform used for

automating the deployment, scaling, and management of containerized applications

BY: Waleed Mousa

Full Stack Data-Science AI, ChatGPT & Generative - 5
No ratings yet
Full Stack Data-Science AI, ChatGPT & Generative - 5
35 pages
Ethnotech - Data Science With Python
No ratings yet
Ethnotech - Data Science With Python
480 pages
Big Book of Data Engineering 2nd Edition Final
No ratings yet
Big Book of Data Engineering 2nd Edition Final
97 pages
JAVA Cheat Sheet
100% (1)
JAVA Cheat Sheet
12 pages
Databricks 101
No ratings yet
Databricks 101
16 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Sqoop Commands
No ratings yet
Sqoop Commands
4 pages
Aws Glue Information
No ratings yet
Aws Glue Information
46 pages
Interview Data Engineer
No ratings yet
Interview Data Engineer
13 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
From Everand
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
Anthony David Giordano
No ratings yet
When Where and Why To Use NoSQL
No ratings yet
When Where and Why To Use NoSQL
13 pages
Hadoop and Mapreduce Cheat Sheet
No ratings yet
Hadoop and Mapreduce Cheat Sheet
1 page
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Solved Big Data and Data Science Projects
100% (1)
Solved Big Data and Data Science Projects
85 pages
Final - Data and Ai Governance.6sept2023
No ratings yet
Final - Data and Ai Governance.6sept2023
42 pages
Datawarehouse To Data Lakehouse
No ratings yet
Datawarehouse To Data Lakehouse
48 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
Cert DEWD (Edits)
No ratings yet
Cert DEWD (Edits)
158 pages
Windowing Functions
No ratings yet
Windowing Functions
54 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Introduction To Database Programming in Python
No ratings yet
Introduction To Database Programming in Python
26 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Machine Learning Interviews V 2 Week 11715787639480
0% (1)
Machine Learning Interviews V 2 Week 11715787639480
49 pages
Big Data As A Service On Google Cloud
No ratings yet
Big Data As A Service On Google Cloud
329 pages
Databuildtoolpdf 220704 142715
No ratings yet
Databuildtoolpdf 220704 142715
39 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Cloud Dataproc Workflow Animation
No ratings yet
Cloud Dataproc Workflow Animation
2 pages
PostgreSQL Administration
No ratings yet
PostgreSQL Administration
8 pages
Azure Data Engineer Learning Path (OCT 2019)
No ratings yet
Azure Data Engineer Learning Path (OCT 2019)
1 page
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
BigQuery Query Optimization With Troposphere PDF
No ratings yet
BigQuery Query Optimization With Troposphere PDF
51 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
136 pages
Python Developer Certification
No ratings yet
Python Developer Certification
0 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
SQL Fundamentals Slides
100% (1)
SQL Fundamentals Slides
84 pages
Demystifying The Medallion and Lakehouse Architectures 1714820046
No ratings yet
Demystifying The Medallion and Lakehouse Architectures 1714820046
19 pages
Simplifying Data Engineering Databricks
100% (1)
Simplifying Data Engineering Databricks
20 pages
150+ Python Interview Questions
No ratings yet
150+ Python Interview Questions
76 pages
Chapter 4. Database System Architecture & Modeling
No ratings yet
Chapter 4. Database System Architecture & Modeling
57 pages
All in One Data Modeling - Compressed
No ratings yet
All in One Data Modeling - Compressed
473 pages
Databricks Practice Questions
No ratings yet
Databricks Practice Questions
83 pages
Interactive Visual Data Exploration With Spark in Databricks Cloud
No ratings yet
Interactive Visual Data Exploration With Spark in Databricks Cloud
26 pages
1 - Optimize Amazon SageMaker Deployment Strategies
No ratings yet
1 - Optimize Amazon SageMaker Deployment Strategies
45 pages
TensorFlow With R
No ratings yet
TensorFlow With R
46 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
100% (4)
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
55 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Machine Learning Notes
No ratings yet
Machine Learning Notes
27 pages
Data Engineering Explanation
No ratings yet
Data Engineering Explanation
43 pages
A Deep Dive Into Query Execution Engine of Spark SQL
100% (2)
A Deep Dive Into Query Execution Engine of Spark SQL
88 pages
07 - Ingesting New Datasets Into Google BigQuery
No ratings yet
07 - Ingesting New Datasets Into Google BigQuery
8 pages
AMIBCP For Aptio Data Sheet PUB
No ratings yet
AMIBCP For Aptio Data Sheet PUB
4 pages
AMI Debug RX Quick Start Guide
No ratings yet
AMI Debug RX Quick Start Guide
1 page
SQL-NOSQL CHEAT Sheet
No ratings yet
SQL-NOSQL CHEAT Sheet
5 pages
English Intermediate Unit 1
No ratings yet
English Intermediate Unit 1
19 pages
Beginnes Guideto OOPin Python
No ratings yet
Beginnes Guideto OOPin Python
11 pages
Business Intelligence
No ratings yet
Business Intelligence
15 pages
Enhancing Decision Making
No ratings yet
Enhancing Decision Making
33 pages
Peter Lalovsky Learn Microsoft SQL Server Intuitively. Transact SQL The Solid Basics
100% (1)
Peter Lalovsky Learn Microsoft SQL Server Intuitively. Transact SQL The Solid Basics
289 pages
Big Data Analytics Module 1
No ratings yet
Big Data Analytics Module 1
31 pages
Evaluation of Business Performance Source 01
No ratings yet
Evaluation of Business Performance Source 01
25 pages
UDW
No ratings yet
UDW
6 pages
Ps Assignment - Solution
No ratings yet
Ps Assignment - Solution
7 pages
b2b Assignment
No ratings yet
b2b Assignment
10 pages
Cover Letter For Intel
100% (1)
Cover Letter For Intel
7 pages
Data Mining: Concepts and Techniques: January 14, 2014
No ratings yet
Data Mining: Concepts and Techniques: January 14, 2014
64 pages
Data Warehouse and Data Mining - Unit 2
No ratings yet
Data Warehouse and Data Mining - Unit 2
24 pages
TIT 721 BI-Unit-III Study Materials
No ratings yet
TIT 721 BI-Unit-III Study Materials
29 pages
CS 8031 Data Mining and Data Warehousing Tutorial
No ratings yet
CS 8031 Data Mining and Data Warehousing Tutorial
9 pages
Cheatsheet from Designing data-intensive applications
No ratings yet
Cheatsheet from Designing data-intensive applications
14 pages
Streaming Ecosystem
No ratings yet
Streaming Ecosystem
31 pages
At 911 Stud
No ratings yet
At 911 Stud
144 pages
Fundamentals of Information Systems - Final
100% (2)
Fundamentals of Information Systems - Final
12 pages
Business Process Framework WP 396413 PDF
No ratings yet
Business Process Framework WP 396413 PDF
65 pages
Create Analytic View in SAP HANA
No ratings yet
Create Analytic View in SAP HANA
9 pages
Term Paper On MANAGEMENT Information System of Dutch Bangla Bank Limited
86% (7)
Term Paper On MANAGEMENT Information System of Dutch Bangla Bank Limited
17 pages
Introduction To Data Warehouse: Unit I: Data Warehousing
No ratings yet
Introduction To Data Warehouse: Unit I: Data Warehousing
110 pages
Informatica Developer
No ratings yet
Informatica Developer
42 pages
Data Modeling Interview Questions
75% (4)
Data Modeling Interview Questions
11 pages
SQL Server Interview Questions2
100% (1)
SQL Server Interview Questions2
311 pages
Data Mining
100% (1)
Data Mining
316 pages
DWH QB
No ratings yet
DWH QB
10 pages
CSE 530 - Database Management Systems: Data Warehousing Presentation by Ali Gardezi Prashanth Janardanan Aaron Sheffield
No ratings yet
CSE 530 - Database Management Systems: Data Warehousing Presentation by Ali Gardezi Prashanth Janardanan Aaron Sheffield
69 pages
CHAPTER 1 - Introduction To Decision Support Systems
No ratings yet
CHAPTER 1 - Introduction To Decision Support Systems
9 pages
Lecture 4, Data Cube Computation: CSI 4352, Introduction To Data Mining
No ratings yet
Lecture 4, Data Cube Computation: CSI 4352, Introduction To Data Mining
18 pages
Overview of SQL Server: Hanoi University of Technology
No ratings yet
Overview of SQL Server: Hanoi University of Technology
25 pages