Data Engineering Quick Reference
Data Engineering Quick Reference
Data Engineering Quick Reference
Databases
┣ Relational Database : A database that stores data in tables with a defined
schema
┣ NoSQL Database : A database that does not use the traditional relational
database model
┣ Redis : An in-memory key-value store used for caching and other high-
performance use cases
Data Warehousing
┣ Data Warehouse : A large, centralized repository of data from various
sources used for business intelligence and decision-making
┣ OLAP : Online Analytical Processing, used for analyzing data from a data
warehouse
┣ Star Schema : A type of data model used in data warehousing that consists
of a central fact table surrounded by dimension tables
┣ ETL : Extract, Transform, Load, the process of moving data from source
systems into a data warehouse
┣ Hive : A data warehousing system built on top of Hadoop for querying and
analysis of large data sets
┗ Impala : A distributed SQL query engine for processing big data sets
stored in Hadoop
Data Processing
┣ Data Pipeline : A set of processes used to extract, transform, and load
data from various sources into a destination system
┣ Talend : A popular open-source ETL tool used for data integration and
management
┣ Spark Streaming : An extension of the Spark API used for processing real-
time data streams
Data Visualization
┣ Tableau : A popular data visualization tool used for creating interactive
dashboards and reports
Data Governance
┣ Data Security : The process of ensuring data privacy and confidentiality
┣ Data Lineage : The process of tracking data from its source to its
destination
┗ Data Stewardship : The process of managing data assets and their use
┗ ERD Tools : Tools used for creating entity-relationship diagrams and other
data modeling diagrams
Data Integration
┣ Data Federation : The process of combining data from multiple sources into
a single virtual view
Machine Learning
┣ Supervised Learning : A type of machine learning where the algorithm is
trained on labeled data
Programming Languages
┣ Python : A popular programming language used for data engineering and
machine learning
┣ Scala : A programming language used for building big data technologies and
data streaming applications
Resources
┣ Data Engineering with Python by Paul Crickard III
Useful Technologies
┣ Apache Airflow : A platform used for creating, scheduling, and monitoring
data pipelines