0% found this document useful (0 votes)
74 views13 pages

02 Introduction

The document provides an overview of data engineering and Apache Spark. It discusses what data engineers do, reference architectures for data engineering platforms, and introduces Apache Spark and Databricks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views13 pages

02 Introduction

The document provides an overview of data engineering and Apache Spark. It discusses what data engineers do, reference architectures for data engineering platforms, and introduces Apache Spark and Databricks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ScholarNest

Introduction to Data Engineering


Data Engineers – Reference Architecture ScholarNest
Image Source: Google Cloud Documentation

What do they do? What do they do? What do they do?


• Develop and Manage • Collect data • Optimize
Operational Systems • Transform • Fraud Prevention
• Banking Apps • Quality Check • Grow
• E-Commerce Apps • Standardize • Recommendations
• OTT Applications • Prepare/Model • Monitor/Report
• IoT Applications • Facilitate Consumption • Sales/Revenue
Data Engineering Platform – Reference Architecture ScholarNest

Data Engineering Functions

Approaches (Batch/Stream-RT/NRT)
Lakehouse Medallion Architecture ScholarNest
ScholarNest

Introduction to Apache Spark


ScholarNest
Apache Spark is an engine for executing data engineering,
stream processing, and machine learning on distributed clusters.

Capabilities
• ANSI SQL
• Batch Processing API
• Stream Processing API

What is •

Graph Processing API
Machine Learning API

Apache Spark? Who is using Apache Spark?


Thousands of companies, including 80% of the Fortune 500, use
Apache Spark .
What is Apache Spark – A Unified Framework ScholarNest

Programming
API/DSL

Spark Framework

Resource Manager (YARN | Standalone | Kubernetes)

Compute Cluster

Distributed Storage (HDFS | S3 | ADLS | GCS)


Why Apache Spark? ScholarNest

Unified Open Wide


Abstraction Ease of use
Platform Source Ecosystem

Spark Framework

Resource Manager (YARN | Standalone | Kubernetes)

Compute Cluster

Distributed Storage (HDFS | S3 | ADLS | GCS)


Missing features from Apache Spark ScholarNest

Data Storage Infrastructure

ACID Transaction capabilities

Metadata Catalog

Cluster Management

Automation APIs and Tools


Spark Platforms ScholarNest

Cloudera Hadoop Platform


Amazon EMR
Azure HDInsight
Google Data Proc
Databricks Platform
ScholarNest

Introduction to Databricks
Databricks Features ScholarNest

Spark as Cloud-Native Technology


Secure Cloud Storage Integration
ACID Transaction via Delta Lake Integration
Unity Catalog for Metadata Management
Cluster Management Databricks Cloud
Photon Query Engine
Notebooks and Workspace
Administration Controls
Optimized Spark Runtime
Automation Tools
Databricks Cloud – Key Integrations ScholarNest

Service Azure AWS GCP


CI/CD Azure DevOps, GitHub Enterprise AWS Code Build, AWS Code Deploy, AWS Code Pipeline Google Cloud Build, Google Cloud Deploy
Data warehouse Azure Synapse Analytics Amazon Redshift BigQuery
Data Integration Azure Data Factory AWS Glue, Amazon Data Pipeline Google Cloud Data Fusion
Messaging Azure Service Bus, Azure Event Hubs AWS Kinesis, Amazon SNS, Amazon SQS Google Pub/Sub
Workflow orchestration Azure Data Factory Amazon Data Pipeline, AWS Glue, Apache Airflow Cloud Composer
Document data Azure Cosmos DB Amazon DocumentDB Firestore
NoSQL - Key/Value Azure Cosmos DB Amazon DynamoDB Cloud Bigtable
RDBMS Azure SQL Database Amazon Aurora, Amazon RDS Cloud SQL
Storage Transfer Azure Data Factory, Azure Storage Mover AWS Storage Gateway, AWS Data Sync Storage Transfer Service
Network connectivity Azure Virtual Private Network AWS Virtual Private Network Cloud VPN
Audit logging Azure Audit Logs AWS CloudTrail Cloud Audit Logs
Key management Azure Key Vault AWS KMS Cloud KMS
Identity Azure Identity Management AWS IAM Google Cloud IAM
Storage Azure Blob Storage - ADLS Gen2 Amazon S3 Google Cloud Storage

You might also like