AWS Services
AWS Services
1. Data Storage
o Scalable object storage for data lakes, backups, and big data analytics.
Amazon Redshift:
Amazon DynamoDB:
Amazon Aurora:
AWS Glue:
o Big data processing with frameworks like Apache Spark, Hive, and Hadoop.
AWS Lambda:
AWS Glue:
o Visual workflow service for orchestrating Lambda functions and other AWS
services.
Amazon Athena:
AWS QuickSight:
Amazon EventBridge:
Amazon SageMaker:
Amazon CloudWatch:
AWS CloudTrail:
o Cost-Effective:
1. Pay-as-you-go pricing.
2. Multiple storage classes to optimize costs based on access patterns.
o Security:
1. Data encryption (server-side and client-side).
2. Fine-grained access control using IAM policies and Bucket Policies.
o Performance:
1. Low-latency performance.( Low-latency performance refers to the
ability to access or retrieve data with minimal delay (latency).
In the context of Amazon S3, this means:
Fast response times for retrieving and storing objects.
Minimal delays when performing operations like uploading,
downloading, or listing data in S3.)
o Versioning:
1. Maintain multiple versions of an object for data protection and
recovery.
Versioning allows you to keep multiple versions of the same object in
an S3 bucket.
It protects against accidental overwrites and deletions by retaining
previous versions of objects.
o Lifecycle Policies:
1. Automate data management by transitioning objects to different
storage classes or deleting them after a specified time.
o Event Notifications:
1. Trigger actions (e.g., AWS Lambda functions) when specific events
occur (e.g., object creation).
o Replication:
1. Cross-Region Replication (CRR) and Same-Region Replication (SRR)
for redundancy.
Advanced Features
1. Multipart Upload:
o Efficiently upload large objects by splitting them into smaller parts.
2. Replication:
o Automatically replicate objects to another bucket (Cross-Region or Same-
Region).
3. Object Lock:
o Protect objects from deletion for regulatory compliance (WORM – Write
Once, Read Many).
4. Storage Lens:
o Gain insights and recommendations on storage usage and activity.
5. Event Notifications:
o Integrate with services like AWS Lambda, SQS, or SNS to trigger actions based
on events.
Use Cases for Amazon S3
1. Data Lake Storage:
o Central repository for structured and unstructured data for analytics.
2. Backup and Restore:
o Secure, scalable backups for applications and databases.
3. Static Website Hosting:
o Host static websites (HTML, CSS, JS) directly from S3.
4. Big Data Analytics:
o Store large datasets for analysis using tools like Amazon Redshift or Athena.
5. Media Storage and Distribution:
o Store and distribute images, videos, and documents globally.
6. Log Storage:
o Store logs for analysis, auditing, and compliance.
4. Low Latency:
o Provides consistent, low-latency reads and writes (usually within
milliseconds).
5. High Availability:
o Data is automatically replicated across multiple availability zones for fault
tolerance.
6. Serverless:
o No need to manage servers or instances. Pay for read/write capacity or on-
demand usage.
7. Global Tables:
o Enables multi-region replication for globally distributed applications.
Up to 5x faster than
Standard performance for
Performance MySQL and 3x faster
MySQL/PostgreSQL
than PostgreSQL
Multi-AZ replication by
Availability Multi-AZ optional
default
The AWS Glue Data Catalog is a centralized metadata repository within the AWS
ecosystem that helps you manage and organize information (metadata) about your
data assets. It's a core component of AWS Glue, which is AWS's fully managed ETL
(Extract, Transform, Load) service.
What is Metadata?
Table names
Partitioning information
Schema versions
The AWS Glue Data Catalog keeps track of this metadata, making it easier to discover,
manage, and analyze your data.
1. Centralized Repository:
o Stores metadata for data assets across different AWS services like S3,
Redshift, RDS, and DynamoDB.
2. Schema Management:
o Keeps track of the schema structure (columns, data types, etc.) for each
table.
o AWS Glue Crawlers can scan your data sources and automatically infer the
schema and create metadata entries in the Data Catalog.
4. Version Control:
o Tracks changes to schemas over time, helping you manage schema evolution.
5. Partition Support:
o Integrated with AWS Identity and Access Management (IAM) for fine-grained
access control.
How the AWS Glue Data Catalog Works
o AWS Glue Crawlers scan data sources (like S3 buckets) and automatically
create or update metadata tables in the Data Catalog.
2. Metadata Storage:
Table names
Partition information
4. ETL Workflows:
o AWS Glue ETL jobs use the Data Catalog to read source data, transform it, and
write it to target destinations.
o Use Amazon Athena to run SQL queries on S3 data using the Data Catalog for
schema definitions.
o Use with Amazon EMR for Apache Spark and Hive jobs.
1. Centralized Metadata:
o Single source of truth for all your data assets across multiple services.
2. Ease of Use:
3. Scalability:
4. Cost Efficiency:
o Pay only for the resources you use (AWS Glue charges are based on crawlers,
ETL jobs, and data processing).
AWS Glue is a fully managed ETL (Extract, Transform, Load) service provided by Amazon
Web Services (AWS). It simplifies the process of preparing and transforming data for
analytics, machine learning, and application development by automating the steps required
to discover, clean, and move data between data sources and data targets.
AWS Glue handles the heavy lifting of infrastructure management, enabling you to focus on
developing data transformation workflows rather than managing servers or ETL
infrastructure.
2. Crawlers:
o Automatically scan data sources (e.g., Amazon S3, RDS) to infer schemas and
populate the Data Catalog with metadata.
3. ETL Jobs:
4. Triggers:
o Used to schedule ETL jobs based on a time schedule or events (e.g., new data
arrival).
5. Workflow:
6. Development Endpoints:
o Allows you to interactively develop and debug ETL scripts using your
preferred IDE (e.g., Jupyter Notebook).
1. Extract:
Amazon S3
Amazon Redshift
DynamoDB
On-premises databases
2. Transform:
o Clean and prepare the data (e.g., filtering, aggregating, joining datasets).
3. Load:
1. Fully Managed:
o Crawlers automatically detect the schema of data and update the Data
Catalog.
3. Code Generation:
o AWS Glue can automatically generate ETL code based on the data source and
target.
4. Scalability:
o Integrates with a wide range of AWS services and on-premises data sources.
6. Flexible Programming:
8. Serverless Execution:
o Pay only for the compute time used by your ETL jobs (billed per second).
3. Database Migration:
4. Event-Driven ETL:
AWS Lambda is a service that lets you run code without managing servers. It automatically
runs your code when certain events happen, and you only pay for the time your code runs.
1. No Servers to Manage:
o You don't have to set up, maintain, or scale servers. AWS takes care of
everything.
2. Event-Driven:
o You are billed only for the exact time your code runs (measured in
milliseconds). No charges when it's idle.
4. Automatic Scaling:
1. Resize Images:
2. Send Notifications:
3. API Backend:
Analogy
The light (code) only turns on when you flip the switch (trigger an event).
You only pay for the electricity (compute time) while the light is on.
Summary
You pay only for what you use, making it cost-efficient and scalable.
AWS Data Pipeline is a service that helps you move, process, and transform data between
different storage and processing services. It acts like a manager that organizes and schedules
tasks to ensure your data flows smoothly from one place to another.
1. Orchestration Service:
o It organizes and schedules tasks to move and process data automatically.
o Transfers data from one place to another (e.g., from S3 to RDS) and performs
transformations (e.g., cleaning, filtering).
4. Fully Managed:
o AWS handles the infrastructure and resources for running the tasks.
o Specify where the data comes from (source), where it should go (destination),
and any transformation tasks in between.
o Set up a schedule (e.g., daily, weekly) for when the pipeline should run.
3. Pipeline Execution:
o AWS Data Pipeline runs the tasks according to your schedule and monitors
the progress.
Real-Time Examples
o Scenario: You have sales data in an S3 bucket and need to load it into an RDS
database every day for analysis.
o Solution: AWS Data Pipeline automates this daily transfer and loads the data
into RDS.
Analogy
The service ensures everything happens on time and notifies you if something goes
wrong.
Summary
AWS Data Pipeline helps you automate the movement and transformation of data.
Ideal for repetitive tasks like transferring, processing, and backing up data.
You can schedule workflows to ensure data flows automatically between services like
S3, RDS, DynamoDB, and more.
AWS Database Migration Service (DMS) is a service that helps you move databases from
one place to another (usually to AWS) with minimal downtime (Downtime refers to the
period when a system, service, or application is unavailable or not operational.). It makes
the migration process simple and smooth, without interrupting your applications for long
periods.
1. Migrate Databases:
2. Minimal Downtime:
o Works with many databases like MySQL, PostgreSQL, Oracle, SQL Server, and
MongoDB.
4. Fully Managed:
o AWS handles the infrastructure, so you don’t need to worry about servers or
resources.
5. Continuous Data Replication:
o Keeps data updated during the migration to ensure everything stays in sync.
Real-Time Examples
o Solution: AWS DMS migrates the database to RDS while your application
keeps running. Downtime is minimized.
Analogy
You want to move your furniture (database) from your old home (on-premises) to
your new home (AWS).
The movers (AWS DMS) carefully transfer your belongings while you still use your old
home until everything is ready.
Summary
AWS DMS helps you move databases to AWS with very little downtime.
Ideal for scenarios like upgrading databases, moving to the cloud, or keeping
databases in sync.
Fully managed by AWS, making the migration process simple and efficient.
1. ETL Pipelines:
o Makes it easier to combine data from various sources into one place.
3. Fully Managed:
o Works with S3, RDS, Redshift, DynamoDB, and even on-premises databases.
5. Code Generation:
o Automatically creates Python or Scala code for your ETL jobs, reducing the
need to write complex code.
Real-Time Examples
1. Database Migration:
o Scenario: You want to move data from your on-premises Oracle database to
Amazon RDS.
o Solution: Use AWS Glue to extract data, transform it into the right format,
and load it into RDS.
o Scenario: You have messy data with missing values in Amazon S3.
o Solution: AWS Glue can clean and transform the data so it’s ready for training
machine learning models in Amazon SageMaker.
Analogy
The blender mixes and processes the ingredients (transforms the data).
You get a smooth final product (organized data ready for use).
Summary
AWS Glue helps you create ETL pipelines to move, clean, and organize data.
It integrates with various sources like S3, RDS, DynamoDB, and more.
Ideal for tasks like building data lakes, preparing data for analytics, and cleaning
data for machine learning.
AWS Step Functions is a service that helps you create and visualize workflows to coordinate
multiple tasks. It is particularly useful for orchestrating tasks that use AWS services like
Lambda, ECS, S3, and DynamoDB.
Step Functions lets you build workflows using a visual interface and ensures tasks run in the
correct order, with automatic retries and error handling.
1. Visual Workflow:
o You can design and see your workflow visually, making it easier to
understand.
2. State Machines:
o Workflows are defined as state machines, where each step (or state)
represents a task.
5. Serverless:
Analogy
The flowchart ensures tasks are executed in the right order and handles what
happens if something goes wrong.
Amazon OpenSearch Service is a fully managed service that makes it easy to deploy, run, and
scale OpenSearch and Elasticsearch clusters for searching, analyzing, and visualizing large
amounts of data in real time. It’s ideal for use cases like log analytics, full-text search, and
monitoring application performance.
1. Search Engine:
o Allows you to search large datasets quickly and efficiently.
2. Log Analytics:
o Helps you analyze logs from servers, applications, and systems to monitor
and troubleshoot issues.
3. Fully Managed:
o AWS handles setting up, managing, scaling, and securing the infrastructure,
so you can focus on your data.
5. Scalability:
o Can handle large amounts of data and scale as your needs grow.
1. Website Search:
o Solution: Use OpenSearch to power fast and relevant product search results.
Analogy
If you have millions of books (data), OpenSearch helps you find exactly what you
need instantly and shows you charts and visuals to understand trends in the data.
Summary
Amazon OpenSearch Service helps you search, analyze, and visualize large datasets.
AWS QuickSight is a business intelligence (BI) service that lets you create interactive
dashboards, charts, and reports to visualize and analyze your data. It helps turn raw data
into easy-to-understand insights.
2. Fully Managed:
3. Interactive Dashboards:
o Users can interact with dashboards, filter data, and drill down into details.
5. Data Sources:
Amazon S3
Redshift
On-premises databases
6. Pay-per-Session Pricing:
o Only pay for what you use, making it cost-effective for occasional users.
Analogy
Think of AWS QuickSight like a dashboard for your car:
The dashboard shows you important information (speed, fuel level, etc.).
Instead of raw data (numbers), it presents the data in visual formats so you can
quickly understand and make decisions.
Summary
AWS QuickSight helps you visualize and analyze data through interactive charts and
dashboards.
Fully managed and easy to connect with various data sources like S3, RDS, and
Redshift.
Amazon EventBridge is a service that helps you connect different applications using events.
It acts as an event bus that listens for events and triggers workflows or actions when
something happens.
1. Event Bus:
o An event bus is like a message board where different applications can post
events and listen for events.
2. Event-Driven:
3. Fully Managed:
o Works with services like Lambda, S3, EC2, and external apps like Salesforce or
Zendesk.
5. Rule-Based Actions:
o You define rules to decide what actions should be taken when specific events
occur.
1. Automatic Notifications:
Analogy
When new mail (an event) arrives, the system reads the address (the event rule) and
sends it to the correct department (application or workflow) for action.
AWS manages the event bus, so you don’t have to worry about infrastructure.
Amazon CloudWatch is a service that helps you monitor your AWS resources and
applications. It collects data like metrics, logs, and events so you can keep an eye on the
health and performance of your systems.
Key Points in Simple Terms
1. Monitoring:
o Set up alerts to get notified if something goes wrong (e.g., high CPU usage,
low disk space).
3. Dashboards:
o Create visual dashboards to see all your key metrics in one place.
4. Log Collection:
5. Automatic Actions:
1. Website Monitoring:
2. Server Performance:
3. Log Analysis:
Analogy
Amazon CloudWatch helps you monitor the health and performance of your AWS
resources.
Provides alerts, dashboards, and log collection to keep your systems running
smoothly.
AWS CloudTrail records and tracks all the API calls and actions made in your AWS account. It
helps you know who did what and when in your AWS environment.
1. Tracks Activity:
o Logs every action made by users, applications, or AWS services (e.g., creating
an S3 bucket, deleting a file).
3. Automatic Logging:
4. Compliance:
1. Security Audit:
2. Troubleshooting:
3. Compliance Reports:
o Generate reports to show compliance with security policies.
Analogy
It records who comes in, what they do, and when they do it.
If something goes wrong, you can review the footage to find out what happened.
Summary
AWS CloudTrail logs all actions and API calls in your AWS account.
AWS IAM is a service that helps you manage who can access your AWS services and what
they can do. It ensures that only the right people have the right access.
1. User Management:
2. Access Control:
3. Permissions:
o Set rules to control what each user can do (e.g., read-only, full access).
4. Security:
1. Employee Access:
o Give your developers access to EC2 instances but restrict access to billing
information.
2. Role-Based Access:
3. Temporary Access:
Analogy
Keys and Badges: Control who can enter different rooms (resources).
Permissions: Some people can enter all rooms (admin access), while others can enter
only certain rooms (limited access).
Summary
AWS IAM helps you securely manage who can access AWS services and what they
can do.
The AWS Glue Data Catalog is a central repository that keeps track of information (called
metadata) about your data. It helps you organize and find data across different locations,
like databases and Amazon S3 buckets.
1. Stores Metadata:
Table names
2. Unified View:
o Provides a single place to search and manage all your data, whether it's
structured (like databases) or unstructured (like text files).
3. Automatic Discovery:
o Uses crawlers to automatically scan your data sources and add metadata to
the catalog.
4. Integration:
o Works with services like Amazon Athena, Redshift Spectrum, and AWS Glue
ETL for querying and transforming data.
o Instead of searching through different databases and files, use the Glue Data
Catalog to quickly find the data you need.
Analogy
The index (catalog) tells you where each book (data) is located and what it contains.
You can quickly find and use the data without manually searching through shelves.
Summary
AWS Glue Data Catalog is a metadata repository that helps you organize and find
data easily.
It supports automatic discovery and works with AWS services for data analysis and
processing.
o Makes it easier to collect, organize, and load large amounts of data into a
central data lake in S3.
o Helps you manage who can access the data and what they can do (read,
write, delete).
o Works with AWS Glue Data Catalog to automatically discover and organize
metadata about the data in your lake.
o Helps you clean and transform data to make it ready for analysis.
5. Centralized Control:
o Manage access permissions and security policies for multiple users and AWS
services in one place.
o Collect data from different sources (e.g., databases, logs, files) and store it
securely in S3.
o Ensure only authorized users can access specific parts of the data lake.
o As new data arrives in S3, Lake Formation automatically adds it to the Glue
Data Catalog.
3. Data Governance:
o Control access to sensitive data, such as giving analysts access to sales data
but restricting access to personal information.
Analogy
It helps you organize all your products (data) in the warehouse (data lake).
It makes sure only the right people can access certain products.
Summary
AWS Lake Formation helps you easily set up, manage, and secure a data lake in S3.