2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Nata in Data
AWS Data Engineering Cheatsheet
Nataindata
May 20, 2024 • 11 min read
Hello dears, here you can find cheat sheets for most commonly used AWS
services in Data Engineering, like:
AWS Redshift Cheat Sheet
Amazon S3 Cheat Sheet
Subscribe
Amazon Athena Cheat Sheet
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 1/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Amazon Kinesis Cheat Sheet
Amazon Redshift Cheat Sheet
Overview
Amazon Redshift is a fully managed, petabyte-scale data warehouse service
that extends data warehouse queries to your data lake. It allows you to run
analytic queries against petabytes of data stored locally in Redshift and
directly against exabytes of data stored in S3. Redshift is designed for OLAP
(Online Analytical Processing).
Currently, Redshift only supports Single-AZ deployments.
Features
Columnar Storage: Redshift uses columnar storage, data
compression, and zone maps to minimize the amount of I/O needed
for queries.
Parallel Processing: It utilizes a massively parallel processing (MPP)
data warehouse architecture to distribute SQL operations across
multiple nodes.
Machine Learning: Redshift leverages machine learning to optimize
throughput based on workloads.
Result Caching: Provides sub-second response times for repeat
queries.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 2/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Automated Backups: Redshift continuously backs up your data to
S3 and can replicate snapshots to another region for disaster
recovery.
Components
Cluster: Comprises a leader node and one or more compute nodes.
A database is created upon provisioning a cluster for loading data
and running queries.
Scaling: Clusters can be scaled in/out by adding/removing nodes
and scaled up/down by changing node types.
Maintenance Window: Redshift assigns a 30-minute maintenance
window randomly within an 8-hour block per region each week.
During this time, clusters are unavailable.
Deployment Platforms: Supports both EC2-VPC and EC2-Classic
platforms for launching clusters.
Redshift Nodes
Leader Node: Manages client connections, parses queries, and
coordinates execution plans with compute nodes.
Compute Nodes: Execute query plans, exchange data, and send
intermediate results to the leader node for aggregation.
Node Types
Dense Storage (DS): For large data workloads using HDD storage.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 3/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Dense Compute (DC): Optimized for performance-intensive
workloads using SSD storage.
Parameter Groups
Parameter groups apply to all databases within a cluster. The default
parameter group has preset values and cannot be modified.
Database Querying Options
Query Editor: Use the AWS Management Console to connect to
your cluster and run queries.
SQL Client Tools: Connect via standard ODBC and JDBC
connections.
Enhanced VPC Routing: Manages data flow between your cluster
and other resources using VPC features.
Redshift Spectrum
Query Exabytes of Data: Run queries against data in S3 without
loading or transforming it.
Columnar Format: Scans only the needed columns for your query,
reducing data processing.
Compression Algorithms: Scans less data when data is compressed
with supported algorithms.
Redshift Streaming Ingestion
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 4/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Streaming Data: Consume and process data directly from
streaming sources like Amazon Kinesis Data Streams and Amazon
Managed Streaming for Apache Kafka (MSK).
Low Latency: Provides high-speed ingestion without staging data in
S3.
Redshift ML
Machine Learning: Train and deploy machine learning models using
SQL commands within Redshift.
In-Database Inference: Perform in-database predictions without
moving data.
SageMaker Integration: Utilizes Amazon SageMaker Autopilot to
find the best model for your data.
Redshift Data Sharing
Live Data Sharing: Securely share live data across Redshift clusters
within an AWS account without copying data.
Up-to-Date Information: Users always access the most current data
in the warehouse.
No Additional Cost: Available on Redshift RA3 clusters without
extra charges.
Redshift Cross-Database Query
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 5/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Query Across Databases: Allows querying across different
databases within a Redshift cluster,
regardless of the database you are connected to. This feature is available on
Redshift RA3 node types at no extra cost.
Cluster Snapshots
Types: There are two types of snapshots, automated and manual,
stored in S3 using SSL.
Automated Snapshots: Taken every 8 hours or 5 GB per node of
data change and are enabled by default. They are deleted at the
end of a one-day retention period, which can be modified.
Manual Snapshots: Retained indefinitely unless manually deleted.
Can be shared with other AWS accounts.
Cross-Region Snapshots: Snapshots can be copied to another AWS
Region for disaster recovery, with a default retention period of
seven days.
Monitoring
Audit Logging: Tracks authentication attempts, connections,
disconnections, user definition changes, and queries. Logs are
stored in S3.
Event Tracking: Redshift retains information about events for
several weeks.
Performance Metrics: Uses CloudWatch to monitor physical
aspects like CPU utilization, latency, and throughput.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 6/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Query/Load Performance Data: Helps monitor database activity
and performance.
CloudWatch Alarms: Optionally configured to monitor disk space
usage across cluster nodes.
Security
Access Control: By default, only the AWS account that creates the
cluster can access it.
IAM Integration: Create user accounts and manage permissions
using IAM.
Security Groups: Use Redshift security groups for EC2-Classic
platforms and VPC security groups for EC2-VPC platforms.
Encryption: Optionally encrypt clusters upon provisioning.
Encrypted clusters' snapshots are also encrypted.
Pricing
Billing: Pay per second based on the type and number of nodes in
your cluster.
Spectrum Scanning: Pay for the number of bytes scanned by
Redshift Spectrum.
Reserved Instances: Save costs by committing to 1 or 3-year terms.
Cluster Management
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 7/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Creating a Cluster
aws redshift create-cluster \
--cluster-identifier my-redshift-cluster \
--node-type dc2.large \
--master-username masteruser \
--master-user-password masterpassword \
--cluster-type multi-node \
--number-of-nodes 2
Deleting a Cluster
aws redshift delete-cluster \
--cluster-identifier my-redshift-cluster \
--skip-final-cluster-snapshot
Describing a Cluster
aws redshift describe-clusters \
--cluster-identifier my-redshift-cluster
Database Management
Connecting to the Database
Use a PostgreSQL-compatible tool such as psql or a SQL client:
psql -h my-cluster.cduijjmc4xkx.us-west-2.redshift.amazonaws.com -U masteruser
Creating a Database
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 8/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
CREATE DATABASE mydb;
Dropping a Database
DROP DATABASE mydb;
User Management
Creating a User
CREATE USER myuser WITH PASSWORD 'mypassword' ;
Dropping a User
DROP USER myuser;
Granting Permissions
GRANT ALL PRIVILEGES ON DATABASE mydb TO myuser;
Revoking Permissions
REVOKE ALL PRIVILEGES ON DATABASE mydb FROM myuser;
Table Management
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 9/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Creating a Table
CREATE TABLE mytable (
id INT PRIMARY KEY,
name VARCHAR(50),
age INT
);
Dropping a Table
DROP TABLE mytable;
Inserting Data
INSERT INTO mytable (id, name, age) VALUES (1, 'John Doe', 30);
Updating Data
UPDATE mytable SET age = 31 WHERE id = 1;
Deleting Data
DELETE FROM mytable WHERE id = 1;
Querying Data
SELECT * FROM mytable;
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 10/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Performance Tuning
Analyzing a Table
ANALYZE mytable;
Vacuuming a Table
VACUUM mytable;
Redshift Distribution Styles
KEY: Distributes rows based on the values in one column.
EVEN: Distributes rows evenly across all nodes.
ALL: Copies the entire table to each node.
Example: Creating a Table with Distribution Key
CREATE TABLE mytable (
id INT,
name VARCHAR(50),
age INT
)
DISTSTYLE KEY
DISTKEY(id);
Backup and Restore
Creating a Snapshot
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 11/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
aws redshift create-cluster-snapshot \
--snapshot-identifier my-snapshot \
--cluster-identifier my-redshift-cluster
Restoring from a Snapshot
aws redshift restore-from-cluster-snapshot \
--snapshot-identifier my-snapshot \
--cluster-identifier my-new-cluster
Security
Enabling SSL
In psql or your SQL client, use the sslmode parameter:
psql "host=my-cluster.cduijjmc4xkx.us-west-2.redshift.amazonaws.com dbname=dev
Managing VPC Security Groups
aws redshift create-cluster-security-group --cluster-security-group-name my-sec
aws redshift authorize-cluster-security-group-ingress --cluster-security-group-
Maintenance
Resizing a Cluster
aws redshift modify-cluster \
--cluster-identifier my-redshift-cluster \
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 12/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
--node-type dc2.large \
--number-of-nodes 4
Monitoring Cluster Performance
Use Amazon CloudWatch to monitor:
CPU Utilization
Database Connections
Read/Write IOPS
Network Traffic
Viewing Cluster Events
aws redshift describe-events \
--source-identifier my-redshift-cluster \
--source-type cluster
Amazon S3 Cheat Sheet
Overview
Amazon S3 (Simple Storage Service) stores data as objects within buckets.
Each object includes a file and optional metadata that describes the file. A
key is a unique identifier for an object within a bucket, and storage capacity
is virtually unlimited.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 13/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Buckets
Access Control: For each bucket, you can control access, create,
delete, and list objects, view access logs, and choose the
geographical region for storage.
Naming: Bucket names must be unique DNS-compliant names
across all existing S3 buckets. Once created, the name cannot be
changed and is visible in the URL pointing to the objects in the
bucket.
Limits: By default, you can create up to 100 buckets per AWS
account. The region of a bucket cannot be changed after creation.
Static Website Hosting: Buckets can be configured to host static
websites.
Deletion Restrictions: Buckets with 100,000 or more objects
cannot be deleted via the S3 console. Buckets with versioning
enabled cannot be deleted via the AWS CLI.
Data Consistency Model
Read-After-Write Consistency: For PUTS of new objects in all
regions.
Strong Consistency: For read-after-write HEAD or GET requests,
overwrite PUTS, and DELETES in all regions.
Eventual Consistency: For listing all buckets after deletion and for
enabling versioning on a bucket for the first time.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 14/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Storage Classes
Frequently Accessed Objects
S3 Standard: General-purpose storage for frequently accessed
data.
S3 Express One Zone: High-performance, single-AZ storage class
for latency-sensitive applications, offering improved access speeds
and reduced request costs compared to S3 Standard.
Infrequently Accessed Objects
S3 Standard-IA: For long-lived but less frequently accessed data,
with redundant storage across multiple AZs.
S3 One Zone-IA: Less expensive, stores data in one AZ, and is not
resilient to AZ loss. Suitable for objects over 128 KB stored for at
least 30 days.
Amazon S3 Intelligent-Tiering
Automatic Cost Optimization: Moves data between frequent and
infrequent access tiers based on access patterns.
Monitoring: Moves objects to infrequent access after 30 days
without access, and to archive tiers after 90 and 180 days without
access.
No Retrieval Fees: Optimizes costs without performance impact.
S3 Glacier
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 15/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Long-Term Archive: Provides storage classes like Glacier Instant
Retrieval, Glacier Flexible Retrieval, and Glacier Deep Archive for
long-term archiving.
Access: Archived objects must be restored before access and are
only visible through S3.
Retrieval Options
Expedited: Access data within 1-5 minutes for urgent requests.
Standard: Default option, typically completes within 3-5 hours.
Bulk: Lowest-cost option for retrieving large amounts of data,
typically completes within 5-12 hours.
Additional Information
Object Storage: For S3 Standard, Standard-IA, and Glacier classes,
objects are stored across multiple devices in at least three AZs.
Amazon Athena Cheat Sheet
Overview
Amazon Athena is an interactive query service that allows you to analyze
data directly in Amazon S3 and other data sources using SQL. It is serverless
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 16/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
and uses Presto, an open-source, distributed SQL query engine optimized
for low-latency, ad hoc analysis.
Features
Serverless: No infrastructure to manage.
Built-in Query Editor: Allows you to write and execute queries
directly in the Athena console.
Wide Data Format Support: Supports formats such as CSV, JSON,
ORC, Avro, and Parquet.
Parallel Query Execution: Executes queries in parallel to provide
fast results, even for large datasets.
Amazon S3 Integration: Uses S3 as the underlying data store,
ensuring high availability and durability.
Data Visualization: Integrates with Amazon QuickSight.
AWS Glue Integration: Works seamlessly with AWS Glue for data
cataloging.
Managed Data Catalog: Stores metadata and schemas for your S3-
stored data.
Queries
Geospatial Data: You can query geospatial data.
Log Data: Supports querying various log types.
Query Results: Results are stored in S3.
Query History: Retains history for 45 days.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 17/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
User-Defined Functions (UDFs): Supports scalar UDFs, executed
with AWS Lambda, to process records or groups of records.
Data Types: Supports both simple (e.g., INTEGER, DOUBLE,
VARCHAR) and complex (e.g., MAPS, ARRAY, STRUCT) data types.
Requester Pays Buckets: Supports querying data in S3 Requester
Pays buckets.
Athena Federated Queries
Data Connectors: Allows querying data sources beyond S3 using
data connectors implemented in Lambda functions via the Athena
Query Federation SDK.
Pre-built Connectors: Available for popular data sources like
MySQL, PostgreSQL, Oracle, SQL Server, DynamoDB, MSK,
RedShift, OpenSearch, CloudWatch Logs, CloudWatch metrics, and
DocumentDB.
Custom Connectors: You can write custom data connectors or
customize pre-built ones using the Athena Query Federation SDK.
Optimizing Query Performance
Data Partitioning: Partitioning data by column values (e.g., date,
country, region) reduces the amount of data scanned by a query.
Columnar Formats: Converting data to columnar formats like
Parquet and ORC improves performance.
File Compression: Compressing files reduces the amount of data
scanned.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 18/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Splittable Files: Using splittable files allows Athena to read them in
parallel, speeding up query completion. Formats like AVRO,
Parquet, and ORC are splittable, regardless of the compression
codec. Only text files compressed with BZIP2 and LZO are
splittable.
Cost Controls
Workgroups: Isolate queries by teams, applications, or workloads
and enforce cost controls.
Per-Query Limit: Sets a threshold for the total amount of data
scanned per query, canceling any query that exceeds this limit.
Per-Workgroup Limit: Limits the total amount of data scanned by
all queries within a specified timeframe, with multiple limits based
on hourly or daily data scan totals.
Amazon Athena Security
Access Control: Use IAM policies, access control lists, and S3 bucket
policies to control data access.
Encrypted Data: Queries can be performed directly on encrypted
data in S3.
Amazon Athena Pricing
Pay Per Query: Charged based on the amount of data scanned by
each query.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 19/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
No Charge for Failed Queries: You are not charged for queries that
fail.
Cost Savings: Compressing, partitioning, or converting data to
columnar formats reduces the amount of data scanned, leading to
cost savings and performance gains.
Amazon Kinesis Cheat Sheet
Overview
Amazon Kinesis makes it easy to collect, process, and analyze real-time
streaming data. It can ingest real-time data such as video, audio, application
logs, website clickstreams, and IoT telemetry data for machine learning,
analytics, and other applications.
Kinesis Video Streams
A fully managed service for streaming live video from devices to the AWS
Cloud or building applications for real-time video processing or batch-
oriented video analytics.
Benefits
Device Connectivity: Connect and stream from millions of devices.
Custom Retention Periods: Configure video streams to durably
store media data for custom retention periods, generating an index
based on timestamps.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 20/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Serverless: No infrastructure setup or management required.
Security: Enforces TLS-based encryption for data streaming and
encrypts all data at rest using AWS KMS.
Components
Producer: Source that puts data into a Kinesis video stream.
Kinesis Video Stream: Enables the transportation, optional
storage, and real-time or batch consumption of live video data.
Consumer: Retrieves data from a Kinesis video stream to view,
process, or analyze it.
Fragment: A self-contained sequence of frames with no
dependencies on other fragments.
Video Playbacks
HLS (HTTP Live Streaming): For live playback.
GetMedia API: For building custom applications to process video
streams in real time with low latency.
Metadata
Nonpersistent Metadata: Ad hoc metadata for specific fragments.
Persistent Metadata: Metadata for consecutive fragments.
Pricing
Pay for the volume of data ingested, stored, and consumed.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 21/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Kinesis Data Stream
A scalable, durable data ingestion and processing service optimized for
streaming data.
Components
Data Producer: Application emitting data records to a Kinesis data
stream, assigning partition keys to records.
Data Consumer: Application or AWS service retrieving data from all
shards in a stream for real-time analytics or processing.
Data Stream: A logical grouping of shards retaining data for 24
hours or up to 7 days with extended retention.
Shard: The base throughput unit, ingesting up to 1000 records or 1
MB per second. Provides ordered records by arrival time.
Data Record
Record: Unit of data in a stream with a sequence number, partition
key, and data blob (max 1 MB).
Partition Key: Identifier (e.g., user ID, timestamp) used to route
records to shards.
Sequence Number
Unique identifier for each data record, assigned by Kinesis when
data is added.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 22/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Monitoring
Monitor shard-level metrics using CloudWatch, Kinesis Agent, and
Kinesis libraries. Log API calls with CloudTrail.
Security
Automatically encrypt sensitive data with AWS KMS.
Use IAM for access control and VPC endpoints to keep traffic within
the Amazon network.
Pricing
Charged per shard hour, PUT Payload Unit, and enhanced fan-out
usage. Extended data retention incurs additional charges.
Kinesis Data Firehose
The easiest way to load streaming data into data stores and analytics tools.
Features
Scalable: Automatically scales to match data throughput.
Data Transformation: Can batch, compress, and encrypt data
before loading it.
Destination Support: Captures, transforms, and loads data into S3,
Redshift, Elasticsearch, HTTP endpoints, and service providers like
Datadog, New Relic, MongoDB, and Splunk.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 23/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Batch Size and Interval: Control data upload frequency and size.
Data Delivery and Transformation
Lambda Integration: Transforms incoming data before delivery.
Format Conversion: Converts JSON to Parquet or ORC for storage
in S3.
Buffer Configuration: Controls data buffering before delivery to
destinations.
Pricing
Pay for the volume of data transmitted. Additional charges for data
format conversion.
Kinesis Data Analytics
Analyze streaming data, gain insights, and respond to business needs in real
time.
General Features
Serverless: Automatically manages infrastructure.
Scalable: Elastically scales to handle data volume.
Low Latency: Provides sub-second processing latencies.
SQL Features
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 24/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Standard ANSI SQL: Integrates with Kinesis Data Streams and
Firehose.
Input Types: Supports streaming and reference data sources.
Schema Editor: Recognizes standard formats like JSON and CSV.
Java Features
Apache Flink: Uses open-source libraries for building streaming
applications.
State Management: Stores state in encrypted, incrementally saved
running application storage.
Exactly Once Processing: Ensures processed records affect results
exactly once.
Components
Input: Streaming source for the application.
Application Code: SQL statements processing input data.
In-Application Streams: Stores data for processing.
Kinesis Processing Units (KPU): Provides memory, computing, and
networking resources.
Pricing
Charged based on the number of KPUs used. Additional charges for
Java application orchestration and storage.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 25/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Member discussion 0 comments
Start the conversation
Become a member of Nata in Data to start commenting.
Sign up now
Already a member? Sign in
Python for Data Best Practices to Get
Engineering Started with Data
How I use Python as a Data engineer: Python Observability + Hands-On
plays a vital role in my daily work as a data… Examples
Jan 15, 2025 2 min read ! Get your Data Observability checklist in the
end ! DATA DOWNTIME - words that send…
Aug 12, 2024 6 min read
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 26/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Nata in Data © 2025
Sign up Roadmap About Home
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 27/27