Day1 - Data Lake - v2
Day1 - Data Lake - v2
Day1 - Data Lake - v2
Engineer
• Consistency
Azure SQL
Azure Data Lake
Azure Databricks Azure HDInsight
• Unstructured data
Examples: Media files, Office files, Text files, Log files
Azure Storage is a Microsoft-managed cloud service that provides storage that is highly
available, secure, durable, scalable and redundant. Within Azure there are two types of
storage accounts, four types of storage, four levels of data redundancy and three tiers for
storing files
Azure Files is a shared network file storage service that provides administrators a way to
access native SMB file shares in the cloud
Azure Queue Storage is a service that allows users to store high volumes of messages,
process them asynchronously and consume them when needed
NAME
• Hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access.
• Use slashes in Blob storage file names to stimulate a tree like directory structure
Monitoring services
Security
Data Lake
• Cool - Optimized for storing data that is infrequently accessed and stored for at least 30 days (early
deletion fee).
• Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with
flexible latency requirements, on the order of hours (early deletion fee).
To read data in archive storage, you must first change the tier of the blob to hot or cool. This
process is known as rehydration and can take hours to complete
• Standard priority: The rehydration request will be processed in the order it was received
and may take up to 15 hours.
• High priority: The rehydration request will be prioritized over Standard requests and may
finish in under 1 hour for objects under ten GB in size.
• Authentication
• Access Control
• Network access
• Data Encryption
Authentication
Authentication
Azure Storage
IP Address My VN Internet
• Authentication
• Access Control
• Network access
• Data Encryption
• Transition blobs to a cooler storage tier such as hot to cool, hot to archive, or cool to archive in order to
optimize for performance and cost
ZRS- Zone-redundant storage: Three copies of your data replicated synchronously to 3 Azure availability zones in a primary region. Zones
are in different physical locations or different data centers.
GRS- Geo-redundant storage: This allows your data to be stored in different geographic areas of the country or world. Again, you get three
copies of the data within a primary region, but it goes one step further and places three additional asynchronous copies in another region. For
example, you can now have a copy in Virginia and in California to protect your data from fires or hurricanes depending on the coast.
RA-GRS- Read Access Geo-redundant storage: This is GRS but adds a read-only element that allows you to have read access for things like
reporting.
Geo-zone-redundant storage (GZRS): Copy your data synchronously over three primary region Azure availability zones using ZRS. It then
asynchronously copies your data to a single physical location within the secondary region.
Region 1
Region 1
Region 1 Region 2
Region 1 Region 2
• Alerts
• Metrics
• Diagnostics
• Logs Analytics