Data Fundamentals
Data Fundamentals
Data Fundamentals
1. Structured Data
Structured data is data that adheres to a
fixed schema, so all of the data has the
same fields or properties. Most
2. Semi-Structured Data
Semi-structured data is information that has some structure but allows for some
variation between entity instances. For example, while most customers may have an
email address, some might have multiple email addresses, and some might have none
at all.
One common format for semi-structured data is JavaScript Object Notation (JSON). The
example shows a pair of JSON documents that represent customer information. JSON
is just one of many ways in which semi-structured data can be represented.
Data Storage
There are two broad categories of data storage in common use:
A. File stores
Delimited The most common format for delimited data is comma-separated
text files values (CSV) in which fields are separated by commas, and rows are
terminated by a return/newline.
Extensible XML is a human-readable data format that was popular in the 1990s
Markup and 2000s. It's primarily been superseded by the less verbose JSON
Language format, but there are still some systems that use XML to represent
(XML) data. XML uses tags
enclosed in angle brackets (<../>) to define elements and attributes.
Binary Large Technically, all files are stored as binary data (1's and 0's), but in the
Object human-readable formats discussed above, the bytes of binary data
(BLOB) are mapped to printable characters (typically through a character
encoding scheme such as ASCII or Unicode).
Some common optimized file formats you might see include Avro, ORC, and Parquet:
ORC (Optimized Row Columnar format) organizes data into columns rather than
rows. It was developed by HortonWorks for optimizing read and write operations in
Apache Hive (Hive is a data warehouse system that supports fast data
summarization and querying over large datasets). An ORC file contains stripes of
data. Each stripe holds the data for a column or set of columns. A stripe contains an
index into the rows in the stripe, the data for each row, and a footer that holds
statistical information (count, sum, max, min, and so on) for each column.
Parquet is another columnar data format. It was created by Cloudera and Twitter.
A Parquet file contains row groups. Data for each column is stored together in the
same row group. Each row group contains one or more chunks of data. A Parquet
file includes metadata that describes the set of rows found in each chunk. An
application can use this metadata to quickly locate the correct chunk for a given set
of rows, and retrieve the data in the specified columns for these rows. Parquet
specializes in storing and processing nested data types efficiently. It supports very
efficient compression and encoding schemes.
B. Databases
A database is used to define a central system in which data can be stored and queried.
Consistency – transactions can only take the data in the database from one valid
state to another. To continue the debit and credit example above, the completed
state of the transaction must reflect the transfer of funds from one account to the
other.
Isolation – concurrent transactions cannot interfere with one another and must
result in a consistent database state. For example, while the transaction to transfer
funds from one account to another is in process, another transaction that checks the
balance of these accounts must return consistent results - the balance-checking
transaction can't retrieve a value for one account that reflects the
balance before the transfer, and a value for the other account that reflects the
balance after the transfer.
Durability – when a transaction has been committed, it will remain committed. After
the account transfer transaction has been completed, the revised account balances
are persisted so that even if the database system were to be switched off, the
committed transaction would be reflected when it is switched on again.
2. An extract, transform, and load (ETL) process copies data from files and OLTP
databases into a data warehouse that is optimized for reading activity. Commonly, a
data warehouse schema is based on fact tables that contain numeric values you
want to analyze (for example, sales amounts), with related dimension tables that
represent the entities by which you want to measure them (for example, customer
or product),
3. Data in the data warehouse may be aggregated and loaded into an online analytical
processing (OLAP) model, or cube. Aggregated numeric values (measures) from
fact tables are calculated for intersections of dimensions from dimension tables. For
example, sales revenue might be totaled by date, customer, and product.
Data lakes are common in modern data analytical processing scenarios, where a large
volume of file-based data must be collected and analyzed.
Data warehouses are an established way to store data in a relational schema that is
optimized for reading operations – primarily queries to support reporting and data
visualization. The data warehouse schema may require some denormalization of data in
an OLTP data source (introducing some duplication to make queries perform faster).
An OLAP model is an aggregated type of data storage that is optimized for analytical
workloads. Data aggregations are across dimensions at different levels, enabling you
to drill up/down to view aggregations at multiple hierarchical levels
Job Roles
The three key job roles that deal with data in most organizations are:
Data engineers manage infrastructure and processes for data integration across
the organization, applying data cleaning routines, identifying data governance rules,
and implementing pipelines to transfer and transform data between systems.
Data analysts explore and analyze data to create visualizations and charts that
enable organizations to make informed decisions.
There are additional data-related roles not mentioned here, such as data
scientist and data architect; and there are other technical professionals that work with
data, including application developers and software engineers.
Azure Storage Blob containers scalable, cost-effective storage for binary files
Normalization
SQL
SQL stands for Structured Query Language and is used to communicate with a
relational database. It's the standard language for relational database management
systems. SQL statements are used to perform tasks such as updating data in a
database or retrieving data from a database. Some standard relational database
management systems that use SQL include Microsoft SQL Server, MySQL,
PostgreSQL, MariaDB, and Oracle.
Transact-SQL (T-SQL). This version of SQL is used by Microsoft SQL Server and
Azure SQL services.
PL/SQL. This is the dialect used by Oracle. PL/SQL stands for Procedural
Language/SQL.
View
A view is a virtual table based on the
results of a SELECT query. You can think CREATE VIEW Deliveries
AS
Stored Procedure
A stored procedure defines SQL
statements that can be run on command. CREATE PROCEDURE RenameProduct
@ProductID INT,
Stored procedures are used to @NewName VARCHAR(20)
encapsulate programmatic logic in a AS
UPDATE Product
database for actions that applications SET Name = @NewName
need to perform when working with data. WHERE ID = @ProductID;
You can define a stored procedure with EXEC RenameProduct 201, 'Spanner';
Index
An index helps you search for data in a table. When you create an index in a database,
you specify a column from the table, and the index contains a copy of this data in sorted
order, with pointers to the corresponding rows in the table. When the user runs a query
that specifies this column in the WHERE clause, the database management system can
use this index to fetch the data more quickly than if it had to scan through the entire
table row by row.
Block blobs. A block blob is handled as a set of blocks. Each block can vary in
size, up to 100 MB. A block blob can contain up to 50,000 blocks, giving a
maximum size of over 4.7 TB. The block is the smallest amount of data that can
You can create lifecycle management policies for blobs in a storage account. A
lifecycle management policy can automatically move a blob from Hot to Cool, and then
to the Archive tier, as it ages and is used less frequently (policy is based on the
number of days since modification). A lifecycle management policy can also arrange
to delete outdated blobs.
To create an Azure Data Lake Store Gen2 files system, you must enable
the Hierarchical Namespace option of an Azure Storage account. You can do this
when initially creating the storage account, or you can upgrade an existing Azure
Azure Files
A file share enables you to store a file on
one computer, and grant access to that
file to users and applications running on
other computers. This strategy can work
well for computers in the same local area
network, but doesn't scale well as the
number of users increases, or if users are
located at different sites. Azure Files is
essentially a way to create cloud-based
network shares, such as you typically find
in on-premises organizations to make
documents and other files available to
multiple users.
Azure Files enables you to share up to 100 TB of data in a single storage account.
This data can be distributed across any number of file shares in the account. The
maximum size of a single file is 1 TB, but you can set quotas to limit the size of each
share below this figure. Currently, Azure File Storage supports up to 2000 concurrent
connections per shared file. Azure File Storage offers two performance tiers. The
Standard tier uses hard disk-based hardware in a data center, and the Premium tier
uses solid-state disks
Azure Files supports two common network file sharing protocols:
Network File System (NFS) shares are used by some Linux and macOS versions.
To create an NFS share, you must use a premium tier storage account and create
and configure a virtual network through which access to the share can be
controlled.
To help ensure fast access, Azure Table Storage splits a table into partitions.
Partitioning is a mechanism for grouping related rows, based on common property or
partition key. Rows that share the same partition key will be stored together. Partitioning
not only helps to organize data, but it can also improve scalability and performance in
the following ways:
Partitions are independent of each other and can grow or shrink as rows are added
to, or removed from, a partition. A table can contain any number of partitions.
When you search for data, you can include the partition key in the search criteria.
This helps to narrow down the volume of data to be examined and improves
Cosmos DB
Azure Cosmos DB is a highly scalable cloud database service for NoSQL data.
Input (SQL)
SELECT *
FROM customers c
WHERE c.id = "joe@litware.com"
MongoDB API
MongoDB is a popular open-source database in which data is stored in Binary JSON
(BSON) format. MongoDB Query Language (MQL) uses a compact, object-oriented
syntax in which developers use objects to call methods. For example, the following
query uses the find method to query the products collection in the db object:
Input (Javascript)
db.products.find({id: 123})
Output (BSON)
{
"id": 123,
"name": "Hammer",
"price": 2.99}
}
Table API
The Table API is used to work with data in key-value tables, similar to Azure Table
Storage. The Azure Cosmos DB Table API offers greater scalability and performance
than Azure Table Storage.
https://endpoint/Customers(PartitionKey='1',RowKey='124')
Cassandra API
The Cassandra API is compatible with Apache Cassandra, which is a popular open-
source database that uses a column-family storage structure. Column families are
tables, similar to those in a relational database, with the exception that it's not
Gremlin API
The Gremlin API is used with data in a graph
structure; in which entities are defined
as vertices that form nodes in a connected graph.
Nodes are connected by edges that represent
relationships, like this:
The example in the image shows two kinds of vertex (employee and department) and
edges that connect them (employee "Ben" reports to employee "Sue", and both
employees work in the "Hardware" department).
Gremlin syntax includes functions to operate on vertices and edges, enabling you to
insert, update, delete, and query data in the graph. For example, you could use the
following code to add a new employee named Alice who reports to the employee with
ID 1 (Sue)
1. Data ingestion and processing – data from one or more transactional data stores,
files, real-time streams, or other sources is loaded into a data lake or a relational
data warehouse. The load operation usually involves an extract, transform, and
load (ETL) or extract, load, and transform (ELT) process in which the data is
cleaned, filtered, and restructured for analysis. The data processing is often
performed by distributed systems that can process high volumes of data in parallel
using multi-node clusters. Data ingestion includes both batch processing of static
data and real-time processing of streaming data.
On Azure, large-scale data ingestion is best implemented by creating pipelines that
orchestrate ETL processes. You can create and run pipelines using Azure Data
Factory, or you can use the same pipeline engine in Azure Synapse Analytics if
you want to manage all of the components of your data warehousing solution in a
unified workspace.
3. Analytical data model – while data analysts and data scientists can work with the
data directly in the analytical data store, it’s common to create one or more data
models that pre-aggregate the data to make it easier to produce reports,
dashboards, and interactive visualizations. Often these data models are
described as cubes, in which numeric data values are aggregated across one or
4. Data visualization – data analysts consume data from analytical models, and
directly from analytical stores to create reports, dashboards, and other
visualizations. Additionally, users in an organization who may not be technology
professionals might perform self-service data analysis and report. The visualizations
from the data show trends, comparisons, and key performance indicators (KPIs) for
a business or other organization, and can take the form of printed reports, graphs,
and charts in documents or PowerPoint presentations, web-based dashboards, and
interactive environments in which users can explore data visually.
Data Warehouse
A data warehouse is a relational database in which the data is stored in a schema that
is optimized for data analytics rather than transactional workloads. Commonly, the data
from a transactional store is denormalized into a schema in which numeric values are
stored in central fact tables, which are related to one or more dimension tables that
represent entities by which the data can be aggregated. This kind of fact and dimension
table schema is called a star schema; though it's often extended into a snowflake
schema by adding additional tables related to the dimension tables to represent
dimensional hierarchies.
Data Lakes
A data lake is a file store, usually on a distributed file system for high-performance data
access. Technologies like Spark or Hadoop are often used to process queries on the
stored files and return data for reporting and analytics. These systems often apply
a schema-on-read approach to define tabular schemas on semi-structured data files at
the point where the data is read for analysis, without applying constraints when it's
stored.
Hybrid approaches
You can use a hybrid approach that combines features of data lakes and data
warehouses in a lake database or data lakehouse. The raw data is stored as files in a
data lake, and a relational storage layer abstracts the underlying files and exposes them
as tables, which can be queried using SQL. SQL pools in Azure Synapse Analytics
Real-time Analytics
Data processing is simply the conversion of raw data to meaningful information through
a process. There are two general ways to process data:
Batch processing, in which multiple data records are collected and stored before
being processed together in a single operation.
Apart from the way in which batch processing and streaming processing handle data,
there are other differences:
Data scope: Batch processing can process all the data in the dataset. Stream
processing typically only has access to the most recent data received, or within a
rolling time window (the last 30 seconds, for example).
Data size: Batch processing is suitable for handling large datasets efficiently.
Stream processing is intended for individual records or micro-batches consisting of
few records.
Analysis: You typically use batch processing to perform complex analytics. Stream
processing is used for simple response functions, aggregates, or calculations such
as rolling averages.
The following diagram shows some ways in which batch and stream processing can be
combined in a large-scale data analytics architecture.
2. Data from other sources are ingested into a data store (often a data lake) for batch
processing.
3. If real-time analytics is not required, the captured streaming data is written to the
data store for subsequent batch processing.
6. The results of stream processing may also be persisted in the analytical data store
to support historical analysis.
Stream Processing
At its simplest, a high-level architecture for stream processing looks like this:
1. An event generates some data. This might be a signal being emitted by a sensor, a
social media message being posted, a log file entry being written, or any other
occurrence that results in some digital data.
3. The event data is processed, often by a perpetual query that operates on the event
data to select data for specific types of events, project data values, or aggregate
data values over temporal (time-based) periods (or windows) - for example, by
counting the number of sensor emissions per minute.
4. The results of the stream processing operation are written to an output (or sink),
which may be a file, a database table, a real-time visual dashboard, or another
queue for further processing by a subsequent downstream query.
Azure Event Hubs: A data ingestion service that you can use to manage queues of
event data, ensuring that each event is processed in order, exactly once.
Azure IoT Hub: A data ingestion service that is similar to Azure Event Hubs, but
which is optimized for managing event data from Internet-of-things (IoT) devices.
Azure Data Lake Store Gen 2: A highly scalable storage service that is often used
in batch processing scenarios, but which can also be used as a source of streaming
data.
Azure Event Hubs: Used to queue the processed data for further downstream
processing.
Azure Data Lake Store Gen 2 or Azure blob storage: Used to persist the
processed results as a file.
Ingest data from an input, such as an Azure event hub, Azure IoT Hub, or Azure
Storage blob container.
Process the data by using a query to select, project, and aggregate data values.
Write the results to an output, such as Azure Data Lake Gen 2, Azure SQL
Database, Azure Synapse Analytics, Azure Functions, Azure event hub, Microsoft
Power BI, or others.
Once started, a Stream Analytics query will run perpetually, processing new data as it
arrives in the input and storing results in the output.
Azure Stream Analytics is a great technology choice when you need to continually
capture data from a streaming source, filter or aggregate it, and send the results to a
data store or downstream process for analysis and reporting.
Azure Databricks
Azure HDInsight
Spark can be used to run code (usually written in Python, Scala, or Java) in parallel
across multiple cluster nodes, enabling it to process very large volumes of data
efficiently. Spark can be used for both batch processing and stream processing.
Delta Lake
Delta Lake is an open-source storage layer that adds support for transactional
consistency, schema enforcement, and other common data warehousing features to
data lake storage. It also unifies storage for streaming and batch data, and can be used
in Spark to define relational tables for both batch and stream processing. When used for
stream processing, a Delta Lake table can be used as a streaming source for queries
against real-time data, or as a sink to which a stream of data is written.
The Spark runtimes in Azure Synapse Analytics and Azure Databricks include support
for Delta Lake.
Azure Data Explorer is a great choice of technology when you need to:
Capture and analyze real-time or batch data that includes a time-series element;
such as log telemetry or values emitted by Internet-of-things (IoT) devices.
Explore, filter, and aggregate data quickly by using the intuitive and powerful Kusto
Query Language (KQL)