Azure Data Factory

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

What is Activity in Azure Data Factory?

The activity is the task we performed on our data. We use activity inside the Azure
Data Factory pipelines. ADF pipelines are a group of one or more activities. For ex:
When you create an ADF pipeline to perform ETL you can use multiple activities to
extract data, transform data and load data to your data warehouse. Activity uses
Input and output datasets. Dataset represents your data if it is tables, files, folders
etc. Below diagram shows the relationship between Activity, dataset and pipeline:

An Input dataset simply tells you about the input data and it’s schema. And an
Output dataset will tell you about the output data and it’s schema. You can attach
zero or more Input datasets and one or more Output datasets. Activities in Azure
Data Factory can be broadly categorized as:

1- Data Movement Activities

2- Data Transformation Activities

3- Control Activities

DATA MOVEMENT ACTIVITIES :

1- Copy Activity: It simply copies the data from Source location to destination
location. Azure supports multiple data store locations such as Azure Storage, Azure
DBs, NoSQL, Files, etc.

To know more about Data Movement activities, please use below link:

Pipelines and activities in Azure Data Factory - Azure Data Factory | Microsoft Docs

DATA TRANSFORMATION ACTIVITIES:

1- Data Flow: In data flow, First, you need to design data transformation workflow to
transform or move data. Then you can call Data Flow activity inside the ADF pipeline.
It runs on Scaled out Apache Spark Clusters. There are two types of DataFlows:
Mapping and Wrangling DataFlows

MAPPING DATA FLOW: It provides a platform to graphically design data


transformation logic. You don’t need to write code. Once your data flow is complete,
you can use it as an Activity in ADF pipelines.
WRANGLING DATA FLOW: It provides a platform to use power query in Azure Data
Factory which is available on Ms excel. You can use power query M functions also on
the cloud.

2- Hive Activity: This is a HD insight activity that executes Hive queries on


windows/linux based HDInsight cluster. It is used to process and analyze structured
data.

3- Pig activity: This is a HD insight activity that executes Pig queries on


windows/linux based HDInsight cluster. It is used to analyze large datasets.

4- MapReduce: This is a HD insight activity that executes MapReduce programs on


windows/linux based HDInsight cluster. It is used for processing and generating large
datasets with a parallel distributed algorithm on a cluster.

5- Hadoop Streaming: This is a HD Insight activity that executes Hadoop streaming


program on windows/linux based HDInsight cluster. It is used to write mappers and
reducers with any executable script in any language like Python, C++ etc.

6- Spark: This is a HD Insight activity that executes Spark program on windows/linux


based HDInsight cluster. It is used for large scale data processing.

7- Stored Procedure: In Data Factory pipeline, you can use execute Stored
procedure activity to invoke a SQL Server Stored procedure. You can use the
following data stores: Azure SQL Database, Azure Synapse Analytics, SQL Server
Database, etc.

8- U-SQL: It executes U-SQL script on Azure Data Lake Analytics cluster. It is a big
data query language that provides benefits of SQL.

9- Custom Activity: In custom activity, you can create your own data processing
logic that is not provided by Azure. You can configure .Net activity or R activity that
will run on Azure Batch service or an Azure HDInsight cluster.

10- Databricks Notebook: It runs your databricks notebook on Azure databricks


workspace. It runs on Apache spark.

11- Databricks Python Activity: This activity will run your python files on Azure
Databricks cluster.

12- Azure Functions: It is Azure Compute service that allows us to write code logic
and use it based on events without installing any infrastructure. It stores your code
into Storage and keep the logs in application Insights.Key points of Azure Functions
are :
1- It is a Serverless service.

2- It has Multiple languages available : C#, Java, Javascript, Python and PowerShell

3- It is a Pay as you go Model.

To know more about Data Transformation activity, use below link:

Pipelines and activities in Azure Data Factory - Azure Data Factory | Microsoft Docs

3- Control Flow Activities:

1- Append Variable Activity: It assigns a value to the array variable.

2- Execute Pipeline Activity: It allows you to call Azure Data Factory pipelines.

3- Filter Activity: It allows you to apply different filters on your input dataset.

4- For Each Activity: It provides the functionality of a for each loop that executes for
multiple iterations.

5- Get Metadata Activity: It is used to get metadata of files/folders. You need to


provide the type of metadata you require: childItems, columnCount, contentMDS,
exists, itemName, itemType, lastModified, size, structure, created etc.

6- If condition Activity: It provides the same functionality as If statement, it


executes the set of expressions based on if the condition evaluates to true or false.

7- Lookup Activity: It reads and returns the content of multiple data sources such as
files or tables or databases. It could also return the result set of a query or stored
procedures.

8- Set Variable Activity: It is used to set the value to a variable of type String, Array,
etc.

9- Switch Activity: It is a Switch statement that executes the set of activities based
on matching cases.

10- Until Activity: It is same as do until loop. It executes a set of activities until the
condition is set to true.

11- Validation Activity: It is used to validate the input dataset.

12- Wait Activity: It just waits for the given interval of time before moving ahead to
the next activity. You can specify the number of seconds.
13- Web Activity: It is used to make a call to REST APIs. You can use it for different
use cases such as ADF pipeline execution.

14- Webhook Activity: It is used to to call the endpoint URLs to start/stop the
execution of the pipelines. You can call external URLs also.

To know more about Control Flow activities use below link

Data transformation activities

Azure Data Factory and Azure Synapse Analytics support the following transformation activities that
can be added either individually or chained with another activity.

For more information, see the data transformation activities article.

Expand table

Data transformation activity Compute environment

Data Flow Apache Spark clusters managed by Azure Data


Factory

Azure Function Azure Functions

Hive HDInsight [Hadoop]

Pig HDInsight [Hadoop]

MapReduce HDInsight [Hadoop]

Hadoop Streaming HDInsight [Hadoop]

Spark HDInsight [Hadoop]

ML Studio (classic) activities: Batch Execution and Update Azure VM


Resource

Stored Procedure Azure SQL, Azure Synapse Analytics, or SQL


Server

U-SQL Azure Data Lake Analytics

Custom Activity Azure Batch

Databricks Notebook Azure Databricks

Databricks Jar Activity Azure Databricks

Databricks Python Activity Azure Databricks

Control flow activities


The following control flow activities are supported:

Expand table

Control Description
activity

Append Add a value to an existing array variable.


Variable

Execute Execute Pipeline activity allows a Data Factory or Synapse pipeline to invoke another
Pipeline pipeline.

Filter Apply a filter expression to an input array

For Each ForEach Activity defines a repeating control flow in your pipeline. This activity is used to
iterate over a collection and executes specified activities in a loop. The loop implementation
of this activity is similar to the Foreach looping structure in programming languages.

Get GetMetadata activity can be used to retrieve metadata of any data in a Data Factory or
Metadata Synapse pipeline.

If Condition The If Condition can be used to branch based on condition that evaluates to true or false.
Activity The If Condition activity provides the same functionality that an if statement provides in
programming languages. It evaluates a set of activities when the condition evaluates
to true and another set of activities when the condition evaluates to false.

Lookup Lookup Activity can be used to read or look up a record/ table name/ value from any
Activity external source. This output can further be referenced by succeeding activities.

Set Variable Set the value of an existing variable.

Until Implements Do-Until loop that is similar to Do-Until looping structure in programming
Activity languages. It executes a set of activities in a loop until the condition associated with the
activity evaluates to true. You can specify a timeout value for the until activity.

Validation Ensure a pipeline only continues execution if a reference dataset exists, meets a specified
Activity criteria, or a timeout has been reached.

Wait When you use a Wait activity in a pipeline, the pipeline waits for the specified time before
Activity continuing with execution of subsequent activities.

Web Web Activity can be used to call a custom REST endpoint from a pipeline. You can pass
Activity datasets and linked services to be consumed and accessed by the activity.

Webhook Using the webhook activity, call an endpoint, and pass a callback URL. The pipeline run waits
Activity for the callback to be invoked before proceeding to the next activity.

You might also like