Migrating From Teradata To Azure SQL Data Warehouse v1.2
Migrating From Teradata To Azure SQL Data Warehouse v1.2
Reviewed by:
Jim Toland – Senior Program Manager
Mitch van Huuksloot – Solution Architect
Jim Rosengarth – Solution Architect
Migrating from Teradata to Azure SQL Data Warehouse
Copyright
This document is provided "as-is". Information and views expressed in this document, including URL and other
Internet Web site references, may change without notice. Some examples depicted herein are provided for
illustration only and are fictitious. No real association or connection is intended or should be inferred.
This document does not provide you with any legal rights to any intellectual property in any Microsoft product.
You may copy, use, and modify this document for your internal, reference purposes.
Microsoft, Azure, Hyper-V, Internet Explorer, Silverlight, SQL Server, Windows, Windows PowerShell, and Windows
Server are trademarks of the Microsoft group of companies. All other trademarks are property of their respective
owners.
Revision History
Table of Contents
Executive Summary ................................................................................................................................................ 5
Migration Phases .................................................................................................................................................... 8
Phase One: Fact Finding ........................................................................................................................... 8
Phase Two: Defining Success Criteria for a Proof of Concept (Optional) ............................................... 10
Phase Three: Data Layering Mapping Options ....................................................................................... 11
Phase Four: Data Modelling ................................................................................................................... 12
Phase Five: Identify Migration Paths ...................................................................................................... 13
Bringing it all together in the Execution of Migration ............................................................................ 13
Architectures ........................................................................................................................................................ 17
Data Migration ................................................................................................................................................ 17
Modern Data Warehouse ............................................................................................................................... 17
What is the best approach? ............................................................................................................................ 18
Preparation for Data Migration ............................................................................................................................ 19
Schema Migration ........................................................................................................................................... 19
Introduction............................................................................................................................................ 19
Create a Teradata Database Azure VM .................................................................................................. 20
Create Jumpbox VM ............................................................................................................................... 20
Create an Azure SQL Data Warehouse Database ................................................................................... 20
Creating the Teradata Schema in Azure ................................................................................................. 21
Use Attunity Replicate to Create Schema .............................................................................................. 21
Extract Schema ....................................................................................................................................... 22
Migrate Business Logic ......................................................................................................................................... 23
Metadata Approaches .................................................................................................................................... 23
Phase One: Lift and Shift ........................................................................................................................ 23
Phase Two: Rewrite Business Logic ........................................................................................................ 24
What is the best approach? ................................................................................................................... 24
Example of Business Logic Migration .............................................................................................................. 24
Semantic Layer and Reporting Views.............................................................................................................. 25
What is the best approach? ................................................................................................................... 25
Data Migration ...................................................................................................................................................... 27
Execution ........................................................................................................................................................ 27
Source-controlled pipeline ..................................................................................................................... 27
Azure Controlled pipeline ....................................................................................................................... 27
What is the best approach? ................................................................................................................... 27
Differentiated or Integrated Export Process .......................................................................................... 28
What is the best approach? ................................................................................................................... 29
Suggested Azure Data Factory Pipeline Configuration ................................................................................... 30
Executive Summary
This document gives insight into how to approach a Teradata to Azure SQL Data Warehouse migration.
This whitepaper is broken into sections which detail the migration phases, the preparation required for data
migration including schema migration, migration of the business logic, the actual data migration approach and
testing strategy.
The migration should pivot on six areas (one is optional but highly recommended) and, with the benefit of
Azure, you can get ahead of the development curve by quickly being able to provision Azure SQL Data
Warehouses for your development team to start business object migration before the data is migrated.
The use of an agile methodology, with focus on what must be delivered, to provide a working environment and
any items that are “nice to haves” will achieve a foundation from which to build upon within Azure.
The illustration below shows the different areas and they all build on top of each other: -
Defining
Success Identify
Data Data Execution of
Fact Finding Criteria for a Migration
Layering Modelling Migration
Proof of Paths
Concept
Through a question and answers session you can help define what your inputs and outputs are for the
migration project. This is the collecting requirements in an Agile methodology.
Taking the answers from phase one, you decide upon running a POC to validate the outputs required and run
the following phases as a POC. If you have decided upon a live migration, then you would skip the POC stage
(not recommended). This should feed into the Agile Project Charter and establish the team norms.
This phase is about mapping the data you have in Teradata to the data layout you will create in Azure SQL Data
Warehouse.
Once you’ve defined the data mappings, phase four concentrates on how to tune Azure SQL Data Warehouse
to provide the best performance for the data you will be landing into it.
Depending on the state of the data warehouse you might need to deploy a team to adjust the schema into a
star or snowflake data model. This layout can bring performance benefits to both the data and reporting
layers.
What is the path of least resistance? What is the quickest path given your cloud maturity? Phase Five helps
describe the options open to you and then for you to decide on the path you wish to take. With Azure you
have two different paths – export to Azure Databox and ship to the Azure Datacentre or use Azure
ExpressRoute to transfer from your data centre to Azure’s datacentre.
The data export file types can make a difference to the export, transfer and import to Azure SQL Data
Warehouse – these have been listed in the detailed section.
Migrating your Teradata data to SQL Data Warehouse involves a series of steps. These steps are executed in
three logical stages: Preparation, Metadata migration and Data migration which have been defined in the
previous phases and are combined into Phase Six: Execution of Migration.
Once migrated, you should have a robust testing plans to validate the solution in a test environment before
going live in production.
▪ Work with the Teradata DBAs to understand the data compression. For example, a table in Teradata
could be 1TB but Teradata has achieved 4x compression. On export, to disk, the text file would be
4TB. This then affects the amount of space you require on a virtual machine within the Teradata
datacentre.
Migration Phases
Phase One: Fact Finding
Phase 1 is all about fact finding and understanding if the workload is new or a migration project. This is the
collecting requirements phase in Agile.
Before migrating, you want to be certain SQL Data Warehouse is the right solution for your workload. SQL Data
Warehouse is a distributed system, designed to perform analytics on large volumes of data. Migrating to SQL
Data Warehouse requires some design changes that are not too hard to understand, but might take some time
to implement. If your business requires an enterprise-class data warehouse (DW), the benefits are worth the
effort. However, if you don't need the power of SQL Data Warehouse, it is more cost-effective to use SQL
Server or Azure SQL Database.
Consider using SQL Data Warehouse when you:
▪ Have one or more Terabytes of data
▪ Plan to run analytics on large amounts of data
▪ Need the ability to scale compute and storage
▪ Want to save costs by pausing compute resources when you don't need them.
Don't use SQL Data Warehouse for operational (OLTP) workloads that have:
▪ High frequency reads and writes
▪ Large numbers of singleton selects
▪ High volumes of single row inserts
▪ Row by row processing needs
▪ Incompatible formats (JSON, XML)
Strategic Questions
▪ Is this a Lift & Shift Migration of On-Premises Data Warehouse or a Green Field Project?
▪ What Data Size Raw Capacity (Storage Level) is required?
▪ What Eco-System Processes are in place to be migrated as part of Data Migration?
▪ What are Data Growth Expectations in 2 – 3 years’ time?
▪ Would it be 2-Tier Data Warehouse or Traditional 3-Tier Data Warehouse with (Acquisition / Ingestion
Layer, Integrated Data / Core Layer and Access / View / Semantic Layer)
▪ What are the expectations around?
− Concurrency (Number of Users at Peak Times and Tactical Workload Queries for ingress and
egress of Data alongside in-place Transformations, reference data management and BI Query
Workload)
− Scalability – Ties to the concurrency above + Total daily ingestion in bytes
− Availability – Tech. and Business SLA driven
− Resilience – Disaster Recovery and RTOs and RPOs (Off site cross region Active / Passive
replicated / synced Database
− Protection – Data Backup and Archival – How often, what Strategy and SLAs around
▪ Are there any thoughts around deploying Semantic Layer via Dimensional or Tabular Data Models
(SSAS?)
After you have the above questions and some of the bottom ones answered we can then look to sizing the
Data Warehouse.
▪ Total
▪ Average
▪ Max concurrent users
▪ Backups and Archive (Alongside frequencies, retention)
− Hot Archives within the DW
− Warm Archives outside DW
− Cold Archives outside (if any)
− Disaster Recovery (DR and their SLAs)
− RTO and RPOs?
▪ IT / System / Data Governance Compliance
− Security
− Data Protection
− Data Privacy
from the task carried out after data migration. The other two mentioned (High Availability and SLAs) can be
verified based on the number of clients already running solutions in Azure.
The Business Development Manager is key at the end of the process as they are usually the person who signs
off the investment to move to production. They must be involved from the start and should define the cost
target at which it makes sense to migrate from Teradata to an Azure solution.
The outputs are driven from the overall proof of concept to provide a Bill of Materials (BOM) and the costs
associated with running the solution.
Finally, through the course of the POC you will identify optimisation and fixes which will need to be applied to
the migration. This all drives the cost to migration, but it is important to categorise and show the return on
investment by implementing these changes. Some changes will be mandatory such as identity and networking
infrastructure. These mandatory requirements are usually referred to as Cloud Operating Model.
Consider this as part of your Agile Project Charter and then take the information in this and the previous phase
to build out your agile project. Run two weekly sprints with a well-defined backlog. The first sprint would
concentrate on getting the infrastructure ready. From sprint two the team would be focussing on schema and
data migration.
▪ Consider consolidating your existing data warehouse environment to use one SQL Data Warehouse
database if the workload formerly required cross-database joins.
▪ Leverage user-defined schemas to provide the boundary previously implemented using databases.
▪ If user-defined schemas have not been used previously then you have a clean slate. Simply use the old
database name as the basis for your user-defined schemas in the SQL Data Warehouse database.
If schemas have already been used, then you have a few options:
You might want to give this some additional thought if you are using views in your architecture
already.
To choose the best distribution option for each table, see Distributed tables.
A surrogate key on a table is a column with a unique identifier for each row. The key is not generated from the
table data. Data modelers like to create surrogate keys on their tables when they design data warehouse
models. You can use the IDENTITY property to achieve this goal simply and effectively without affecting load
performance. See Using IDENTITY to create surrogate keys in Azure SQL Data Warehouse. The schema of
Retailer_ID Store_ID StoreKey
1 1 1
1 2 2
2 1 3
2 3 4
▪ Source controlled pipeline - data export, transfer and import steps runs mostly from the source data
system
▪ Azure controlled pipeline - data export, transfer and import steps runs mostly from Azure
▪ Differentiated Export - data export, transfer and import are distinctly executed with each reading from
or writing to intermediate files
▪ Integrated Export - data export, transfer and import are combined, and entities are transferred directly
from the source data system to Azure SQL Data Warehouse with no intermediate files created
Where possible, automate as much of the export as possible. This enables repeatable exports and can be used
for both the proof of concept and the live migration.
Use Polybase to import the data to Azure SQL Data Warehouse. PolyBase is a technology that accesses data
outside of the database via the T-SQL language. It is the best way to load data into SQL Data Warehouse. With
PolyBase, the data loads in parallel from the data source directly to the compute nodes.
PolyBase loads data from UTF-8 and UTF-16 encoded delimited text files. In addition to the delimited text files,
it loads from the Hadoop file formats RC File, ORC, and Parquet. PolyBase can load data from Gzip and Snappy
compressed files. PolyBase currently does not support extended ASCII, fixed-width format, and nested formats
such as WinZip, JSON, and XML.
After collecting information, combining and making the key decisions in phases one to five you must move
onto the Execution of the Migration. Migrating your Teradata data to SQL Data Warehouse are executed in
three logical stages: Preparation, Metadata migration and Data migration.
Preparation
Logical Progress
Metadata Migration
Data Migration
In each stage, tasks to be executed involve the on-premises database system, the on-premises local storage,
the network connecting the local system to Azure (either internet or a dedicated circuit) and SQL Data
Warehouse.
This results in a physical data movement from the source database to Azure as shown below. (These steps are
also similar in moving data from any other source system on cloud instead of on-premises to SQL Data
Warehouse)
Physical Movement
Bringing these two diagrams together results in a logical flow from top to bottom and a physical flow from left
to right.
ExpressRoute
Metadata:
Metadata: Metadata: Metadata:
Compatibility
Export Copy to Azure Import
Changes
Data:
Data: Data: Data:
Compatibility
Filter and Extract Copy to Azure Load
Changes
Data: Data:
Conversions Tansformation
Data Migration
Data: Data:
Compression Optimizations
If the volume of the data to migrate is large, some steps can be time consuming. These steps are rate-
determining because they influence the overall migration time. Such steps are shaded in yellow.
Some migration steps may be optional depending on the size of the data, the nature of the network, and the
tools and services used for migration. Optional steps are shown with orange dots.
Architectures
Data Migration
To ingest historical data, you need a basic cloud Cloud Data Warehouse setup for moving data from your on-
premise solution to Azure SQL Data Warehouse and to enable the development team to build Azure Analysis
Cubes once the majority of the data is loaded.
Azure Data Factory Pipeline is used to ingest and move data through the store, prep and train pipeline.
Extact and Load files, via Polybase, into the Staging Schema on Azure SQL DW.
Transform data through Staging, Source (ODS), EDW and Sematic Schemas on Azure SQL DW.
Azure Analysis Services will be used as the sematic layer to serve thousands of end users and scale out
Azure SQL DW concurrency.
Build operational reports and analytical dashboards on top of Azure Analysis Services to serve thousands of
end users via Power BI.
Note: You should provision a testing virtual machine in Azure, with accelerated
networking enabled. This will be used to execute ad-hoc test queries. You can then look at
the internals of Azure SQL DW to see the latency/execution times.
Giving you the ability to easily add more data feeds which are then moved through the pipeline up to the
Azure SQL Data Warehouse layer and to your analytics/reporting layer.
Combine all your structured, unstructured and semi-structured data (logs, files, and media) using Azure
Data Factory to Azure Blob Storage.
Cleansed and transformed data can be moved to Azure SQL Data Warehouse to combine with existing
structured data, creating one hub for all your data. Leverage native connectors between Azure Databricks
and Azure SQL Data Warehouse to access and move data at scale.
Leverage data in Azure Blob Storage to perform scalable analytics with Azure Databricks and achieve
cleansed and transformed data.
Build operational reports and analytical dashboards on top of Azure Data Warehouse to derive insights
from the data and use Azure Analysis Services to serve thousands of end users. Run ad hoc queries directly
on data within Azure Databricks.
Once the team starts to work on incremental loads, the Modern Data Warehouse architecture should have
been built, with focus on the Azure Data Bricks to cleanse and transform data into Azure SQL Data Warehouse.
See further details on Data Migration pipeline to understand how this architecture supports your migration.
By provisioning an environment in under an hour in Azure, it allows both activities to run in parallel. The key to
getting the schema from Teradata to SQL Data Warehouse is to build a Teradata Database VM which we can
recreate the on-premise schema on.
The process of extracting the Teradata tables to Azure SQL Data Warehouse is:
Create Teradata
Create an Jumpbox Create a Azure SQL
Database in Azure
Virtual Machine and Data Warehouse
using Azure
Configure Database
Marketplace
Extract Schema
If you are able to provision a virtual machine quickly in your datacenter, close to the
Teradata database, then this is the preferred process. This would allow you to start at
“Create Jumpbox”. However, in our experience, internal processes can slow this down and
with a schema the project will be on hold for ages.
Using Azure Marketplace, it is possible to deploy a 2-node development edition of Teradata Database 15.x and
to then use create a copy of the schemas you wish to migrate. Only Teradata Database is required, no other
Teradata components are required.
Pre-requisites:
You should have extracted a copy of the Teradata Schema scripts for the objects you wish to migrate. You will
need one vNet with three subnets provisions and an Azure Subscription with contributor permission on at
least one empty Azure resource group.
After the Teradata cluster has been created, you will need to open port 1025 to allow the Jumpbox VM to
communicate with it.
Full details on creating a Teradata Azure VM can be found in the Appendix under “Create Teradata Database
Azure VM”.
Create Jumpbox VM
To facilitate copying the schema between the Azure Teradata Cluster and Azure SQL Data Warehouse you
should create a Windows 2012 Azure VM (Size D4_v2). Using the Data Science Virtual Machine - Windows
Server 2012 edition, from the Azure Marketplace, will save you time in installing SQL Server Management
Studio and Visual Studio – since these components are pre-installed on the Azure Marketplace image.
Additional software which should be installed after this has been provisioned (also applies to on-premise
virtual machine): -
Full details on creating a Teradata Azure VM can be found in the Appendix under “Create Jumpbox VM”
Prior to using Attunity Replicate, to create the schema, you will need to create a master key for Azure SQL Data
Warehouse.
To run loads with appropriate compute resources, we recommend that you create a load database user
resource class to fit the resource class and load need (see Loading Best Practices).
Full details on Creating an Azure SQL Data Warehouse Database can be found in the Appendix under ”Create a
SQL DataWarehouse Dev DB”
By approaching the schema migration using a separate instance away from Production, you can remove
several barriers around connecting to a production instance.
Full details on how you can create the Teradata schema can be found in the Appendix under ”Create the
Teradata Schema in Azure”.
Attunity Replicate for Microsoft Migrations is installed on the Jumpbox VM and is used transfer the Teradata
schema to Azure SQL Data Warehouse.
The table schema will need to be performance tuned as each table is created as a ROUND ROBIN table.
Once migrated, you should check that the schema has been transferred. The full process is described in the
Appendix under “Using Attunity Replicate to Create Schema”.
Extract Schema
At this point in the schema creation process, it is advisable to extract the Azure SQL Data Warehouse schema
and check it into your code store.
The easiest way to achieve this is with the mssql-scripter. mssql-scripter is the multiplatform command line
equivalent of the widely used Generate Scripts Wizard experience in SSMS.
You can use mssql-scripter on Linux, macOS, and Windows to generate data definition language (DDL) and data
manipulation language (DML) T-SQL scripts for database objects in SQL Server running anywhere, Azure SQL
Database, and Azure SQL Data Warehouse. You can save the generated T-SQL script to a .sql file or pipe it to
standard *nix utilities (for example, sed, awk, grep) for further transformations. You can edit the generated
script or check it into source control and subsequently execute the script in your existing SQL database
deployment processes and DevOps pipelines with standard multiplatform SQL command line tools such as
sqlcmd.
This will extract the Azure SQL Data Warehouse Schema to allow you to recreate the database later. For a full
command line reference please see https://github.com/Microsoft/mssql-
scripter/blob/dev/doc/usage_guide.md.
Some Azure SQL Data Warehouse features can significantly improve performance as they are designed to work
in a distributed fashion. However, to maintain performance and scale, some features are also not available.
See Migrate your SQL code to SQL Data Warehouse.
Metadata Approaches
There are two approaches which you can take to migrating the stored procedures and views which reside in
the Teradata Database.
However, if you are migrating to different application stacks and have the timelines to be able to move
business logic migration and data loads, then investment upfront in this will leave you with less technical debt.
Looking at this code there are multiple challenges with trying to execute this against SQL Data Warehouse. To
convert these into T-SQL we would apply the following changes:
These are small changes and shows why our approach of Datometry followed by recoding eases the migration
path to SQL Data Warehouse.
The finished T-SQL script, with changes highlighted in read, would look like:
Several great benefits exist when leveraging Azure Analysis Services (AAS) with SQL DW:
▪ Use Cached mode for maximum query performance, where you have complex measures, full cache
model and to be able to pause/scale Azure SQL Data Warehouse up and down for loads. The
challenge with this mode is reduced data freshness, data must fit into memory and it is read only (no
writeback).
▪ Use Direct Query for sourcing data directly, optimal data freshness and query at base fact level. The
challenge is inconsistent query performance, consumption of concurrency slots, no caching, requires
Azure SQL Data Warehouse to be online and may require higher DWU
Data Migration
There multiple ways to approach data migration and your choice will depend on the connectivity you have
from your datacenter to Azure.
Execution
Source-controlled pipeline
▪ Data export, transfer and import steps runs mostly from the source data system
▪ Reuse existing computer and storage resources at the source system
▪ May use custom scripts and programs or ETL tools like SSIS run from the source database server Azure
Controlled
Adopting the Azure Controller pipeline approach, all the way from Azure down to the on-premise resources
removes human intervention and reduces the number of mistakes. It relies on less on skills and more on the
infrastructure you have enabled.
Teradata ships with FASTEXPORT utilities which allow you to export data which is called by Teradata Parallel
Transporter. Azure Data Factory v2 Teradata connector calls the Teradata Parallel Transporter.
An internet connection to the virtual machine is required to allow Azure Data Factory to connect to self-hosted
runtimes which will orchestrate the export/import process.
If your data export is between 1TB and 2TB at a time, having at least two virtual machines, with self-hosted
runtimes, will increase throughput by 1.5x.
The following items are required on the Virtual Machine located in the Teradata Datacenter: -
▪ Windows Server 64bit 2012 R2, 8 cores and minimum 256GB RAM.
▪ Enough disk space to export your data to. This depends on how much whitespace you have in your
database tables. It is possible to compress a 1.2TB extract to 100GB using pipe delimited text, with
GZIP, or ORC formats if most columns are NULL or have repeated values.
▪ Teradata Tools and Utilities 15.1 (https://downloads.teradata.com/download/tools/teradata-tools-
and-utilities-windows-installation-package).
▪ Configure Virtual Machine as per the instructions on Configuring Data Science VM earlier in this
document.
▪ PowerShell with the Azure PowerShell module 3.6 or later (see install documentation at
https://docs.microsoft.com/en-us/powershell/azure/install-azurerm-ps?view=azurermps-6.8.1)
▪ SQL Server Management Studio (https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-
management-studio-ssms?view=sql-server-2017)
▪ Microsoft Connectors for Oracle and Teradata by Attunity for Integration Services (SSIS)
(https://docs.microsoft.com/en-us/sql/integration-services/attunity-connectors?view=sql-server-
2017)
For the integrated history load process, we would suggest you create a Windows 2012 R2+ Virtual Machine
located as near to the existing Teradata Cluster and follow the instructions on Configuring Jumpbox VM earlier
in this document.
You should then follow the Attunity Replicate for Microsoft Migrations instructions for migrating data which
are maintained outside this document.
The integrated approach relies on Azure ExpressRoute being in place between the source Teradata location
datacenter and Azure datacenter location, then using Attunity Replicate for Microsoft Migrations to ease the
historical load process.
Attunity Replicate for Microsoft Migrations can be scaled out onto multiple nodes and use one node to co-
ordinate the load process. It is recommended that you build VMs, based on their recommended specifications,
in the Teradata source location datacenter. Attunity will query data locally and then upload to blob storage
before creating external tables and importing data using a Create Table as Select to the base tables.
Differentiated Export
▪ Data export, transfer and import are distinctly executed with each reading from or writing to
intermediate files
▪ File compression is used to reduce the cost and time in transferring files from the source system to
Azure
The data is transferred to local storage which can then be copied to Azure Databox Disks and shipped back to
the Azure datacenter. Full details on Azure Databox (https://docs.microsoft.com/en-us/azure/databox/data-
box-disk-overview).
The time to transfer data over the internet or Azure Express route will drive the decision to use Azure Data
Factory or AzCopy to upload the export files instead of the Azure Databox service.
For example a 100TB export, over a 70MBps connection, assuming you have 100% bandwidth all of the time,
would take almost 140 days to upload. Anything more than 14 days to upload should make use of the Azure
Databox service.
To summarize, the process flow, of the two methods described would be broken into: -
Export from Teradata to Local Disk Export from Teradata to Local Disk
using Azure Data Factory V2 using Azure Data Factory V2
▪ For large databases following the different differentiated export approach, with Azure Data Factory,
and export tables to Pipe Delimited Text with gzip or ORC from Teradata – AZCOPY to Azure Blob
Storage or Azure Data Lake Store. Use Azure Data Factory to PolyBase into Azure SQL DW.
▪ To force Polybase instead of BULKCOPY, in Azure Data Factory, make sure:
− The input dataset type is AzureBlob or AzureDataLakeStoreFile. The format type under type
properties is OrcFormat, or TextFormat (Pipe Delimited Text is fastest), with the following
configurations:
▪ fileName doesn't contain wildcard filter.
▪ rowDelimiter must be \n.
▪ nullValue is either set to empty string ("") or left as default, and treatEmptyAsNull is not set
to false.
▪ encodingName is set to utf-8, which is the default value.
▪ escapeChar, quoteChar and skipLineCount aren't specified. PolyBase supports skip header
row which can be configured as firstRowAsHeader in ADF.
▪ compression can be no compression, GZip, or Deflate – our results showed that 1.5TB can be
compressed down to 100GB using GZIP if the data contains a lot of similar values. If using
GZIP compression, set to Optimal.
▪ If data is relatively small;
− Following the integrated approach by using Attunity for Microsoft Migrations and consider
scaling this out with a couple of nodes.
− Use your favorite ETL tool and parallel copy several tables at once from source to target
If you have no Azure ExpressRoute or less than 70MBps VPN to Azure, then:
▪ Using Azure Data Factory, export tables to Pipe Delimited Text or ORC from Teradata and use DataBox
to transfer to Azure Data Center – Polybase into Azure SQL DW.
If your POC doesn’t have Azure ExpressRoute or you need to load data in stages, then you will need to look at
alternative ways to load data from Teradata to SQL Data Warehouse, using the differentiated approach and
your favorite ETL toolkit.
All configuration activities are executed from the Virtual Machine located nearest to the Teradata Cluster.
End-to-End Workflow
▪ The first pipeline looks up the list of tables that needs to be copied over to the sink data stores. Then,
the pipeline triggers another pipeline, which iterates over each table in the database and performs
the data copy operation.
▪ The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table
in the list, copy the specific table in Teradata to the local storage.
Note: Depending on bandwidth, you would copy the local storage to Azure Data Box disks, and ship Azure
Datacenter, or Use AzCopy with multiple threads (no more than the number of processor cores on the VM) to
copy to blob storage.
▪ The third pipeline looks up the list of tables that needs to be copied over to the sink data stores. Then,
the pipeline triggers another pipeline, which iterates over each table in the database and performs
the data copy operation.
▪ The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table
in the list, copy the specific table in blob storage to the Azure SQL Data Warehouse.
You could keep these metadata tables simple BUT you must keep in mind that if you have some very large
tables, you will want to split these into smaller export files, as opposed creating one large export file, which
will cause out of memory issues during export and are much harder to reload should the import fail at some
point.
During our testing, we always ran a smaller 1.5TB test export a couple of times, to make sure that the multiple
file sizes for one table were less than 2GB.
The simple metadata table, called TablesToProcess, has schemaname and tablename columns. If you wanted
to add intelligence to this, you could add an imported smallint column to allow ADF to update this column to 1
upon successful import or 3 upon failure.
A more advanced metadata table structure would still need one table called TablesToProcess which has
schemaname, tablename, filtercol, FilterVal and imported columns.
In the advanced metadata table, you will see multiple entries for the same table. As Azure Data Factory
accepts parameters, we can build dynamic SQL where clauses. The filtercol contains the column we will filter
on and the filtervalue is the value we will compare to.
When the pipeline runs, it will use the advance metadata table to export data into smaller files. The import will
use the same table to move the data into Azure SQL Data Warehouse.
To best understand this pipeline we would suggest running through the detailed setup on the Appendix.
Next Steps
This section details the next areas you need to put into your migration plan once the schema and data have to
been moved to Azure SQL Data Warehouse.
Data Distribution
Data Distribution is an important concept in SQL Data Warehouse. It allows breaking down data processing
requirements for efficient parallel processing across all hardware nodes. When creating any table in SQL DW,
the distribution key is used to determine the node on which each row of data is stored. That is, any table will
be distributed across multiple nodes based upon the distribution key. An important factor in SQL DW database
design is selecting good distribution keys for fact tables.
The most expensive queries in MPP systems such as SQL DW, are those that involve a lot of data movement.
When you execute a query, SQL DW will redistribute rows across the compute nodes as necessary to fulfill the
query such as on JOINs and GROUP BY statements. Users should create tables with distribution styles that
minimize data movement, reflective of queries that would occur in their normal workload.
▪ Hash Distribution – Based on the result of a hashing algorithm on the chosen key, SQL Data
Warehouse will place the row in one of sixty distributions.
▪ Round Robin – SQL Data Warehouse will distribute each incoming row in a round robin fashion
▪ Replicated – SQL Data Warehouse will replicate the table on each of the compute nodes, backed by a
round robin table for insert performance.
Once data has been loaded, you should redistribute tables to a more appropriate distribution type.
Indexing
SQL Data Warehouse offers several indexing options including clustered columnstore indexes, clustered
indexes and nonclustered indexes, and a non-index option also known as heap.
Clustered columnstore tables offer both the highest level of data compression as well as the best overall query
performance. Clustered columnstore tables will generally outperform clustered index or heap tables and are
usually the best choice for large tables. Clustered columnstore is the best place to start when you are unsure
of how to index your table.
There are a few scenarios where clustered columnstore may not be a good option:
▪ Columnstore tables do not support varchar(max), nvarchar(max) and varbinary(max). Consider heap
or clustered index instead.
▪ Columnstore tables may be less efficient for transient data. Consider heap and perhaps even
temporary tables.
▪ Small tables with less than 100 million rows. Consider heap tables.
For futher information on Indexing tables in SQL Data Warehouse please read https://docs.microsoft.com/en-
us/azure/sql-data-warehouse/sql-data-warehouse-tables-index
Partitioning
Fact tables are typically millions or billions of rows in a data warehouse and using table partitioning during
loading and querying can improve performance.
Benefits to loads
The primary benefit of partitioning in SQL Data Warehouse is to improve the efficiency and performance of
loading data by use of partition deletion, switching and merging see (https://docs.microsoft.com/en-
us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition). In most cases data is partitioned on a date
column that is closely tied to the order in which the data is loaded into the database. One of the greatest
benefits of using partitions to maintain data is the avoidance of transaction logging. While simply inserting,
updating, or deleting data can be the most straightforward approach, with a little thought and effort, using
partitioning during your load process can substantially improve performance.
Partition switching can be used to quickly remove or replace a section of a table. For example, a sales fact
table might contain just data for the past 36 months. At the end of every month, the oldest month of sales
data is removed from the table. This data could be removed by using a delete statement to delete the data for
the oldest month. However, removing a large amount of data row-by-row with a delete statement can take
too much time, as well as create the risk of large transactions that take a long time to rollback if something
goes wrong. A more optimal approach is to drop the oldest partition of data. Where deleting the individual
rows could take hours, dropping an entire partition could take seconds.
Benefits to queries
Partitioning can also be used to improve query performance. A query that applies a filter to partitioned data
can limit the scan to only the qualifying partitions. Parition elimination can avoid a full table scan and only scan
a smaller subset of data. With the introduction of clustered columnstore indexes, the predicate elimination
performance benefits are less beneficial, but in some cases, there can be a benefit to queries. For example, if
the sales fact table is partitioned into 36 months using the sales date field, then queries that filter on the sale
date can skip searching in partitions that don’t match the filter.
query than how the data is at rest. As a rule of thumb, try to avoid any more than 30% skew across
distributions.
Avoid a NULLABLE column for your distribution key. Rows where the distribution key is NULL or an “unknown”
value such as -1 will be distributed in a single distribution which can lead to very skewed data and bottleneck
performance.
For type 1 and type 2 dimensions.
▪ Since the table has multiple copies, replicated tables work best when the table size is less than 2 GB
compressed. If you have dimension tables that are frequently used in join conditions, with simple
query predicates, such as equality or inequality, and are rarely updated, consider replicating these
tables. This will create a copy of the table on each compute node, preventing any data movement
from occurring on join. If data changes frequently you should consider the performance impact and
consult https://docs.microsoft.com/en-us/azure/sql-data-warehouse/design-guidance-for-replicated-
tables#performance-considerations-for-modifying-replicated-tables.
▪ If table size is greater than 2GB user or your have queries with complex query predicates, such as LIKE
or NOT LIKE. Use Distributed tables.
Consider using ROUND_ROBIN tables for tables that cannot be replicated, have no consistent join column, or
have very low cardinality (uniqueness) across all columns.
Indexing
Clustered columnstore should be your default option for most table types. Clustered columnstore is a highly
efficient compressed storage format that reduces IO greatly during query time.
Clustered indexes enforce a physical row sorting of your data on disk. Consider this type of index only on
relatively small tables (< 60 million rows) such as (replicated) dimension tables that benefit from range filtering
such as on date columns.
Sizing Partitioning
While partitioning can be used to improve performance in some scenarios, creating a table with too many
partitions can hurt performance in others. These concerns are especially true for clustered columnstore tables.
For partitioning to be helpful, it is important to understand when to use partitioning and the number of
partitions to create. There is no hard-fast rule as to how many partitions are too many, it depends on your
data and how many partitions you are loading simultaneously. A successful partitioning scheme usually has
tens to hundreds of partitions, not thousands.
When creating partitions on clustered columnstore tables, it is important to consider how many rows belong
to each partition. For optimal compression and performance of clustered columnstore tables, a minimum of 1
million rows per distribution and partition is needed. Before partitions are created, SQL Data Warehouse
already divides each table into 60 distributed files. Any partitioning added to a table is in addition to the
distributions created behind the scenes. Using this example, if the sales fact table contained 36 monthly
partitions, and given that SQL Data Warehouse has 60 distributions, then the sales fact table should contain 60
million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the
recommended minimum number of rows per partition, consider using fewer partitions in order to increase the
number of rows per partition. The Indexing article on Microsoft docs includes queries that can assess the
quality of cluster columnstore indexes.
Data Validation
Once the performance tuning of the schema has been completed. It is important to check that all the data has
been transferred across.
You should implement a robust testing plan which includes comparing row counts, data types and checking
that queries between on the source and on Azure SQL Data Warehouse return the same amount of data.
A full testing plan is outside the scope of this whitepaper but consider defining a repeatable process.
Microsoft have publish a whitepaper called Performance Tuning of Tabular Models in SQL Server 2012 Analysis
Services. This will help you optimise and performance tune your model.
Optimise the processessing of Azure Analysis Services by using partitioning, in your model, which align with the
partitions in Azure SQL Data Warehouse.Large datasets normally require table partitioning to accelerate and
optimize the data-load process. Azure Analysis Services partitioning enables incremental loads, increases
parallelization, and reduces memory consumption. See Automated Partition Management with Azure Analysis
Services and Azure Analysis Service Partitions.
Reporting
After creating your Azure Analysis Service Model your should recreate your reports. For a Proof of Concept,
choose no more than five reports and decide if you will keep your existing reporting technology or adopt a
new solution.
If the reports are complex and are going to take time to transform, then Datometry is a great solution for
being able to lift and shift your reporting solution whilst you evaluate your path.
If you are moving to a Microsoft reporting solution such as Power BI it can still make sense to repoint your
current reporting solution, via Datometry, whilst you redevelop your reports. This is a much quicker way to get
users working with your modern data warehouse.
Using the Search function, in the top right corner, search the market place for Teradata Database and click
the selection as it appears in the drop-down list.
Before continuing check the price plans. For the purpose of this document we will be creating a
Development version of Teradata Database which, as at 4th July 2018, is provided free (All you pay for is
the Azure compute and storage). Teradata is licensed for use on specific Azure VMs. For latest information
on supported software and VMs, visit www.teradata.com/azure/products/.
Click Get Now and fill in the details on the form, click continue, and this will then take you through to
provisioning Teradata for Azure.
On the Teradata database blade, fill out the DBC password (this is the password you will use to connect to
Teradata).
a. Choose your Purchase method. For our lab we selected Hourly.
b. Choose your Database version and Tier. As most migrated systems are from 15.10 we chose this for our
lab and the development tier as we are only created a schema.
c. Leave the VM size at Standard D15 v2
d. Change the number of nodes from 2 to 1.
Click OK
On the General Setting Tab click onto subnets and chose the three subnet you wish to use. Make sure you
allocate the subnet you wish to communicate on to the VM subnet as you won’t be able to communicate
over the other two networks.
Click OK.
Click OK
On the Summary page you will see the information which will be passed to the ARM template.
Review the summary information
Click OK.
Important: once the Teradata database VM has been provision. do not shut this VM down
until you have finished the schema conversion, or it will fail to come back online.
In the Azure Portal go to Virtual Machines and click on the Teradata Database VM you created.
You will only add a port rule to the first network card. Usually end with “-NIC00”
Click Add Inbound Port Rule.
Create Jumpbox VM
We will use this to connect to the Teradata Azure VM and load some additional software onto this machine to
assist with the schema migration.
Using the Search function, in the top right corner, search the market place for Data Science Virtual Machine
- Windows 2012 and click the selection as it appears in the drop-down list.
Before continuing check the price plans. For the purposes of this document we will be creating a DS3_v2.
Under choose virtual machine size type DS3_v2 into the search box.
Click on the VM size
Click Select
Under Settings checked the disk type is HDD and that managed disk are being used.
Click on Virtual Network and select the network which you previously used for the Teradata Database VM
to connect to.
Click on Subnet and select the subnet which you allocated to the Teradata Database VM connection.
To allow public access, then leave the Public IP address allow. Otherwise select public IP and then select
none.
Accept the other defaults and Click OK
Configure Jumpbox VM
The jump box needs to be configured with software to connect to the Teradata database, to create your
schema, and has the migration software installed which will take your Teradata schema across to SQL Data
Warehouse.
Remote Desktop to the Azure Data Science VM you created in the previous step.
Download the Teradata Tools and Utilities v15.xx from
http://downloads.teradata.com/download/tools/teradata-tools-and-utilities-windows-installation-
package.
Extract the ZIP file to C:\InstallFiles\TeradataTools
Go to C:\InstallFiles\TeradataTools\TTUExpress
Launch TTU_BASE to install core Teradata Components
Download the Attunity Replicate for Microsoft Migrations User and Setup Guide from
https://attunity.jiveon.com/servlet/JiveServlet/downloadBody/1013-102-5-
1048/AttunityReplicateMSM_5_5_0_User_Guide.pdf
Open PDF and follow installation instructions from section three to four.
Chose “Install local Replicate for Microsoft Migrations Server”.
Download Java SE Runtime Environment 8 and install from
http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html
Right-click on My Computer, and then select Properties.
Click on Advance Settings.
Click on Environment Variables.
Download, Open and Install the Teradata Developer tools from Visual Studio from
https://marketplace.visualstudio.com/items?itemName=MikeDempsey.TeradataDeveloperTools
Click Server and create a new server (make a note of the login, password and server name).
Click Select.
Click Performance level and configure for Gen1 DWU100 – you can scale this up later when you need to
load data.
Click Apply.
Click Create.
Remote Desktop to the Azure Data Science VM you created in the previous step.
Open SQL Server Management Studio.
In Object Explorer, expand Databases. Then expand System databases and master to view the objects in
the master database. Expand the Azure SQL Data Warehouse Database you created to view the objects in
your new database.
To create the user for loading data, in SSMS right click master to show a drop-down menu and choose New
Query.
In the query window, enter these T-SQL commands to create a login and user named LoaderRC60,
substituting your own password for 'a123STRONGpassword!'.
a. CREATE LOGIN LoaderRC60 WITH PASSWORD = 'a123STRONGpassword!';
b. CREATE USER LoaderRC60 FOR LOGIN LoaderRC60;
Click Execute.
Right-click your Azure SQL Data Warehouse Database and choose New Query. A new query Window opens.
Enter the following T-SQL commands to create a database user named LoaderRC60 for the LoaderRC60
login. The second line grants the new user CONTROL permissions on the new data warehouse. These
permissions are similar to making the user the owner of the database. The third line adds the new user as a
member of the 'xlargerc' resource class.
a. CREATE USER LoaderRC60 FOR LOGIN LoaderRC60;
b. GRANT CONTROL ON DATABASE::[<YourSQLDWDB>] to LoaderRC60;
c. EXEC sp_addrolemember 'largerc', 'LoaderRC60';
Click Execute
To create the user for loading data, in SSMS right click master to show a drop-down menu and choose New
Query. In the query window, enter this T-SQL commands
CREATE MASTER KEY
Click Execute
Close all applications and log off.
In the new query window, type the statements to recreate your tablespace(s) on Teradata
Before executing you need to connect to the Azure Teradata instance. From menu select Teradata and go
to Query Tool > Connect
Click OK
From menu select Teradata and go to Query Tool >Execute
http://<computer name>/attunityreplicatemsm
Add a Teradata Migration Task to the Attunity Replicate for Microsoft Migrations Console.
After opening the console, the next step is to add a migration task.
Make sure that you are in TASKS view. The word "TASKS" should be displayed in the upper left corner of
the Attunity Replicate for Microsoft Migrations Console.
In the Attunity Replicate for Microsoft Migrations console, click Manage Endpoint Connections to open the
Manage Endpoint Connections dialog box.
Click New Endpoint Connection.
In the Name field, type a name for your Teradata database. This can be any name that will help to identify
the database being used.
In the Description field, type a description that helps to identify the Teradata database. This is optional.
Select SOURCE as the database role.
In the Attunity Replicate for Microsoft Migrations Console, click Manage Endpoint Connections to open the
Manage Endpoints Connections dialog box.
In the Manage Endpoint Connections dialog box, click New Endpoint Connection.
In the Name field, type a name for your SQL Data Warehouse data warehouse [service]. This can be any
name that will help to identify your SQL Data Warehouse database.
In the Description field, type a description that helps to identify the SQL Data Warehouse target database.
This is optional.
Select TARGET as the role.
Select SQL Data Warehouse as the Type.
Enter the following SQL Data Warehouse information:
Server name: Specify the name of the SQL Data Warehouse server you created earlier
a. Port: Specify port 1433
b. User name: LoaderRC60
c. Password: Specify the password for the user entered in the LoaderRC60 User.
d. Database name: Specify the target database name.
e. Azure Blob Storage Access: During a migration task, SQL Data Warehouse authenticates itself to Azure
Blob Storage using an SQL Server Credential. You can either configure Replicate for Microsoft
Migrations to create the Credential automatically during runtime (the default) or use an existing
Credential.
f. Select Automatically create SQL Server Credential
g. Enter the following Microsoft Azure Blob Storage information. You may need to click the Microsoft
Azure Blob Storage header to see the information.
h. Account name: Specify the name of the Azure Blob Storage account to which you created earlier.
i. Container name: Specify the name of the Azure Blob Storage container which you created earlier.
j. Account key: Specify the key for your Azure Blob Storage account.
k. Folder: Specify the name of the Azure Blob Storage folder to which you want the files copied.
In the Advanced tab, set the following properties: Max file size (MB): Change the file size to 2000MB.
The Endpoints pane is in the left of the console. Replicate for Microsoft Migrations supports several endpoint
types (database, Hadoop, data warehouse, file, NoSQL) with each endpoint type having its own icon. As both
of our endpoints are databases, the following icons will be displayed:
− Drag the endpoint to the Drop source endpoint here area in the endpoints diagram to the right of the
Endpoints pane.
− Hover your mouse cursor over the endpoint name and click the arrow that appears to its right.
Add the Microsoft SQL Data Warehouse Database target endpoint using any of the following methods:
− Drag the endpoint to the Drop target endpoint here area in the endpoints diagram to the right of the
Endpoints pane.
− Hover your mouse cursor over the endpoint name and click the arrow that appears to its right.
The endpoint diagram should now appear like this:
In the right pane of the Attunity Replicate for Microsoft Migrations Console, click the
Table Selection button.
The Select Tables dialog box opens.
In the Select Tables dialog box, do the following:
From the Schema list, select the schema containing your source tables, and then click Search.
From the Table List, either:
− Hold down the [Ctrl] button and select the tables you want to migrate. Then, click the > button to
move the tables to the Selected Tables list.
− Click the >>> button to move all the tables to the Selected Tables list.
Click OK.
On the main toolbar, click Save. The task is now ready to run.
The Starting task dialog is displayed, and the console automatically switches to Monitor view:
− Full Load tab: Uses status bars to provide a visual indication of the loading status during the full
load operation.
− Change Processing tab: Shows any changes that occur after the full load completes. Since this a
Full Load only task, this tab is not relevant.
Click the Completed status bar. Replicate for Microsoft Migrations displays a table below the graphs with
statistics related to the migrated tables.
Click the Total Completion bar to see a summary of the task.
Remote Desktop to the Azure Data Science VM you created in the previous step.
Open SQL Server Management Studio
In the Connect to Server dialog box enter
a. Server type
b. Database engine
c. Server name: The fully qualified server name
d. Authentication: SQL Server Authentication
e. Login: The server admin account (This is the account that you specified when you created the server)
f. Password
The password for your server admin account
Click Connect. The Object Explorer Windows opens in SSMS.
In Object Explorer, expand Databases. Expand the AZURE SQL DATA WAREHOUSE Database, you created,
then tables to view the objects in your new database.
Using PowerShell, with the Azure PowerShell module 3.6 or later, open “.\02-ADF\01-Pre-Requisites.ps1” and
step through the script.
Sign in to Azure
Sign in to your Azure subscription with the Connect-AzureRmAccount command and follow the on-screen
directions.
Connect-AzureRmAccount
If you don't know which location you want to use, you can list the available locations. After the list is displayed,
find the one you want to use. This example uses eastus. Store this in a variable and use the variable so you can
change it in one place.
$resourceGroup = "myResourceGroup"
New-AzureRmResourceGroup -Name $resourceGroup -Location $location
creates a storage account called mystorageaccount with locally redundant storage (LRS) and blob encryption
(enabled by default).
$ctx = $storageAccount.Context
Create a container
Blobs are always uploaded into a container. You can organize groups of blobs like the way you organize your
files on your computer in folders.
Set the container name, then create the container using New-AzureStorageContainer, setting the permissions
to 'blob' to allow public access of the files. The container name in this example is import
$containerName = "import"
New-AzureStorageContainer -Name $containerName -Context $ctx -Permission blob
In Object Explorer, expand Databases. Right click on the database you created earlier and select new query.
In the query window, enter this T-SQL commands
Click Execute
Close all applications and log off.
Start PowerShell on your machine, and keep it open through completion of this section. If you close and
reopen it, you'll need to run these commands again.
a. Run the following command, and enter the user name and password that you use to sign in to the
Azure portal:
Connect-AzureRmAccount
b. Run the following command to view all the subscriptions for this account:
Get-AzureRmSubscription
c. Run the following command to select the subscription that you want to work with. Replace
SubscriptionId with the ID of your Azure subscription:
Select-AzureRmSubscription -SubscriptionId "<SubscriptionId>"
a. The name of the Azure data factory must be globally unique. If you receive the following error, change
the name and try again.
b. To create Data Factory instances, you must be a Contributor or Administrator of the Azure subscription.
Id : /subscriptions/<subscription
ID>/resourceGroups/ADFTutorialResourceGroup/providers/Microsoft.DataFactory/factori
es/onpremdf0914/integrationruntimes/myonpremirsp0914
Type : SelfHosted
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Name : myonpremirsp0914
Description : selfhosted IR description
To retrieve the status of the created integration runtime, run the .\02-ADF\ 05bSelfHostedRuntime.ps1
script.
Nodes : {}
CreateTime : 9/14/2017 10:01:21 AM
InternalChannelEncryption :
Version :
Capabilities : {}
ScheduledUpdateDate :
UpdateDelayOffset :
LocalTimeZoneOffset :
AutoUpdate :
ServiceUrls : {eu.frontend.clouddatahub.net,
*.servicebus.windows.net}
ResourceGroupName : <ResourceGroup name>
DataFactoryName : <DataFactory name>
Name : <Integration Runtime name>
State : NeedRegistration
To retrieve the authentication keys for registering the self-hosted integration runtime with the Data
Factory service in the cloud, run the following command. Copy one of the keys (excluding the quotation
marks) for registering the self-hosted integration runtime that you install on your machine in the next step.
Get-AzureRmDataFactoryV2IntegrationRuntimeKey `
-Name $integrationRuntimeName `
-DataFactoryName $dataFactoryName.DataFactoryName `
-ResourceGroupName $resourceGroupName.ResourceGroupName | `
ConvertTo-Json
{
"AuthKey1": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=",
"AuthKey2": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy="
}
When the self-hosted integration runtime is registered successfully, the following message is displayed:
In the Register Integration Runtime (Self-hosted) window, select Launch Configuration Manager.
When the node is connected to the cloud service, the following message is displayed:
In this step, you link your on-premises Teradata instance to the data factory.
Open the JSON file named TeradataLinkedService.json in the C:\ADF\Connectors folder by using the
following code:
a. Replace <integration runtime name> with the name of your integration runtime.
b. Before you save the file, replace <servername>, <username>, and <password> with the values of your
Teradata instance.
c. If you need to use a backslash (\) in your user account or server name, precede it with the escape
character (\). For example, use mydomain\\myuser.
In this step, you link your on-premises Teradata instance to the data factory.
Open the JSON file named FileLinkedService.json in the C:\ADF\Connectors folder by using the following
code:
a. Replace <server> and <ShareName> with the values of your file path either fileshare or x:\ExportFiles
b. Before you save the file, replace <Domain>, <UserName>, and <password> with the values of user
which has permission to write files to the local storage.
c. If you need to use a backslash (\) in your user account or server name, precede it with the escape
character (\). For example, use mydomain\\myuser.
d. Replace <accountName> and <accountKey> with name and key of your Azure storage account before
saving the file.
a. Replace <accountName> and <accountKey> with name and key of your Azure storage account before
saving the file.
Open the JSON file named AzureStorageLinkedService.json in the C:\ADF\Connectors folder by using the
following code:
a. Replace <servername>, <databasename>, <username> and <password> with values of your Azure SQL
Database before saving the file. You should use the user, with the largerc, you created earlier.
Run the .\02-ADF\ 06-CreateLinkedServers.ps1 script to create an all of the linked services
a. Select the Azure Resource Group you created or used in the previous step.
b. Select the Azure Data Factory you created in the previous step.
c. Select the Self Hosted Integration Runtime created in the previous step.
The file share should be on the local virtual machine as opposed pushing out file copies from the VM to
another location.
Create Datasets
Run the .\02-ADF\07-CreateDatasets.ps1 script to create an all of the linked services
a. Select the Azure Resource Group you created or used in the previous step.
b. Select the Azure Data Factory you created in the previous step.
This will create all the datasets required for the four pipelines to use.
Dataset Notes
In SourceDW.json, in the C:\ADFv2\02-ADF\DataSets folder, “tableName" is a dummy one which the pipeline
will use the SQL query in copy activity to retrieve data. This can be checked or changed in the Azure Data
Factory User Interface SourceDW dataset.
In TDDataset.json, in the C:\ADFv2\02-ADF\DataSets folder, “tableName" is a dummy one which the pipeline
will use the SQL query in copy activity to retrieve data. This can be checked or changed in the Azure Data
Factory User Interface TDDataset dataset.
In FileORC.json, in the C:\ADFv2\02-ADF\DataSets folder the "FileDirectory" and “FileTableName” are set as a
parameter. The copy activity that references this dataset passes the actual directory and filename into the
dataset. This can be checked or changed in the Azure Data Factory User Interface FileORC dataset.
These are dynamically referenced in the file path and the file format is set to ORC format.
Create Pipelines
Run the .\02-ADF\ 08-CreatePipelines.ps1 script to create an all of the linked services
a. Select the Azure Resource Group you created or used in the previous step.
b. Select the Azure Data Factory you created in the previous step.
This will create all four pipelines to use for migration (GetTableListAndTriggerCopyTDToCSV,
IterateAndCopyTDTablesToCSV, GetTableListAndTriggerCopyBlobtoSQLDW and
IterateAndCopyTDTablesToCSV)
Pipeline Notes
Teradata to Local Storage
The GetTableListAndTriggerCopyTDToCSV pipeline looks up the Azure SQL Data Warehouse system table to get
the list of tables to be copied from Teradata to the local storage. This can be checked or changed in the Azure
Data Factory User Interface GetTableListAndTriggerCopyTDToCSV pipeline by selecting the LookupTableList
activity.
Selecting the TDTableToCSVFilePipeline activity and setting will show how we pass the output of
LookupTableList to the IterateAndCopyTDTablesToCSV pipeline.
The output of the lookup activity is passed into an array parameter which is used by the ForEach loop.
By selecting the ForEach activity you will see that this array to passed to another parameter associated with
the activity.
Editing the activities for IterateAndCopyTDTablesToCSV and selecting CopyTDtoCSV you can investigate how
Teradata data is copied to local storage using a dynamic query.
The local storage is passed a directory and filename based on the schema and tablename. This example is a
basic extract process. You will know your data and we highly recommend that modify this query to partition
the data export based on the advanced metadata approach above.
The GetTableListAndTriggerCopyBlobtoSQLDW pipeline looks up the Azure SQL Data Warehouse system table
to get the list of tables to be copied from Teradata to the local storage. This can be checked or changed in the
Azure Data Factory User Interface GetTableListAndTriggerCopyBlobtoSQLDW pipeline by selecting the
LookupTableList activity.
Selecting the BlobtoSQLDW activity and setting will show how we pass the output of LookupTableList to the
IterateAndCopyBlobToDW pipeline.
The output of the lookup activity is passed into an array parameter which is used by the ForEach loop.
By selecting the ForEach activity you will see that this array to passed to another parameter associated with
the activity.
Editing the activities for IterateAndCopyBlobToDW and selecting CopyCSVtoSQLDW you can investigate how
the blob storage is passed a directory and filename based on the schema and tablename.
Azure SQL Data Warehouse is setup as the sink and makes use of PolyBase, parameters for the destination and
truncates the table before loading. In an advanced metadata approach, you would remove the pre-copy script
and make use of Azure SQL Data Warehouse snapshot backups which are retained for seven days.
References
▪ Why Move to SQL Data Warehouse?
https://datometry.com/solutions/replatforming/migrate-teradata-to-azure
▪ Migrate your solution to Azure SQL Data Warehouse
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-overview-
migrate
▪ Migrating data to Azure SQL Data Warehouse in practice
https://blogs.msdn.microsoft.com/sqlcat/2016/08/18/migrating-data-to-azure-sql-data-
warehouse-in-practice/
Note: For additional information about migrating various source databases to Azure, see the Azure
Database Migration Guide.