Best Practices Informatica
Best Practices Informatica
Best Practices Informatica
By Wayne W. Eckerson
May 2018
Research sponsored by
Migrating Data Warehouses to the Cloud
The cloud has reached an inflection point. For the past 10 years, enterprises have dabbled in the cloud.
Many have purchased multi-tenant cloud applications (SaaS) to support various functions such as
payroll, sales, human resources, and business intelligence. Some have offloaded data processing
functions to virtual servers in the cloud (IaaS) or shifted development and test environments to
cloud-based platforms (PaaS).
Data gravity. Today, the gravity of data has shifted from on-premises data centers to public cloud
platforms and services. Executives who were once fearful about the security, latency, and reliability of
the cloud now want to reap its benefits—fast provisioning, elasticity, infinite scalability, pay-as-you-go
pricing, reduced infrastructure and IT costs, seamless upgrades, and rapid technology innovation. Many
have now decided it’s time to shut down corporate data centers and shift all computing to the cloud.
“The days of maintaining physical servers and racks in your own data center are a thing of the past,” says
Joe Caserta, president of Caserta Consulting, a technology consulting and implementation firm
specializing in data analytics. In a recent Eckerson Group podcast, Caserta said that five years ago his
firm built on-premises solutions exclusively. Three years ago, half the implementations were in the
cloud, and this year, “100% of our projects are either migrating legacy systems to the cloud or building
something from scratch there.”
According to the IDC Worldwide Quarterly Cloud IT Infrastructure Tracker, the split between cloud and
traditional data centers will be almost equal by 2019. In the data analytics world, the number of
companies using the cloud for data warehousing or business intelligence (BI) has increased nearly 50%
since 2013, according to a 2017 survey by Eckerson Group and the Business Application Research Center
(BARC).
Benefits. For data warehousing professionals, the cloud is a huge opportunity. The cloud promises to
simplify and accelerate data warehousing development, increase scalability and flexibility, and reduce
IT costs and total cost of ownership1. These benefits have many data warehousing professionals flocking
to implement new cloud-based data warehousing services as well as cloud-based data integration,
business intelligence, and analytics products.
For the ROI of cloud data warehouses running on Microsoft Azure, see “Total Economic Impact of the Modern Data
1
But designing cloud solutions is different than designing on-premises ones. What works on-premises
doesn’t necessarily work in the cloud. Most organizations need to rethink their approach and be ready to
learn new ways of doing things. “What I learned is, don’t build in the cloud, build for the cloud,” said
Lenin Gali, vice president of engineering at Quotient Technology, during an Eckerson Group podcast on
cloud computing.
Knowledge needed. To succeed in the cloud, data warehousing professionals need to educate
themselves about cloud architectures, elasticity, data lakes, scale-out architectures, serverless
computing, multi-tenancy, and cloud-based pricing. They have to understand the impact of the cloud on
data movement, ETL backups, disaster recovery, security, privacy, and operations.
Data warehousing professionals also need to evaluate and compare the myriad of services offered by
public cloud providers that are launching new and improved services at a rapid clip to secure market
share in the booming cloud economy. The top three public cloud providers—Microsoft, Amazon, and
Google—continually leapfrog each other with functionality.
Migration Models
Build or Migrate?
To implement a cloud data warehouse, organizations first need a strategy. That means they have to
decide whether to migrate an existing data warehouse, build one in the cloud from scratch, or a hybrid
of the two. There are pros and cons to each approach. (See table 1.)
New data warehouse in the Design and optimize for the Maintain existing DW during
Build New cloud for a business unit cloud construction with limited staff
Move existing DW to cloud Clean, quick cutover; no Sub-optimized for cloud; may
Lift and Shift virtual machine without hybrid complexity; cut data not reap full benefits of cloud
change center costs
Migrate
Rebuild select analytic Focus on business value; Maintain and sync dual DWs;
Use-Case-Driven applications in the cloud optimize for cloud; learn as maintain two teams
you go
Build
Building from scratch gives an organization time to learn about the cloud and how to optimize a data
warehouse for the new environment. It also frees them from legacy constraints, allowing them to build
new applications and data pipelines in the most efficient manner. But it does add complexity and cost,
since an organization will need to maintain two data warehouses and teams.
1. On demand. Here, organizations leave their data warehouse as is, but gain experience with the
cloud by implementing short-lived data sets and applications there. For instance, organizations
might spin up virtual servers in the cloud and replicate data there to support data scientists who
need to create a predictive model. Or they might replicate a production environment in the cloud
to support developers who want to develop and test new data-centric applications or auditors who
need to review a company’s books and processing history. They might also move their BI
environment to the cloud and tunnel through corporate firewalls to query an on-premises data
warehouse.
2. Build new. Another option is to build a new enterprise data warehouse in the cloud and do a
cutover from the legacy system. This gives an organization a clean slate to design for the cloud
without worrying about migrating or maintaining a legacy environment. They can build data
schema, ETL programs, data pipelines, and metadata to support current requirements and not
have to worry about integrating with legacy code. However, there are risks to this approach. It
requires a separate team to build the new data warehouse, and there is no guarantee that the
cutover will work smoothly, especially if there are dozens or hundreds of reports that need to be
rewritten or migrated. Although most companies eschew this approach at first, some have no
choice once they evaluate the condition of their existing data warehouse and discover that it is
easier and quicker in the long run to rebuild the entire data warehouse than to do so piecemeal.
Migrate
Migrating an existing data warehouse enables an organization to leverage the benefits of the cloud.
However, the data warehouse design may not be optimized for the cloud and hence not deliver the
value, agility, and cost efficiencies the organization seeks.
1. Lift and shift. This is the quickest way to migrate a legacy data warehouse to the cloud. The team
simply moves the data warehouse, including its database, schema, and metadata to the cloud and
re-points its ETL and BI tools to the new cloud environment. Of course, it’s never as simple as that.
Use of a data catalog can make it easier to identify endpoints and lineage, simplifying the
migration. The benefit of this approach is that it eliminates the need to support dual environments
and teams. But the downside is that the new data warehouse may not be optimized for the cloud,
and the team may not have the experience to operate it in an efficient, cost-effective manner. This
could cause a backlash among users and sponsors. Consequently, this approach is risky.
2. Incremental. Instead of migrating the entire data warehouse at once, the team migrates
portions of the data warehouse that have the highest business value or to meet the needs of a
business unit that is willing to try the cloud in exchange for a faster, lower-cost deployment. This
use-case-based approach gives the team the opportunity to redesign the application to meet
business requirements and optimize it for the cloud. It also allows the team to learn about the
cloud in a measured way with minimal risk. However, this option is complex. Data integration
programs will need to be rewritten and pointed at two locations and kept in sync, or better yet, the
tables can be replicated so no recoding is required. BI tools will need to virtualize queries across
two data warehouses, and metadata will need to be pooled and collected from two environments
and mapped together. The more distinct and separate the use cases, the better, to minimize
complexity.
Selecting Approaches
Some companies may prefer a “lift-and-shift” approach to expedite migration to the cloud and avoid
supporting dual or hybrid environments. However, after they conduct a thorough assessment of their
current environment, they may realize they can save time and money by redesigning existing schema,
data flows, transforms, and reports rather than simply porting an existing legacy environment to the
cloud.
For example, one company abandoned its lift-and-shift strategy in favor of a wholesale redesign once it
discovered that business people were only using 27 of its 1,000 reports and most of its ETL jobs were
simply modifying existing ETL jobs, creating a brittle, hard-to-change legacy data environment. With the
help of an experienced systems integrator, the company built all-new data pipelines with new
cloud-based services, ditching its legacy ETL tools and databases in the process. Although the company
was nervous about redesigning its entire environment and learning all new tools, it finished its migration
early and under budget in less than 18 months.
Hybrid Environments
Although many business executives like the idea of a quick, clean cutover, most organizations find they
need to support dual data warehousing environments for some period of time. This may be for security
reasons, since some data can’t run outside corporate firewalls, or to minimize risks involved in cutting
over a large, mission-critical data warehousing environment. Other companies may not find a suitable
way to cut through the Gordian knot of complex ETL rules and programs supporting mission-critical
applications.
Consequently, many companies need to support a hybrid environment in which data is spread across
two data warehousing environments. The challenge here is keeping data in sync, especially if ETL
programs and data warehousing schema are modified for the cloud. This requires a robust data
management and integration platform that supports complex mapping, replication, mastering, and
synchronization programs. These tools need to source data from both on-premises systems and cloud
systems and map, clean, transform, and load the data to the appropriate data warehouse, whether
on-premises or in the cloud.
Migration Strategy
Before embarking on any cloud data warehousing initiative, organizations should first develop a
strategy. Organizations need to define what they hope to achieve from the migration and then identify
the steps and services required to achieve that goal. Eckerson Group analyst Dave Wells has created a
diagram to map out a cloud data migration strategy. (See figure 1.)
Business case. The first step in the strategy is to create a business case. Why does the company want a
cloud data warehouse? What value does it aim to achieve for the investment of time and money? The
answer—whether greater agility, faster query performance, increased data capacity, or cost
savings—helps determine the best course of action and migration option.
Assessment. Organizations should then evaluate the condition of their existing data warehouse to
determine if it should be refined, redesigned, or left as is. In most cases, the data warehouse design is
insufficient to meet current business requirements, let alone future use cases such as data science and
the Internet of things. Companies should invest in a data catalog tool to better understand the nature of
the data in their data warehouse to help perform this assessment.
Migration options. With the assessment results in hand, organizations then need to decide whether to
migrate the data warehouse as is (i.e., lift and shift), refine or update the design (i.e., incremental), or
redesign it entirely (i.e., build new). The actual rollout of the new or redesigned data warehouses can be
done incrementally, either by pain point, subject area, or user group. Many companies start an
incremental migration in a single department and tackle one subject area at a time; once finished, they
move on to a second department, and so on. (See “Best Practices” below.)
Technology. The next step is to select a cloud computing platform and a suite of data management tools
and services. The cloud platform could be a hosted solution from a traditional database vendor (e.g.,
Oracle, SAP, Teradata) or a public cloud platform that hosts its own and third-party solutions (e.g.,
Microsoft Azure). In some cases, the customer simply runs the data warehouse using cloud infrastructure
services (i.e., IaaS) in which the customer provisions and manages servers and processing power, while
in other cases, the cloud provider assumes responsibility for the infrastructure, allowing the customer to
focus entirely on the solution (i.e., platform as a service, PaaS).
The data management environment encompasses a host of tools and/or services, including data
integration, data discovery, data preparation, master data management, data lake and pipeline
management, data movement, data warehouse automation, data quality, and data security solutions.
Teams need to decide whether to use existing on-premises solutions or adopt new ones designed
exclusively for the cloud. If the team is building a new environment from scratch, then it might want to
invest in cloud-based data management tools. Many enterprises find they need data management
capabilities that span cloud and on-premises environments and enable them to move workloads and
data incrementally between the two with minimal disruption to the business.
Migrate and operationalize. The last step is the biggest. As with any technology project, it is wise to
define test and acceptance criteria at the beginning. Plan the testing, then execute the migration process
to move schema, ETL, data, metadata, users, and applications. Execute the test plan, then
operationalize the cloud data warehouse. The rollout plan, as mentioned above, should make new
capabilities available incrementally, usually one subject area and department at a time, until the entire
organization is migrated. An incremental rollout plan minimizes risk and maximizes rewards.
Best Practices
A majority of companies today have moved or are considering moving their data warehouse to the cloud
to reap the benefits of scalability, flexibility, and agility at potentially lower costs. Because cloud
computing is different from on-premises computing, organizations should learn from those who have
already made the switch. Before turning over your company’s crown jewels (e.g., data) to a cloud
provider, here are recommendations from organizations that have already implemented a cloud data
warehouse.
1. Have a Strategy
As mentioned, it’s imperative to have a strategy to guide the initiative and keep it focused. Before leaping
into the cloud, teams should know their mission. Is it cost cutting or cost optimization? Or is it greater
flexibility and agility? It’s important to take the time to flesh out the strategy, which can avert problems
down the road.
A strategy is especially helpful when selecting cloud-based services. There are so many cloud services, it
can be overwhelming. “Before evaluating offerings in Azure, you need a strategy; otherwise, it’s
daunting,” says Gali. That’s not to say you can’t look around when you’re there. “If you’ve been using the
same architecture for the past ten years, there are probably things in Azure that leapfrog what you have,”
he says.
In a traditional environment, processing capacity is fixed, physical servers do the processing, and data is
isolated in a dedicated database. Data warehouses (or data marts) are designed dimensionally using
star schemas and loaded in batch with nightly or mini-batch increments. ETL tools are used to move,
transform, and then load data into a data warehouse.
This all changes in the cloud. The cloud offers infinite capacity and on-demand elasticity. A cloud data
warehouse supports multiple companies (multi-tenancy). Data modeling is focused more on business
semantics than database models, and stored procedures give way to serverless functions. And the list
goes on.
Operations. Data warehousing professionals also have to recalibrate how they manage a data
warehouse. The cloud (and each cloud provider) manages backups, disaster recovery, security, pricing,
and support in different ways. There is much to learn, and the learning must be continuous, since cloud
providers release new capabilities every week.
Public cloud providers, such as Microsoft, are recreating entire ecosystems (e.g., data warehousing) in
the cloud using modern technologies and approaches made possible by the unique characteristics of
cloud computing.
For instance, Microsoft has redesigned its cloud data warehousing service, Azure SQL Data Warehouse,
to exploit the virtualization capabilities of the cloud. The new product separates storage and processing,
providing greater flexibility and scalability to support multiple workloads and use cases. The service also
uses adaptive caching to accelerate query performance and query concurrency. This service was built
from the ground up as a cloud data warehouse and is Microsoft’s best practice guidance for building a
data warehouse on Azure.
Microsoft’s Azure environment also offers a raft of database, data lake, and data integration services
from Microsoft and third parties. Table 3 lists some of the more salient services for data warehousing
professionals. Each service has different performance characteristics and pricing, and some aren’t
available in all processing regions (i.e., cloud data centers). Crafting a cloud data warehousing
ecosystem takes considerable time and research to sort through the nuances. Most cloud platform
providers have in-house experts who help customers understand and evaluate all the options and select
a combination of services best suited to meet their needs
Cloud Databases Azure SQL Database The cloud version of Microsoft’s venerable SQL Server
database. Geared to transaction processing, but can
run data warehouses up to 4 TB.
Bring Your Own Microsoft SQL Server On-premises relational database. Runs on Azure VM.
Databases Good for lift-and-shift SQL Server DWs with more than 4
TB of data.
SQL Server Stretch Store warm and cold SQL Server 2016 data in Azure
Database storage. Great to extend life of SQL Server databases.
Data Migration Azure Database PaaS solution to migrate on-premises SQL Server to
Services Migration Service Azure SQL Database.
This is where a data catalog can help. A data catalog scans all tables and fields in a database, creating a
profile that tracks usage, lineage, and dependencies. Data stewards use the catalog to tag data and
apply security and privacy settings as needed. Business users use the catalog to search and discover
relevant data sets they are authorized to use; they can annotate the data, leaving comments and tags to
help others better understand the relevance, accuracy, quality, and nuances of data sets.
A data catalog is a critical tool in a cloud migration process, as it can help avoid potentially embarrassing
or costly exposures of sensitive information. Data warehousing teams must implement proper security
policies for a cloud data architecture, just as they would for an on-premises setup. A data catalog also
helps prioritize and schedule migrations, ensuring sources and dependencies are managed properly and
things don’t break once the new cloud system is rolled out.
During the assessment phase, data profiling tools can help companies better understand data by
identifying redundancies, missing values and fields, and interdependencies. During the migration
phase, companies can use data integration tools to move, clean, transform, and load data. These tools
are critical in a hybrid environment where data must be synchronized across environments. To simplify
matters, some companies use on-premises tools to manage on-premises data and cloud-based
integration tools to manage cloud-based data.
efficient processing and reuse. It should also write scripts to auto-scale cloud services, since not all cloud
platform providers support auto-scaling out of the box. “The biggest challenge going into the cloud was
figuring out how to operate it,” says Gali.
8. Migrate Incrementally
Because processing data in the cloud is so different, it behooves data warehousing managers to take a
measured approach to migrating to the cloud. It’s important that they experiment with the cloud with
point or on-demand solutions until they’re comfortable with the services, pricing, and reliability of the
cloud.
When migrating a data warehouse, it’s best to move incrementally. Although a systematic migration
requires a hybrid environment that increases architectural complexity, it gives data warehousing
managers time to learn. It also means they can redesign legacy data warehouses and applications to
leverage the capabilities of the cloud and gain cost efficiencies. Many teams start with a single
department and identify key functions or applications they want to build or redesign. Migrating a data
warehouse subject-by-subject and department-by-department ensures a successful outcome that
minimizes downtime for users. This can take several months to a year.
IT adaptation. An incremental migration also gives the IT department time to adapt. Some IT managers
are threatened by the cloud and present numerous reasons why the cloud is a bad idea. (“It’s not secure
enough to meet our guidelines.” “It’s not reliable enough to run mission-critical processing.” “Data
latency and movement is too slow and costly.”) The reality is that the cloud opens new opportunities for
the IT department, replacing low-level technical work with more strategic, business-facing tasks that
can improve career prospects.
But to succeed in the cloud, data teams need to understand how to design and manage a data
warehouse in this new environment. First, they need to identify business and technical goals for shifting
computing to the cloud. Next, they need to evaluate their current environment, identify dependencies,
impacts, and requirements, and establish a migration plan that aligns with corporate needs and culture.
Finally, they need to select cloud services and products and figure out and establish operational
processes and policies for governing security and costs.
Many organizations have already reaped the benefits of cloud data warehousing. With a well-designed
plan and ample services from cloud platform providers, organizations can modernize their data
environments and foster better, faster, more agile decision making.
Need help with your business analytics or data management and governance strategy?
Want to learn about the latest business analytics and big data tools and trends?
Check out Eckerson Group research and consulting services.
The move to cloud data warehousing requires an incremental, well-governed approach. To fully seize the
benefits of an Azure data architecture, you need a high-performance, scalable data integration and
management solution that provides native, automated connectivity to Azure data services. With flexible and
repeatable integration, you can simplify and accelerate the migration of on-premises enterprise data
workloads to Azure.
Successful cloud data warehousing projects share common data management characteristics.
• Catalog for data visibility: An intelligent data catalog helps you cut the time it takes to find the right
data to migrate. Its downstream dependency identification and lineage tracking also helps you reduce
the risks of “lift and shift” initiatives. Informatica Enterprise Data Catalog accelerates migration through
its intelligent data discovery engine that catalogs and identifies all types of data and data relationships.
• Robust connectivity architecture: A data integration architecture should connect and automate all
the data sources and applications you need regardless of their location-on-premises or in a cloud.
Informatica’s market-leading integration platform-as-a-service (iPaaS), Informatica Intelligent Cloud
Services, offers best-in-class self-service data integration, synchronization, management, and prebuilt
connectivity for Azure.
• Security and data governance discipline: It’s important to build the right level of security into the
architecture at every level, both within technology and at the process level. Implementing strong data
governance procedures becomes even more important when you move data off-premises into a cloud
environment.
Informatica, the leader in enterprise cloud data management, provides an AI-driven, microservices-based
Intelligent Data Platform with solutions that are purpose-built for Microsoft Azure. These solutions are
designed to accelerate the migration to Azure by automating your data integration development lifecycle,
including connectivity, development, deployment, and management.