A Cloud Data Platform For Data Science
A Cloud Data Platform For Data Science
A Cloud Data Platform For Data Science
WHITE PAPER
INTRODUCTION: IT’S ALL ABOUT THE DATA THE MOST COMPLETE PLATFORM FOR
DATA SCIENCE
Machine learning (ML) technologies have entered
the mainstream. According to a 2019 TDWI survey Snowflake's platform combines the power of data
on artificial intelligence (AI) and ML use, 92 percent warehousing, the flexibility of big data platforms,
of respondents reported using machine learning the elasticity of the cloud, and live data sharing at
technology, and 85 percent said they are building a fraction of the cost of traditional data platform
predictive models using tools for ML.¹ solutions. Snowflake delivers the performance,
concurrency, and simplicity needed to store and
Data scientists require massive amounts of data to
analyze all your data in one location, both for internal
build and train these machine learning models. In
use and to create a data exchange. Thousands of
the age of AI, fast and accurate access to data has
customers are standardizing on this platform because
become an important competitive differentiator. Data
it satisfies three essential needs:
management (discovering, securing access, cleaning,
combining, and preparing the data for analysis) is • A single consolidated source for all data: Snowflake
commonly recognized as the most time consuming helps data scientists access structured and semi-
aspect of the process. structured data from one consistent source, making it
easy to find, consolidate, clean, and use more of your
organization’s data assets. Output from data science
can seamlessly be incorporated back into Snowflake
An efficient data platform is paramount for access by business users.
According to Forbes, scientists spend up to • Efficient, high-speed data preparation: Snowflake
80 percent of their time finding, retrieving, provides efficient, dedicated virtual warehouses that
consolidating, cleaning, and preparing data for can ingest, transform, and query data using SQL
analysis and training.² The same study from without impacting other users or departments. SQL in
Snowflake is in many cases 10x more efficient at data
Forbes found that data scientists not only spend
preparation than other tools such as Spark, resulting
most of their time massaging rather than mining in reduced latency between ML tasks.
or modeling data, but that 76 percent of these
• An extensive partner ecosystem: Snowflake has
highly skilled professionals view data preparation
connectors to all the established and emerging data
as the least enjoyable part of their work.³ science technologies. This allows customers to choose
the best data science tools for their needs, and all
tools access a unified and consistent data platform.
Snowflake seamlessly exports data to Amazon S3
This white paper will help you identify the data and other blob stores for universal access by data
requirements driving today's data science and ML science tools.
initiatives, and explain how you can satisfy those
requirements with a platform that supports industry-
leading tools from Snowflake and its partners.
Snowflake Workloads
Figure 1: Snowflake enables many use cases and workloads in addition to data science, so your organization
can leverage the power of a single platform for data, analytics, and predictive analytics.
WHITE PAPER 2
KEY DATA SCIENCE CONCEPTS AND PERSONAS THE MACHINE LEARNING PROCESS
Data scientists use machine learning technology Successful machine learning initiatives depend
to identify patterns, relationships, correlations, on getting the right data at the right time to the
outcomes, and inferences in their data. These data- correct models. That’s not always easy, since most
driven discoveries are incorporated into models machine learning cycles consist of multiple stages,
that can detect fraud, predict maintenance cycles, from discovery and development into production.
mitigate customer churn, forecast sales, and automate Data is added and prepared multiple times during
many other forward-looking tasks. The key roles and each stage of the ML cycle, often with different data
personas in this process include the following: requirements. Success in ML is predicated on getting
• Data scientists build models and train them with data. the right data in the right condition into the right
They use notebooks such as Jupyter and Zeppelin and analytic platforms to generate business results.
languages like R, Python, Java, and Scala.
As shown in Figure 2, data scientists begin by
• Data analysts/citizen data scientists use these finding, collecting, understanding, and preparing the
models to conduct predictive and prescriptive data (steps 1-3). They may use business intelligence
analytics for business decisions, based on their
tools to better understand the data and formulate
working understanding of machine learning.
a hypothesis. Data scientists experiment with
• Data engineers prepare data and establish many data sets throughout this iterative process.
automated data pipelines that feed ML models
Whenever they broaden or extend the scope of the
on a continuous basis.
data set, they have to wait for data engineers to
load and prepare the data. This causes a delay and
THE ROLE OF MACHINE LEARNING IN
DATA SCIENCE introduces significant latency between iterations.
They also must “shape” the data into a normalized
Machine learning deals mainly with the data modeling form, and many algorithms require nuanced formats.
aspect of the much broader data science discipline
that encompasses data preparation, data discovery, Next they run training data through the models
analytics, and data modeling. Today’s ML and data prepared in earlier steps (4), evaluating the outcomes
science tools can handle many aspects of parsing to determine each model's effectiveness, then
data, generating predictive and prescriptive models, further tune the models through cycles of feature
placing models into production, and maintaining engineering (3) and hyper parameter tuning (4).
those models over time. Predictive and prescriptive The resulting trained models are then deployed
analytics apps can often make their own decisions to production (5) to empower business users with
without human intervention, such as monitoring predictive and prescriptive tools. Once deployed
web browsing patterns to recommend products and into production, the models receive ongoing
services to visitors. evaluation (6) to identify model-drift and determine
Data scientists use analytics tools to formulate a whether the models are out of date. Models must be
hypothesis, and then use programming languages periodically retrained with fresh training data. These
and ML libraries to create their predictions. Types model updates represent yet another iteration of the
of ML include linear regression, logistic regression, ML cycle, and require processing more data which is
classification, decision trees, deep learning, and many time consuming and prone to errors. Depending on
others. Some popular ML libraries include XGBoost, the use case, the models may need to be retrained as
TensorFlow, scikit-learn, and PyTorch. frequently as every few hours, days, or weeks.
WHITE PAPER 3
ML workflow
1. 2. 3. 4. 5. 6.
Collect data Visualize & Feature engineering Train Deploy Evaluate
understand & transformation
Figure 2: Data drives the ML process, from collection and preparation through training, predictions, and productization.
Machine learning is a data-intensive activity, and TDWI recommends acquiring a modern data
the success of each predictive model depends on platform, built for the cloud, that can satisfy the
large volumes of diverse data that must be collected, entire data life cycle of machine learning, artificial
persisted, transformed, and presented in many intelligence, and predictive application development.
different ways. This involves large volumes of data What should you look for in such a platform? For
characterized by many dimensions and details, and data preparation you want to be able to work with
arising from many contexts. For example, if you have large data sets with interactive response times.
built a machine learning model to predict customer For training, you want to plow through those data
churn, you will likely have data about customer sets iteratively. For production, you want a reliable,
behaviors relating to sales, service, purchasing, and repeatable, and scalable data pipeline.
app interactions, both historic and real-time. Here are a few of the primary reasons to use
Snowflake's Data Cloud allows you to consolidate Snowflake for your data science endeavors:
your data from data warehouses, data marts, • Simplicity: No need to manage multiple compute
and data lakes into a single source of truth that platforms and constantly maintain integrations.
powers multiple types of analytics and data science • Security: One copy of data is stored securely in
applications. It makes it easy for diverse teams to the Snowflake environment, with user credentials
share governed data, internally and externally, by carefully managed and all transmissions encrypted.
allowing team members to collaborate without • Performance: Query results are cached and can be
having to copy data and move it from place to place. used repeatedly during the ML process, as well as
Raw, structured, and semi-structured data is easily for analytics.
discoverable and immediately accessible for data • Workload isolation: Each user and workload can
science workflows, with native support for JSON, receive dedicated compute resources.
AVRO, XML, ORC, and Parquet.
• Elasticity: It only takes seconds to scale-up
Being able to use one set of tools to manage both capacity to accommodate large data processing
structured and semi-structured data shortens the tasks and then it's just as easy to release it once
completed, minimizing costs with pay per
data discovery and preparation cycle. Furthermore,
second pricing.
the data that is output from the ML algorithms is
placed back into the repository for access by business • Support for structured and semistructured data:
users, alongside the source data. This means all data Easily load, integrate, and analyze all types of data
inside a unified repository.
is always up to date and consistently maintained for
business users, analysts, and data scientists. • Concurrency: Run massively concurrent
workloads at scale across shared data.
WHITE PAPER 4
CONSOLIDATING DATA FOR MACHINE AUTOMATED DATA ENGINEERING, DATA
LEARNING AND ANALYTICS INTEGRATION, AND DATA SHAPING
There are many ways to provision data for machine Achieving success with ML means creating efficient
learning applications, and flexibility is essential. and reliable data pipelines that feed accurate and
For example, some organizations use a data lake in timely data to business users, as well as populate the
conjunction with a data warehouse. This allows them apps and services they use.
to store vast amounts of raw data in its native form,
which can then be repurposed for a wide range of Ingesting data
analytics when it’s needed. Most data science tools Snowflake includes a serverless ingestion service
rely on data lakes as their source of data, but today’s called Snowpipe that asynchronously loads data
analytic strategies increasingly use multiplatform and makes it available immediately. Manual data
data architectures that include a mix of big data flattening tasks are fully automated: The platform
platforms, clouds, data lakes, and data warehouses. transforms data into the type and shape required
Many leading organizations choose to skip the data for each target table.
lake altogether and instead consolidate their data
Standard connectors and adapters allow you to easily
entirely into a cloud data platform. This approach
ingest event streams from Kafka and other messaging
eliminates the complexity of managing a separate
systems, while Snowflake streams and tasks make it
data lake, and it also removes the need for a data
easy to schedule data loads for SQL jobs.
transformation pipeline between the data lake and
the data warehouse. Having a unified repository, By “productizing” the ML model with an automated
based on a versatile cloud data platform, allows them data ingestion service, the pipeline simplifies complex
to select the appropriate storage, processing, and data integration tasks. Data scientists can find and
economics for each data set and workload, optimizing prepare data on demand—without waiting days or
the options for ML and analytics. hours between tests. Once an automated data pipeline
service is put into production, raw data is immediately
Once you have collected and prepared your data, you
available without requiring ETL from a data lake. As
need to be able to discover patterns and insights via
data comes in, it is automatically run through the
analytics and predictive analytics tools. Snowflake
model to make predictions. And because it’s all based
allows you to combine general analytics with
in the cloud, data scientists can use dedicated virtual
predictive analytics, so your business intelligence
warehouse compute resources without impacting
tools and data science tools have one consistent
other users.
view of the same governed data. All data science
tools reference the same data definitions, so you Universal SQL capabilities
can consistently reproduce the content of queries,
Snowflake customers can leverage a central source
forecasts, dashboards, and reports. Both the raw data
of truth with universal SQL capabilities that power
and ML results reside in the data platform for easy
robust and efficient ETL and ELT workloads. Data
access. This unified approach allows data scientists to
transformations using SQL are faster, easier, and
output the results of machine learning activities back
less expensive than the same operations using
into the data platform for general-purpose analytics,
Spark. Because data can be transformed as part of
as well as to embed these results in the decision-
a SQL query, the transformation becomes part of
making process.
the analysis. Due to Snowflake’s architecture and
Having common semantics, data definitions, and compression, you can rapidly ingest large amounts of
data models keeps everybody on the same page. For streaming data and store it indefinitely at nominal cost.
example, a sales manager might look at a BI report Data engineers can utilize many types of integration
that shows the historic performance of the sales tools including Alteryx, Alooma, Matillion, Fivetran,
team. An ML model could also forecast expected sales Alation, Informatica, and many others.
results for upcoming quarters based on target account
propensity, and highlight both booked and forecasted
revenue through the same report.
WHITE PAPER 5
Dedicated compute resources By contrast, Snowflake's Data Cloud is built on
With Snowflake, ML data ingestion, data management, a multilayered security foundation that includes
and data preparation workloads receive dedicated encryption, access control, network monitoring,
resources that don’t contend with non-ML data and physical security measures, in conjunction with
engineering and analytics workloads. By removing comprehensive monitoring, alerts, and cybersecurity
contention for resources, live data can be ingested practices. In addition to industry-standard technology
and transformed in stream and immediately made certifications such as ISO/IEC 27001 and SOC 1/
available for analytics. You can customize the SOC 2 Type 2, Snowflake complies with important
size of your data warehouse for each workload, government and industry regulations such as PCI
scale up as needed, and turn off the cloud services DSS, HIPAA/Health Information Trust Alliance
upon completion. Thanks to linear scaling, you can (HITRUST), and FedRAMP certifications. Snowflake
request the exact amount of resources you need to customers can securely access data for all types of
execute queries in a predictable timeframe. With data science activities.
instant elasticity and per-second billing, each user
and workgroup pays only for the precise compute EFFICIENT DATA SHARING
resources they use. Ultimately, this architecture allows The Snowflake platform simplifies the exchange of data
you to maximize the performance and efficiency of between partners, suppliers, vendors and customers
each team, while providing consistent data. through Snowflake Data Marketplace and Data
Exchange. This offers access to unique data sets that
Robust data security
can increase the effectiveness of models and provide
Many legacy data science projects depend on Apache additional feature engineering possibilities. Secure data
Hadoop, an open-source framework for storing and sharing in Snowflake doesn’t require data transfer via
processing data in a distributed framework. However, FTP or the configuration of APIs to link applications.
the Hadoop architecture employs only rudimentary It simplifies ETL integration and automatically
access controls, and it was not designed to comply synchronizes “live” data among data providers and data
with important industry standards governing the consumers. Because the source data is shared rather
security and privacy of data, including HIPAA, PCI than copied, consumers don’t require any additional
DSS, and GDPR. Other data science projects leverage cloud storage. Snowflake Data Marketplace and Data
general-purpose object stores such as Amazon S3, Exchange enable data scientists to easily collaborate
which also lack robust data security. on models by sharing raw and processed data.
Figure 3: Snowflake Secure Data Sharing enables you to share data externally via Snowflake Data Markeplace,
and create your own Data Exchange with customers, suppliers, and other business partners.
WHITE PAPER 6
EXTENSIVE PARTNER ECOSYSTEM
The machine learning space is rapidly evolving, enabling data analysts to perform ML functions
with new tools being added each year. Through without requiring advanced programming skills
Snowflake’s extensive partner ecosystem, customers or deep mathematics/statistics knowledge. A
can take advantage of direct connections to all few tools bridge these two approaches, allowing
existing and emerging data science tools, platforms, data scientists to customize AutoML processes.
and languages such as Python, R, Java, and Scala; DataRobot, a leading player in the AutoML space,
open-source libraries such as PyTorch, XGBoost, has a built-in Snowflake integration where its users
TensorFlow, scikit-learn; notebooks like Jupyter and can quickly connect their DataRobot account to
Zeppelin; and platforms such as Data Robot, Dataiku, Snowflake and use it as a data store.
H20.ai, Amazon Sagemaker, and others. By offering
a single consistent repository for data, Snowflake Analytics and cloud partners
removes the need to retool the underlying data Regardless of which ML approach you choose,
every time you switch tools, languages, or libraries. Snowflake allows you to consume the results via
Furthermore, the output from these activities is dashboards, reports, and business analytics tools by
easily fed into Snowflake and made accessible by leveraging connections to other ecosystem partners
nontechnical users to generate business value. such as Tableau, Looker, ThoughtSpot, and Sigma.
Furthermore, Snowflake allows you to store and
Notebook-based ML tools replicate your data across any region, on any cloud,
Traditional ML notebooks such as Jupyter and including popular offerings from Amazon, Microsoft,
Zeppelin power today’s leading data science tools, and Google. Snowflake can seamlessly export data
including Amazon Sagemaker, Dataiku, Zepl, and to external tables maintained in Amazon S3, Azure
many others. This approach allows data scientists Blob, and Google Cloud Storage for universal access
to have ultimate control over the frameworks and by any tool. For example, you can use Snowflake to
algorithms they choose, conduct in-depth feature complement your data lake on AWS, then connect
engineering, tune hyperparameters, and iteratively with Amazon SageMaker to develop, test, and
create, assess, and productize ML models. They can deploy ML models at scale. The platform automates
turn intuitions into accurate predictions by iteratively everything, from data storage and processing to
experimenting with algorithms, scoring their transaction management, security, governance, and
performance, and choosing and refining new models. metadata management.
Users of Amazon SageMaker can use the Snowflake
Python Connector to directly populate Pandas
DataFrames. This high speed connection results in Get started in minutes
accelerated training speed as well as an optimized If you’re looking for the fastest way to get
data preparation and feature engineering cycle that started with Snowflake and ML, consider the
leverages the full power of ANSI SQL. Snowflake Partner Connect program, which
simplifies deployment through pre-configured
AutoML tools
integrations with select technology partners.
Alternatively, AutoML tools such as RapidMiner, You can automatically provision and configure
BigSquid, H2o.ai, and DataRobot can automatically partner applications in minutes and load data
select algorithms, conduct model training, and into Snowflake for immediate use.
choose the best model. These tools are a great
way to democratize access to advanced analytics,
WHITE PAPER 7
CASE IN POINT
ConsumerTrack is a digital advertiser and publisher ConsumerTrack has eliminated chokepoints and
that aggregates and syndicates website performance reduced time-to-insight from hours to minutes.
data from hundreds of providers to portals such as Snowflake substantially reduces the amount of time
CNN and MSN. Previously its data science team spent on data discovery and preparation. Snowflake’s
struggled with an ML environment that used MySQL broad ecosystem allows ConsumerTrack to connect
and various orchestration tools, which led to data with many types of data science platforms and tools,
chokepoints and latency issues. including a native connector for Python. When they
need to, the data science team can export data to any
ConsumerTrack augmented its existing data lake with
blob store for universal access.
Snowflake and chose Amazon SageMaker as it's fully
managed service for automating the ML workflow. WHAT'S NEXT?
It labels and prepares data, chooses an algorithm,
trains a model, tunes and optimizes the model for To learn more about for machine learning, visit
deployment, makes predictions, and then takes action. the Snowflake data science page and the Snowflake
platform page.
Now data flows into the data lake via an automated
pipeline that uses AWS Lambda and AWS Glue.
Data is curated and then loaded into Snowflake,
and the data streams are configured with custom
alerts. Amazon SageMaker connects to Snowflake
to simplify the development, testing, and building
of ML models.
WHITE PAPER 8
ABOUT SNOWFLAKE
Snowflake delivers the Data Cloud—a global network where thousands of organizations mobilize
data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations
unite their siloed data, easily discover and securely share governed data, and execute diverse analytic
workloads. Wherever data or users live, Snowflake delivers a single and seamless experience across
multiple public clouds. Snowflake’s platform is the engine that powers and provides access to the
Data Cloud, creating a solution for data warehousing, data lakes, data engineering, data science, data
application development, and data sharing. Join Snowflake customers, partners, and data providers
already taking their businesses to new frontiers in the Data Cloud. snowflake.com
© 2021 Snowflake Inc. All rights reserved. Snowflake, the Snowflake logo, and all other Snowflake product, feature and service
names mentioned herein are registered trademarks or trademarks of Snowflake Inc. in the United States and other countries.
All other brand names or logos mentioned or used herein are for identification purposes only and may be the trademarks of their
respective holder(s). Snowflake may not be associated with, or be sponsored or endorsed by, any such holder(s).
CITATIONS
1
“Best Practices Report: Driving Digital Transformation Using AI and Machine Learning” (tdwi.org/bpreports).
2
Forbes “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says” (bit.ly/38EbXmN).
Forbes “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says” (bit.ly/38EbXmN).
3
WHITE PAPER