The Data Engineers Guide To Python For Snowflake
The Data Engineers Guide To Python For Snowflake
Snowpark is the set of libraries and runtimes that enable all Snowpark fits into the larger Snowflake data
data users to bring their work to the Snowflake Data Cloud with engineering ecosystem
native support for Python, SQL, Java, and Scala. With Snowpark,
We’ll also share resources designed to help data engineers get started
data engineers can execute pipelines that feed ML models and
with Snowflake and Snowpark.
applications faster and more securely in a single platform using their
language of choice.
3
CHAMPION GUIDES
SNOWPARK FOR
DATA ENGINEERING
Snowpark is the set of libraries and runtimes Snowflake offers data engineers many benefits A SINGLE PLATFORM
that securely enable data engineers to deploy with Snowpark: Architecture complexity increases significantly
and process non-SQL code, including Python, when different teams use different languages across
• A single platform that supports multiple
Java and Scala as shown in Figure 1. languages, including SQL, Python, Java, multiple processing engines. Snowpark streamlines
and Scala architectures by natively supporting programming
On the client side, Snowpark consists of libraries languages of choice, without the need for separate
including the DataFrame API. Snowpark brings deeply • Consistent security across all workloads with no processing engines. Instead, Snowpark brings all
integrated, DataFrame-style programming and OSS governance trade-offs teams together to collaborate on the same data in a
compatible APIs to the languages data practitioners single platform—Snowflake.
• Faster, cheaper, and more resilient pipelines.
like to use. It provides familiar APIs for various data
centric tasks, including data preparation, cleansing,
preprocessing, model training, and deployments LIBRARIES AND RUNTIMES IN SNOWPARK
tasks.
Figure 1: Snowpark allows developers working in many languages to leverage the power of Snowflake.
4
CHAMPION GUIDES
CUSTOMER SPOTLIGHT FASTER, CHEAPER PIPELINES SNOWFLAKE (AND SNOWPARK)
FOR DATA ENGINEERING
Snowpark enables pipelines with better price
performance, transparent costs, and less operational Using Snowpark runtimes and libraries, data
Snowflake customer, HyperFinity, a no-code decision engineers can securely deploy and process Python
overhead thanks to Snowflake’s unique multi-cluster
intelligence platform for retailers and CPGs, uses code to build pipelines in Snowflake. Some of the
shared data architecture. Snowflake is a single,
SQL and Python for their ML and AI initiatives. critical use cases for data engineers working in
integrated platform that delivers the performance,
With Snowpark, HyperFinity has a single platform Snowpark include:
scale, elasticity, and concurrency today’s
that supports both languages, thereby eliminating
organizations require. • ETL/ELT: Data teams can use Snowpark to
cumbersome data movement and code developed
to keep data movement across different services. transform raw data into modeled formats
CUSTOMER SPOTLIGHT regardless of type, including JSON, Parquet,
As a result, HyperFinity works more seamlessly—
developing, testing, and deploying Python and SQL in and XML. All data transformations can then be
one environment for more agile overall operations. packaged as Snowpark stored procedures to
These benefits can be seen at IQVIA, a leading operate and schedule jobs with Snowflake Tasks
provider of analytics, technology solutions, and or other orchestration tools.
NO GOVERNANCE TRADE-OFFS
clinical research services in the life sciences
Enterprise-grade governance controls and security • Custom logic: Users can leverage Snowpark’s
industry, and a Snowflake customer. As IQVIA
are built into Snowflake. For example, Snowpark is User Defined Functions (UDFs) to streamline
processed increasingly large volumes of structured,
secure with a design that isolates data to protect architecture with complex data processing and
semistructured, and unstructured data, the company
the network and host from malicious workloads, custom business logic written in Python or Java
had to manage mounting complexity as its
and gives administrators control over the libraries in the same platform running SQL queries and
business scaled.
developers execute. Developers can build transformations. There are no separate clusters to
Since implementing Snowpark in Snowflake, manage, scale, or operate.
confidently, knowing data security and compliance
IQVIA has developed data engineering pipelines
measures are consistent and built-in. • Data science and ML pipelines: Data teams can
and intelligent apps more quickly and easily, with
use the integrated Anaconda repository and
consistent enterprise-level governance features
CUSTOMER SPOTLIGHT package manager to collaborate in bringing ML
such as row-level access, data masking, and closer
data pipelines to production. Trained ML models
proximity of data to processing. By leveraging
can also be packaged as a UDF to run the model
Snowpark to build their pipelines that process large
inference close to data, enabling faster paths from
EDF, a supplier of gas and zero-carbon electricity volumes of data, IQVIA has realized a cost savings of
model development to production.
to homes and businesses in the United Kingdom, 3x compared to previous pipeline processes.
tapped Snowpark to help deploy data applications.
By working within Snowflake, the project did not
require additional sign-offs and meetings to approve
data accessibility. Instead, the EDF team could scale
seamlessly by working within the security rules
Snowflake enables and that applied to the project.
Since integrating Snowpark into its data engineering
operations, EDF sped up production of customer-
facing, ML-driven programs, from several months to
just three to four weeks, increasing output up by 4x.
5
CHAMPION GUIDES
SNOWFLAKE FOR PYTHON
Using Snowpark for Python, data engineers
can take advantage of familiar tools and SNOWPARK FOR PYTHON ARCHITECTURE
programming languages while benefiting from
the scale, security, and performance of the
Snowflake engine. All processing is run in a
secure Python runtime right next to your data,
resulting in faster, more scalable pipelines
with built-in governance regardless of the
language used. Figure 2 gives an overview of
both the Snowpark client side libraries and the
Snowflake server side runtimes.
6
CHAMPION GUIDES
For custom Python or Java code, there is no SNOWFLAKE SERVER SIDE RUNTIME access to dedicated compute clusters for each
translation to SQL. Rather the code is serialized and workload, so users can take advantage of near-
Snowflake is cloud-built as a data platform that
sent to Snowflake to be processed inside the limitless concurrency without degrading performance.
architecturally separates but logically integrates
Java or Python Secure Sandbox. In the case of The three architectural layers that integrate within
storage and compute, and optimized to enable near-
Python, if the custom code includes any third-party Snowflake’s single platform are shown in Figure 3.
limitless amounts of these resources. Elastic scaling,
open source libraries available in the integrated
multi-language processing, and unified governance The Snowpark Python server-side runtime makes
Anaconda package repository, the package manager
also underpin Snowflake’s architecture. it possible to write Python UDFs and Stored
can help ensure code runs without complex
Procedures that are deployed into Snowflake’s
environment management. The intelligent infrastructure is what makes
secured Python runtime. UDFs and stored
everything just work. Compute clusters can be
procedures are two other key components of
SNOWPARK CLIENT SIDE LIBRARIES started, stopped, or resized—automatically or on
Snowpark that allow data engineers to bring custom
the fly—accommodating the need for more or less
The Snowpark client side libraries are open source Python logic to Snowflake’s compute engine, while
compute resources at any time. Along with flexibility,
and work with any Python environment. This includes taking advantage of open source packages pre-
Snowflake prioritizes speed, granting near-instant
the Snowpark DataFrame API, which allows data installed in Snowpark.
engineers to build queries using DataFrames right in
their Python code, without having to create and pass
along SQL strings. THE SNOWFLAKE PLATFORM ARCHITECTURE
7
CHAMPION GUIDES
SNOWPARK USER DEFINED FUNCTIONS (UDFS) #Given geo-coordinates, UDF to calculate distance between
distribution center and shipping locations
Custom logic written in Python runs directly in
Snowflake using UDFs. Functions can stand alone or
be called as part of a DataFrame operation to process from snowflake.snowpark.functions import udf
the data. Snowpark takes care of serializing the
import geopandas as gpd
custom code into Python byte code and pushes all
of the logic to Snowflake, so it runs next to the data. from shapely.geometry import Point
8
CHAMPION GUIDES
STORED PROCEDURES -- Create python stored procedure to host and run the snowpark pipeline
to calculate and apply bonuses
Snowpark stored procedures help data engineers
operationalize their Python code and run, create or replace procedure apply_bonuses(sales_table string, bonus_table
orchestrate, and schedule their pipelines. A stored string)
procedure is created once and can be executed returns string
many times with a simple CALL statement in your
language python
orchestration or automation tools. Snowflake
supports stored procedures in SQL, Python, Java, runtime_version = ‘3.8’
warehouse = ‘xs’
as
call apply_bonuses(‘wholesale_sales’,‘bonuses’);
9
CHAMPION GUIDES
SNOWFLAKE, ANACONDA, AND pre-installed from the Anaconda repository, including write transformations in the language they find most
THE OPEN SOURCE ECOSYSTEM fuzzy wuzzy for string matching, h3 for geospatial familiar and fit for purpose. And dbt on Snowpark
One of the benefits of Python is its rich ecosystem of analysis, and scikit-learn for machine learning and allows analyses using tools available in the open
open-source packages and libraries. In recent years, predictive data analysis. Additionally, Snowpark is source Python ecosystem, including state-of-the-
open-source packages have been one of the biggest integrated with the Conda package manager so users art packages for data engineering and data science,
enablers for faster and easier data engineering. To can avoid dealing with broken Python environments all within the dbt framework familiar to many SQL
leverage open-source innovation, Snowpark has because of missing dependencies. users. It supports a SQL-first workflow, and in 2022,
partnered with Anaconda for a product integration Using open-source packages in Snowflake is as dbt introduced Python as a second language running
without any additional cost to the user beyond simple as the code below, which demonstrates how Snowpark under the hood to perform analyses using
warehouse usage. users can call packages such as NumPy, XGBoost, tools available in the open-source Python ecosystem.
Data engineers in Snowflake are now able to speed and Pandas, directly from Snowpark.
up their Python-based pipelines by taking advantage Snowpark also fully supports dbt, one of the most
of the seamless dependency management and popular solutions for data transformation today. It
comprehensive set of curated open-source packages supports a SQL-first transformation workflow, and
provided by Anaconda—all without moving or copying in 2022, dbt introduced support for Python. With
the data. All Snowpark users can benefit from dbt’s support for both SQL and Python, users can
thousands of the most popular packages that are
returns array
language python
runtime_version = 3.8
packages = (‘numpy’,’pandas==1.4.*’,’xgboost==1.5.0’)
handler = ‘udf’
as $$
import numpy as np
import pandas as pd
def udf():
$$;
10
CHAMPION GUIDES
BEST PRACTICES:
DATA ENGINEERING IN
SNOWPARK WITH PYTHON
As the Snowpark for Python developer 1. Maximize use of the Snowpark libraries for Community so Snowflake teams can facilitate
community grows rapidly, data engineers are development and secure execution. its integration. If the package is a pure Python
looking for “best practices” to guide their work. Snowpark can be used with your preferred IDE package, you can unblock yourself and bring in
and development and debugging tools, and the package via Stages.
Understanding how Snowpark DataFrames,
UDFs, and stored procedures work together the execution can be transparently pushed 3. Use vectorized UDFs for feature transformations
down to Snowflake. Maximize this utility while and ML scoring.
can make data engineers’ work in Snowflake
being mindful of the use of to_pandas() from
more efficient and secure. We’ve compiled a Vectorized UDFs using the Batch API can
Snowpark, which brings full data into memory.
short list of best practices for data engineers execute scalar UDFs in batches. Use the
Also, Cachetools is a Python library that provides
Python UDF Batch API if leveraging third-party
working with Python in Snowpark. a collection of caching algorithms to store a
Python packages where transformations are
limited number of items for a specified duration.
independently done row by row and the process
They can be used to speed up UDFs and stored
could be efficiently scaled out by processing
procedures by ensuring the logic is cached in
rows in batches. This is a common scenario when
memory in cases of repeated reads.
using third-party Python packages to do machine
2. Accelerate development to production flow with learning-specific transformations on data as part
Anaconda integration. of feature engineering or when executing ML
We recommend using the Snowflake Anaconda batch inference.
channel for local development to ensure 4. Use Snowpark-optimized warehouses for
compatibility between client- and server-side memory-intensive workloads.
operations. Building your code using the latest
Snowpark-optimized warehouses are important
stable versions of third-party packages doesn’t
for data engineers working on large data sets.
require users to specify dependencies because
Consider using a Snowpark-optimized warehouse
the Conda Package Manager takes care of this,
when you run into a 100357 (P0000): UDF
offering tremendous peace of mind. If a desired available memory exhausted error during
package is not available inside Snowflake, development. Avoid mixing other workloads
please submit feedback through the Snowflake with workloads that require Snowpark-optimized
warehouses. If you must mix them, consider
calling the session.use_warehouse() method
to switch back to standard warehouses.
11
CHAMPION GUIDES
BEYOND SNOWPARK:
OTHER CAPABILITIES IN THE SNOWFLAKE
DATA ENGINEERING ECOSYSTEM
In addition to Snowpark, Snowflake has Snowflake is constantly enhancing functionality. Snowflake was designed to give data engineers
many other data engineering capabilities Dynamic tables, which provide a way to build access to all data at speed with performance and
that make it a fast and flexible platform that declarative pipelines, are currently in private review reliability at scale to build radically simple data
and offer a different approach to building pipelines pipelines. With innovative pipeline automation and
comprehensively supports simple, reliable data
from Snowpark for Python UDFs and Stored data programmability, data engineers can simplify their
pipelines in any language of choice. Procedures. These tools are designed to automatically workflows and eliminate what’s unnecessary, so they
Figure 4 offers an overview of Snowflake’s advanced process data incrementally as it changes to simplify can focus their effort on their most impactful work.
functionality for ingestion, transformation, and data engineering workloads. Snowflake automates all
delivery that simplify data engineering. the database objects and data manipulation language
management, enabling data engineers to easily build
Snowflake allows data engineering teams to ingest scalable, performant, and cost-effective data pipelines.
all types of data using a single platform, including
streaming or batch and structured, semi-structured, or The resulting data pipelines have intelligent
unstructured. Supported data formats include JSON, infrastructure, pipeline automation, and data
XML, Avro, Parquet, ORC, and Iceberg. Streaming programmability. Snowflake’s simplified pipelines
data, including streams from Apache Kafka topics, can then power analytics, applications, and ML models
also be ingested directly to a Snowflake table with with only one copy of data to manage and near-zero
Snowpipe Streaming. Thanks to the Data Cloud, all maintenance. Data can also be accessed and shared
this data can be accessed and shared across providers directly using secure data sharing capabilities with
and between internal teams, customers, partners, and internal teams, customers, partners, and even more
other data consumers via the Snowflake Marketplace. data providers and consumers through the Snowflake
Marketplace. Data doesn’t move with Snowflake’s
Data can be transformed using the data engineer’s modern data sharing technology. Instead a data
language of choice using Snowpark. Tasks can be provider grants a data consumer near-instant access
combined with table streams for continuous ELT to live, read-only copies of the data. This approach
workflows to process recently changed table rows. reduces latency, removes the need to copy and move
Tasks are easily chained together for successive stale data, and dramatically reduces the governance
execution to support more complex periodic challenges of managing multiple copies of
processing. All of this can be done fast, and scaled to the same data.
meet the evolving number of users, data, and jobs of
complex projects.
12
CHAMPION GUIDES
DATA ENGINEERING WITH SNOWFLAKE
Figure 4: Snowflake supports ingestion of unstructured, semi-structured, and structured data while automated workflows facilitate transformation and delivery.
13
CHAMPION GUIDES
GETTING STARTED
WITH SNOWPARK
To develop and deploy code with Snowpark, • Snowsight worksheets: Snowsight is Snowflake’s • Partner integrated solutions: Many Snowpark
developers have always had the flexibility web interface that provides SQL and Python Accelerated partners offer either hosted open
to work from their favorite integrated (currently in public preview) support in a unified, source notebooks or their own integrated
easy-to-use experience. These worksheets experiences. Their solutions include out-of-the-
development environment (IDE) or notebook.
provide autocomplete for the Snowpark session box Snowpark APIs preinstalled and offer secure
Data engineers can easily get started in Snowpark, and can run directly from the browser as a stored data connections. These deeply integrated
beginning development anywhere that can run procedure. Snowsight is a good option for teams experiences speed up the building and deploying
a Python kernel. Minimizing learning curves by looking for a zero-install editor for writing and of pipelines, models, and apps. More information
eliminating the need for a new tool, data engineers running Snowpark and quickly turning that code on partner integrations can be found on the
simply install the Snowpark DataFrame API and into Stored Procedures that can be orchestrated Snowpark Accelerated page.
establish a connection to their Snowflake account. as part of an automated pipeline.
Snowpark aims to give developers flexibility. It • Open-source notebook solutions: One popular RESOURCES
supports many development interfaces, including: option for building pipelines in Snowpark is to
Start harnessing the power of Snowflake with
leverage notebooks. Notebooks enable rapid
• Code editors and IDEs: Many data engineers Snowpark for data engineering, and get started with
experimentation using cells. With Snowpark, you
prefer to build using code editors and IDEs. the resources below:
can run a variety of notebook solutions such as
These offer capabilities such as local debugging,
Jupyter Notebooks, which can be run locally while
autocomplete, and integration with source
connected securely to Snowflake to execute data FREE TRIAL
control. Snowpark works well in VS Code,
operations. Any machine running containers or
IntelliJ, PyCharm, and other tools. VS Code
Python can build and execute Snowpark pipelines.
QUICKSTART
works with a Jupyter extension that provides
A similar approach can be used for working with
a notebook experience within the editor,
Snowpark in other notebook solutions, including DEVELOPER DOCUMENTATION
bringing in breakpoints and debugging to the
Apache Zeppelin. Open source notebook
notebook experience, without requiring separate
solutions are a great choice for data exploration. MEDIUM BLOG
management of the Jupyter container or runtime.
Code editors and IDEs are a great choice for rich
SNOWFLAKE FORUMS
development and testing experience for
building pipelines.
14
ABOUT SNOWFLAKE
Snowflake enables every organization to mobilize their data with Snowflake’s Data Cloud. Customers use the Data Cloud to unite siloed data,
discover and securely share data, power data applications, and execute diverse AI/ML and analytic workloads. Wherever data or users live,
Snowflake delivers a single data experience that spans multiple clouds and geographies. Thousands of customers across many industries,
including 590 of the 2022 Forbes Global 2000 (G2K) as of April 30, 2023, use Snowflake Data Cloud to power their businesses.
© 2023 Snowflake Inc. All rights reserved. Snowflake, the Snowflake logo, and all other Snowflake product, feature and service names mentioned herein
are registered trademarks or trademarks of Snowflake Inc. in the United States and other countries. All other brand names or logos mentioned or used
herein are for identification purposes only and may be the trademarks of their respective holder(s). Snowflake may not be associated with, or be
sponsored or endorsed by, any such holder(s).
CITATIONS
1
https://insights.stackoverflow.com/survey