Snowflake - Interview Questions
Snowflake - Interview Questions
Snowflake - Interview Questions
# How can you enable query acceleration? Query acceleration is a serverless compute service
offered by snowflake which can allocate compute to queries to enhance their performance.
The service must be enabled at the warehouse level for it to work. Query acceleration is enabled at
the warehouse level and can be done with the alter statement
# What is QUERY_ACCELERATION_MAX_SCALE_FACTOR?
The query acceleration scale factor determines how much compute resource will be provided to the
warehouse.
If we set the value to 1 then it will consume the same number of credits as the warehouse
COMPUTE_WH.
ENABLE_QUERY_ACCELERATION = true
QUERY_ACCELERATION_MAX_SCALE_FACTOR = 1;
When a warehouse runs out of memory while executing a query it moves the data to the disk of the
warehouse. This is called memory spilling.
If the disk of the warehouse is not sufficient then it will spill to cloud storage.
Memory spillage is bad for query performance because it needs to write to disk and this slows down
the processing of the query
warehouse_name,
bytes_spilled_to_local_storage,
bytes_spilled_to_remote_storage
FROM
snowflake.account_usage.query_history
ORDER BY bytes_spilled_to_remote_storage,
Memory spillage happens because memory available is not sufficient to process the data being
consumed. Reduce the amount of data being processed , this can be done by eliminating columns
that are not being used , or filtering data that does not need to be processed. Increase the size of the
warehouse to increase the amount of memory available Use snowpark optimized warehouse which
has 16x memory and 10x local disk space.
X-small 16GB of RAM and 200GB SSD disk. ------ Not Available
Medium 64GB of RAM and 800GB SSD disk.- 4 credits per hour -- Medium 1024 GB of RAM and 8000
GB disk – 6 credits per Hour
The process of skipping partition when reading data from micro-partitions is called partition pruning
or query pruning.
The metadata stored in partitions allow snowflake to skip partitions that do not contain data
included in the where clause or join condition
What is metadata cache ?
Snowflake stores certain information about each table like row count, min max values of a column in
its metadata in the cloud services layer so that certain queries about the table can be easily
answered without reading the table.
# What is the difference between automatic clustering and manually clustering tables ?
Automatic clustering as the name suggests is automatically done by snowflake and is a process of
partitioning data into micro-partitions based on snowflakes best guess based on its analysis of the
data.
In manual clustering which is only recommended for tables that are over I TB in size, we pick the
cluster keys based on our understanding of the data and the conditions that are used in filter
conditions and joins for that table.
We use the CLUSTER BY keyword to cluster the table. The table can be clustered either during
creation or latter with the alter table statement
In general, a table with a large clustering depth will perform poorly in queries. The function is called
CLUSTERING_DEPTH(‘’)
# You suspect that queries are waiting to get picked up by a warehouse to be processed, how will you
confirm this ?
If simple queries are talking more time than it should then the cause of the problem could be
multiple queries running concurrently.
SELECT
HOUR(start_time) start_time_hour,
AVG(avg_running) avg_nbr_of_query_running_per_hour,
AVG(avg_queued_load) avg_nbr_of_queued_query_due2overloaded_WH_per_hour,
AVG(avg_queued_provisioning) avg_nbr_of_queued_query_due2provisioning_WH_per_hour,
AVG(avg_blocked) avg_blocked
FROM snowflake.account_usage.warehouse_load_history
ORDER BY start_time_trunced_at_hour;
If we observe that there is queuing, then we can change the cluster to multi-warehouse or increase
the maximum number of warehouses in the multi-cluster warehouse.
Unstructured Data
# Can you unload data from a relational table to a JSON stage with the COPY command
If we create a JSON stage and load relational data into it the load process will fail, because the stage
is expecting JSON format. We will need to use the object_construct function convert the relational
data to JSON format.
# What data type do you recommend to store JSON and XML data ?
Snowflake natively supports storing of semi-structured data like JSON and XML with the VARIANT
datatype.
# Can you explain a scenario in your project where you needed to use variant data type?
We receive IOT data files in JSON format and we copy this data into snowflake tables called the RAW
layer which has VARIANT data type.
Once data is added to the raw tables, we parse the data from the variant columns and load the data
to the final tables in snowflake.
The flatten command is used with Array, Object VARIANT datatypes and is used to convert data
present in these data types from rows to columns.
SELECT student_id, first_name, last_name, cert.this, cert.value FROM students stu, LATERAL FLATTEN
(input => stu.courses) cert;
DATA SECURITY
# How do you protect sensitive data like Date of Birth and SSN# from being seen by end users ?
We can use Dynamic Data Masking to protect sensitive data from being seen by end users. Only roles
that have access to the data will be able to see the data in its unmasked form.
We can also use Views to protect data, we can create a view which contain columns that can be
exposed to the end user and grant select access to the view.
Static data masking is updating sensitive data like Date of Birth and SSN# so that it is changed. Eg: All
birth date can be updated to 31-Dec-2010, so we will have a birth date but none of them will be
correct.
We can also use Row Access Policies to limit the data being seen by end users who should not have
access to it.
Dynamic Data Masking is a Column-level Security feature that uses masking policies to selectively
mask plain-text data in table and view columns at query time The reason it is called dynamic is
because the masking changes depending on who is seeing the data.
The masking is not hard coded into the data, rather it is a mask which covers the data and the
masking changes as per the rules we have set.
Row Access policy is a row level security feature which determines what rows a user will see when
they run a select statement.
Two users who select data from the same table may see different data if the Row Access Policy for
them is different.
Snowflake Architecture
# What is the max data storage capacity of snowflake ? The max capacity is limited by the cloud
storage provided which is basically unlimited.
Cost Saving
• What are the best practices that we can follow to reduce credit consumption?
• •Size the warehouse appropriately, start with a smaller warehouse and test the
query and progressively increase the size till you can find a good fit. Over
provisioning can cost credits and should be avoided as cost doubles every
time we size-up.
• •Use the WAREHOUSE_LOAD_HISTORY and ensure that the warehouses are
being fully utilized when they are running.
• •Enable auto-suspend, credits are consumed even if no queries are being
executed, so enable auto-suspend to shut down warehouses when they are not
needed.
• •Control access to CREATE and MODIFY privilege of warehouse to control
creation and re-sizing of the warehouse which can increase cost.
• •Use STATEMENT_TIMEOUT_IN_SECONDS to kill hung or runaway queries as
they consume a lot of credits, and we can define the maximum amount of time
an SQL statement can be run before it is canceled.
• •Use resource monitors to enforce spending limits.
Create secure view view_customer as Select * from customer Views should be defined as secure
when they are specifically needed to limit access to sensitive information.
• The query used in creation of the secure view is not visible to users(except owner) providing
greater security.
• Regular views are more performant than secure views because internal optimizations used in
regular views are not used in secure views as they can expose the data through user code this
provides greater data security with the secure view but reduce view performance
• Secure views can be shared with data sharing, regular views cannot be shared.
# Have you created a data share and shared data in your project ?
We use data shares to share data between accounts within our own organization. We also share data
with external partners who need access to our data with the help of data shares.
Reader accounts enable these users to get setup with snowflake quickly without having to go thru
the process of signing a licensing agreement.
The account that creates the reader account is responsible for providing the compute needed for
the consumption of the data.
You can see all reader accounts by using the command SHOW MANAGED ACCOUNTS
Secure View
• Tables
• Secure UDF
• External Tables
Unloading Data
# What is data unloading ? Data unloading is the process of moving data out of snowflake to cloud
storage. We use the COPY command to unload data from snowflake.
#When unloading data into cloud storage what best practice do you use?
To ensure that the export process is completed quickly and with usage of least amount of credits, I
would use the partition clause in the COPY command to ensure that parallelism feature of snowflake
is used, and multiple files are generated based on the partition I want.
I would also name of the file to disallow the file to be named generically by snowflake.
# Was there a scenario in your project where you had to unload the data to an internal stage ?
Request for data is frequently received from business users from snowflake. This could be for data
analysis or dealing with data quality issues.
The way I provide the data is to unload into an internal stage and use the GET command to download
the data, I use the SINGLE=TRUE option to ensure that they file does not get split to multiple files.
# Do you prefer unloading data into multiple files or a single file from snowflake ?
To take advantage of parallelism provided by snowflake and get the files faster I prefer multiple files
being generated. This is the default format.
There is also a 5GB file limitation on cloud storage and if your file size is bigger then you need to
generate multiple files.
COPY statement supports the full syntax of snowflake SQL and hence you can join the tables in the
COPY command and get data from more than 1 table.
The default compression when unloading data to snowflake in parquet format is snappy. We can also
use LZO if we wish.
However we need to explicitly provide the compression type if we don’t want the default. For CSV
and JSON files the default compression type is GZIP. The other supported compression types are
bzip2, Brotli,Zstandard
When using the COPY command we can use the CAST function to modify data types when unloading
data in parquet format.
Load failures due to files already being present in the folder is a challenge we have encountered, we
resolved that by creating a lambda function(or Azure functions) that copies the file to a different
folder and adds the timestamp to the file.
unloading data in CSV format we observed that snowflake truncates data in decimal columns to 15,9.
We overcame this problem by casting the data to string and unloading the data.
# How can you limit the size of a file generated by unloading data from snowflake.
We need to use the MAX_FILE_SIZE parameter and limit the size of the files. Files generated are
generally around 16 MB in size , you can make the files smaller by using the MAX_FILE_SIZE
parameter.
COPY INTO @MYDB.MYSTAGES.UNLOAD_OUTPUT FROM
SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER MAX_FILE_SIZE=10000000
• Faster processing
Virtual Warehouse
#What is the maximum number of warehouses you can have in a cluster? 10 is the max number of
warehouses you can have inside a cluster.
It depends on the warehouse and what jobs the warehouse is designed to process. If the warehouse
is designed for end user consumption in Snowsight, we use economy scaling policy.
However for jobs that are run in ETL the auto scaling policy is standard.
We also use the standard scaling policy for warehouses that are consumed for reporting via tableau
and power BI.
Scaling up is increasing the compute and memory capacity of a warehouse. We can scale up by
resizing a warehouse from the current size to a higher one.
We scale up when the job needs memory and processing power Scaling out is increasing the number
of clusters in a multicluster warehouse.
Scale out is recommended when we have large number of queries that need to be processed and
waiting to be picked up.
# What is the auto suspension time period you use on the warehouse used in your project?
For ETL jobs and tasks we auto suspend them immediately. For end user warehouses we have set the
auto-suspension to the default time-period of 10 mins.
Yes, you can re-size the warehouse when it is running, if you decide to re-size it to make it bigger the
currently running queries will not be able to take advantage of the bigger size.
# What is zero copy cloning, what are its advantages over create table as select ?
Zero Copy cloning is a metadata operation which creates an exact copy of the object being cloned.
Because it is a metadata operation the data is not being replicated when a table ,schema or database
is being cloned, rather it creates a pointer to the same data.
Advantages:
• It is faster to create
#When cloning a table do the grants get copied over to roles that has access to the original table?
The grants do not get copied over when we clone a table, we need to use the COPY GRANTS
command along with the clone to ensure that the privileges get copied over.
#When cloning a table do the grants get copied over to roles that has access to the original table?
The grants do not get copied over when we clone a table, we need to use the COPY GRANTS
command along with the clone to ensure that the privileges get copied over.
When we clone a table all the properties including comments, cluster keys, data get copied over, the
only thing that does not get copied over are grants
(SCENERIO QUESTION) Zero copy cloning can be used to backup tables that are required after the
time travel retention period is complete.
Zero copy cloning can be used to create new DEV/QA environments or when data needs to be fully
refreshed in these env.
==
Let us say that data in your production table was accidentally Updated, how will you recover it if you
have time travel available ?
• creating a clone of the prod table with the help of time travel to a point before the data was
truncated/updated. CREATE TABLE MYDB.PUBLIC.CLONE_CUSTOMER CLONE
MYDB.PUBLiC.CUSTOMER AT (TIMESTAMP => TO_TIMESTAMP_TZ('04/05/2024 01:02:03',
'mm/dd/yyyy hh24:mi:ss'));
• Validating the data to verify that the clone table data is correct.
# You have a table in a snowflake prod account and you want to clone it and bring it to a snowflake
database in a different account(QA), can you use zero copy cloning to do that ?
Zero Copy cloning works only across databases inside an account, we cannot do zero copy cloning
across accounts.
If zero copy cloning does not work, how can you bring the data from one account to another?
• We can use SHARES to share the data a between the accounts and then create a new table from
that shared data
• We can unload the data from snowflake into cloud storage and then use the COPY command to
copy data to the data in a different account.
• We can use python to copy data from one snowflake account to another
• Database replication
# If I clone a table and update the clone will the original table also get updated
When the clone is created, they reference the same data files at creation, however whenever there is
an update/insert new micropartitions are created which are used to store the changed data. Hence
the 2 tables will diverge in what they will return when the table is updated
Loading Data
# Do you use transformation when loading data using the copy command ?
We generally don’t use transformation when loading data to the RAW tables with the copy
command, the reason for this is because this tends to slow down the loads on large files.
We prefer to bring the data into snowflake and then do the necessary transformation in SQL.
Exception: In some cases, we may receive very large files with over 1000 columns , in this situation
we may choose to load a subset of fields into the target raw table.
In this scenario we need to use select in the COPY command and may need to apply transformations.
# What is needed to establish connectivity between snowflake and cloud storage(s3 and Azure Blob
Storage) ?
The storage integration object helps us establish connectivity between cloud storage and snowflake
and we can use the Copy command to copy data from s3 to snowflake tables
#How can we ingest data without using snowpipe, COPY or ETL tools like DBT ?
We can ingest data into snowflake by creating external tables on cloud storage folders, this gives us
select access to data files and we can use the INSERT into target_table
This method is very useful when you have large files of greater than 250 MB which are not suitable
for loading with snowpipe.
# You received files from your new vendor, you are aware of the layout of the file as they shared the
details in the past. A target table was created with the file structure. Before loading the file to the
target table how will you validate if the file provided will load without any issues to the target table ?
We can use the VALIDATION_MODE=RETURN_ALL_ERRORS in the COPY command to see if there are
any errors the file might encounter when loaded into the target table. When the COPY command is
run in VALIDATION_MODE data is not loaded to the target table
# How will you setup a process of automated validation and quality control of all files received from
your source system before loading them to snowflake ?
When running jobs on a regular basis, we can validate files by using python to read the files from s3
and check that the important fields and values are populated and to throw an error if the validation
fails.
The second option is to ingest the data to snowflake raw tables and run SQL to validate if the file is
good before loading then further.
We can also use the SKIP_FILE option to not load files that have load errors when loading them.
#Once a file is loaded into a table with the COPY command, can you reload the file again ?
If you try to reload the file, snowflake will ignore the file. Snowflake stores file load metadata which
is associated with the table and is aware that the file is already loaded and will stop reload to prevent
duplicates. This metadata is re-set when the table is truncated or dropped and re-created.
#How can you reload the file, even if it is already loaded ?
I will use the FORCE option in the COPY command to force load the files. Use Case: When you need a
large volume of data for performance testing
We use Tasks/Airflow and stored procedures to ingest when data needs to be loaded in batch mode.
Merge statements inside the stored procedure are used to update data in the target table.
For tables that need to be updated near real time we use task and streams. The task detects data in
the stream and SQL statement associated with the task will update the target table.
This SQL is generally a merge statement which uses the stream as input.
We use snowpipe for loading files that need to be ingested immediately and are less than 250
MB.IOT data is ingested in this way.
We also use external tables to ingest files that are over 1 GB in size, we use SELECT statement to get
the data into snowflake. Streams on External tables are used when files need to be ingested without
any delay
# When ingesting data from data lake, are most of your files in parquet format or csv and what is the
reason for that?
Data in s3/Azure Blob Storage is stored in parquet format, this is our enterprise standard for storing
data in s3/Azure Blob Storage and the data lake.
Parquet format helps conserve space and stores data in columnar format which is more efficient
when we select a subset of columns from Athena/Azure Data Lake Analytics.
While csv files tend to load faster than parquet files with the COPY command when loading
snowflake, we were not interested in re-creating the parquet files for ingestion into snowflake as this
would consume a lot of time and compute resources and lead to duplication of data.
UDF do not permit DDL statements in them, so we will need to use a stored procedure, also UDF are
called from select statement and cannot be invoked independently.
• Use stored procedures in tasks as tasks can have only 1 SQL statement.
We used stored procedures to run the COPY command to ingest files from cloud storage.
We used stored procedures with tasks as task can have only one SQL statement.
We use stored procedures to truncate staging tables for those loads that are batch based.
SQL, Python, Java Script and Java # What do you use in your project to create UDF and Stored
procedure We use SQL mainly because it is easier to maintain.