100% found this document useful (9 votes)
11K views

Snowflake Notes

The document discusses various topics related to Snowflake including: 1. What Snowflake is and its key features such as caching, clustering, data sharing, etc. 2. Snowflake's architecture including shared-disk and shared-nothing processing, and key concepts like database storage, query processing, and cloud services. 3. Methods for creating and scaling virtual warehouses in Snowflake including auto-scale and maximized modes.

Uploaded by

Jay Jinka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (9 votes)
11K views

Snowflake Notes

The document discusses various topics related to Snowflake including: 1. What Snowflake is and its key features such as caching, clustering, data sharing, etc. 2. Snowflake's architecture including shared-disk and shared-nothing processing, and key concepts like database storage, query processing, and cloud services. 3. Methods for creating and scaling virtual warehouses in Snowflake including auto-scale and maximized modes.

Uploaded by

Jay Jinka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 67

*********Snowflake Topics********

1. What is Snowflake & Snowflake features?


2. Architecture & Key Concepts?
3. Virtual warehouse creation & scaling methods (Maximized, Auto scale)?
4. Caching in snowflake & Types of Chasings?
5. Clustering in snowflake?
6. Data sharing in snowflake (Reader Account)?
7. Time travel & Fail safe in snowflake?
8. Stages & Types of Stages and Creation of Stages?
9. File formats & File types and Options (Force, Purge, etc...)?
10. Database creation & Tables & Types of tables?
11. Clone & Swap creation in snowflake?
12. Views & Types of views and MV maintenance cost and Limitations?
13. Copy command & Options and Data loading (Local file system, Could)?
14. Snow pipe in snowflake?
15. Streaming in snowflake & Types of Streaming, SCD’s, Change Clause?
16. Scheduling-TASKS in snowflake?
17. Loading unstructured data (Jason, XML, and Parquet)?
18. Data sampling in snowflake (Row, Block, Clone)?
19. Performance tuning?
20. AWS s3 account creation (S3 bucket, Policy, Role) & Creating Integration.
21. Snow CLI, AWS CLI?
1. What is Snowflake & Snowflake features?

 Snowflake : Snowflake is an analytic data warehouse database as Software-


as-a-Service (SAAS)
.SAAS:
 There is no hardware, software (virtual or physical) to install,
configure, or manage.
 Ongoing maintenance, management, and tuning is handled by
Snowflake (software installation and updates).
 Snowflake runs completely on cloud infrastructure and cannot be run
on private cloud infrastructures (on-premises or hosted).
.Snowflake Features:
1. Cashing
2. Clustering
3. Data sharing
4. Time travel & Fail safe
5. Clone & Swap
6. Snow pipe
7. Streaming & Change Clause
8. Scheduling-TASKS

2. Architecture & Key Concepts?


  Snowflake Architecture
 Snowflake’s architecture is a shared-disk and shared-nothing
processing architectures.
 Shared-disk: Snowflake uses a central data repository for persisted
data that is accessible from all compute nodes.
 Shared-nothing processing: Snowflake processes queries using MPP
(massively parallel processing) compute clusters where each node in
the cluster stores a portion of the entire data set locally.

 Key Concepts
1. Database Storage
2. Query Processing
3. Cloud Services
 

 Database Storage: Snowflake loaded or stored the data as optimized,


compressed, columnar format in cloud database storage.
The data objects stored by Snowflake are not directly visible nor
accessible by customers, they are only accessible through SQL query
operations run using Snowflake.
 Query Processing: Snowflake processes queries using “virtual
warehouses.
Each virtual warehouse is an independent compute cluster that does
not share compute resources with other virtual warehouses
 Cloud Services: The cloud services layer is a collection of services that
coordinate all to process user requests, from login to query dispatch
using below collection of services.
 Authentication
 Infrastructure management
 Metadata management
 Query parsing and optimization
 Access control
3. Virtual warehouse creation & scaling methods (Maximized, Auto scale)?
CREATE OR REPLACE WAREHOUSE KNOW_ARCHITECTURE_1 WITH

WAREHOUSE_SIZE='X-SMALL'

AUTO_SUSPEND = 180

AUTO_RESUME = TRUE
INITIALLY_SUSPENDED=TRUE;

AUTO-SCALE MODE: This mode is enabled by specifying different values for


maximum and minimum clusters. In this mode, Snowflake starts and stops
clusters as needed to dynamically manage the load on the warehouse this mode.
MAXIMIZED MODE: This mode is enabled by specifying the same value for both
maximum and minimum clusters. In this mode, when the warehouse is started,
Snowflake starts all the clusters so that maximum resources are available while
the warehouse is running.
Scale Up: Doing server scale up like XS to S.
Scale Down: Doing server scale down like S to XS.
Scale Out: Increasing cluster size like 2 serve to 3 servers.
Scale In: Decreasing cluster size like 3 server to 2 server.
1. X-Small (1 cluster, 1Server,8 VW(8 CPU’s or 8 threads).

2. Small (1 cluster, 2Server,16 VW(16 CPU’s or 16 threads)

3. Medium (1 cluster, 4Server,32 VW(32 CPU’s or 32 threads)

4. Large (1 cluster, 8Server,64 VW(64 CPU’s or 64 threads)

5. X-Large (1 cluster, 16Server,128 VW(128 CPU’s or 128 threads)

6. 2x-Large (1 cluster, 32Server,256 VW(256CPU’s or 256 threads)

7. 3x-Large (1 cluster, 64Server,512 VW(512 CPU’s or 512 threads)

8. 4x-Large (1 cluster, 128Server,1024 VW(1024CPU’s or 1024 threads)

Virtual Warehouse Sizes:


Snowflake Regions:

Snowflake Editions:
4. Caching in snowflake & Types of Chasings?
When user given query request, the Cloud services take that request and submit
to the VWH then VWH will go and take the data form data storage and give to
Cloud service.
Finally the cloud service give result to user and store the result in Result cash.
Types of Caching’s:
1. Result Cashing: The result will store in Cloud services.
2. Local disk chasing: The query result will store in VWH cash.
3. Remote disk chasing: It will store in cloud cashing.

• Query Results are Re-Used Based on the below Criteria


• Caching the Query Results for result set re-use
• Retention Period – 24 Hours
• The user executing the query has the necessary access privileges for
all the tables used in the query.
• The new query syntactically matches the previously-executed query.
• The table data contributing to the query result has not changed.
• The persisted result for the previous query is still available.
• The query does not include functions that must be evaluated at
execution (e.g. CURRENT_TIMESTAMP).
• The table’s micro-partitions have not changed (e.g. been re clustered
or consolidated) due to changes to other data in the table.

• Session Parameter USE_CACHED_RESULT = true;


Alter session set USE_CACHED_RESULT = TRUE;
• Result_Scan Function for getting the Query Results from Cache 
RESULT_SCAN ( { '<query_id>' | LAST_QUERY_ID() } )
• Retrieve All the Values from first Query in the session
• select * from table(result_scan(last_query_id(1)));
• Retrieve All the Values from  second most recent query in the session
• select * from table(result_scan(last_query_id(-2)));

5. Clustering in snowflake?
Clustering: Clustering is to co-locate similar rows in the same micro-
partitions.
In snowflake the table file data will be stored as micro petitions in database
storage. If we apply cluster after storage data the cluster will be happen so
processing cost will be there. Better to add order by clause while loading data
into table.

Select Name, Country From employee Where date=’11/2’


This query will get submitted to cloud services layer.
Cloud service layer will do query optimization, creates execution plan and
submits this plan to all virtual warehouse nodes.
Nodes will first download table file HEADER from all table files.
Based on the metadata information in the HEADER file, table files (Miro
partitions) will be scanned.
/** CREATE TABLE WITH CLUSTERING **/
CREATE TABLE EMPLOYEE (TYPE,NAME,COUNTRY,DATE) CLUSTER BY (DATE);

/** IF YOU HAVE ALREADY LOADED DATA **/


ALTER TABLE EMPLOYEE CLUSTER BY (DATE);
/****** Cardinality of columns **********/
SELECT DISTINCT C_ADDRESS FROM
SAMPLE_DATABASE.PUBLIC.CUSTOMER_NOCLUSTER CLUSTER
SELECT 238609294/1500000000
SELECT DISTINCT C_MKTSEGMENT FROM
SAMPLE_DATABASE.PUBLIC.CUSTOMER_NOCLUSTER CLUSTER
SELECT 5/1500000000
/**** Imporve performence without applying clustering **************/

CREATE OR REPLACE TRANSIENT TABLE


DEMO_DB.PUBLIC.CUSTOMER_ORDERBY
AS
SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF10000.CUSTOMER
ORDER BY C_MKTSEGMENT

6. Data sharing in snowflake (Reader Account)?

Data Sharing enables us sharing the selected objects in a database in your


account with other Snowflake accounts.
Here we are going to share data table metadata from one snowflake account to
other snowflake account.
The consumer going connect to our cloud service for accessing the table’s data
using their VWH.
Note: Here the data is not copying from one account to another account just
sharing data only with cloud service metadata.

Providers: A data provider is any Snowflake account that creates shares and
makes them available to other Snowflake accounts to consume
Consumers: A data consumer is any account that chooses to create a database
from a share made available by a data provider
Reader Account: Share data with a consumer who does not already have a
Snowflake account and/or is not ready to become a licensed Snowflake customer
• Sharing Data
• Web UI – Inbound , Outbound
• Outbound Shares (Providers)
• Shares that you have created and you have access to
• Create a Share
• Edit Shares
• Inbound Shares (Consumers)
• View Shares from providers that you are granted to
• Create a Database for the share
• Sharing Considerations
• Currently, consumer accounts must be in the same Snowflake
Region as your account; i.e. you can only share with other accounts
in your Snowflake Region.
• A share can include data from multiple databases.
• For data security and privacy reasons, only secure views are
supported in shares at this time. If a standard view is added to a
share, Snowflake returns an error.
• Adding accounts to a share immediately makes the share available
to consume by the accounts.
• New and modified rows in tables in a share (or in tables referenced
by a view in a share) are available immediately to all consumers who
have created a database from the share. Keep this in mind when
updating these tables.
• A new object created in a database in a share is not automatically
available to consumers.
• General Limitations for Shared Databases
• Shared databases are read-only. Users in a consumer account can
view/query data, but cannot insert or update data, or create any
objects in the database.
• The following actions are not supported:
• Creating a clone of a shared database or any schemas/tables
in the database.
• Time Travel for a shared database or any schemas/tables in
the database.
• Editing the comments for a shared database
Share DDLs
• Create Share
• Alter Share
• Drop Share
• Describe Share
• Show Shares
• Grant <Privileges> to Share
• Revoke <Privileges> to Share
• Show Grants to Share
• Show Grants of Share
create database reader_sales;
create table customer as
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.CUSTOMER;
create share sales_s;

grant usage on database reader_sales to share sales_s;


grant usage on schema reader_sales.public to share sales_s;
grant select on table reader_sales.public.customer to share sales_s;

grant usage on database reader_sales to role public;


grant usage on schema reader_sales.public to role public;
grant select on table reader_sales.public.customer to role public;
create or replace secure view sales.public.customer_data
as select C_NAME, C_MKTSEGMENT, C_ACCTBAL
from sales.public.customer;
create database reader_sales;
create table customer as
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.CUSTOMER;

7. Time travel & Fail safe in snowflake?


Time travel: SF Time Travel Enables access to Historical Data ( Modified or
Dropped / Deleted).
• Objects can be restored which were deleted or dropped
Data Retention Period: The Standard Retention Period is 1 day / 24 Hours
--- Set retention time
create or replace table emp_retention (
first_name varchar(10) ,
last_name varchar(10) ,
email varchar(10) ,
streetaddress varchar(100) ,
city varchar(100) ,
start_date string
);
copy into emp_retention
from @my_s3_stage
file_format = (type = csv field_optionally_enclosed_by='"')
pattern = '.*employees0[1-5].csv'
ON_ERROR='CONTINUE'
-- Check retention period.
sHOW TABLES LIKE 'emp' in sfnowflake_tutorial.public
-- Retention period is unset.
ALTER TABLE emp_retention SET DATA_RETENTION_TIME_IN_DAYS = 0;
Examples :
select * from my_table at(timestamp => 'Mon, 01 May 2015 16:20:00 -
0700'::timestamp);
select * from my_table at(offset => -60*5);
select * from my_table before(statement => '8e5d0ca9-005e-44e6-b858-
a8f5b37c5726’);
create table restored_table clone my_table at(timestamp => 'Mon, 09 May 2015
01:01:00 +0300'::timestamp);
create schema restored_schema clone my_schema at(offset => -3600);
create database restored_db clone my_db before(statement => '8e5d0ca9-005e-
44e6-b858-a8f5b37c5726')
SF just remove the metadata points. Once we did time travel again SF will
attached the those particle files.

Fail Safe:
Fail safe provides 7-day period during which historical data is recoverable by
Snowflake.
Fail-safe is not provided access us to get historical data after the Time Travel
retention period has ended. 
It is for use only by Snowflake to recover data that may have been lost or
damaged due to extreme operational failures

Why Fail Safe Instead of Backup


• Data corruption or loss can happen with any database management
• Time required to reload lost data for us.
• Fail-safe provides an efficient and cost-effective alternative to backup that
eliminates the remaining risk

8. Stages & Types of Stages and Creation of Stages?


Stage: Staging area in snowflake is a blob storage area where you load all your
raw files before loading them into snowflake database.
-- Create external stage
create or replace stage control_db.external_stages.my_ext_stage
url='s3://snowflake067/test/'
credentials=(aws_key_id='AKIAUIIPUVJBJMSPABKO'
aws_secret_key='bgQb6b816dzQdGkT+JPVqeiQ561B');
DESC STAGE control_db.external_stages.my_ext_stage
alter stage control_db.external_stages.my_ext_stage set
credentials=(aws_key_id='d4c3b2a1' aws_secret_key='z9y8x7w6');
-- Create internal stage
create or replace stage control_db.internal_stages.my_int_stage
DESC STAGE control_db.internal_stages.my_int_stage
9. File formats & File types and Options (Force, Purge, etc...)?
Create file format:
create or replace file format control_db.file_formats.my_csv_format
type = csv field_delimiter = ',' skip_header = 1 null_if = ('NULL', 'null')
empty_field_as_null = true compression = gzip;
Types and Options: Can refer below copy command topic.
1. Purge
2. Overwrite
3. Force

10. Database creation & Tables & Types of tables?


Database:
create or replace database reader_sales;
create or replace transient database control_db;

Permanent Table:
CREATE OR REPLACE TABLE READER_SALES.PUBLIC.CUSTOMER
AS
SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF10000.CUSTOMER;
TRANSIENT:
CREATE OR REPLACE TRANSIENT TABLE READER_SALES.PUBLIC.CUSTOMER
AS
SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF10000.CUSTOMER;
TEMPORARY:
CREATE OR REPLACE TEMPORARY TABLE
READER_SALES.PUBLIC.CUSTOMER_TEMP
AS
SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF10000.CUSTOMER;
11. Clone & Swap creation in snowflake ?
Clone: Which means taking metadata copy of the table. Then new copy of the
metadata pointing to same storage in backend. But after cloning the tables will be
independently each other.
If add any data into any tables, the files will be added in separately (S3 storage).

-- CLONE TABLE
CREATE OR REPLACE TRANSIENT TABLE EMP_CLONE CLONE EMP;
SELECT * FROM EMP_CLONE;
-- Make changes in EMP_CLONE it should not affect EMP Table.
DELETE FROM EMP_CLONE WHERE FIRST_NAME='Arlene';
SELECT * FROM EMP_CLONE WHERE FIRST_NAME='Violette';
---Checking clone id's,(ID and clone id will be same EMP, But clone id will same as
emp but ID will be unique).
SELECT * FROM INFORMATION_SCHEMA.TABLE_STORAGE_METRICS WHERE
TABLE_NAME LIKE 'EMP'
AND TABLE_CATALOG='SFNOWFLAKE_TUTORIAL' AND TABLE_DROPPED IS NULL;
SELECT * FROM INFORMATION_SCHEMA.TABLE_STORAGE_METRICS WHERE
TABLE_NAME LIKE 'EMP_CLONE'
AND TABLE_CATALOG='SFNOWFLAKE_TUTORIAL' AND TABLE_DROPPED IS NULL;
Swap: Swapping the one table metadata to another table. Once swap emp_dev
to Emp_prod, then emp_dev data will not there. The records will be there in
EMP_Prod.

---- Table swap


create or replace table emp_dev (
first_name string ,
last_name string ,
email string ,
streetaddress string ,
city string ,
start_date date
);
copy into emp_dev
from @my_s3_stage
file_format = (type = csv field_optionally_enclosed_by='"')
pattern = '.*employees0[1-5].csv'
ON_ERROR='CONTINUE'
create or replace table emp_prod (
first_name string ,
last_name string ,
email string ,
streetaddress string ,
city string ,
start_date date
);
copy into emp_prod
from @my_s3_stage
file_format = (type = csv field_optionally_enclosed_by='"')
pattern = '.*employees0[1-2].csv'
ON_ERROR='CONTINUE'
ALTER TABLE emp_prod SWAP WITH emp_dev;
select * from emp_prod;
select * from emp_dev;
12. Views & Types of views and MV maintenance cost and Limitations?
• Views
• Regular View - Non-Materialized Views
• Secured Views
• Materialized Views
/********* How data might get exposed in normal view ***********/
create or replace view sales.public.customer_data_normal
as select C_NAME, C_MKTSEGMENT, C_ACCTBAL
from sales.public.customer where c_mktsegment='AUTOMOBILE';

/***************Secure view normal view difference **************/


create or replace secure view sales.public.customer_data_secure
as select C_NAME, C_MKTSEGMENT, C_ACCTBAL
from sales.public.customer;
/***************materialize view**************/
create or replace materialized view call_center_M_view
as
select * from
CALL_CENTER;
select * from table(information_schema.materialized_view_refresh_history());
SHOW MATERIALIZED VIEWS LIKE 'call_center_M_view' -- -- observe behind_by
column
select 60+59 =119
select * from call_center_M_view -- still 59 records
select * from
table(information_schema.materialized_view_refresh_history());--- It will start
to refresh
13. Copy command & Options and Data loading (Local file system, Could)?
COPY: Copy command is using for load the files of stage data into existing table
in SK.
-- Create table
create or replace table emp (
first_name string ,
last_name string ,
email string ,
streetaddress string ,
city string ,
start_date date
);

-- We will be copying the data where we have error values for date columns.
copy into emp
from @my_s3_stage
file_format = (type = csv field_optionally_enclosed_by='"')
--pattern = '.*employees0[1-5].csv'
validation_mode = 'RETURN_ERRORS';

-- Recreate the table with start_date as string.


create or replace table emp (
first_name string ,
last_name string ,
email string ,
streetaddress string ,
city string ,
start_date string
);

-- We will change the data type of date column and try to load it again.
copy into emp
from @my_s3_stage
file_format = (type = csv field_optionally_enclosed_by='"')
--pattern = '.*employees0[1-5].csv'
validation_mode = 'RETURN_ERRORS';

File Options:
-- Create emp table.
create or replace table emp (
first_name string ,
last_name string ,
email string ,
streetaddress string ,
city string ,
start_date date
);
-- Copy data using filter condition on file name.
copy into emp
from @my_s3_stage
file_format = (type = csv field_optionally_enclosed_by='"')
pattern = '.*employees0[1-5].csv'
-- ON_ERROR='CONTINUE'
files=('employees01.csv','employees_error_file0.csv','employees_error_file1.csv
')

desc stage my_s3_stage;

-- Remove error files and load data


copy into emp
from @my_s3_stage
file_format = (type = csv field_optionally_enclosed_by='"')
pattern = '.*employees0[1-5].csv'
-- ON_ERROR='CONTINUE'
files=('employees01.csv','employees_error_file0.csv','employees_error_file1.csv
')
copy into emp
from @my_s3_stage
file_format = (type = csv field_optionally_enclosed_by='"')
pattern = '.*employees0[1-5].csv'
-- ON_ERROR='CONTINUE'
files!
=('employees01.csv','employees_error_file0.csv','employees_error_file1.csv')

ON ERROR:

-- Load data into table ignoring rejected records.


copy into emp
from @my_s3_stage
file_format = (type = csv field_optionally_enclosed_by='"')
--pattern = '.*employees0[1-5].csv'
ON_ERROR='CONTINUE';
-- Check for rejected records.
select * from table(validate(emp, job_id=>'018ff817-0317-f8b7-0000-
e5fd0001551e'));
-- Save rejected records to reprocess in future.
create table copy_rejects
as
select * from table(validate(emp, job_id=>'018ff817-0317-f8b7-0000-
e5fd0001551e'));
-- Lessons learned.
1. How to load ignore records while loading data into table.
2. How to retrieve rejected records.

ENFORCE_LENGTH & TRUNCATECOLUMNS: If column values length is more using


these option we can load data into table.
ENFORCE_LENGTH = TRUE | FALSE
TRUNCATECOLUMNS = TRUE | FALSE
DESC STAGE MY_S3_STAGE;
COPY INTO EMP
FROM @MY_S3_STAGE
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"')
--PATTERN = '.*EMPLOYEES0[1-5].CSV'
ON_ERROR='CONTINUE'
--TRUNCATECOLUMNS = TRUE
ENFORCE_LENGTH = FALSE;
FORCE: Using this option we can load already loaded file again into table.
Without this option we cannot same records again into table bas SK checking
records with MD5 hash algorithm.
-- Remove files from staging area.
copy into emp
from @my_s3_stage
file_format = (type = csv field_optionally_enclosed_by='"')
pattern = '.*employees0[1-5].csv'
ON_ERROR='CONTINUE'
FORCE = TRUE;
PURGE: Once file loaded into table then file will be deleted form stages.
-- Remove files from staging area.
copy into emp
from @my_s3_stage
file_format = (type = csv field_optionally_enclosed_by='"')
pattern = '.*employees0[1-5].csv'
ON_ERROR='CONTINUE'
PURGE = TRUE;
Load History: Copy command history we can get till 14 days (What are the files
loaded by copy command).It will give only 10k records.
SELECT * FROM INFORMATION_SCHEMA.LOAD_HISTORY
ORDER BY LAST_LOAD_TIME DESC;
SELECT * FROM INFORMATION_SCHEMA.LOAD_HISTORY
WHERE SCHEMA_NAME='PUBLIC' AND
TABLE_NAME='EMP';
Copy History: To overcome Load history command 10k rcords we can use this
copy history option.
SELECT *
FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(TABLE_NAME=>'EMP'));
SELECT *
FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(TABLE_NAME=>'EMP',
START_TIME=> DATEADD(HOURS, -5, CURRENT_TIMESTAMP())))
WHERE ERROR_COUNT >0;
D/w Load history & Copy history view:

14. Snow pipe in


snowflake?
Snow Pipe: Continuous Data Loading (stream data) in SK.
Once file loaded in S3 bucket then AWS S3 notify the Snow pipe service then copy
command will execute and load data form S3 bucket to Stage tables.
Amazon SQS(ADD SQS QUEUE ARN): Amazon SQS is a message queue service
used by distributed applications to exchange messages through a polling model,
and can be used to decouple sending and receiving components.
Event: Need to open bucket properties in S3 and create Event with show pipe
SQS(Show pipe pipename;).

-- Create a pipe to ingest csv data


create or replace pipe snowpipe.public.snowpipe auto_ingest=true as
copy into snowpipe.public.emp_snowpipe
from @snowpipe.public.snow_stage
file_format = (type = 'csv');

alter pipe snowpipe refresh;


show pipes;

Note1: If there is an error snow pipe will not give you any notification. It will
remain silent. It's your responsibility to check for errors.
-- Validate
select SYSTEM$PIPE_STATUS('snowpipe');

select * from table(validate_pipe_load(


pipe_name=>'DEMO_DB.PUBLIC.snowpipe',
start_time=>dateadd(hour, -4, current_timestamp())));

select * from
table(information_schema.copy_history(table_name=>'emp_snowpipe',
start_time=> dateadd(hours, -1, current_timestamp())));

select * from table(information_schema.query_history())


order by start_time desc;
Note2: You can't update copy command. You can only recreate pipe. Recreating
your pipe will not change your notification channel. Recreating pipe object is not
dropping pipe metadata (file sent). Snow pipe will work based on filename and
Copy command work based on records (MD5 hash algorithm).
create or replace pipe demo_db.public.snowpipe auto_ingest=true as
copy into demo_db.public.emp_snowpipe
from @demo_db.public.snow_stage
file_format = (type = 'csv', FIELD_OPTIONALLY_ENCLOSED_BY='"');

alter pipe snowpipe refresh;

--arn:aws:sqs:us-east-1:628993367716:sf-snowpipe-AIDAZE4XND2SCZEJXXYBR-
gmNjf-iFPpApjNiM7VwsSQ
--arn:aws:sqs:us-east-1:628993367716:sf-snowpipe-AIDAZE4XND2SCZEJXXYBR-
gmNjf-iFPpApjNiM7VwsSQ
Integration: The AWS & Screat key can integrate instead of giving directly in copy
command, Because others can see the keys’ and we can do changes in future in
once place instead of changing copy commands.
create or replace storage integration s3_int
type = external_stage
storage_provider = s3
enabled = true
storage_aws_role_arn = 'arn:aws:iam::579834220952:role/snowflake_role'
storage_allowed_locations = ('s3://hartfordstar/');

alter storage integration s3_int


set STORAGE_ALLOWED_LOCATIONS =
('s3://hartfordstar/','s3://hartfordstar/snowpipe/','s3://hartfordstar/
snowpipe2/');
create or replace file format my_csv_s3_format
type = csv field_delimiter = ',' skip_header = 0 null_if = ('NULL', 'null')
empty_field_as_null = true FIELD_OPTIONALLY_ENCLOSED_BY='"';

create or replace stage snow_stage


storage_integration = s3_int
url = 's3://hartfordstar/snowpipe'
file_format = my_csv_s3_format;
15. Streaming in snowflake & Types of Streaming, SCD’s, Change Clause?
Stream: Stream is an object which created on top of the table it maintains
offsets when perform any DML operation on table. The offsets will be recorded by
stream.
Stream itself does not contain any table data. A stream only stores the offset for
the source table and returns CDC (Change Data Capture) records by leveraging
the versioning history for the source table
• Stream Columns –
• METADATA$ACTION
• METADATA$ISUPDATE
• METADATA$ROW_ID
• Data Retention Period and Staleness
• If a stream is created on a table, the data retention period is 14 days
• Additional Storage Charges are applied.
Types of Stream’s:
-- Create a standard stream on the source table.
create or replace stream delta_s on table t;

-- Create an append-only stream on the source table.


create or replace stream append_only_s on table t append_only=true;
-- Create consumer table
create or replace
table target_t(id int, name string, stream_type string default null, rec_version
number default 0,REC_DATE TIMESTAMP_LTZ);
--consuming the delta_s table data to Target table t
merge into target_t t
using delta_s s
on t.id=s.id and (metadata$action='DELETE')
when matched and metadata$isupdate='FALSE' then update set
rec_version=9999, stream_type='DELETE'
when matched and metadata$isupdate='TRUE' then update set
rec_version=rec_version-1
when not matched then insert (id,name,stream_type,rec_version,REC_DATE)
values(s.id, s.name, metadata$action,0,CURRENT_TIMESTAMP() )
CHANGE Clause: when use this changes clause we can track changes on table for
particular period like stream.
-- Enable change tracking on the table.
alter table t1 set change_tracking = true;
-- Initialize a session variable for the current timestamp.
set ts1 = (select current_timestamp());
-- Query the change tracking metadata in the table during the interval from $ts1
to the current time.
-- Return the full delta of the changes.
select *
from t1
changes(information => default)
at(timestamp => $ts1);
SCD:
-- updating task
CREATE TASK tgt_merge
WAREHOUSE = compute_wh
SCHEDULE = '1 minute'
WHEN
SYSTEM$STREAM_HAS_DATA('delta_s')
AS
merge into target_t t
using delta_s s
on t.id=s.id and (metadata$action='DELETE')
when matched and metadata$isupdate='FALSE' then update set
rec_version=9999, stream_type='DELETE'
when matched and metadata$isupdate='TRUE' then update set
rec_version=rec_version-1
when not matched then insert (id,name,stream_type,rec_version,REC_DATE)
values(s.id, s.name, metadata$action,0,CURRENT_TIMESTAMP() )
16. Scheduling-TASKS in snowflake?
Task:

.Execute Single DML Statement or Procedure(Multiple DML).


Task Dependency: A task can be trigger by only one parent task. This is one
constraint currently in Snowflake.
-- Creating task.
CREATE OR REPLACE TASK mytask_minute
WAREHOUSE = COMPUTE_WH,
SCHEDULE = '1 MINUTE'
AS
INSERT INTO mytable(ts) VALUES(CURRENT_TIMESTAMP);

CREATE TASK mytask_hour


WAREHOUSE = mywh
SCHEDULE = 'USING CRON 0 9-17 * * SUN America/Los_Angeles'
TIMESTAMP_INPUT_FORMAT = 'YYYY-MM-DD HH24'
AS
INSERT INTO mytable(ts) VALUES(CURRENT_TIMESTAMP);
-- SCHEDULE AT SPECIFIC TIME.

# __________ minute (0-59)


# | ________ hour (0-23)
# | | ______ day of month (1-31, or L)
# | | | ____ month (1-12, JAN-DEC)
# | | | | _ day of week (0-6, SUN-SAT, or L)
#|||||
#|||||
*****
-- Check sheduled tasks.
SHOW TASKS
-- Put task in the shedule.
alter task mytask_minute resume;
alter task mytask_minute SUSPEND
-- Check task history.
select *
from table(information_schema.task_history())
order by scheduled_time;

17. Loading unstructured data (Jason, XML, and Parquet)?


Loading unstructured data: We can load Jason, XML, and Parquet data using
variant data type.
create or replace table json_demo (v variant);
Json:

XML:

18. Data sampling in snowflake (Row, Block, Clone)?


Data Sampling: Returns a subset of rows sampled randomly from the specified
table instead of taking entire data form prod data base for analysis.
--Sample Block wise
CREATE OR REPLACE TRANSIENT TABLE SUPPLIER_BLOCK_ALG
AS
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF10000.SUPPLIER sample
system (3) seed (82);
--Sample Row wise.
CREATE OR REPLACE TRANSIENT TABLE SUPPLIER_ROW_ALG
AS
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF10000.SUPPLIER sample row
(3) seed (82);
-- CLONE sample: Clone also will take much time for querying the data for
analysis better to go sample above.
CREATE OR REPLACE TRANSIENT TABLE SUPPLIER_CLONE CLONE SUPPLIER
CREATE OR REPLACE TRANSIENT TABLE SUPPLIER as select * from
SNOWFLAKE_SAMPLE_DATA.TPCH_SF10000.SUPPLIER
19. Performance tuning?
1)absolute path while copy into from s3
2)break the bigger files like 10gb into small 100 mbs file so the parallel
execution takes place to load and performance will be better(split -b <byetes>.
3)since snowflake is a columnar storage utilise by selecting only the requires
columns instead of select * from
4)Create multicluster warehouses for parallel tasks, which autosespends after
60 secs
5)Decide Warehouse Size based on Environment
6)Separate Warehouse for Data Loading and Query Execution
7)Warehouse for Data Loading
8)Enable Auto-scaling
9)Optimize Insert Statements
10)Refrain from Executing Simple Queries
11)Enable Warehouse Auto-suspension
CLUSTERING (lowest possible cardinality)
1)select system$clustering_information('test2', '(col1, col3)');
2)cluster ratio

SELECT e.ename, e.empno, m.ename as manager, e.mgr


FROM
emp e
LEFT OUTER JOIN emp m
ON e.mgr = m.empno

20. AWS s3 account creation (S3 bucket, Policy, Role) & Creating Integration.
AWS S3: It is a simple storage service that we can use to store and retrieve any
amount data, at any time, from anywhere on the web.
How to connect S3 bucket & Integration from SK:
• Step 1: Create S3 Bucket / Access Permissions for the S3 Bucket
• Step 2: Snowflake related AWS IAM Role creation
• Step 3: Creation of Snowflake Cloud Storage Integration
• Step 4: AWS IAM User for your Snowflake Account updation with Snowflake
Integration
• Step 5: Granting the IAM User Permissions to Access Bucket Objects
• Step 6: Creating an External Stage
• Step 7: Load Data into Table
Integration:
• Integration With S3 can be done through Below Options
• Option 1: Configuring a Snowflake Storage Integration
• Option 2: Configuring an AWS IAM Role
• Option 3: Configuring AWS IAM User Credentials

Crete IAM Policy:


Create IAM Role:
AWS S3 Configuration with integration.
create or replace storage integration s3_int
type = external_stage
storage_provider = s3
enabled = true
storage_aws_role_arn = 'arn:aws:iam::***************:role/snowflake'
storage_allowed_locations = ('s3://snowflake069/employee/');

DESC INTEGRATION s3_int;


Unloading data from SK to S3 bucket:
Aws commands:
-- Copy data from local to aws s3.
aws s3 cp /GITHUB/Snowflake/Snowflake/getting-started/
s3://snowflake069/upload_cli --recursive
-- Copy data from aws s3 to local.
aws s3 cp s3://snowflake069/upload_cli
/GITHUB/Snowflake/Snowflake/getting-started_1/ --recursive
-- Command to list all files in aws s3.
aws s3 ls s3://snowflake069/ --recursive

--- AWS S3 Configuration.

create or replace storage integration s3_int


type = external_stage
storage_provider = s3
enabled = true
storage_aws_role_arn = 'arn:aws:iam::*******:role/snowflake'
storage_allowed_locations =
('s3://snowflake069/employee/','s3://bucky2018/');

DESC INTEGRATION s3_int;

create or replace file format control_db.file_formats.my_csv_format


type = csv field_delimiter = ',' skip_header = 1 null_if = ('NULL', 'null')
empty_field_as_null = true;

desc file format my_csv_format;

create or replace stage control_db.external_stages.my_s3_stage


storage_integration = s3_int
url = 's3://bucky2018/'
file_format = control_db.file_formats.my_csv_format;
-- Query data directly
select t.$1 as first_name,t.$2 last_name,t.$3 email
from @control_db.external_stages.my_s3_stage/ t
-- filter data directly
select t.$1 as first_name,t.$2 last_name,t.$3 email
from @control_db.external_stages.my_s3_stage/ t
where t.$1 in ('Di','Carson','Dana')
-- you can write join condition.
select t.$1 as first_name,t.$2 last_name,t.$3 email
from @control_db.external_stages.my_s3_stage/ t,
@control_db.external_stages.my_s3_stage/ d
where t.$1 =d.$1
-- You can also create views & Tables.
create or replace view demo_db.public.query_from_s3
as
select t.$1 as first_name,t.$2 last_name,t.$3 email
from @control_db.external_stages.my_s3_stage/ t

create or replace table demo_db.public.query_from_s3_table


as
select t.$1 as first_name,t.$2 last_name,t.$3 email
from @control_db.external_stages.my_s3_stage/ t

select * from demo_db.public.query_from_s3_table;


show tables;

select * from demo_db.public.query_from_s3


where First_name='Ivett';
select count(*) from demo_db.public.query_from_s3;

/************** LOAD DATA FROM S3 TO SNOWFLAKE ******************/


create or replace table demo_db.public.emp_ext_stage (
first_name string ,
last_name string ,
email string ,
streetaddress string ,
city string ,
start_date date
);

create or replace stage control_db.external_stages.my_s3_stage


storage_integration = s3_int
url = 's3://snowflake069/employee/'
file_format = control_db.file_formats.my_csv_format;

copy into demo_db.public.emp_ext_stage


from (select t.$1 , t.$2 , t.$3 , t.$4 , t.$5 , t.$6 from
@control_db.external_stages.my_s3_stage/ t)
--pattern = '.*employees0[1-5].csv'
on_error = 'CONTINUE';

copy into demo_db.public.emp_ext_stage


from (select metadata$filename,t.$1 , t.$2 , t.$3 , t.$4 , t.$5 , t.$6 from
@control_db.external_stages.my_s3_stage/ t)
pattern = '.*employees0[1-5].csv'
on_error = 'CONTINUE';

TRUNCATE TABLE emp_ext_stage;

SELECT * FROM emp_ext_stage

select * from table(validate(emp_ext_stage, job_id=>'01931c98-02b2-9c2c-000f-


73030001a0e2'));

/************** UNLOAD DATA FROM SNOWFLAKE TO S3 ***************/


create or replace file format my_csv_unload_format
type = csv field_delimiter = ',' skip_header = 1 null_if = ('NULL', 'null')
empty_field_as_null = true compression = gzip;

alter storage integration s3_int set


storage_allowed_locations=('s3://snowflake069/employee/','s3://snowflake06
9/emp_unload/','s3://snowflake069/zip_folder/')

desc integration s3_int

create or replace stage my_s3_unload_stage


storage_integration = s3_int
url = 's3://snowflake069/emp_unload/'
file_format = my_csv_unload_format;
copy into @my_s3_unload_stage
from
emp_ext_stage

copy into @my_s3_unload_stage/select_


from
(
select
first_name,
email
from
emp_ext_stage
)

copy into @my_s3_unload_stage/parquet_


from
emp_ext_stage
FILE_FORMAT=(TYPE='PARQUET' SNAPPY_COMPRESSION=TRUE)
************* Copy zip files**********
create or replace stage my_s3_zip_stage
storage_integration = s3_int
url = 's3://snowflake069/zip_folder/'
file_format = my_csv_format;

copy into emp_ext_stage


from (select t.$1 , t.$2 , t.$3 , t.$4 , t.$5 , t.$6 from @my_s3_zip_stage/ t)
--pattern = '.*employees0[1-5].csv'
on_error = 'CONTINUE';
21. Snow CLI, AWS CLI?
Snow CLI: Using command line we can connect snowflake for writing snow sql.
Snowsql – a <snowflake account id>
Password
AWS CLI: Using command line we can connect AWS for working S3 bucket files
like splitting the files and etc…., we need to some AWS commands.
-- Aws commands
-- Copy data from local to aws s3.
aws s3 cp /GITHUB/Snowflake/Snowflake/getting-started/
s3://snowflake069/upload_cli --recursive

-- Copy data from aws s3 to local.


aws s3 cp s3://snowflake069/upload_cli
/GITHUB/Snowflake/Snowflake/getting-started_1/ --recursive

-- Command to list all files in aws s3.


aws s3 ls s3://snowflake069/ --recursive

==================END================

This Notes Prepared by


Chandu.

You might also like