0% found this document useful (0 votes)
108 views155 pages

Snowflake Data Warehousing Overview

Snowflake is a cloud-based data warehousing platform that utilizes a multi-cluster, shared-data architecture, allowing for scalability and flexibility in data storage and processing. It features a SQL interface, automatic query optimization, and robust security measures, while separating storage and compute resources for cost management. Snowflake supports various data formats, offers data sharing capabilities, and provides tools for performance optimization and resource management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views155 pages

Snowflake Data Warehousing Overview

Snowflake is a cloud-based data warehousing platform that utilizes a multi-cluster, shared-data architecture, allowing for scalability and flexibility in data storage and processing. It features a SQL interface, automatic query optimization, and robust security measures, while separating storage and compute resources for cost management. Snowflake supports various data formats, offers data sharing capabilities, and provides tools for performance optimization and resource management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Cloud-Based Architecture: Snowflake is built on a cloud-based architecture, which means it

leverages the power of cloud computing for data warehousing, allowing for scalability and flexibility.

Data Warehousing: Snowflake is primarily used for data warehousing, which involves the storage,
processing, and analysis of large volumes of data.

Multi-Cluster, Shared-Data Architecture: Snowflake uses a multi-cluster, shared-data architecture.


Multiple compute clusters can access the same data simultaneously, which enhances performance
and concurrency.

Virtual Warehouses: In Snowflake, you create virtual warehouses to separate and manage your
compute resources. This allows you to scale your resources up or down based on demand.

Data Storage: Snowflake separates data storage from compute resources. You pay for storage
separately from compute, and this decoupling is a key feature for cost management.

SQL Interface: Snowflake uses SQL as its query language, making it accessible to SQL-savvy users.

Automatic Query Optimization: Snowflake handles query optimization automatically, so you don't
need to manually tune queries for performance.

Data Sharing: Snowflake allows for easy and secure data sharing between organizations and within
your organization. You can share data without copying it.

Data Security: Snowflake provides robust security features, including encryption, role-based access
control, and auditing.

Elasticity: You can easily scale up or down based on your workload, which makes Snowflake cost-
effective.

Data Loading: Snowflake provides multiple ways to load data, including bulk loading, streaming, and
third-party integrations.

Snowpipe: Snowpipe is a service for automatically ingesting streaming data into Snowflake.

Semi-Structured Data: Snowflake supports semi-structured data formats like JSON, Avro, and
Parquet, allowing you to work with a variety of data types.

Time Travel: Snowflake provides features like Time Travel and Fail-Safe, which allow you to recover
from accidental data changes and historical data analysis.

Global Availability: Snowflake is available across multiple cloud providers, giving you the flexibility to
choose your preferred cloud infrastructure.

Snowflake Data Marketplace: You can access third-party data sets and data services through
Snowflake's Data Marketplace.

Query Performance: Snowflake is known for its excellent query performance, thanks to its
architecture and optimizations.

Integration: Snowflake can be easily integrated with popular data visualization and ETL tools.

Cost Management: With separate billing for storage and compute, you have better control over
costs. Snowflake also offers features like Auto-Suspend and Auto-Resume to save on compute costs.
Compliance: Snowflake complies with various data security and compliance standards, making it
suitable for a wide range of industries.
Follow these steps to add the Sample Database to your account:
1) Role set to ACCOUNTADMIN

2) Click on side menu DATA

3) Click on Private Sharing

4) Look for Direct Sharing

5) Click on SAMPLE_DATA

6) In the dialog, edit the name to say SNOWFLAKE_SAMPLE_DATA and

choose SYSADMIN as the other role that should have access.


Udemy LEARNINGS on Snowflake
Introduction:

Architecture

Data loading: data aloding and unloding techniques

Performance tuning – performance otmiztion techniques. Time travel, cloning and data sharing.

Continuous data protection

Secure data sharing

Resource management

Security

Cloning

Snowflake architecture – decoupled storage and compute. How they all work together to deliver
scalable and efficient data warehousing solution.

How to load data into snowflake from cloud and on-premises sources. Data loading approach via
bulk loading approach and also continuous data loading using snowpy.

How to optimize the data loading process.

Performance optimization techniques include query re-writing , query profiling, materialized views
clustering and cache.

Snowflake includes cloning , time travel and data sharing. Time travel allows users to query data as it
appeared at specific point of time making it easy to track changes and analyse data discrepancies.

Cloning enables users to create copy of entire databases schemas or tables within a few mins. Which
can help creating sandboxes or backups.

Data sharing allows users to share data securely and efficiently with other snowflake accounts
enabling collaboration and data monetization.

Snowflake security features include multi factor authentication, RBAC, and data encryption in transit
and rest.

Resource management in snowflake covers resource monitors used to monitor the cost of virtual
warehouses and the use on information schema and account schema to view and monitor the
compute and storage.

what is Snowflake?

Snowflake is a purpose built data warehousing platform built from scratch exclusively for public
cloud platforms.

So it is available as a cloud only software as a service offer.

Since it has been designed for the cloud, the software is optimized for execution on the cloud and
takes advantage of cloud concepts like decoupling the storage and compute from each other

scalability features.

We'll learn more about this decoupling and how it results in a better outcome for the customer.

So Snowflake is provided as a software, as a service offering, which means that there is no hardware

to manage for the customer and no software to maintain or upgrade.

Any updates, optimizations and tunings are made by Snowflake itself and it's made available to all

customers automatically.

A snowflake data warehouse is a paper used only, and because of the decoupled nature of storage
and compute, a customer only pays for the actual storage used and the actual compute used.

This means that you could store terabytes of data, and if you're not processing that data regularly,

you get charged only for the storage costs. So finally, Snowflakes design takes advantage of the
storage and compute scalability that is offered by the underlying cloud platforms.

Since it uses object storage, the storage available to a snowflake customer is virtually unlimited.

It's highly fault tolerant and can be scaled up to any number. Similarly, on the processing front,
Snowflake uses scalable compute clusters called virtual warehouses.

Virtual warehouses allow the processing power available to Snowflake to be scaled in many different

ways.

Snowflake was ranked second on Forbes magazine's Cloud 100 list and first on LinkedIn's 2019 US
top

startup list.

So what does the future hold for Snowflake?

With all the unique features that Snowflake has, customers worldwide are moving to Snowflake as
their platform of choice.

And slowly but surely, Snowflake is gaining traction and becoming a snowball.

And I believe this technology is seriously challenging both traditional data warehousing platforms
and the big data platforms.

If you are working with data warehousing, big data or databases in any capacity.

Snowflake is undoubtedly one of the technology to master.

Private Snowflake.

So all the fundamentals of a cloud based data warehousing solution are included in Snowflake
Standard Edition.

It includes complete SQL data, warehouse capabilities, data sharing, data encryption in transit and

at rest and access to the Snowflake data marketplace.


Using the standard edition, you can perform database replication, federated authentication, multi-
factor authentication, cloning, failsafe and time travel are also included.

However, in the Standard edition, the ability to travel back in time is limited to one day.

The Enterprise has all the capabilities of the standard edition but adds additional capabilities.

These include time travel up to 90 days Multi-cluster virtual warehouses, materialized views,
dynamic masking, search optimization, external data, tokenization and annual rekeying of the data.

The business-Critical Edition enhances the enterprise edition with additional security features such

as a customer managed key payment card industry or PCI compliance and private connectivity
support, as well as failover.

And then finally, the virtual private snowflake.

It builds on all these editions. So it has all the capabilities of Business Critical Edition, but also
provides additional isolation by providing a customer specific metadata store and a customer
specific pool of computing resources that are not shared with any other customer.

So what that means is you will get your own isolated version of Snowflake.

Snowsight: Snowflake’s web interface.

Virtual private Snowflake edition provides dedicated compute resources and dedicated metadata
store.

What is the minimum Snowflake edition which supports multi-factor authentication (MFA) –
Standard edition.

What is the minimum Snowflake edition that supports private connectivity to Snowflake – Business
critical

Snowflake architecture:

Hybrid architecture -multi cluster , shared data architecture.

Multi cluster, shared data – separation of storage and compute allows Unlimited scalability
which is independent of each other.

Snowflake supports below clouds service providers storages:

AWS S3, Azure Blob storage or Google cloud storage to store its data. Since snowflake
stores data on object

Currently, Snowflake supports AWS, S3 Storage, Azure Blob Storage or Google Cloud
Storage to store its data.

Since Snowflake stores data on object storage on the cloud platform, the storage
can scale indefinitely and independently of compute.
The cloud platform is responsible for providing the durability for these stored files.
Therefore, Snowflake can take advantage of the disaster recovery and fault tolerance
provided by the underlying cloud platform.
Data that is loaded into Snowflake is stored as files on the object based cloud
storage.
It is worth mentioning here that cloud based object storage is immutable, which
means stored data cannot be updated once it is written, but it can only be appended
to if updates are required to a file that was written to an object store.

You must remove the complete file, perform an update and write the new file back to
the object store.
So this immutability of files on object store presents an interesting challenge that
Snowflake solves through its unique micro partitioning approach.
These micro partitions also form the architectural basis of many exciting snowflake
features such as time travel, secure data sharing and cloning.
So as this course progresses, we'll be talking a lot more about micro partitions, how
they are managed, how data is inserted in those micro partitions, what is the
maximum size of a micro partition and what are the immutability implications on
micro partitions?
But for the time being, in the next lecture we'll be looking at the compute layer, which
is the virtual warehouses in Snowflake.
Whenever you hear the term virtual warehouse, it actually refers to a compute
cluster.
However, each of the virtual warehouse accesses the same shared data.
Next, we have the compute layer through which queries are executed on the stored
data.
The compute engines in Snowflake are referred to as virtual warehouses.
The virtual warehouses are entirely independent of the storage.

each of the virtual warehouse accesses the same shared data.

Therefore, the architecture for Snowflake is often referred to as shared data.


Multi-cluster Architecture.
In the next lecture we will cover the cloud services, which is also a critical part of the
snowflake architecture.

Storage layer – Micro partitions.

Each micro partition – 50 MB – 500 MB

So data in Snowflake is automatically organized into partitions known as micro


partitions.
Compared to traditional static partitioning.
Micro partitions in Snowflake are managed automatically and don't require
intervention by the user.
As the name suggests, micro partitions are relatively small and each micro partition
will generally contain 50 MB to 500 MB of uncompressed data.
However, do note that the actual stored data is smaller as data in Snowflake is
always stored with compression.
Micro partitions are added to a table in the order of how the data arrived in the table.
So if additional data is added to a table, another micro partition or possibly multiple
micro partitions depending on the size of the data, will be created to accommodate
this data.
Snowflake.
Micro partitions are immutable, which means they cannot be changed once created.
Any update to existing data or loading of new data into a table will result in new
micro partitions being created.
Because micro partitions are immutable and any update or new data must be added
into a new micro partition. Therefore, it is not necessary that similar partition values
will always be in the same physical partition. So as an example, you'll see the data
from the table on the left stored in multiple micro partitions
on the right.
You will also notice that the values in the two micro partitions overlap between the
partitions.
You will see multiple partitions for the same column value and overlaps between
partitions for different column values.
Because the data values are spread across multiple partitions.
Snowflake must keep track of what range of data is in which partitions so that it can
use that information for efficient query processing.
Now Snowflake maintains several different kinds of metadata for a given table for
this purpose.
It stores the range of column values in its metadata.
That is the maximum and minimum value for each column in each micro partition.
With this metadata information, Snowflake can intelligently decide which partitions
to read when processing a query.

Similarly, it also stores the count of distinct values for each column in the metadata
and certain other information to assist in query optimization.
Now, another important aspect is that within each micro partition, the data is stored
in a columnar format.

So each column is stored compressed, and snowflake automatically determines the


most appropriate and best compression algorithm.

Storing data in columnar format enables Snowflake to optimize queries even further
when a subset of columns are accessed.

So consider this straightforward SQL example on the screen where we are querying
the table on the left and only selecting the ID column.
We also added an ID equal to one where clause in the query.
So now Snowflake knows exactly which micro partitions contain the value one for
the ID column.
And additionally, since we selected only one column, it will limit the processing to
just that column
within its columnar format.
So the result is highly efficient access of a subset of stored data as highlighted on
the screen.
So this was a bit of detail about the micro partitions in Snowflake.
We will come across micro partitions throughout this course, especially when we
look into the cloning and the data sharing section.

Cloud Services Layer

So the cloud services is a highly available fault, tolerant, always on service.


So for any user connecting to Snowflake, whether via the Snowflake Web, UI or Snow
SQL, their requests will go through the cloud services layer.
Metadata and management, security & governance, sharing, query parsing, query
optimization, acid control.
Additionally, the Cloud Services layer provides transaction control or ACID
compliance.
ACID stands for Atomicity, consistency, isolation and durability and a high level.
This term refers to the fact that a database system must allow for multiple
transactions to execute in isolation, commit or rollback a transaction as a single unit,
ensuring that a consistent state.

Compute layer – additional aspects of virtual warehouses.

A virtual warehouse in Snowflake is typically a multi node compute cluster.


A virtual warehouse provides resources such as CPU memory and temporary
storage, which is used to process

Multiple virtual warehouses can be created for a given snowflake account.


But it is worth noting that each of them access the same shared data, hence the
term multi clustered shared data.

Now virtual warehouse can be created in any of the available sizes.


For example, extra small, which is a one node cluster, small, which is a two node
cluster, and then all the way up to four x large, which is a staggering 128 nodes.
Now Importantly, a customer can suspend a virtual warehouse if it is not in use or if
it is not required for some time.
A suspended virtual warehouse does not consume any credits and therefore doesn't
cost anything to the customer.
It is important to note here that when a virtual warehouse is requested to be
suspended, it does not enter a suspended state until all active queries using that
virtual warehouse have been completed.

A customer can also resize a virtual warehouse to meet the requirements of a


changing workload.
For example, suppose the customer started out with a small size virtual warehouse.
In that case, it may be that the query workload complexity has increased significantly
after a while.
So a small sized virtual warehouse is no longer enough to meet the demand.
In such cases, the virtual warehouse can be resized to a larger size to meet the
increased workload complexity.
When a virtual warehouse is resized to a larger size, additional nodes are provisioned
and added to the virtual warehouse compute cluster. It is important to note that the
charge for the new size only takes effect. After all, the nodes in the virtual warehouse
have been provisioned.

If the queries can execute efficiently and timely on a smaller virtual warehouse,
keeping a larger sized virtual warehouse is basically wasting resources and incurring
extra costs.
Therefore, scaling down the virtual warehouse size is a good option in such cases.

Snowflake pricing:

Storage cost
Compute cost
Serverless cost
Data transfer

So Snowflake charges a monthly fee to all customers for the data that they store.
The storage costs are calculated based on the average storage for each month.
And it's important to note that the storage costs are calculated after compression is
applied on the Data.
Now with Snowflake, you can create a virtual warehouse of any size from extra small
to four x large snowflake.
Credits are consumed based on the size of the virtual warehouse and are charged
every second.
However, do note that there is a minimum of one minute charge when a virtual
warehouse starts up.

How is Snowflake Costed:

Storage Cost – AWS S3 or Azure Blob storage based in actual usage , columnar
compression and other techniques save cost. SF charges monthly fee to all the
customers that they process data. The storage costs are calculated after the
compression is applied.

Snowflake credits are consumed based on the size the virtual warehouse and are
charged every sec.
Snowflake virtual warehouse can be created from extra small to 4x large.
Snowflake offers several serverless features for which the compute costs are
charged. The serverless features doesn’t require any virtual wh so charged
separately.

These serverless features includes: snowpipe, automatic clustering, database


replication, materialized view, search optimization.

Another cost is data transfer costs: there are no costs for data transfer into the
snowflake but its chargeable when its out to other regions and to other cloud
platforms.
Cloud services:10% cloud services is free – database definitions, table definitions,
users etc.,

Multi clustered shared data architecture -snowflake architecture.

Snowflake interfaces and Tools: snowflake web interface will be discussed in this.
The majority of tools and web interfaces.

Snowsql itself is available in linux, windows and mac OS.


Connecting to snowsql through command line interface:

Snowsight is a modern and lightweight web interface using new technologies and is a
primary method of interacting with your Snowflake instance.

Snow SQL connects to Snowflake through the command line and executes SQL queries on
your Snowflake instance. Snow SQL is available for Linux, Windows, and Mac OS.

High level view of data loading in Snowflake

COPY and snowpipe are 2 methods to load the data into snowflake.
Alternatively, the staging can be inside the DWH as shown below:
Role of STAGE in Data Loading:
Types of stages in snowflake: - external and internal stages.

External stage, named internal stage, user stage.


Loading on premises data via a named internal stage:

Hands on : LOADING DATA USING NAMED internal stage

External stage: points to cloud storage locations.


Snowflake can securely connect to a cloud storage location.

Hands on: Loading data using an external stage

http link doesn’t work in external stage definition.


While creating the external stage we will have authentication and authorization as well.
Tables Stage and User Stage
Basic Data transformations:

External tables: an alternate to data loading

Materialized views can improve query performance for external tables.


MV’s on external tables don’t refresh automatically.

Unloading or exporting data from snowflake:


Unloading data or exporting data is almost the same as loading. It uses the same copy
mechanism and the concept of stages. Unload to different file formats.
Compress and encrypt while unloading.
The GET command is used to download data from an internal stage to an on-premises
system. The PUT command uploads data from an on-premises system to an internal stage. To
download or upload data to an external stage, cloud provider utilities or other tools are used
to interact with data in the cloud storage pointed to by the external stage.

The load metadata stores a variety of information, such as the name of every file that was
loaded into that table and the time stamp corresponding to the time that a file was loaded. By
utilizing this load metadata, Snowflake ensures that it will not reprocess a previously loaded
file. The load metadata expires after 64 days. Snowflake skips over any older files for which
the load status is undetermined.

The COPY command allows unloading or exporting data from a table or a view and also
allows using queries (SELECT) to unload data.

When loading data into a table using the COPY command, Snowflake allows you to do
simple transformations on the data as it is being loaded by using a SELECT statement.
During the load process, the COPY command allows for modifying the order of columns,
omitting one or more columns, and casting data into specified data types. It is also possible to
truncate data using the COPY command if it is larger than the desired column width.

Snowflake offers an alternative approach for tables called external tables, which permits the
creation of tables with data stored in external cloud storage. External tables remove the need
for the data to be loaded into Snowflake. In the case of an External table, the definition of the
table is still stored in Snowflake metadata and consists of table structure, file locations,
filenames, and other attributes. However, the table's data is saved outside of Snowflake. The
external table functionality enables you to query external data like a standard table. External
tables may be joined to other

COPY command uses virtual warehouse resources. Snowpipe is billed separately and does
not use virtual warehouse resources. Snowpipe is serverless and has its own computational
capability; therefore, it does not rely on virtual warehouses for processing. Snowflake
automatically manages the compute required by a Snowpipe. Snowflake also manages the
scaling up and down of a Snowpipe as per the data load requirement. Since a Snowpipe is
serverless, its costs are charged separately from virtual warehousing fees.

Snowflake allows continuous data loading using Snowpipe, a serverless service. Snowpipe
enables you to load data in a micro-batch manner, loading small volumes of data on each
execution. The micro-batch-based data loading is used when a continuous stream of data,
such as transactions or events, must be loaded and made available to enterprises quickly.
Snowpipe enables continuous data loading and can load data within a few minutes after it
arrives in a stage. Snowpipe is serverless and has its own computational capability; therefore,
it does not rely on virtual warehouses for processing.

Snowflake allows continuous data loading using Snowpipe, a serverless service. Snowpipe
enables you to load data in a micro-batch manner, loading small volumes of data on each
execution. The micro-batch-based data loading is used when a continuous stream of data,
such as transactions or events, must be loaded and made available to enterprises quickly.
Snowpipe enables continuous data loading and can load data within a few minutes after it
arrives in a stage. Snowpipe is serverless and has its own computational capability; therefore,
it does not rely on virtual warehouses for processing.

When loading data into a table using the COPY command, Snowflake allows you to do
simple transformations on the data as it is being loaded. During the load process, the COPY
command allows for modifying the order of columns, omitting one or more columns, casting
data into specified data types, and truncating values. While loading the data, complex
transformations such as joins, filters, aggregations, and the use of FLATTEN are not
supported as they are not essential data transformations. Therefore, joining, filtering, and
aggregating the data are supported ONLY after the data has been loaded.

Continuous Data protection

This offers features that protect data in snowflake env without compromise.
This offers data encryption at rest and while in transit. Multi factor authentication is part of
the authentication mechanism.

In addition, Snowflake is also equipped with a number of cutting edge features that assist in
safeguarding of data and its subsequent recovery in the event that a human makes a mistake.

So you can restore data that has been mistakenly changed or deleted by utilizing two features
time travel and UN drop.

In this section, we'll cover some aspects of continuous data protection in Snowflake,
including time travel, failsafe, storage and drop functionality, and also covers the concept of
transient and temporary

Back to the Future with Time travel

So time travel in Snowflake is a cutting edge data protection feature that enables users to
query, retrieve and recover historical data from tables in case of loss of data when it has been
mistakenly changed or deleted as a result of a human error. You want to restore data as
painlessly and as quickly as possible.

Before the time travel functionality was introduced by Snowflake. The most common method
for recovering from the inadvertent loss of data was to restore the lost information from a
previous backup.

How time travel works in Snowflake:

Snowflake stores the data in its own format using the micro partitions.
as
new data added to the table new micropartitions are added to the same. The micropartitions
are immutable.

Micropartitions and metadata are key to time travel. And this immutability of the micro
partition has implications for the updates and deletes performed on a table.

The example shows that two rows from a table were deleted because micro partitions are
immutable.

Snowflake cannot simply update the data in micro partition number two, but rather it will
create a new micro partition marked as partition number four on our example. So it is
essential to note that deleted micro partitions that have been marked as deleted still exist.

physically on the disk and they can be read if necessary. And this lays the groundwork for
snowflake's time travel functionality.

So when users request to read data from a table using time travel extension and they want to
read data as it existed at a specific point, Snowflake reads data from deleted historical
partitions to fulfil those time travel queries.

Snowflake retains these historical deleted partitions for a specific period of time before
purging them altogether. Until these partitions are purged, any ordinary user can access the
data contained in these historical partitions, and the period for which these historical
partitions are retained is known as the time travel duration.

The time travel duration generally varies from one day to 90 days depending on the
snowflake edition that you may have. So this was basically behind the scenes working of time
travel. In the next lecture, we will discuss the time travel extensions and how to use them to
modify existing queries to use time travel.
The new micro partition has all the data from the deleted micro partition except the two rows
that were deleted. At the same time, the micro partition number two is marked as deleted. So
it is essential to note that deleted micro partitions that have been marked as deleted still exist
physically on the disk and they can be read if necessary.

Time travel SQL extensions:

Snowflake also supports UN drop statement, which can be used to recover tables, schemas or
even complete databases after they have been dropped. So the add clause is usually used
with select statements and supports three different ways of accessing historical data.

Back to the future with time travel:


Undrop with Timetravel:

Undo is nothing but how we do ctrl+z. it’s the same feature in snowflake.
Failsafe Storage:

The data in failsafe storage can be accessible by snowflake support team so we can contact
the snowflake support team to recover the data from failsafe storage.

Time travel and failsafe durations:

Snowflake charges for data stored in failsafe and Time travel storage.
Types of Tables in Snowflake:

Depending on the Snowflake edition, the Time Travel duration might range from 1 to 90
days. The Standard edition allows for one day of Time Travel. Time Travel is possible for up
to 90 days in the Enterprise version and above.
In addition to protection provided by Time Travel, data that has been modified also goes
through a failsafe period. Failsafe storage is intended to provide an extra layer of protection
against data loss caused by human error. Once the Time Travel period ends, Snowflake keeps
the data for a further 7-day period as further protection. When data is in failsafe storage,
ordinary users cannot access it; only Snowflake support employees can access and recover it
if the customer requests it.

Time Travel SQL extensions allow you to see data as it existed before or at a particular time.
It can also be used to see data before an SQL statement is executed or at the point when an
SQL statement is run. Time Travel does not let you recover data for more than 90 days in the
past.

To support Time Travel queries, Snowflake supports special SQL extensions. It supports the
AT and BEFORE statements which can be used with SELECT statements or while cloning
tables, schemas, and databases. Snowflake also supports the UNDROP statement, which can
be used to recover tables, schemas, or even complete databases after they have been dropped.

Cloning in Snowflake:

Intro to zero copy cloning:

How cloning works in snowflake:

Micro partitions and metadata are key to zero copy cloning.

So Snowflake divides data in a table into several micro partitions, and the metadata in the
cloud services layer tracks micro partitions corresponding to a table.

As new data is added to a table, new micro partitions are produced. So this metadata in
Snowflake's Cloud services layer maintains information on which micro partitions belong to
which table, and also other information such as if a micro partition is marked as deleted, or if
it is in time travel storage or in fail safe storage.
Any updates to the table will result in addition of new micro partition.

Cloning of a table is bit faster than the copy operation.

Cloning a table in snowflake:

cloning of a table, demonstrating how much faster the operation is compared to a


standard copy operation.
Sample queries for cloning a table:

The distinct queries from both the tables will be different as we have updated some
data in the copy table.

Clone a complete database and a schema:


Cloning with Time travel:
This is how time travel can be done with cloning:

Cloning is a metadata operation in which no actual copying of the data occurs. A


snapshot of the data in the object being cloned is captured and made available in the
cloned object. The cloned table's metadata references the existing micro-partitions at
the time of the snapshot.

Cloning is achieved through metadata operation performed in the cloud services layer. Data is
not physically copied, nor are new micro-partitions created—instead, the cloned table points to
the micro-partitions of the source table.

Combining Cloning and Time Travel can generate a clone of a table, database, or schema as it
existed at a specific point in time. Because both Time Travel & Cloning are metadata operations,
they can easily be combined.

The zero-copy cloning capability of Snowflake enables users to create clones of tables, schemas,
and databases without physically copying the data. Cloning does not require additional storage
space, and because cloning does not physically replicate data, it is far faster than the physical
copying of data. Micro-partitions and metadata enable rapid and efficient zero-copy cloning
because the cloned table's metadata references the existing micro-partitions.
When tables, schemas, or databases are cloned, the cloned item does not contribute to total
storage until data manipulation language (DML) operations are performed on the source or
target, which modify or delete existing data or add additional data.

A cloned object does not contribute to overall storage until DML operations on the
source or target object are done.

Data sharing in snowflake:


Secure data sharing in snowflake:

There is no storage fee for data sharing.

How data sharing works:


Data sharing – underpinned by micro partitions & metadata

Shared table refers to the underlying table and its micro partition. Sharing is faster and any
changes in data in table will be instantly reflected in the shared table.
Snowflake offering for data sharing:

3 types of sharing that snowflake provides are the above:

Direct share, snowflake market place and data exchange.

Direct sharing:
Virtual private snowflake accounts (VPS) are exception for data sharing. They cant share the
data.
Th e data sharing process requires steps to share with another consumer.

Steps in blue are performed by the data provider.


So in secure data sharing, the data sharing process starts with the provider account,
creating a share object.
So one way to think about a share object is as a container that stores all of the
information that is necessary to enable sharing.
So each object contains information on objects that are being shared, such as
tables, and the addition of a table to a share is accomplished by granting the share
object.
Select access to the table.
The shared object also contains information on the schema and database containing
the shared item.
The shared object must be granted usage access to the schema in database.
Finally, the shared object contains information about one or more snowflake
accounts with whom the data is shared.
These accounts are also referred to as consumers.
So after consumers account number is associated with a share, the share will begin
to appear in the consumer's account.
The consumer can then create a read only database on the share object and is then
able to view all of the shared object within that read only database.
Now it is important to note that the consumer account does not pay for the storage
cost for the shared data.
However, any queries run by the consumer account on the shared data result in the
usage of consumers compute.
Therefore, queries on shared data are charged to the consumer account.
Um, as I mentioned previously, reader accounts are an exception to this rule.
We will discuss them in coming lectures.
Finally, any snowflake account has the potential to become a data provider, allowing
it to share a single item or several objects with other snowflake accounts.
Um, virtual private snowflake accounts or accounts are an exception to this, so they
cannot share data.
So the process on the screen shows the steps required to share a table with another
consumer.
The steps in blue are performed on the data provider.
So we start by creating a new share object.
Then we add the tables or any other supported objects such as views that we want
to add to the share.
The add to the share is achieved by granting select access on the table to the share
object.
We must also add the schema that contains the table and the corresponding
databases to the share object.
This is achieved by granting usage access on the schema and the database to the
share object.
Finally, on the data provider, we add one or more consumers to the share object,
which basically results in the share being available to those consumers.
Now on the consumer account, we must create a database from the share.
There is a special syntax for that which we will see in the hands on lectures.
A database that is created from a share is automatically created as read only.
Then, optionally, you can grant access to other roles on the new database so that
other users are also able to access shared data.
Now users in consumer accounts can query this shared data.
However, note that they'll be using the consumer accounts compute.

Share a table with another snowflake account:

2 snowflake accounts are required for this – one is data provider and another one
for data consumer. Be on the same cloud provider and the same region.
It also covers across geographies.
Pls note that we use account admin role for sharing and managing the operation.

Please note that we will use the account admin role to create and manage the
sharing operation.
It is possible to grant, create, share and import share privileges to another role which
can then create these shares.
For setting up the sahre and sharing that with consumer account:
From the above, its clear that the share can be only created by a account admin.
First we need to grant select on the database and schema and then the table else it
will throw an error like below:

To find out the account name, below is the way we to find out the account name
from the web url:

Share a table using snowflake WEB UI:


Provide acct and consumer accts are required for this as mentioned above already
We need to have account admin privileges else we will not be able to share and the
consumer acct also should be in the same role.

The consumer should be in the same region and the same cloud provider.
Share data with a non-snowflake customer with snowflake web UI:
Here in this case, we are sharing the data here to a customer who doesn’t have
snowflake account. In this case we need to create a reader account for the user
who to view the data on the other side.

First we need to create a reader account in the below way:

Billing of this compute for the below is our responsibility:

The account has been created:

Snowflake Market place:


Exploring snowflake market place: To explore this we need a user with account
admin privilege or the import share privilege to consume the data.

Snowsite web UI:


We usually will have different datasets un snowflake web UI in Market place. We
need atleast admin account privileges or import and share privileges.

Data Exchange:

Data exchange in snowflake is the own private data sharing hub. Its just similar to
marketplace.
Quiz:
The cloud services layer facilitates data sharing through metadata operations.
When a Snowflake data provider shares data with another Snowflake account, the
data consumer is charged for the compute charges for any queries they run.
Metadata operations in the cloud services layer allow data sharing without physically copying it.
Since the provider account stores and pays for the data storage, the data consumer doesn't have
to pay anything extra for storage. However, the data consumer pays for the compute used to run
queries on shared data. When queries are run on shared data, the compute of the data consumer
is used.

Sharing data with a non-Snowflake user or organization is possible by creating a reader account.
This reader account is created by the data provider solely for sharing purposes.

Since the data provider creates and administers the reader account, all the reader account's
compute expenses are invoiced to the provider account. Therefore, the reader account's use of
the virtual warehouse compute is added to the provider account compute charges.

The Snowflake Marketplace is an online marketplace where you can purchase and sell datasets.
You may import data from outside your company into your Snowflake instance and utilize it to
enrich your data via the Snowflake Marketplace.

Data Exchange is your own private hub for sharing data with a small group of people or
organizations who have been invited to join. The owner of the Data Exchange account is in
charge of inviting members and specifying whether they can share, consume, or do both.

The consumer creates a database from the Share object as a read-only database.

Metadata operations in the cloud services layer allow data sharing without physically copying it.
Since the provider account stores and pays for the data storage, the data consumer doesn't have
to pay anything extra for storage. However, the data consumer pays for the compute used to run
queries on shared data. When queries are run on shared data, the compute of the data consumer
is used.

Except for Virtual private Snowflake accounts, the Snowflake Marketplace is available to all
Snowflake accounts hosted on Amazon Web Services, Google Cloud Platform, and Microsoft
Azure. Any Snowflake account (again, except for VPS accounts) can become a data provider and
publish datasets to the Marketplace for a cost or for free. In addition, you are required to sign up
as a partner first and become an approved data provider.

Virtual Private Snowflake (VPS) cannot use secure data sharing, Marketplace, etc., because VPS
accounts have isolated metadata, compute, and storage and therefore don't have sharing
capabilities.
Performance Optimization in Snowflake:

Some manual tuning may be required. scaling up and down of virtual warehouse is one of these
features. There are some automatic features which work by default only thing is we need to take
care that these features are working efficiently and as intended.

Partition cloning during query execution

***In Snowflake, scaling up refers to increasing the size of a warehouse by adding


more compute resources to it. This can be done by resizing the warehouse 1.
Scaling down refers to reducing the size of a warehouse by removing compute
resources from it. This can be done by either suspending the warehouse or by manually
downsizing it 1.
Scaling out refers to adding more clusters to a multi-cluster warehouse to increase
the pool of compute resources available to it. This can be done either statically or
dynamically, depending on the load on the warehouse ***

Some of the snowflake managed performance features include metadata, cache,


query result, caching, caching at the virtual warehouse level and partition pruning
during query execution.
While most of this work out of the box features such as partition pruning may work
better if table partitions are aligned to the query patterns, so some manual tuning
may be required.
In addition to these snowflake managed features, Snowflake also provides optional
and configurable performance options that can be set manually to optimize
performance.
Scaling up and down a virtual warehouse is one of these features, as is auto scaling
a virtual warehouse to handle rising concurrency.
This section discusses the performance optimization features and approaches
available for enhancing query performance and in some instances, even reducing
costs.
For snowflakes, unique architecture and the underlying micro partitions.
Storage technology means that it is not required to perform much query tuning in
most situations.
There are, however, several performance improvement approaches that are available
and can be used to increase Snowflake's overall performance. So these options
include internal caching mechanisms that operate transparently in the background
to increase performance.
And scaling up or increasing the capacity of a virtual warehouse to allow for more
processing power to be available for complex queries.
Then horizontal scaling, which is done by increasing the capacity, by using a multi-
cluster virtual warehouse to handle a larger number of concurrent users and
concurrent queries.
Automatic static and dynamic partition pruning can be used to reduce unneeded
partitions while a query is being processed.
It is possible to accomplish better partition pruning by redistributing data in micro
partitions using clustering keys.
Then pre-computing results of complex regularly executed queries by using
materialized views.
And finally, search optimization, which can be used to improve the performance of
specific types of lookup queries.

Query execution in Snowflake:

Query Profile:
Once we submit the query , a query id appears in the out put screen. If we click on
that query id, it will take you to the query profile.

Caching in Snowflake:
Metadata cache & Result cache – these are part of cloud services layer. These are
available to all virtual warehouses.

Query result cache:


Query result cache in action:
Before returning the results pls check the below query plan when we checked using
the query id: when we re-run the same query again the query plan looks different
compared to initial execution.

Below is the query plan returned when the same query is executed again:

We can also disable the query result cache. Below is the way to disable the query
result cache for that particular session:

In this case there wont be any difference in the query plan as we disabled the query
result cache. From this point it will not use the query result cache unless a new
session is started or if the above is set to ‘TRUE’

Metadata Cache:everytime when data is inserted, updated or deleted in snowflake


table, new micro partitions are written. When new micro partitions are written
snowflake stores that info in the metadata.
Continuation of above:

Metadata cache in Action:

All the types of min, max and count related queries will go to metadata cache and
fetches the result quickly. For character columns (for last query from the above 3
queries) the metadata doesn’t store the min and max of those columns
For any of the below queries executed the snowflake doesn’t make use of the micro
partitions but it would make use of only the cache as shown below:

Virtual warehouse cache or local disk cache:

Virtual warehouse cache in action:

Once the virtual warehouse is suspended, all the data stored in cache will be
removed and the query executed immediately after the warehouse suspended will
not be using the data stored in cache as it doesn’t exists any more.

Partition pruning and clustering Keys:


Snowflake query optimizer knows the id value 1 is contained in 2 and 3 partitions, it
doesn’t search for the data in micropartition 1 as shown in below screen shot:

Additional partitions are added to the micro partitions are produced when data are
added to a table. Because the column values are scattered across numerous micro
partitions. Snowflake must keep track of what range of data is kept in which micro
partition for each column. This metadata enables Snowflake to eliminate
unnecessary micro partitions when running queries.
Therefore, boosting the overall query performance. This process of eliminating micro
partitions is also known as partition pruning.
So consider the query shown on the screen. Because the Snowflake Query Optimizer
knows that the data for ID equal to one is contained in micro partition two and three.
It does not search micro partition one at all.
Therefore, reducing the amount of reads it needs to perform for processing this
query. So for this simple example, it may not seem that big a deal, but for a really
large table partition, pruning can eliminate a large number of partitions, therefore
improving query performance significantly.

The below one is the same as above highlighting the

Since micro partitions are produced in the order of the arrival of the data over time,
the data in micro partitions may not be optimally stored and may not support optimal
partition pruning.
For example, the micro partitions shown on the below screen do not enable effective
partition pruning if most queries are based on the store column.
If queries are mostly predicated on the store column as shown on the screen, queries
may perform better if the table is clustered on the store column with no clustering in
place and for the data shown on the screen.
Snowflake will need to scan all micro partitions to find all records for store A.
Similarly, if a query was trying to retrieve data for store B only, it also needs to scan all
the micro partitions.

Similarly, if a query was trying to retrieve data for store B only, it also needs to scan
all the micro partitions.
By clustering a table on a specific column, queries can be optimized by eliminating
unneeded partitions from the query processing.

The advantages of clustering a table on a separate column may not be visible for
tables with a small quantity of data, but as the table data and its micro partitions
grow.
Clustering a table on the proper set of columns can bring significant gains in speed
through partition pruning (This process of eliminating micro partitions is also known
as partition pruning.)

What happens behind the scenes when a table is re-clustered:


For tables with a clustering key defined automatic clustering, A snowflake service
manages the clustering as needed distributing data according to the clustering key to
achieve appropriate partition pruning.
So Snowflake internally maintains the cluster tables and any resource requirements
that are associated with automatic clustering.
Automatic clustering only adjusts those micro partitions which benefit from the
clustering process. Automatic clustering does not need a virtual warehouse but uses
snowflake managed CPU and RAM.
Therefore it has a cost attached which would appear under serverless costs.
Clustering a table uses credits like any other data modification action in Snowflake,
so clustering also adds extra storage when data is physically redistributed, and new
micro partitions are created.
The original micro partitions in this case are kept for time travel and failsafe purposes,
resulting in increased storage.
So Snowflake does not immediately update the table Micro partitions when we define
a table clustering key.
Instead, Snowflake redistributes data according to the new clustering key only if it
determines that the table will benefit from re clustering.

Scaling up and down a Virtual warehouse:


The virtual warehouse being scaled up or down is being done by system admins.

The syntax to scale up or down a Virtual warehouse:

The larger size is used for scaling up and the smaller size is used for scaling down.

Multi-cluster virtual warehouse - Scaling out


In the case of standard virtual warehouse:

However, in the case of a multi-cluster warehouse, Snowflake will spin up an


additional instance of a medium sized virtual warehouse and execute the queued
queries on the newly spawned virtual warehouse. Below is the way:

Incase if both the above are full and again the additional queries are queued up
again as shown below:
However, the process of scaling out automatically can continue up to the maximum
configured value for the Multi-cluster virtual warehouse.
So if the Multi-cluster virtual warehouses were configured to allow for up to three or
higher number of virtual warehouses, it would spawn additional virtual warehouses as
more queries come in.
And if there are not enough compute resources to handle those queries.
Once the query workload decreases and the additional spawned virtual warehouses
don't have any workload, Snowflake shuts them down until it gets to the configured
minimum value.
In our case, the minimum value is one, so the multi-cluster virtual warehouses will
scale back until there is only one medium sized virtual warehouse.
The syntax for creating a Multi-cluster virtual warehouse is similar to creating a
standard virtual warehouse with a few additional options.
The syntax shown on the screen focuses on relevant parameters which are related
to Multi-cluster virtual warehouses.
For the complete syntax, please refer to snowflake's documentation.

So when creating a multi-cluster virtual warehouse, the primary differentiation from a


typical virtual warehouse is that you set a minimum and a maximum cluster count.
The maximum cluster count can be any value from 2 to 10, indicating the maximum
number of virtual warehouses that this multi cluster virtual warehouse can spin up.
The minimum cluster count can be any value from one to the maximum, indicating the
starting number of virtual warehouses.

Materialized Views:
Search optimization service in Snowflake:

Quiz Performance optimization in snowflake:


➢ Once a result cache is generated for a query stays valid for 24 hours. If another query
that reuses the query result cache is executed within that 24-hour window, the result
cache expiry is extended for another 24 hours from that point onwards. If the result
cache for a query keeps getting used, it will stay valid for up to 31 days. After 31 days, the
result cache for a query will be purged regardless of any other condition.

➢ Every time a virtual warehouse accesses data from a table, it caches that data locally.
This data cache can improve the performance of subsequent queries if those queries can
reuse the data in the cache instead of reading from the table in the cloud storage. The
warehouse cache is local to a virtual warehouse and can not be shared with other virtual
warehouses.

➢ Clustering a table on a specific column can optimize queries by eliminating unnecessary


partitions from the query processing. A table can be re-clustered by defining a clustering
key, which effectively redistributes the data into micro-partitions, ensuring optimal
access to the clustered column.
➢ For tables with a clustering key defined, Automatic Clustering, a Snowflake service,
manages the re-clustering as needed, distributing data according to the clustering key.
Snowflake internally maintains the clustered tables and any resource requirements with
Automatic Clustering. Automatic Clustering only adjusts those micro-partitions which
benefit from the re-clustering process.

➢ For a populated table, the clustering depth is the average depth of overlapping micro-
partitions for specific columns. The clustering depth starts at 1 (for a well-clustered
table) and can be a larger number. For an unpopulated table, the clustering depth is zero.

➢ When defining clustering keys, the initial candidate clustering columns are those
columns that are frequently used in the WHERE clause or other selective filters.
Additionally, columns that are used for joining can also be considered.

➢ A materialized view is a view that pre-computes data based on a SELECT query. The
query's results are pre-computed and physically stored to enhance performance for
similar queries that are executed in the future. When the underlying table is updated, the
materialized view refreshes automatically, requiring no additional maintenance.
Snowflake-managed services perform the update in the background transparent to the
user without interfering with the user's experience.

➢ Auto-Scaling mode is enabled by selecting different values for the multi-minimum


clusters and maximum warehouse count. As a result, Snowflake starts and stops
warehouses dynamically based on the workload needs. When a multi-cluster virtual
warehouse using auto-scaling mode starts, the number of active virtual warehouses
equals the minimum warehouse count. Snowflake spins up more warehouses according
to the need, up to the maximum warehouse count. Snowflake shuts down virtual
warehouses as the demand lowers until the number equals the minimum warehouse
count.

➢ The Economy scaling policy attempts to conserve credits over performance and user
experience. It doesn't spin up more virtual warehouses as soon as queuing is observed
but instead applies additional criteria to ascertain whether to spin up new virtual
warehouses. With the scaling policy set to Standard, Snowflake prefers to spin up extra
virtual warehouses almost as soon as it detects that queries are starting to queue up.
The Standard scaling policy aims to prevent or minimize queuing.

➢ The metadata in Snowflake allows the Snowflake query engine to eliminate partitions to
optimize query execution. For example, if the query specifies a WHERE condition,
partitions NOT containing the value matching that condition will NOT be scanned.

Security and Access Control in Snowflake


Security in snowflake: its implemented in multi layers.
Variety of security features available in snowflake security.
Layers of security in snowflake:
Data encryption at rest:
Data storage security.

Try Secret secure refers to the combination of a snowflake managed key and a
customer managed key that results in the creation of a composite master key to
further protect your data.
Additional:

Multi-factor authentication:

The process of MFA:


Other aspects in Authentication:
Access Control in Snowflake:

RBAC and DAC:


Authorization other aspects:
Row level and column level security.
Row level security:
Out of the box roles in Snowflake:

Roles in Snowflake:
Network layer security: At the network level, Snowflake encrypts all communication
by default using TLS 1.2.
The security is further enhanced through network policies, through which specific IP
addresses may be allowed to connect and others may be blocked.
Additionally, Snowflake supports private connectivity, which means your connection
to Snowflake can be via a private link to the cloud.

At the network level snowflake encrypts all the communications by default.


However, administrators can use network policies to allow only certain IP addresses
to connect.
Or they can also deny access to specific IP addresses.
A network policy can only be created by a security administrator, a higher role or by a
role with the create network policy privilege.
So network policy consists of the policy. Name a list of authorized IP addresses
separated by commas, and a list of forbidden IP addresses, again separated by
commas in the authorized or disallowed IP address list.

You can specify an individual IP address or an IP address range. However, do note


that the network policies at present only support IP version four addresses.
If both the allowed and blocked IP lists are populated, Snowflake applies the blocked
list first, followed by the allowed list.
So network policies can be created and applied at the account level, in which case the
policy applies to the entire account.
Network policies can also be assigned to the individual user, in which case they apply
solely to that user.

When a user level network policy is applied, that policy takes precedence over account
level policy.
Now let's talk about private connectivity.
So by default, your snowflake instance is available over the public Internet with access
protected by different security measures such as MFA, HTTPS and network rules.
Now, if your organization demands that your snowflake instance not be available
through the Internet. Snowflake supports private connectivity through which you can
ensure that access to your snowflake instance is via a private connection.
And then you can optionally block all Internet access. It's good to note that private
connectivity to Snowflake requires at least the business critical edition.
So depending on your cloud provider, Snowflake provides private connectivity
methods specific to that cloud. So it supports AWS PRIVATELINK. Azure Privatelink
and Google Cloud Private Services Connect. Finally, let's talk about encryption in
transit. So Snowflake encrypts all communication end to end. TLS 1.2 is used to
encrypt data while it is in transit. Everything in Snowflake is connected over HTTPS,
including connectivity to the Snowflake web UI connectivity via JDBC, via ODBC, as
well as via the Python connector and other connection mechanisms. And then all
access to snowflake services is accomplished through rest APIs, which are also
invoked over the HTTPS protocol. So this lecture covered the network level security
capabilities that Snowflake provides. We talked about TLS 1.2, which is that transport
level encryption.
MFA is enabled by default for all Snowflake accounts and is available in all Snowflake editions.

Snowflake supports masking policies that may be applied to columns and enforced at the column
level to provide column-level security. Column-level security is achieved by dynamic data masking
or external Tokenization.

Row-level security is implemented by creating row access policies, which include conditions and
functions that govern which rows are returned during query execution.

Secure views can be used to return only certain rows from a table. Additionally, secure
views hide the underlying data by removing some of the internal Snowflake
optimizations.

In Snowflake, all data at rest is encrypted using AES 256-bit encryption.

Snowflake encrypts all data in transit using Transport Layer Security (TLS) 1.2. This applies to all
Snowflake connections, including those made through the Snowflake Web interface, JDBC, ODBC,
and the Python connector

Administrators can configure the system to allow or deny access to specific IP addresses through
network policies. A network policy consists of the policy name, a comma-separated list of allowed
IP addresses, and a list of blocked IP addresses

Snowflake supports SCIM 2.0 and is compatible with Okta and Azure Active Directory. SCIM is an
open standard that provides automatic user provisioning and role synchronization based on
identity provider information. When a new user is created in the identity provider, the SCIM
automatically provisions the user in Snowflake. Additionally, SCIM can sync groups defined in an
identity provider with Snowflake roles.

Snowflake is pre-configured with the following roles. ACCOUNTADMIN is a full-privilege account


administrator role. USERADMIN provides the ability to create USERS and ROLES. SECURITYADMIN
receives privileges from USERADMIN and can govern global object grants. SYSADMIN can create
and manage the majority of Snowflake objects. ORGADMIN manages the operations at an
organizational level. There is also the PUBLIC role, which is automatically assigned to everyone.

ACCOUNTADMIN is the most powerful role in a Snowflake account. Due to the role hierarchy and
privileges inheritance, the ACCOUNTADMIN inherits all the privileges that SECURITYADMIN &
USERAMDIN has.

Snowflake's access control system is built on the RBAC idea, which means that privileges are
issued to roles and roles to users. The privileges associated with a role are given to all users
assigned to it. Snowflake also supports discretionary access control (DAC), which means that the
role that created an object owns it and can provide access to other roles to that item.
Extending snowflake functionality:

Secure UDF’s:
Stored Procedures:

Snowpark:
Snowflake scripting:

Extending Snowflake functionality Quiz:


An external function, unlike other UDFs, does not include its own code; instead, it invokes code
that is stored and run outside of Snowflake. For an external function, the only thing that is kept
inside Snowflake is information that Snowflake uses to invoke the remote service that contains
the code.

Snowflake Scripting is an extension to SQL that allows you to use procedural logic similar to that
found in programming languages. Snowflake Scripting allows you to use variables, if-else
expressions, looping, cursors, manage result sets, and allows you to handle errors. Snowflake
scripting is typically used to create stored procedures, but it may also be used to create procedural
code outside of a stored procedure.

Snowpark is a library created by Snowflake that provides APIs for accessing and processing data
in applications written in a programming language other than SQL. Snowpark allows programmers
to utilize common programming languages such as Java, Scala, and Python to construct apps that
handle data using standard programming structures. Snowpark automatically converts the data-
processing programming constructs to SQL and pushes it down to Snowflake for execution. As a
result, developers may utilize a familiar language while benefiting from Snowflake's scale and
execution engine.

Stored procedures are often used to perform recurring administrative activities, e.g., in a particular
organization setting up a new user on the system may require creating the user, granting them
several roles, creating a private database from them, etc. These steps can easily be placed in a
stored procedure, and then the stored procedure can be called whenever there is a requirement to
create a new user.

You can create functions using typical programming languages such as Java, Python, or Scala,
and those functions can be exposed in Snowflake as UDFs. So, you can use these UDFs in your
SQL just like any other UDFs. To execute these UDFs, Snowflake creates a run-time environment
sandbox within the virtual warehouse houses. The UDFs execute inside the sandbox. This
approach also ensures default parallel execution of the UDFs because they will use Snowflake
infrastructure to scale.

Resource Management in Snowflake:


Resource Monitors: These help to monitor the virtual warehouses.
Viewing Usage and Billing in Snowflake:
System usage and Billing –
Account Usage schema:

Information Schema:
The ACCOUNT USAGE schema consists of several views that provide usage metrics and metadata
information at the account level. Data provided by the ACCOUNT_USAGE views is NOT real-time
and refreshes typically with a lag of 45 minutes to 3 hours, depending on the view. The data in
these views are retained for up to 365 days.

The data provided via the INFORMATION_SCHEMA views is real-time, and there is no latency in
the information provided. So, if you are asked which schema should be used if there is a
requirement to view real-time data, then the views in INFORMATION SCHEMA should be used as
they contain real-time information.

A resource monitor cannot control the costs of cloud services. A warehouse-level resource
monitor can monitor credit usage by Cloud Services, but the resource monitor cannot suspend the
cloud services.
Resource monitors can track & manage a single virtual warehouse against a defined quota.
Resource monitors can be created to track the credit usage of multiple virtual warehouses
together. Resource Monitors can also be created at the account level, which means that such
resource monitors track credit usage at the account level, considering the credit usage of all virtual
warehouses.

Resource monitors help manage virtual warehouse costs and avoid unexpected credit usage.
Credit usage can be controlled with resource monitors by monitoring credit usage against a
defined upper limit, notifying administrators when a certain percentage of the limit is reached, and
even suspending virtual warehouses if necessary.
Another one in Udemy

Snowflake Complete course

Learning Plan:
Snowflake Architecture:
Layers in architecture and the importance of each layer.
Since the data is stored in columnar format in snowflake the performance is good.

In snowflake the data is stored in micro partitions in columnar format.


Query processing layer→
Snowflake provides 2 options for increasing the computer resources.

You can increase the VW anytime using the web UI or SQL Interface.
Scale up or vertical scaling – increase the size of the virtual Warehouse.
(warehouse resizing)- Increase in the size of the VW if your queries are taking too
long or data loading is slow.
Scale out or horizontal scaling– increasing the no of clusters to await the queries
going to que.
Snowflake offers upto 10 clusters to increase as part of scale out option.

The different roles in snowflake account we can see are below:


ACCOUNTADMIN, PUBLIC, SECURITYADMIN, SYSADMIN, USERADMIN, ORGADMIN

If we login with security admin or account admin then only the below option is
visible to us:

Within the Warehouses we have an option called multi cluster warehouse.


Scaling policy available – standard and economy.
Standard meaning if we don’t want to wait and don’t want to make the queries to be
in que, then we go for standard.
Bulk load & continuous load.
Other ways to load data to snowflake tables:

Copying the data from data lake to snowflake table:

To keep all the stage objects in one schema we need to create a schema like
external schema as shown below:

If we want to specifically load the [Link] file then we have to mention in the
below way:
To generate the sequence number in the snowflake table, below is the way:

Stage is nothing but the location of files. That can be internal or external to
snowflake.
Stage Types:

Stage object is a db object created in schema.

To check the properties of the stage object:

To load the data from data lake s3 bucket storage to snowflake tables:

If we don’t mention any file properties, then by default its .csv file.
To alter file format object:

Snow SQL:
Below is the connection string to give to connect to snowsql:

Above are account and user id passed.

Internal stages and SnowSQL:

User stage is used to store files in staging area allocated to that user.
This happens when user is created some space is allocated to user.

The particular staging area is allocated to one table. One or more users can load to
this table.

Named stage is a db object.


Difference between Copy & Put commands in snowflake:
Put → This is used to copy the files from local desktop or local server to snowflake
internal staging area.
Copy→ to load the data from external stages or internal stages to snowflake tables.

We can mention the file format while creating the stage itself.
Below way to copy the files from desktop to User staging area:
Executed the put command to userstage using snowsql:

Named staging area is a db object. we have to create it in the snowflake db.

If we want to check all the internal stages which are created above:

Copying the data from different stages to snowflake table:

Copy command Options:

The different copy options available are:


In detail about each Copy options:
Return_error:

On_error:

Force property:

Size Limit:

Truncate columns or enforce length:


Purge:

Load Uncertain files:

If we want to list down the files in a stage→ list <stage name>

On_error, truncate_columns, validation_mode are the important


properties/commands in the real time scenarios.

Snowflake-Azure Integration

For storage integration object we need to go to IAM(Access Control) and provide


access to necessary snowflake related components.
Processing Semi structured data
Processing Json data:
If we have multiple elements in a list then e call it as an array. Json file can contain
array of elements.
How to extract the data from json file and load into snowflake tables step by step:

To copy the JSON data into Stage Table:

How can we store avro or parquet or json file into snowflake


Or while processing the semi strucred data how can you store the entire file into
snowflake → we can create a table with variant data type and we can load the entire
file into that variant field.
Later by using the parsing method we can extract the data from the variant type and
load it into rows and columns.

If we want to fetch the array data from that json file we may need to give the
indexing position as below:

This case we will fetch only one lement from the array.

Incase if we want to fetch all the elements from the array:


TO get the array size:

To find out how many no of elements in each row:

Nested data in the JSON file:


When we want to fetch the address part from the above:

The output will be as shown below:

Below is the way to extract entire data in single sql statement using union all:

To avoid the duplicates, we will have to give in the where condition as below:

By suing an approach called flatten approach, we can avoid eliminate writing


multiple union all statements.

When ever we do flattening we have to put .value for it. As shown above for pets field as
[Link] as we flattened pets field.

When we are flattening . it means that we are converting rows into columns.
Processing XML files:
Below is the sample xml file shown which contains the book details:

If we don’t give the file format in the below by default it would take xml format:

Snowflake Pricing: Cost calculation in Snowflake


Types of Cost:

Storage Cost:

How to choose storage type:

After we analyze the data size we can switch to storage type to capacity size.
Compute Cost:

Snowflake Credit:
It’s a unit of measure. The cost is calculates using the measure - credits.

Server less features:

Types of Cost:

The storage cost depends on:


Snowpipe

Continuous Loading:

Snowpipe is a named DB object that contains copy command used to load the data.

It’s a serverless setup. As soon as the files are placed in aws s3 bucket or adls, they are
immediately loaded into snowflake tables. For that we have some configuration settings
needs to be done.
How to trouble shoot issues in snow pipe:

We have to check whether the pipe is up and running using the below:

Step 2 : we can check for the copy history:


Incase if we want to check the copy history of 10 hrs back then below is the way:

From the above we can understand that there is a zero count loaded into the table from the
file for the last file since it shows row_count as zero.

The error it has shown for the last file is below:

Validating the source files: This is not with us as its completely from source team.

In some cases there will be an issue with the file format object. In that case we can manually
run the copy command that is used in the pipe as shown below by specifying the particular
file name:

We have to load the history files by running the copy command manually.
Incase if the delimiter was changed from | to , then below step need to be xecuted manually:
How can we manage the pipes:

If we want to see the pipes listed in the DB, below are some of the ways:

Caching in Snowflake:
Cache is a temporary storage location that stores files/copies of data so that they can be
accessed faster in near future. It plays a vital role in saving costs and speeding up the results
and improves query performance.
There are 2 types of cache → Query results cache & local disk cache
If we want to use those files in next 24 hrs or next 2 days which are stored in cache, we can
access them in faster and easier way.
Architecture of snowflake: Results cache is located in the cloud services layer. Local disk
cache is located in the virtual warehouse layer. When a query is executed first it checks in the
result cache to give the output [Link] the desired data is not available in the query result
cache, then it looks for the local disk cache and bring up the data from local disk cache.

In this case the data is available in the results cache when the VWH is up and running. In
between when the VWH is suspended then the local disk cache is cleared.
Query results are available in cache for 24 hrs.
Results cache can be available across different virtual warehouses.

Results cache works as long as there is no change done for the underlying data.
Query results returned to one user is available to another user on the system who executed the
same query.
The query should be the same without any changes in the underlying data. Not even the re-
ordering of columns or subset of data.
It works when we query for subset of data that is available in local disk cache.
The local disk cache depends on the size of the virtual warehouse we are using.
For ex: X small VW cant hold Millions of records , but can fetch part of the data from local
disk cache and remaining from Remote Disk.
As we know metadata management is done by cloud services layer in snowflake.

The difference in the time taken by re-ordering the columns but querying the same data:

To disable the result cache→ ALTER SESSION SET USE_CACHED_RESULT=FALSE


We can set the auto suspend option for certain time for the vwh.

We can suspend the vwh in the below way:

Time Travel & Fail Safe Mechanisms

There is no need to enable the time travel, its automatically enabled by default.

Higher the retention period higher will be the cost of storage.


Retention period can be set at the schema level or table level or db level or warehouse level.
To query historical data:
How to restore objects:

With time travel we can retain the data for about 90 days after that fail safe period comes into
picture when the time travel period expires. But the data can be retrieved from fail safe only
through snowflake support.

Continuous data protection Life cycle:

Hands on:

It shows the 10 days retention period:


We can change the data retention period in the below way:

Checking the above table 25 hrs back what is the data:

If data is not available during that retention period or if its beyond the time it has been created
then below is the error its thrown:

If it is at certain point of time then we use at timestamp. If its before certain period of time
then we use offset as mentioned below:

For semi structured data, when ever we are casting the column from one datatype to another
datatype then we use double colon same as in last query in above ss.
The below is another kind ofd retrieval of data apart from offset and timestamp i.e. before

If we want to check the storage metrics we can see the below:

The below indicates that the time travel period is completed and currently in the fail safe
zone:

The below query is to convert the no of bytes to GB’s to see how much space the table is
occupying.
To create a schema with certain data retention period:

The above will be applicable for all the underlying tables.

Zero Copy Cloning

***If source tabl;e has 1000 records and its cloned to a new table. The new cloned table is
loaded with another 200 new records, then the cost will be incurred only on the new 200
records but not on the 1000 records which are pointed to the original source table. This is
how the zero copy cloning works. ***

Why we use zero copy cloning→ when we want to do unit testing or integration testing and
bring some data from prod table to dev table, then there will be no charge.
Cloning Syntax:

Objects that can be cloned:

Hands on:
While cloning we cannot apply any filters and we will have to clone the entire object.
If we are updating something in the main table it should not affect the cloned table and vice
versa.
***The below way of creating the backup table also works in snowflake too but its costly.
Whereas if we clone the table it doesn’t cost as the data is pointing to the main table. Its
wrong way of taking the backup of a table***

Merging time travel concept with cloning:


Until the time travel retention period is active there will be no cost for the cloned table. Once
the time travel period is over, then in that case storage cost is associated with the cloned
table.

Table Types:
3 types of tables – Permanent, Temporary, Transient Tables.
Permanent Tables: Default table type in Snowflake. These are the regular and common
tables. Tables exists until we drop them explicitly. These are the tables which will have the
time travel period – 90 days and fail safe - 7 days.

Transient Tables:
These are similar to Permanent tables but with 1 day retention period. There is no fail safe
period. Tables exist until we drop them. These types are useful when data protection is not
required.
Defining stage tables as transient is best practice. Its only 1 day the time travel period and
there is no fail-safe period for this type.
SYNTAX:
CREATE TRANSIENT TABLE ,TABLE NAME>();

Temporary Tables: This type of tables exists only with session. Once the session ends, then
the table gets dropped completely and is not recoverable. This will have the retention period
as 1 day and the session should be active for 24 hrs, only then we can retain. Though multiple
worksheets are opened, this type of table is accessed only for that sql worksheet and cannot
be accessed in a different worksheet.

These atbles can be used for temporary processing like it can be used in procedures and drop
at the end of the procedure. These are useful for intermediary storage.

SYNTAX:
CREATE TEMPORARY TABLE TABLE_NAME();

Key points to Remember→


We can’t convert any type of table to other type.
We can create transient databases and transient schemas.
Tables created under transient databases/schemas are by default transient.
We can create a temporary table with same name as perm/tran table. If we query with that
table name, it fetches data from temporary table in that session.
To find the table type → Look at the ‘Kind’ field in SHOW TABLES properties.
Comparison of Tables:

Working with Transient Tables:


The tables which we create in transient schema are transient table type by default. We cannot
alter the retention period from 1 to 2 for transient tables. Ex shown in below:

To retain the transient table which was dropped a while ago can be re stored in below way:

For Temp tables:


The table though renamed to different table the actual table still can be retained using the
undrop command. This is not only with temporary tables but can be done for permanent
tables.
Two different table types can be created with the same name as shown below:

RBAC – Role Based Access Control – Access Control in Snowflake


In real time we may not be operating these, only the admins will operate these. It important to
know the below concepts:

2 types of access controls – RBAC & DAC:

RBAC – access privileges are assigned to some roles and those roles will be assigned to the
users.

Different types of Privileges:


Object Hierarchy:

Roles in Snowflake: Roles are the entities to which privileges are on securable objects can be
granted and revoked.
Roles are assigned to users to allow them to perform actions required for business functions.
A user can be assigned multiple roles. This allows users to switch roles.

2 types of Roles: → System defined roles & Custom roles.


Account admin, ORG Admin, Public, Security Admin, Sys Admin, User Admin are the
system defined roles.

System Defined Roles:


Org Admin - Organization administrator→ Role that manages the operations at the
organization level. Morer specifically, this role can:
- create accounts in the organization
- View all accounts in the organization as well as all regions enabled for the
organization.
- View usage information across the organization.
Account Admin – Account Administrator→ It is the top-level role in the system.
Only account admin can see all the account related things like usage. Billing, users, roles,
sessions, and reader account details.
This admin role should be granted to only a limited no of users.
AA can enable multi factor authentication.
Encapsulates the SYSADMIN and SECURITY ADMIN system-defined roles.

Security admin and User admin:


Security admin can grant access to roles and users. He will inherit the privileges of user
admin role.
System Defined Roles:

Role Hierarchy in Snowflake:

Account admin is the boss. He will have all access. He can do anything. What ever security
admin and sys admin can do can be done by account admin. Whatever user admin can do -
can be done by security admin as well. Security admin inherits the properties of user admin.
Public role is the least role. And account admin role is the top role.
The custom roles are user-defined roles.
To explain everything:

Databases, schemas, table and privileges will be done by the sys admin and roles and users
will be done by the security admin.
Roles and Users Creation:

From the above what ever the sales user can do the sales admin also can do.

---- user 1 - Account Admins ---

CREATE USER RANJIT PASSWORD ='ABC123'

DEFAULT_ROLE = ACCOUNTADMIN

MUST_CHANGE_PASSWORD=TRUE;

Password changed to Raghurama123

GRANT ROLE ACCOUNTADMIN TO USER RANJIT;

---- user 2 - Security Admin ----

CREATE USER CHARLES PASSWORD ='ABC123'


DEFAULT_ROLE=SECURITYADMIN

MUST_CHANGE_PASSWORD =TRUE;

GRANT ROLE SECURITYADMIN TO USER CHARLES;

--- user 3 - sysadmin ----

CREATE USER JANET PASSWORD ='ABC123'

DEFAULT_ROLE = SYSADMIN

MUST_CHANGE_PASSWORD = TRUE;

GRANT ROLE SYSADMIN TO USER JANET;

Custom roles mostly assigned to SYSADMIN.

Security admin will not be able to create warehouses.


If we want to grant roles to users or users to objects then below is the way:

Views & Materialized Views:

In Snowflake we can see 3 types of Views:


Non materialized Views(Normal Views), Secure Views & Materialized Views
Secure views doesn’t allow the user to view the definition of the view.
How to determine a view is secure view or not?

Materialized views:

Refresh of Materialized Views:

Costs of Materialized Views:

When to create a materialized view and when to use Normal View:

Advantages of Materialized Views:

Limitations of materialized Views:

Secure views & Materialized Views are created by Sysadmin.


Dynamic Data Masking Agenda:
• Column level security
• Masking policies
• Dynamic Data masking
• Creating Mask Policies
• Applying Masking Policies
• Altering or Dropping Policies

Column level security is to protect the information of customer.


PHI – primary health information
PII – personal identifier

Column level security is composed of 2 features:


Dynamic data masking → process of hiding data by masking with other characters. We can
create the masking policies to hide the data present in columns.
& External tokenization → process of hiding sensitive data by replacing it with cypher text.
External tokenisation makes use of masking policies with external functions created at
external cloud provider side.

Masking Policies:

Dynamic data masking: The data is not changed in the storage or in any table, but when
executed the output data will be masked dynamically and displayed.

Creation of masking Policy:


Applying Masking Policies:

We can apply the masking policies at column level.


Removing masking policy:

We no need to mention the masking policy when we are removing as we can only apply one
policy one a column.

For altering and dropping Policies:

We cannot clone the tables which are imported from share.

To check what are all the masking policies:


SHOW MASKING POLICIES;
if we want to see the particular masking policy:
DESC MASKING POLICY CUSTOMER_PHONE;

Before masking policy is dropped we have to unset them and then drop the policy.

Masking policy applying on a view:


Secure Data Sharing

For non-snowflake users we must create the reader account and then share it.

One consumer can get data from multiple providers and one provider can share data to
multiple consumers.
A snowflake user can act as a both provider and consumer.
Objects can be shared are:

Data sharing is supported only between snowflake accounts.

Here are total 3 snowflake accounts:

Consumer account, Reader account, Provider account. The reader account belongs to
provider account.
Sharing the data and dropping the share object using snowsight window:

We can share the complete database or schema to other users.


To see the shares on an object we can execute the below:

We can create a database from the shared:

Creating a Reader Account:

How to see reader accounts:


SHOW MANAGED ACCOUNTS;
Creation of reader account and sharing that with non-snowflake reader account:
DATA UNLOADING:
Unloading process →The process of loading data into files is the same as loading process
except in reverse.
Step1: Use the COPY INTO <location> command to copy the data from the snowflake table
into one or more files in snowflake or external stage.
Syntax: COPY INTO @STAGE
FROM TABLE_NAME
<OPTIONS>
There are different options (5) for unloading process:

Data Sampling:
Data sampling is selecting part of data or subset of records from the table.
This is to build and test the query whether the query is syntactically correct or not.
This is mainly used for query building & testing and also for Data analysis or understanding.
This useful in dev env where we use small wh’s and occupy less storage.
We can sample a fraction or % of rows and also we can sample a fixed no of rows.

Syntax:
Some of the samples:

Sampling the data – Lab part:

In the below case it cannot guarantee the same data as output. Because we are not giving the
seed number.

In below case since we are fetching the data from the seed, in both cases the data is same.

External Tables in Snowflake:


Metadata of External Tables:

Steps in creating External Tables:

From an external table always the first field is value field. Which stores the entire record in
variant datatype.

The data is fetched as shown below:


External stage stores 27 properties.

We built the external stages to analyse the data when the data is in source file itself (shown
below) and to access the file in the form of table.

We can also build the views on external tables.

We can create secured view (top) and materialized view (below) on top of external tables.

We can easily identify the type of view by looking at the icon symbol of the view as shown
below:

If its secure it will be with lock symbol.

USER DEFINED FUNCTIONS


UDF’s allows you to perform the operations that are not available through the built in system,
defined functions.
As shown above we can create the function with same name but different no. of parameters.

Stored Procedures:
SPs allows you to write procedural code which includes SQL statements, conditional
statements, looping statements & cursors.
Snowflake supports 5 languages for writing procedures:
Snowflake scripting
Java
Scala
Python
From a stored procedure, you can return a single value or tabular data.
SP’s supports branching and looping.
Dynamically SQL statements can be created and executed.
Differences between SP’s & UDF’s:

Stored procedure creation:

The above stored procedure contains like declaration of variables and execution of sql
statements.
Need to be careful in writing the stored procedures. Even if we miss a single semicolon its
difficult to find it out.

Resource Monitors:
A Virtual warehouse consumes snowflake credits while it runs. The no of credits consumed
depends on the size of the warehouse and how long it runs.

A resource monitor can be used to monitor credit usage by virtual warehouses and the cloud
services needed to support those warehouses.

Resource monitors helps in controlling costs and avoid unexpected credit usage.
In resource monitors we can set credit limits for a specified interval or date range.

When these limits are reached or approaching, the resource monitor can trigger various
actions such as sending alerts and suspending warehouses.

Resource monitors can only be created by account administrators or with the role that has
admin privileges.
Resource monitors will reduce the unexpected credit usage and they can help us to track the
credit usage.

Credit Quota: This specifies the no. of snowflake credits allocated to the monitor for the
specified frequency interval. In addition snowflake tracks the used credits/quota within the
specified frequency interval by all warehouses assigned to the monitor. After specified
interval this number resets back to 0.

Credit quota accounts for credits consumed by both user managed virtual warehouses and
virtual warehouses used by cloud services.
Monitor type: A resource monitor can be created to monitor the credit usage both at account
level & warehouse level (single or set of warehouses).
If this property is not set then the resource monitor doesn’t monitor any credit usage.

Schedule: The default schedule for a resource monitor specifies that it starts monitoring credit
usage immediately and the used credits reset back to 0 at the beginning of each calendar
month (i.e. the start of the standard snowflake billing cycle)
We can customize the schedule for a resource monitor using the following properties:
1. Frequency: Daily, monthly, weekly, yearly
Never(Used credits never reset; assigned warehouses continue using credits until the credit
quota is reached)
2. Start: Date & time when the resource monitor starts monitoring the assigned
warehouses. It can be immediately or any future timestamp.
3. End: Date and Time when snowflake suspends the warehouses associated with the
resource monitor, regardless of whether the used credits reached any of the thresholds
defined. It can be any future timestamp. Need to be very careful with this property
especially. We have to set this property as “never”.

Actions:

Warehouse suspension and Resumption:

Creating Resource Monitors:


Modifying the Monitors:

Creating a monitor at account level:

Tasks & Streams:

Creation of Tasks:

Alter the Task:


Some of the sample Tasks:

Using CRON we will schedule the tasks:

DAG of tasks:
Directed acyclic graph
Creation of tasks setting the dependency between the tasks:

Task History:

Some examples of scheduling the Tasks:

To create DAG of tasks:


First we have to resume the child tasks then parent tasks.
To check the tasks History and to check the specific tasks at a certain period of time. we can
check the history as below:

Streams for CDC


A stream object records dml changes made to the tables like – deletes , updates and inserts.
It stores metadata about each change , so that actions can be taken using this metadata.
We call this process as change data capture.
Streams track all row level changes to source table using offset but doesn’t store the changed
data.
Once these changes are consumed by the target table, this offset moves to the next point.
Streams can be combined with tasks to set continuous data pipelines.
Snowpipe + Stream + Task → Continuous Data Load.

Metadata of Streams:

How stream works or data flows in stream:

Types of streams:

Consuming data from streams.

Every update is tracked by one delete and one insert. The stream will track an update record
with one delete and insert record.
For tracking updates and deletes we will have to write merge queries.

If we create a snow pipe, task and stream there will be continuous ingestion of data and it’s a
continuous data pipeline.
Snowflake Alerts and Email Notifications:
When to use snowflake alerts:

Privileges needed to create Alert:

Creating and Dropping Alert:

Viewing and executing Alerts:


Altering Alerts:

Alerts Hands on:

Email Notifications:

Create Notification Integration:

Types of possible Alerts:


Example Alerts with Email Notification:

Testing the email notification:


Snowflake Interview questions and Scenario based

1. What is snowflake and what kind of DB it is?


Snowflake is a cloud based DWH available as SaaS, Snowflake enables data storage and data
analytics solutions. It doesn’t have their own infra structure and currently it can be setup on
Azure, GCP and AWS.
Snowflake is pure SQL DB. It organizes the data into multiple micro partitions that are
internally optimized and compressed. It uses a columnar format to store.

2. Explain the architecture of Snowflake?


- The cloud services layer is a collection of services that coordinate activities across
snowflake.
- Actual processing unit of snowflake, it processes the queries using VWH’s.
- Data is stored in columnar format in Micro-partitions.

3. What are the advantages of snowflake over traditional databases or what are the new
features available in snowflake?
Lot of new features and advantages:
Pay as you go, No infra structure maintenance, Easy data loading, Time travel and
failsafe, Zero copy cloning, Easy data sharing, Tasks and Streams.

4. What are the stages in snowflake and query to create a stage?


Snowflake stages are the locations where data files are stored. There are 2 types of stages in
snowflake.
External Stages- If the data that needs to be loaded into snowflake is stored in other cloud
regions like AWS S3 or Azure or GCP then we can use external stages.

Internal stages – Stores the data files internally. We can copy files to internal stages by using
PUT command from snowsql.

5. Syntax to copy file data into snowflake table:


6. Can you use where clause in the copy command?
No we can’t use where clause in copy command. We can do some transformations while
loading the data by using COPY.
- Select only required fields.
- Can use functions like substring, cast etc.
- Can use case statement

7. How can you load a json file to Snowflake or how can you process and load semi
structured data?
We can store this semi structured data into a table by using a data type called Variant. Then
we can read this data from variant, we can process it into rows and columns and load into
another table.

8. Some performance tuning techniques in snowflake.


-Use cluster keys effectively – don’t define on small tables, define on filter columns, define
on join keys, define on function based columns.
-Make use of results cache for faster retrieval of data.
-Use materialized views wisely – On more frequently accessed tables, On tables with less
frequent data changes.
-and other common sql tuning techniques like- select only required columns, replace or with
union, Union all is always better if we are sure there are no duplicates, Try to avoid
inequality with ‘OR’ condition, avoid unnecessary joins, avoid using ‘distinct’.

9. How can you handle If the data coming from file is exceeds the length of a field in the
table.
Ans: We can handle this by using (truncatecolumns=true) in copy command. If we don’t
specify this property, copy command will fail. By default, this property is set to FALSE.

10. How is the cost calculated in Snowflake?


2 types of cost – storage cost & compute cost

11. What is clustering in Snowflake?


Clustering is basically grouping a bunch of values togetheso that it improves your query
performance. We define cluster keys on big tables, below are the best practices to define
cluster keys.
Don’t define on small tables.
Define on filter columns.
Define on join keys.
Define on function-based columns.
We can define cluster keys at the time of creating the tables. Also we can add or modify the
cluster keys by using alter statement.

12. How many cluster keys is advised on single table?


Snowflake recommends a max of 3 or 4 columns for clustering keys on tables. Adding more
than 3-4 columns tends to increase costs more than benefits.

13. Write a query to retrieve data that was deleted from a table?

14. What are all the objects we can restore after delete or drop?
We can restore the deleted data from any table based on the time travel retention period
defined on the table. Based on the edition of snowflake the retention period can be 1 to 90
days. We can un-drop the tables, schemas and databases that were dropped by mistake or
wantedly.
Refreshed_on & Behind_columns in the output of the above show command.

There is a need to choose the time travel retention period less than the 90 days as it would
impact costs if the data is not required to be kept till 90 days, in that case it can be reduced.

There is no indexing concept in snowflake. Instead we can define cluster keys on large tables
for better performance.
Snowflake Interview questions Part-2
Snowflake Interview questions scenario based

==========================================================

==========================================================

========================================================
======================================
The above can be achieved using listagg in snowflake.

We can implement the SCD ‘s in snowflake using Streams & Tasks.

Stored procedures for automating data loads

You might also like