Snowflake Data Warehousing Overview
Snowflake Data Warehousing Overview
leverages the power of cloud computing for data warehousing, allowing for scalability and flexibility.
Data Warehousing: Snowflake is primarily used for data warehousing, which involves the storage,
processing, and analysis of large volumes of data.
Virtual Warehouses: In Snowflake, you create virtual warehouses to separate and manage your
compute resources. This allows you to scale your resources up or down based on demand.
Data Storage: Snowflake separates data storage from compute resources. You pay for storage
separately from compute, and this decoupling is a key feature for cost management.
SQL Interface: Snowflake uses SQL as its query language, making it accessible to SQL-savvy users.
Automatic Query Optimization: Snowflake handles query optimization automatically, so you don't
need to manually tune queries for performance.
Data Sharing: Snowflake allows for easy and secure data sharing between organizations and within
your organization. You can share data without copying it.
Data Security: Snowflake provides robust security features, including encryption, role-based access
control, and auditing.
Elasticity: You can easily scale up or down based on your workload, which makes Snowflake cost-
effective.
Data Loading: Snowflake provides multiple ways to load data, including bulk loading, streaming, and
third-party integrations.
Snowpipe: Snowpipe is a service for automatically ingesting streaming data into Snowflake.
Semi-Structured Data: Snowflake supports semi-structured data formats like JSON, Avro, and
Parquet, allowing you to work with a variety of data types.
Time Travel: Snowflake provides features like Time Travel and Fail-Safe, which allow you to recover
from accidental data changes and historical data analysis.
Global Availability: Snowflake is available across multiple cloud providers, giving you the flexibility to
choose your preferred cloud infrastructure.
Snowflake Data Marketplace: You can access third-party data sets and data services through
Snowflake's Data Marketplace.
Query Performance: Snowflake is known for its excellent query performance, thanks to its
architecture and optimizations.
Integration: Snowflake can be easily integrated with popular data visualization and ETL tools.
Cost Management: With separate billing for storage and compute, you have better control over
costs. Snowflake also offers features like Auto-Suspend and Auto-Resume to save on compute costs.
Compliance: Snowflake complies with various data security and compliance standards, making it
suitable for a wide range of industries.
Follow these steps to add the Sample Database to your account:
1) Role set to ACCOUNTADMIN
5) Click on SAMPLE_DATA
Architecture
Performance tuning – performance otmiztion techniques. Time travel, cloning and data sharing.
Resource management
Security
Cloning
Snowflake architecture – decoupled storage and compute. How they all work together to deliver
scalable and efficient data warehousing solution.
How to load data into snowflake from cloud and on-premises sources. Data loading approach via
bulk loading approach and also continuous data loading using snowpy.
Performance optimization techniques include query re-writing , query profiling, materialized views
clustering and cache.
Snowflake includes cloning , time travel and data sharing. Time travel allows users to query data as it
appeared at specific point of time making it easy to track changes and analyse data discrepancies.
Cloning enables users to create copy of entire databases schemas or tables within a few mins. Which
can help creating sandboxes or backups.
Data sharing allows users to share data securely and efficiently with other snowflake accounts
enabling collaboration and data monetization.
Snowflake security features include multi factor authentication, RBAC, and data encryption in transit
and rest.
Resource management in snowflake covers resource monitors used to monitor the cost of virtual
warehouses and the use on information schema and account schema to view and monitor the
compute and storage.
what is Snowflake?
Snowflake is a purpose built data warehousing platform built from scratch exclusively for public
cloud platforms.
Since it has been designed for the cloud, the software is optimized for execution on the cloud and
takes advantage of cloud concepts like decoupling the storage and compute from each other
scalability features.
We'll learn more about this decoupling and how it results in a better outcome for the customer.
So Snowflake is provided as a software, as a service offering, which means that there is no hardware
Any updates, optimizations and tunings are made by Snowflake itself and it's made available to all
customers automatically.
A snowflake data warehouse is a paper used only, and because of the decoupled nature of storage
and compute, a customer only pays for the actual storage used and the actual compute used.
This means that you could store terabytes of data, and if you're not processing that data regularly,
you get charged only for the storage costs. So finally, Snowflakes design takes advantage of the
storage and compute scalability that is offered by the underlying cloud platforms.
Since it uses object storage, the storage available to a snowflake customer is virtually unlimited.
It's highly fault tolerant and can be scaled up to any number. Similarly, on the processing front,
Snowflake uses scalable compute clusters called virtual warehouses.
Virtual warehouses allow the processing power available to Snowflake to be scaled in many different
ways.
Snowflake was ranked second on Forbes magazine's Cloud 100 list and first on LinkedIn's 2019 US
top
startup list.
With all the unique features that Snowflake has, customers worldwide are moving to Snowflake as
their platform of choice.
And slowly but surely, Snowflake is gaining traction and becoming a snowball.
And I believe this technology is seriously challenging both traditional data warehousing platforms
and the big data platforms.
If you are working with data warehousing, big data or databases in any capacity.
Private Snowflake.
So all the fundamentals of a cloud based data warehousing solution are included in Snowflake
Standard Edition.
It includes complete SQL data, warehouse capabilities, data sharing, data encryption in transit and
However, in the Standard edition, the ability to travel back in time is limited to one day.
The Enterprise has all the capabilities of the standard edition but adds additional capabilities.
These include time travel up to 90 days Multi-cluster virtual warehouses, materialized views,
dynamic masking, search optimization, external data, tokenization and annual rekeying of the data.
The business-Critical Edition enhances the enterprise edition with additional security features such
as a customer managed key payment card industry or PCI compliance and private connectivity
support, as well as failover.
It builds on all these editions. So it has all the capabilities of Business Critical Edition, but also
provides additional isolation by providing a customer specific metadata store and a customer
specific pool of computing resources that are not shared with any other customer.
So what that means is you will get your own isolated version of Snowflake.
Virtual private Snowflake edition provides dedicated compute resources and dedicated metadata
store.
What is the minimum Snowflake edition which supports multi-factor authentication (MFA) –
Standard edition.
What is the minimum Snowflake edition that supports private connectivity to Snowflake – Business
critical
Snowflake architecture:
Multi cluster, shared data – separation of storage and compute allows Unlimited scalability
which is independent of each other.
AWS S3, Azure Blob storage or Google cloud storage to store its data. Since snowflake
stores data on object
Currently, Snowflake supports AWS, S3 Storage, Azure Blob Storage or Google Cloud
Storage to store its data.
Since Snowflake stores data on object storage on the cloud platform, the storage
can scale indefinitely and independently of compute.
The cloud platform is responsible for providing the durability for these stored files.
Therefore, Snowflake can take advantage of the disaster recovery and fault tolerance
provided by the underlying cloud platform.
Data that is loaded into Snowflake is stored as files on the object based cloud
storage.
It is worth mentioning here that cloud based object storage is immutable, which
means stored data cannot be updated once it is written, but it can only be appended
to if updates are required to a file that was written to an object store.
You must remove the complete file, perform an update and write the new file back to
the object store.
So this immutability of files on object store presents an interesting challenge that
Snowflake solves through its unique micro partitioning approach.
These micro partitions also form the architectural basis of many exciting snowflake
features such as time travel, secure data sharing and cloning.
So as this course progresses, we'll be talking a lot more about micro partitions, how
they are managed, how data is inserted in those micro partitions, what is the
maximum size of a micro partition and what are the immutability implications on
micro partitions?
But for the time being, in the next lecture we'll be looking at the compute layer, which
is the virtual warehouses in Snowflake.
Whenever you hear the term virtual warehouse, it actually refers to a compute
cluster.
However, each of the virtual warehouse accesses the same shared data.
Next, we have the compute layer through which queries are executed on the stored
data.
The compute engines in Snowflake are referred to as virtual warehouses.
The virtual warehouses are entirely independent of the storage.
Similarly, it also stores the count of distinct values for each column in the metadata
and certain other information to assist in query optimization.
Now, another important aspect is that within each micro partition, the data is stored
in a columnar format.
Storing data in columnar format enables Snowflake to optimize queries even further
when a subset of columns are accessed.
So consider this straightforward SQL example on the screen where we are querying
the table on the left and only selecting the ID column.
We also added an ID equal to one where clause in the query.
So now Snowflake knows exactly which micro partitions contain the value one for
the ID column.
And additionally, since we selected only one column, it will limit the processing to
just that column
within its columnar format.
So the result is highly efficient access of a subset of stored data as highlighted on
the screen.
So this was a bit of detail about the micro partitions in Snowflake.
We will come across micro partitions throughout this course, especially when we
look into the cloning and the data sharing section.
If the queries can execute efficiently and timely on a smaller virtual warehouse,
keeping a larger sized virtual warehouse is basically wasting resources and incurring
extra costs.
Therefore, scaling down the virtual warehouse size is a good option in such cases.
Snowflake pricing:
Storage cost
Compute cost
Serverless cost
Data transfer
So Snowflake charges a monthly fee to all customers for the data that they store.
The storage costs are calculated based on the average storage for each month.
And it's important to note that the storage costs are calculated after compression is
applied on the Data.
Now with Snowflake, you can create a virtual warehouse of any size from extra small
to four x large snowflake.
Credits are consumed based on the size of the virtual warehouse and are charged
every second.
However, do note that there is a minimum of one minute charge when a virtual
warehouse starts up.
Storage Cost – AWS S3 or Azure Blob storage based in actual usage , columnar
compression and other techniques save cost. SF charges monthly fee to all the
customers that they process data. The storage costs are calculated after the
compression is applied.
Snowflake credits are consumed based on the size the virtual warehouse and are
charged every sec.
Snowflake virtual warehouse can be created from extra small to 4x large.
Snowflake offers several serverless features for which the compute costs are
charged. The serverless features doesn’t require any virtual wh so charged
separately.
Another cost is data transfer costs: there are no costs for data transfer into the
snowflake but its chargeable when its out to other regions and to other cloud
platforms.
Cloud services:10% cloud services is free – database definitions, table definitions,
users etc.,
Snowflake interfaces and Tools: snowflake web interface will be discussed in this.
The majority of tools and web interfaces.
Snowsight is a modern and lightweight web interface using new technologies and is a
primary method of interacting with your Snowflake instance.
Snow SQL connects to Snowflake through the command line and executes SQL queries on
your Snowflake instance. Snow SQL is available for Linux, Windows, and Mac OS.
COPY and snowpipe are 2 methods to load the data into snowflake.
Alternatively, the staging can be inside the DWH as shown below:
Role of STAGE in Data Loading:
Types of stages in snowflake: - external and internal stages.
The load metadata stores a variety of information, such as the name of every file that was
loaded into that table and the time stamp corresponding to the time that a file was loaded. By
utilizing this load metadata, Snowflake ensures that it will not reprocess a previously loaded
file. The load metadata expires after 64 days. Snowflake skips over any older files for which
the load status is undetermined.
The COPY command allows unloading or exporting data from a table or a view and also
allows using queries (SELECT) to unload data.
When loading data into a table using the COPY command, Snowflake allows you to do
simple transformations on the data as it is being loaded by using a SELECT statement.
During the load process, the COPY command allows for modifying the order of columns,
omitting one or more columns, and casting data into specified data types. It is also possible to
truncate data using the COPY command if it is larger than the desired column width.
Snowflake offers an alternative approach for tables called external tables, which permits the
creation of tables with data stored in external cloud storage. External tables remove the need
for the data to be loaded into Snowflake. In the case of an External table, the definition of the
table is still stored in Snowflake metadata and consists of table structure, file locations,
filenames, and other attributes. However, the table's data is saved outside of Snowflake. The
external table functionality enables you to query external data like a standard table. External
tables may be joined to other
COPY command uses virtual warehouse resources. Snowpipe is billed separately and does
not use virtual warehouse resources. Snowpipe is serverless and has its own computational
capability; therefore, it does not rely on virtual warehouses for processing. Snowflake
automatically manages the compute required by a Snowpipe. Snowflake also manages the
scaling up and down of a Snowpipe as per the data load requirement. Since a Snowpipe is
serverless, its costs are charged separately from virtual warehousing fees.
Snowflake allows continuous data loading using Snowpipe, a serverless service. Snowpipe
enables you to load data in a micro-batch manner, loading small volumes of data on each
execution. The micro-batch-based data loading is used when a continuous stream of data,
such as transactions or events, must be loaded and made available to enterprises quickly.
Snowpipe enables continuous data loading and can load data within a few minutes after it
arrives in a stage. Snowpipe is serverless and has its own computational capability; therefore,
it does not rely on virtual warehouses for processing.
Snowflake allows continuous data loading using Snowpipe, a serverless service. Snowpipe
enables you to load data in a micro-batch manner, loading small volumes of data on each
execution. The micro-batch-based data loading is used when a continuous stream of data,
such as transactions or events, must be loaded and made available to enterprises quickly.
Snowpipe enables continuous data loading and can load data within a few minutes after it
arrives in a stage. Snowpipe is serverless and has its own computational capability; therefore,
it does not rely on virtual warehouses for processing.
When loading data into a table using the COPY command, Snowflake allows you to do
simple transformations on the data as it is being loaded. During the load process, the COPY
command allows for modifying the order of columns, omitting one or more columns, casting
data into specified data types, and truncating values. While loading the data, complex
transformations such as joins, filters, aggregations, and the use of FLATTEN are not
supported as they are not essential data transformations. Therefore, joining, filtering, and
aggregating the data are supported ONLY after the data has been loaded.
This offers features that protect data in snowflake env without compromise.
This offers data encryption at rest and while in transit. Multi factor authentication is part of
the authentication mechanism.
In addition, Snowflake is also equipped with a number of cutting edge features that assist in
safeguarding of data and its subsequent recovery in the event that a human makes a mistake.
So you can restore data that has been mistakenly changed or deleted by utilizing two features
time travel and UN drop.
In this section, we'll cover some aspects of continuous data protection in Snowflake,
including time travel, failsafe, storage and drop functionality, and also covers the concept of
transient and temporary
So time travel in Snowflake is a cutting edge data protection feature that enables users to
query, retrieve and recover historical data from tables in case of loss of data when it has been
mistakenly changed or deleted as a result of a human error. You want to restore data as
painlessly and as quickly as possible.
Before the time travel functionality was introduced by Snowflake. The most common method
for recovering from the inadvertent loss of data was to restore the lost information from a
previous backup.
Snowflake stores the data in its own format using the micro partitions.
as
new data added to the table new micropartitions are added to the same. The micropartitions
are immutable.
Micropartitions and metadata are key to time travel. And this immutability of the micro
partition has implications for the updates and deletes performed on a table.
The example shows that two rows from a table were deleted because micro partitions are
immutable.
Snowflake cannot simply update the data in micro partition number two, but rather it will
create a new micro partition marked as partition number four on our example. So it is
essential to note that deleted micro partitions that have been marked as deleted still exist.
physically on the disk and they can be read if necessary. And this lays the groundwork for
snowflake's time travel functionality.
So when users request to read data from a table using time travel extension and they want to
read data as it existed at a specific point, Snowflake reads data from deleted historical
partitions to fulfil those time travel queries.
Snowflake retains these historical deleted partitions for a specific period of time before
purging them altogether. Until these partitions are purged, any ordinary user can access the
data contained in these historical partitions, and the period for which these historical
partitions are retained is known as the time travel duration.
The time travel duration generally varies from one day to 90 days depending on the
snowflake edition that you may have. So this was basically behind the scenes working of time
travel. In the next lecture, we will discuss the time travel extensions and how to use them to
modify existing queries to use time travel.
The new micro partition has all the data from the deleted micro partition except the two rows
that were deleted. At the same time, the micro partition number two is marked as deleted. So
it is essential to note that deleted micro partitions that have been marked as deleted still exist
physically on the disk and they can be read if necessary.
Snowflake also supports UN drop statement, which can be used to recover tables, schemas or
even complete databases after they have been dropped. So the add clause is usually used
with select statements and supports three different ways of accessing historical data.
Undo is nothing but how we do ctrl+z. it’s the same feature in snowflake.
Failsafe Storage:
The data in failsafe storage can be accessible by snowflake support team so we can contact
the snowflake support team to recover the data from failsafe storage.
Snowflake charges for data stored in failsafe and Time travel storage.
Types of Tables in Snowflake:
Depending on the Snowflake edition, the Time Travel duration might range from 1 to 90
days. The Standard edition allows for one day of Time Travel. Time Travel is possible for up
to 90 days in the Enterprise version and above.
In addition to protection provided by Time Travel, data that has been modified also goes
through a failsafe period. Failsafe storage is intended to provide an extra layer of protection
against data loss caused by human error. Once the Time Travel period ends, Snowflake keeps
the data for a further 7-day period as further protection. When data is in failsafe storage,
ordinary users cannot access it; only Snowflake support employees can access and recover it
if the customer requests it.
Time Travel SQL extensions allow you to see data as it existed before or at a particular time.
It can also be used to see data before an SQL statement is executed or at the point when an
SQL statement is run. Time Travel does not let you recover data for more than 90 days in the
past.
To support Time Travel queries, Snowflake supports special SQL extensions. It supports the
AT and BEFORE statements which can be used with SELECT statements or while cloning
tables, schemas, and databases. Snowflake also supports the UNDROP statement, which can
be used to recover tables, schemas, or even complete databases after they have been dropped.
Cloning in Snowflake:
So Snowflake divides data in a table into several micro partitions, and the metadata in the
cloud services layer tracks micro partitions corresponding to a table.
As new data is added to a table, new micro partitions are produced. So this metadata in
Snowflake's Cloud services layer maintains information on which micro partitions belong to
which table, and also other information such as if a micro partition is marked as deleted, or if
it is in time travel storage or in fail safe storage.
Any updates to the table will result in addition of new micro partition.
The distinct queries from both the tables will be different as we have updated some
data in the copy table.
Cloning is achieved through metadata operation performed in the cloud services layer. Data is
not physically copied, nor are new micro-partitions created—instead, the cloned table points to
the micro-partitions of the source table.
Combining Cloning and Time Travel can generate a clone of a table, database, or schema as it
existed at a specific point in time. Because both Time Travel & Cloning are metadata operations,
they can easily be combined.
The zero-copy cloning capability of Snowflake enables users to create clones of tables, schemas,
and databases without physically copying the data. Cloning does not require additional storage
space, and because cloning does not physically replicate data, it is far faster than the physical
copying of data. Micro-partitions and metadata enable rapid and efficient zero-copy cloning
because the cloned table's metadata references the existing micro-partitions.
When tables, schemas, or databases are cloned, the cloned item does not contribute to total
storage until data manipulation language (DML) operations are performed on the source or
target, which modify or delete existing data or add additional data.
A cloned object does not contribute to overall storage until DML operations on the
source or target object are done.
Shared table refers to the underlying table and its micro partition. Sharing is faster and any
changes in data in table will be instantly reflected in the shared table.
Snowflake offering for data sharing:
Direct sharing:
Virtual private snowflake accounts (VPS) are exception for data sharing. They cant share the
data.
Th e data sharing process requires steps to share with another consumer.
2 snowflake accounts are required for this – one is data provider and another one
for data consumer. Be on the same cloud provider and the same region.
It also covers across geographies.
Pls note that we use account admin role for sharing and managing the operation.
Please note that we will use the account admin role to create and manage the
sharing operation.
It is possible to grant, create, share and import share privileges to another role which
can then create these shares.
For setting up the sahre and sharing that with consumer account:
From the above, its clear that the share can be only created by a account admin.
First we need to grant select on the database and schema and then the table else it
will throw an error like below:
To find out the account name, below is the way we to find out the account name
from the web url:
The consumer should be in the same region and the same cloud provider.
Share data with a non-snowflake customer with snowflake web UI:
Here in this case, we are sharing the data here to a customer who doesn’t have
snowflake account. In this case we need to create a reader account for the user
who to view the data on the other side.
Data Exchange:
Data exchange in snowflake is the own private data sharing hub. Its just similar to
marketplace.
Quiz:
The cloud services layer facilitates data sharing through metadata operations.
When a Snowflake data provider shares data with another Snowflake account, the
data consumer is charged for the compute charges for any queries they run.
Metadata operations in the cloud services layer allow data sharing without physically copying it.
Since the provider account stores and pays for the data storage, the data consumer doesn't have
to pay anything extra for storage. However, the data consumer pays for the compute used to run
queries on shared data. When queries are run on shared data, the compute of the data consumer
is used.
Sharing data with a non-Snowflake user or organization is possible by creating a reader account.
This reader account is created by the data provider solely for sharing purposes.
Since the data provider creates and administers the reader account, all the reader account's
compute expenses are invoiced to the provider account. Therefore, the reader account's use of
the virtual warehouse compute is added to the provider account compute charges.
The Snowflake Marketplace is an online marketplace where you can purchase and sell datasets.
You may import data from outside your company into your Snowflake instance and utilize it to
enrich your data via the Snowflake Marketplace.
Data Exchange is your own private hub for sharing data with a small group of people or
organizations who have been invited to join. The owner of the Data Exchange account is in
charge of inviting members and specifying whether they can share, consume, or do both.
The consumer creates a database from the Share object as a read-only database.
Metadata operations in the cloud services layer allow data sharing without physically copying it.
Since the provider account stores and pays for the data storage, the data consumer doesn't have
to pay anything extra for storage. However, the data consumer pays for the compute used to run
queries on shared data. When queries are run on shared data, the compute of the data consumer
is used.
Except for Virtual private Snowflake accounts, the Snowflake Marketplace is available to all
Snowflake accounts hosted on Amazon Web Services, Google Cloud Platform, and Microsoft
Azure. Any Snowflake account (again, except for VPS accounts) can become a data provider and
publish datasets to the Marketplace for a cost or for free. In addition, you are required to sign up
as a partner first and become an approved data provider.
Virtual Private Snowflake (VPS) cannot use secure data sharing, Marketplace, etc., because VPS
accounts have isolated metadata, compute, and storage and therefore don't have sharing
capabilities.
Performance Optimization in Snowflake:
Some manual tuning may be required. scaling up and down of virtual warehouse is one of these
features. There are some automatic features which work by default only thing is we need to take
care that these features are working efficiently and as intended.
Query Profile:
Once we submit the query , a query id appears in the out put screen. If we click on
that query id, it will take you to the query profile.
Caching in Snowflake:
Metadata cache & Result cache – these are part of cloud services layer. These are
available to all virtual warehouses.
Below is the query plan returned when the same query is executed again:
We can also disable the query result cache. Below is the way to disable the query
result cache for that particular session:
In this case there wont be any difference in the query plan as we disabled the query
result cache. From this point it will not use the query result cache unless a new
session is started or if the above is set to ‘TRUE’
All the types of min, max and count related queries will go to metadata cache and
fetches the result quickly. For character columns (for last query from the above 3
queries) the metadata doesn’t store the min and max of those columns
For any of the below queries executed the snowflake doesn’t make use of the micro
partitions but it would make use of only the cache as shown below:
Once the virtual warehouse is suspended, all the data stored in cache will be
removed and the query executed immediately after the warehouse suspended will
not be using the data stored in cache as it doesn’t exists any more.
Additional partitions are added to the micro partitions are produced when data are
added to a table. Because the column values are scattered across numerous micro
partitions. Snowflake must keep track of what range of data is kept in which micro
partition for each column. This metadata enables Snowflake to eliminate
unnecessary micro partitions when running queries.
Therefore, boosting the overall query performance. This process of eliminating micro
partitions is also known as partition pruning.
So consider the query shown on the screen. Because the Snowflake Query Optimizer
knows that the data for ID equal to one is contained in micro partition two and three.
It does not search micro partition one at all.
Therefore, reducing the amount of reads it needs to perform for processing this
query. So for this simple example, it may not seem that big a deal, but for a really
large table partition, pruning can eliminate a large number of partitions, therefore
improving query performance significantly.
Since micro partitions are produced in the order of the arrival of the data over time,
the data in micro partitions may not be optimally stored and may not support optimal
partition pruning.
For example, the micro partitions shown on the below screen do not enable effective
partition pruning if most queries are based on the store column.
If queries are mostly predicated on the store column as shown on the screen, queries
may perform better if the table is clustered on the store column with no clustering in
place and for the data shown on the screen.
Snowflake will need to scan all micro partitions to find all records for store A.
Similarly, if a query was trying to retrieve data for store B only, it also needs to scan all
the micro partitions.
Similarly, if a query was trying to retrieve data for store B only, it also needs to scan
all the micro partitions.
By clustering a table on a specific column, queries can be optimized by eliminating
unneeded partitions from the query processing.
The advantages of clustering a table on a separate column may not be visible for
tables with a small quantity of data, but as the table data and its micro partitions
grow.
Clustering a table on the proper set of columns can bring significant gains in speed
through partition pruning (This process of eliminating micro partitions is also known
as partition pruning.)
The larger size is used for scaling up and the smaller size is used for scaling down.
Incase if both the above are full and again the additional queries are queued up
again as shown below:
However, the process of scaling out automatically can continue up to the maximum
configured value for the Multi-cluster virtual warehouse.
So if the Multi-cluster virtual warehouses were configured to allow for up to three or
higher number of virtual warehouses, it would spawn additional virtual warehouses as
more queries come in.
And if there are not enough compute resources to handle those queries.
Once the query workload decreases and the additional spawned virtual warehouses
don't have any workload, Snowflake shuts them down until it gets to the configured
minimum value.
In our case, the minimum value is one, so the multi-cluster virtual warehouses will
scale back until there is only one medium sized virtual warehouse.
The syntax for creating a Multi-cluster virtual warehouse is similar to creating a
standard virtual warehouse with a few additional options.
The syntax shown on the screen focuses on relevant parameters which are related
to Multi-cluster virtual warehouses.
For the complete syntax, please refer to snowflake's documentation.
Materialized Views:
Search optimization service in Snowflake:
➢ Every time a virtual warehouse accesses data from a table, it caches that data locally.
This data cache can improve the performance of subsequent queries if those queries can
reuse the data in the cache instead of reading from the table in the cloud storage. The
warehouse cache is local to a virtual warehouse and can not be shared with other virtual
warehouses.
➢ For a populated table, the clustering depth is the average depth of overlapping micro-
partitions for specific columns. The clustering depth starts at 1 (for a well-clustered
table) and can be a larger number. For an unpopulated table, the clustering depth is zero.
➢ When defining clustering keys, the initial candidate clustering columns are those
columns that are frequently used in the WHERE clause or other selective filters.
Additionally, columns that are used for joining can also be considered.
➢ A materialized view is a view that pre-computes data based on a SELECT query. The
query's results are pre-computed and physically stored to enhance performance for
similar queries that are executed in the future. When the underlying table is updated, the
materialized view refreshes automatically, requiring no additional maintenance.
Snowflake-managed services perform the update in the background transparent to the
user without interfering with the user's experience.
➢ The Economy scaling policy attempts to conserve credits over performance and user
experience. It doesn't spin up more virtual warehouses as soon as queuing is observed
but instead applies additional criteria to ascertain whether to spin up new virtual
warehouses. With the scaling policy set to Standard, Snowflake prefers to spin up extra
virtual warehouses almost as soon as it detects that queries are starting to queue up.
The Standard scaling policy aims to prevent or minimize queuing.
➢ The metadata in Snowflake allows the Snowflake query engine to eliminate partitions to
optimize query execution. For example, if the query specifies a WHERE condition,
partitions NOT containing the value matching that condition will NOT be scanned.
Try Secret secure refers to the combination of a snowflake managed key and a
customer managed key that results in the creation of a composite master key to
further protect your data.
Additional:
Multi-factor authentication:
Roles in Snowflake:
Network layer security: At the network level, Snowflake encrypts all communication
by default using TLS 1.2.
The security is further enhanced through network policies, through which specific IP
addresses may be allowed to connect and others may be blocked.
Additionally, Snowflake supports private connectivity, which means your connection
to Snowflake can be via a private link to the cloud.
When a user level network policy is applied, that policy takes precedence over account
level policy.
Now let's talk about private connectivity.
So by default, your snowflake instance is available over the public Internet with access
protected by different security measures such as MFA, HTTPS and network rules.
Now, if your organization demands that your snowflake instance not be available
through the Internet. Snowflake supports private connectivity through which you can
ensure that access to your snowflake instance is via a private connection.
And then you can optionally block all Internet access. It's good to note that private
connectivity to Snowflake requires at least the business critical edition.
So depending on your cloud provider, Snowflake provides private connectivity
methods specific to that cloud. So it supports AWS PRIVATELINK. Azure Privatelink
and Google Cloud Private Services Connect. Finally, let's talk about encryption in
transit. So Snowflake encrypts all communication end to end. TLS 1.2 is used to
encrypt data while it is in transit. Everything in Snowflake is connected over HTTPS,
including connectivity to the Snowflake web UI connectivity via JDBC, via ODBC, as
well as via the Python connector and other connection mechanisms. And then all
access to snowflake services is accomplished through rest APIs, which are also
invoked over the HTTPS protocol. So this lecture covered the network level security
capabilities that Snowflake provides. We talked about TLS 1.2, which is that transport
level encryption.
MFA is enabled by default for all Snowflake accounts and is available in all Snowflake editions.
Snowflake supports masking policies that may be applied to columns and enforced at the column
level to provide column-level security. Column-level security is achieved by dynamic data masking
or external Tokenization.
Row-level security is implemented by creating row access policies, which include conditions and
functions that govern which rows are returned during query execution.
Secure views can be used to return only certain rows from a table. Additionally, secure
views hide the underlying data by removing some of the internal Snowflake
optimizations.
Snowflake encrypts all data in transit using Transport Layer Security (TLS) 1.2. This applies to all
Snowflake connections, including those made through the Snowflake Web interface, JDBC, ODBC,
and the Python connector
Administrators can configure the system to allow or deny access to specific IP addresses through
network policies. A network policy consists of the policy name, a comma-separated list of allowed
IP addresses, and a list of blocked IP addresses
Snowflake supports SCIM 2.0 and is compatible with Okta and Azure Active Directory. SCIM is an
open standard that provides automatic user provisioning and role synchronization based on
identity provider information. When a new user is created in the identity provider, the SCIM
automatically provisions the user in Snowflake. Additionally, SCIM can sync groups defined in an
identity provider with Snowflake roles.
ACCOUNTADMIN is the most powerful role in a Snowflake account. Due to the role hierarchy and
privileges inheritance, the ACCOUNTADMIN inherits all the privileges that SECURITYADMIN &
USERAMDIN has.
Snowflake's access control system is built on the RBAC idea, which means that privileges are
issued to roles and roles to users. The privileges associated with a role are given to all users
assigned to it. Snowflake also supports discretionary access control (DAC), which means that the
role that created an object owns it and can provide access to other roles to that item.
Extending snowflake functionality:
Secure UDF’s:
Stored Procedures:
Snowpark:
Snowflake scripting:
Snowflake Scripting is an extension to SQL that allows you to use procedural logic similar to that
found in programming languages. Snowflake Scripting allows you to use variables, if-else
expressions, looping, cursors, manage result sets, and allows you to handle errors. Snowflake
scripting is typically used to create stored procedures, but it may also be used to create procedural
code outside of a stored procedure.
Snowpark is a library created by Snowflake that provides APIs for accessing and processing data
in applications written in a programming language other than SQL. Snowpark allows programmers
to utilize common programming languages such as Java, Scala, and Python to construct apps that
handle data using standard programming structures. Snowpark automatically converts the data-
processing programming constructs to SQL and pushes it down to Snowflake for execution. As a
result, developers may utilize a familiar language while benefiting from Snowflake's scale and
execution engine.
Stored procedures are often used to perform recurring administrative activities, e.g., in a particular
organization setting up a new user on the system may require creating the user, granting them
several roles, creating a private database from them, etc. These steps can easily be placed in a
stored procedure, and then the stored procedure can be called whenever there is a requirement to
create a new user.
You can create functions using typical programming languages such as Java, Python, or Scala,
and those functions can be exposed in Snowflake as UDFs. So, you can use these UDFs in your
SQL just like any other UDFs. To execute these UDFs, Snowflake creates a run-time environment
sandbox within the virtual warehouse houses. The UDFs execute inside the sandbox. This
approach also ensures default parallel execution of the UDFs because they will use Snowflake
infrastructure to scale.
Information Schema:
The ACCOUNT USAGE schema consists of several views that provide usage metrics and metadata
information at the account level. Data provided by the ACCOUNT_USAGE views is NOT real-time
and refreshes typically with a lag of 45 minutes to 3 hours, depending on the view. The data in
these views are retained for up to 365 days.
The data provided via the INFORMATION_SCHEMA views is real-time, and there is no latency in
the information provided. So, if you are asked which schema should be used if there is a
requirement to view real-time data, then the views in INFORMATION SCHEMA should be used as
they contain real-time information.
A resource monitor cannot control the costs of cloud services. A warehouse-level resource
monitor can monitor credit usage by Cloud Services, but the resource monitor cannot suspend the
cloud services.
Resource monitors can track & manage a single virtual warehouse against a defined quota.
Resource monitors can be created to track the credit usage of multiple virtual warehouses
together. Resource Monitors can also be created at the account level, which means that such
resource monitors track credit usage at the account level, considering the credit usage of all virtual
warehouses.
Resource monitors help manage virtual warehouse costs and avoid unexpected credit usage.
Credit usage can be controlled with resource monitors by monitoring credit usage against a
defined upper limit, notifying administrators when a certain percentage of the limit is reached, and
even suspending virtual warehouses if necessary.
Another one in Udemy
Learning Plan:
Snowflake Architecture:
Layers in architecture and the importance of each layer.
Since the data is stored in columnar format in snowflake the performance is good.
You can increase the VW anytime using the web UI or SQL Interface.
Scale up or vertical scaling – increase the size of the virtual Warehouse.
(warehouse resizing)- Increase in the size of the VW if your queries are taking too
long or data loading is slow.
Scale out or horizontal scaling– increasing the no of clusters to await the queries
going to que.
Snowflake offers upto 10 clusters to increase as part of scale out option.
If we login with security admin or account admin then only the below option is
visible to us:
To keep all the stage objects in one schema we need to create a schema like
external schema as shown below:
If we want to specifically load the [Link] file then we have to mention in the
below way:
To generate the sequence number in the snowflake table, below is the way:
Stage is nothing but the location of files. That can be internal or external to
snowflake.
Stage Types:
To load the data from data lake s3 bucket storage to snowflake tables:
If we don’t mention any file properties, then by default its .csv file.
To alter file format object:
Snow SQL:
Below is the connection string to give to connect to snowsql:
User stage is used to store files in staging area allocated to that user.
This happens when user is created some space is allocated to user.
The particular staging area is allocated to one table. One or more users can load to
this table.
We can mention the file format while creating the stage itself.
Below way to copy the files from desktop to User staging area:
Executed the put command to userstage using snowsql:
If we want to check all the internal stages which are created above:
On_error:
Force property:
Size Limit:
Snowflake-Azure Integration
If we want to fetch the array data from that json file we may need to give the
indexing position as below:
This case we will fetch only one lement from the array.
Below is the way to extract entire data in single sql statement using union all:
To avoid the duplicates, we will have to give in the where condition as below:
When ever we do flattening we have to put .value for it. As shown above for pets field as
[Link] as we flattened pets field.
When we are flattening . it means that we are converting rows into columns.
Processing XML files:
Below is the sample xml file shown which contains the book details:
If we don’t give the file format in the below by default it would take xml format:
Storage Cost:
After we analyze the data size we can switch to storage type to capacity size.
Compute Cost:
Snowflake Credit:
It’s a unit of measure. The cost is calculates using the measure - credits.
Types of Cost:
Continuous Loading:
Snowpipe is a named DB object that contains copy command used to load the data.
It’s a serverless setup. As soon as the files are placed in aws s3 bucket or adls, they are
immediately loaded into snowflake tables. For that we have some configuration settings
needs to be done.
How to trouble shoot issues in snow pipe:
We have to check whether the pipe is up and running using the below:
From the above we can understand that there is a zero count loaded into the table from the
file for the last file since it shows row_count as zero.
Validating the source files: This is not with us as its completely from source team.
In some cases there will be an issue with the file format object. In that case we can manually
run the copy command that is used in the pipe as shown below by specifying the particular
file name:
We have to load the history files by running the copy command manually.
Incase if the delimiter was changed from | to , then below step need to be xecuted manually:
How can we manage the pipes:
If we want to see the pipes listed in the DB, below are some of the ways:
Caching in Snowflake:
Cache is a temporary storage location that stores files/copies of data so that they can be
accessed faster in near future. It plays a vital role in saving costs and speeding up the results
and improves query performance.
There are 2 types of cache → Query results cache & local disk cache
If we want to use those files in next 24 hrs or next 2 days which are stored in cache, we can
access them in faster and easier way.
Architecture of snowflake: Results cache is located in the cloud services layer. Local disk
cache is located in the virtual warehouse layer. When a query is executed first it checks in the
result cache to give the output [Link] the desired data is not available in the query result
cache, then it looks for the local disk cache and bring up the data from local disk cache.
In this case the data is available in the results cache when the VWH is up and running. In
between when the VWH is suspended then the local disk cache is cleared.
Query results are available in cache for 24 hrs.
Results cache can be available across different virtual warehouses.
Results cache works as long as there is no change done for the underlying data.
Query results returned to one user is available to another user on the system who executed the
same query.
The query should be the same without any changes in the underlying data. Not even the re-
ordering of columns or subset of data.
It works when we query for subset of data that is available in local disk cache.
The local disk cache depends on the size of the virtual warehouse we are using.
For ex: X small VW cant hold Millions of records , but can fetch part of the data from local
disk cache and remaining from Remote Disk.
As we know metadata management is done by cloud services layer in snowflake.
The difference in the time taken by re-ordering the columns but querying the same data:
There is no need to enable the time travel, its automatically enabled by default.
With time travel we can retain the data for about 90 days after that fail safe period comes into
picture when the time travel period expires. But the data can be retrieved from fail safe only
through snowflake support.
Hands on:
If data is not available during that retention period or if its beyond the time it has been created
then below is the error its thrown:
If it is at certain point of time then we use at timestamp. If its before certain period of time
then we use offset as mentioned below:
For semi structured data, when ever we are casting the column from one datatype to another
datatype then we use double colon same as in last query in above ss.
The below is another kind ofd retrieval of data apart from offset and timestamp i.e. before
The below indicates that the time travel period is completed and currently in the fail safe
zone:
The below query is to convert the no of bytes to GB’s to see how much space the table is
occupying.
To create a schema with certain data retention period:
***If source tabl;e has 1000 records and its cloned to a new table. The new cloned table is
loaded with another 200 new records, then the cost will be incurred only on the new 200
records but not on the 1000 records which are pointed to the original source table. This is
how the zero copy cloning works. ***
Why we use zero copy cloning→ when we want to do unit testing or integration testing and
bring some data from prod table to dev table, then there will be no charge.
Cloning Syntax:
Hands on:
While cloning we cannot apply any filters and we will have to clone the entire object.
If we are updating something in the main table it should not affect the cloned table and vice
versa.
***The below way of creating the backup table also works in snowflake too but its costly.
Whereas if we clone the table it doesn’t cost as the data is pointing to the main table. Its
wrong way of taking the backup of a table***
Table Types:
3 types of tables – Permanent, Temporary, Transient Tables.
Permanent Tables: Default table type in Snowflake. These are the regular and common
tables. Tables exists until we drop them explicitly. These are the tables which will have the
time travel period – 90 days and fail safe - 7 days.
Transient Tables:
These are similar to Permanent tables but with 1 day retention period. There is no fail safe
period. Tables exist until we drop them. These types are useful when data protection is not
required.
Defining stage tables as transient is best practice. Its only 1 day the time travel period and
there is no fail-safe period for this type.
SYNTAX:
CREATE TRANSIENT TABLE ,TABLE NAME>();
Temporary Tables: This type of tables exists only with session. Once the session ends, then
the table gets dropped completely and is not recoverable. This will have the retention period
as 1 day and the session should be active for 24 hrs, only then we can retain. Though multiple
worksheets are opened, this type of table is accessed only for that sql worksheet and cannot
be accessed in a different worksheet.
These atbles can be used for temporary processing like it can be used in procedures and drop
at the end of the procedure. These are useful for intermediary storage.
SYNTAX:
CREATE TEMPORARY TABLE TABLE_NAME();
To retain the transient table which was dropped a while ago can be re stored in below way:
RBAC – access privileges are assigned to some roles and those roles will be assigned to the
users.
Roles in Snowflake: Roles are the entities to which privileges are on securable objects can be
granted and revoked.
Roles are assigned to users to allow them to perform actions required for business functions.
A user can be assigned multiple roles. This allows users to switch roles.
Account admin is the boss. He will have all access. He can do anything. What ever security
admin and sys admin can do can be done by account admin. Whatever user admin can do -
can be done by security admin as well. Security admin inherits the properties of user admin.
Public role is the least role. And account admin role is the top role.
The custom roles are user-defined roles.
To explain everything:
Databases, schemas, table and privileges will be done by the sys admin and roles and users
will be done by the security admin.
Roles and Users Creation:
From the above what ever the sales user can do the sales admin also can do.
DEFAULT_ROLE = ACCOUNTADMIN
MUST_CHANGE_PASSWORD=TRUE;
MUST_CHANGE_PASSWORD =TRUE;
DEFAULT_ROLE = SYSADMIN
MUST_CHANGE_PASSWORD = TRUE;
Materialized views:
Masking Policies:
Dynamic data masking: The data is not changed in the storage or in any table, but when
executed the output data will be masked dynamically and displayed.
We no need to mention the masking policy when we are removing as we can only apply one
policy one a column.
Before masking policy is dropped we have to unset them and then drop the policy.
For non-snowflake users we must create the reader account and then share it.
One consumer can get data from multiple providers and one provider can share data to
multiple consumers.
A snowflake user can act as a both provider and consumer.
Objects can be shared are:
Consumer account, Reader account, Provider account. The reader account belongs to
provider account.
Sharing the data and dropping the share object using snowsight window:
Data Sampling:
Data sampling is selecting part of data or subset of records from the table.
This is to build and test the query whether the query is syntactically correct or not.
This is mainly used for query building & testing and also for Data analysis or understanding.
This useful in dev env where we use small wh’s and occupy less storage.
We can sample a fraction or % of rows and also we can sample a fixed no of rows.
Syntax:
Some of the samples:
In the below case it cannot guarantee the same data as output. Because we are not giving the
seed number.
In below case since we are fetching the data from the seed, in both cases the data is same.
From an external table always the first field is value field. Which stores the entire record in
variant datatype.
We built the external stages to analyse the data when the data is in source file itself (shown
below) and to access the file in the form of table.
We can create secured view (top) and materialized view (below) on top of external tables.
We can easily identify the type of view by looking at the icon symbol of the view as shown
below:
Stored Procedures:
SPs allows you to write procedural code which includes SQL statements, conditional
statements, looping statements & cursors.
Snowflake supports 5 languages for writing procedures:
Snowflake scripting
Java
Scala
Python
From a stored procedure, you can return a single value or tabular data.
SP’s supports branching and looping.
Dynamically SQL statements can be created and executed.
Differences between SP’s & UDF’s:
The above stored procedure contains like declaration of variables and execution of sql
statements.
Need to be careful in writing the stored procedures. Even if we miss a single semicolon its
difficult to find it out.
Resource Monitors:
A Virtual warehouse consumes snowflake credits while it runs. The no of credits consumed
depends on the size of the warehouse and how long it runs.
A resource monitor can be used to monitor credit usage by virtual warehouses and the cloud
services needed to support those warehouses.
Resource monitors helps in controlling costs and avoid unexpected credit usage.
In resource monitors we can set credit limits for a specified interval or date range.
When these limits are reached or approaching, the resource monitor can trigger various
actions such as sending alerts and suspending warehouses.
Resource monitors can only be created by account administrators or with the role that has
admin privileges.
Resource monitors will reduce the unexpected credit usage and they can help us to track the
credit usage.
Credit Quota: This specifies the no. of snowflake credits allocated to the monitor for the
specified frequency interval. In addition snowflake tracks the used credits/quota within the
specified frequency interval by all warehouses assigned to the monitor. After specified
interval this number resets back to 0.
Credit quota accounts for credits consumed by both user managed virtual warehouses and
virtual warehouses used by cloud services.
Monitor type: A resource monitor can be created to monitor the credit usage both at account
level & warehouse level (single or set of warehouses).
If this property is not set then the resource monitor doesn’t monitor any credit usage.
Schedule: The default schedule for a resource monitor specifies that it starts monitoring credit
usage immediately and the used credits reset back to 0 at the beginning of each calendar
month (i.e. the start of the standard snowflake billing cycle)
We can customize the schedule for a resource monitor using the following properties:
1. Frequency: Daily, monthly, weekly, yearly
Never(Used credits never reset; assigned warehouses continue using credits until the credit
quota is reached)
2. Start: Date & time when the resource monitor starts monitoring the assigned
warehouses. It can be immediately or any future timestamp.
3. End: Date and Time when snowflake suspends the warehouses associated with the
resource monitor, regardless of whether the used credits reached any of the thresholds
defined. It can be any future timestamp. Need to be very careful with this property
especially. We have to set this property as “never”.
Actions:
Creation of Tasks:
DAG of tasks:
Directed acyclic graph
Creation of tasks setting the dependency between the tasks:
Task History:
Metadata of Streams:
Types of streams:
Every update is tracked by one delete and one insert. The stream will track an update record
with one delete and insert record.
For tracking updates and deletes we will have to write merge queries.
If we create a snow pipe, task and stream there will be continuous ingestion of data and it’s a
continuous data pipeline.
Snowflake Alerts and Email Notifications:
When to use snowflake alerts:
Email Notifications:
3. What are the advantages of snowflake over traditional databases or what are the new
features available in snowflake?
Lot of new features and advantages:
Pay as you go, No infra structure maintenance, Easy data loading, Time travel and
failsafe, Zero copy cloning, Easy data sharing, Tasks and Streams.
Internal stages – Stores the data files internally. We can copy files to internal stages by using
PUT command from snowsql.
7. How can you load a json file to Snowflake or how can you process and load semi
structured data?
We can store this semi structured data into a table by using a data type called Variant. Then
we can read this data from variant, we can process it into rows and columns and load into
another table.
9. How can you handle If the data coming from file is exceeds the length of a field in the
table.
Ans: We can handle this by using (truncatecolumns=true) in copy command. If we don’t
specify this property, copy command will fail. By default, this property is set to FALSE.
13. Write a query to retrieve data that was deleted from a table?
14. What are all the objects we can restore after delete or drop?
We can restore the deleted data from any table based on the time travel retention period
defined on the table. Based on the edition of snowflake the retention period can be 1 to 90
days. We can un-drop the tables, schemas and databases that were dropped by mistake or
wantedly.
Refreshed_on & Behind_columns in the output of the above show command.
There is a need to choose the time travel retention period less than the 90 days as it would
impact costs if the data is not required to be kept till 90 days, in that case it can be reduced.
There is no indexing concept in snowflake. Instead we can define cluster keys on large tables
for better performance.
Snowflake Interview questions Part-2
Snowflake Interview questions scenario based
==========================================================
==========================================================
========================================================
======================================
The above can be achieved using listagg in snowflake.