SNOWFLAKE
Introduction:
Snowflake is a cloud data platform which is provided as soft-as-a-service (SaaS) which
enables data storage, processing and analytical solutions that are faster, easier to use,
and far more flexible than traditional offerings.
Snowflake is a true SaaS offering. More specifically:
There is no hardware (virtual or physical) to select, install, configure, or manage.
There is virtually no software to install, configure, or manage.
Ongoing maintenance, management, upgrades, and tuning are handled by
Snowflake.
Snowflake is a Massively parallel database processing engine; this means that the
system uses multiple nodes to be able to scale the execution of queries.
Snowflake runs completely on cloud infrastructure. All components of Snowflake’s
service (other than optional command line clients, drivers, and connectors), run in
public cloud infrastructures.
Snowflake uses virtual compute instances for its compute needs and a storage service
for persistent storage of data. Snowflake cannot be run on private cloud infrastructures
(on-premises or hosted).
Snowflake is not a packaged software offering that can be installed by a user. Snowflake
manages all aspects of software installation and updates.
Snowflake data platform is built from scratch. It is not built on any existing database
technology or Bigdata software platforms such as Hadoop.
Snowflake combines a completely new SQL query engine with an innovative architecture
natively designed for the cloud.
CLUSTER: -
It enables the data automatically into micro-partitions to allow for faster retrieval of frequently
requested data.
Micro partitions: -
Snowflake has implemented a powerful and unique form of partitioning called Micro-
partitioning.
Tables are transparently partitioned using the ordering of the data as it is
inserted/loaded.
Snowflake is columnar-based and horizontally partitioned, meaning a row of data is
stored in the same micro-partition.
Micro-partitions are small in size (50 to 500 MB).
Data is compressed in micro partitions; snowflake automatically determines the most
efficient compression algorithm for the columns in each micro-partition.
When data is inserted with batch insert, row-by-row, data is ordered on micro-partition based
on the order of rows inserted.
INSERT/COPY
INSERT and COPY into table operations only create new micro-partitions.
UPDATE
UPDATE operations keep the old MPs (before the change) and create new MPs with the change.
Each MP will have its own unique version IDs.
DELETE
DELETE operations keep the old MPs (before the delete) and create new MPs with the change
by removing the record(s). Each MP will have its own unique version IDs.
The difference between UPDATE and INSERT is that UPDATE also needs to scan existing
partitions. So, you can see an INSERT operation is only one operation without touching any
existing files, while an UPDATE operation does need to scan existing partitions. Like the UPDATE
operations, DELETE operations need to scan existing partitions to know which record(s) are to
be removed in the new MPs.
Due to immutable characteristics of Micro-partitions, it makes time-travel possible and easier.
This time-travel feature is unique to Snowflake in this highly competitive technology space,
which can make data recovery super easier with proper retention period while it’s very time
consuming and error-prone in traditional databases and data warehouses. A very important side
note on locking and contention, since INSERTs only add new files, it will not need any locking on
existing MPs. As a result, INSERTs can be run with high concurrency compared to
UPDATE/DELETE/MERGE etc.
VIRTUAL WAREHOUSE: -
A virtual warehouse, often referred to simply as a “warehouse”, is a Cluster of compute
resources that executes database queries and commands. DML, select, copy commands uses
virtual warehouse. This process is automatic. A virtual warehouse can consist of one or more
clusters, within each cluster we can have 1-128 nodes.
Credit: -
Snowflake credits are used to pay for the consumption of resources on Snowflake. A Snowflake
credit is a unit of measure, and it is consumed only when a customer is using resources, such as
when a virtual warehouse is running, the cloud services layer is performing work, or serverless
features are used.
DATABASE: -
It is a logical grouping of schemas. Each database belongs to a single Snowflake account.
SCHEMA: -
It is a logical grouping of database objects (tables, views, etc.). Each schema belongs to a single
database.
Architecture: -
Snowflake’s architecture is a hybrid of traditional shared-disk and shared-nothing database
architectures. Similar to shared-disk architectures, Snowflake uses a central data repository for
persisted data that is accessible from all compute nodes in the platform. But similar to shared-
nothing architectures, Snowflake processes queries using MPP (massively parallel processing)
compute clusters where each node in the cluster stores a portion of the entire data set locally.
This approach offers the data management simplicity of a shared-disk architecture, but with the
performance and scale-out benefits of a shared-nothing architecture.
Snowflake’s unique architecture consists of three key layers,
Storage
Query Processing
Cloud Services
Storage: -
When data is loaded into Snowflake, Snowflake reorganizes that data into its internal
optimized, compressed, columnar format. Snowflake stores this optimized data in cloud
storage.
Snowflake manages all aspects of how this data is stored — the organization, file size, structure,
compression, metadata, statistics, and other aspects of data storage are handled by Snowflake.
The data objects stored by Snowflake are not directly visible nor accessible by customers; they
are only accessible through SQL query operations run using Snowflake.
Query Processing: -
Query execution is performed in the processing layer. Snowflake processes queries using
“virtual warehouses”. Each virtual warehouse is an MPP compute cluster composed of multiple
compute nodes allocated by Snowflake from a cloud provider.
Each virtual warehouse is an independent compute cluster that does not share compute
resources with other virtual warehouses. As a result, each virtual warehouse has no impact on
the performance of other virtual warehouses.
Cloud Services: -
The cloud services layer is a collection of services that coordinate activities across Snowflake.
These services tie together all the different components of Snowflake in order to process user
requests, from login to query dispatch. The cloud services layer also runs on compute instances
provisioned by Snowflake from the cloud provider.
Services managed in this layer include:
Authentication
Query parsing and optimization
Access control
Infrastructure management
Metadata management
Data types: -
Category Type Notes
NUMBER Default precision and scale are
(38,0).
DECIMAL, NUMERIC Synonymous with NUMBER.
Numeric Data INT, INTEGER, BIGINT, Synonymous with NUMBER except
Types SMALLINT, TINYINT, precision and scale cannot be
BYTEINT specified.
FLOAT, FLOAT4, FLOAT8
Category Type Notes
DOUBLE, DOUBLE Synonymous with FLOAT.
PRECISION, REAL
VARCHAR Default (and maximum) is
16,777,216 bytes.
CHAR, CHARACTER Synonymous with VARCHAR except
String & Binary
default length is VARCHAR(1).
Data Types
STRING Synonymous with VARCHAR.
TEXT Synonymous with VARCHAR.
BINARY
VARBINARY Synonymous with BINARY.
Logical Data BOOLEAN Currently only supported for
Types accounts provisioned after January
25, 2016.
DATE
DATETIME Alias for TIMESTAMP_NTZ
TIME
TIMESTAMP Alias for one of the TIMESTAMP
variations (TIMESTAMP_NTZ by
Date & Time
default).
Data Types
TIMESTAMP_LTZ TIMESTAMP with local time zone;
time zone, if provided, is not stored.
TIMESTAMP_NTZ TIMESTAMP with no time zone;
time zone, if provided, is not stored.
TIMESTAMP_TZ TIMESTAMP with time zone.
VARIANT
Semi-structured
OBJECT
Data Types
ARRAY
Geospatial Data GEOGRAPHY
Category Type Notes
Types GEOMETRY
Analytic functions: -
Ranking Functions (RANK and DENSE_RANK, CUME_DIST, PERCENT_RANK,
ROW_NUMBER)
Windowing Aggregate Functions
SUM, AVG, MAX, MIN, COUNT, STDDEV, VARIANCE, FIRST_VALUE, LAST_VALUE
Reporting Aggregate Functions
LAG/LEAD Functions, FIRST/LAST Functions
Inverse Percentile Functions
PERCENTILE_CONT, PERCENTILE_DISC
Hypothetical Rank and Distribution Functions
RANK | DENSE_RANK | PERCENT_RANK | CUME_DIST
Linear Regression Functions
REGR_COUNT, REGR_AVGY and REGR_AVGX, REGR_SLOPE and
REGR_INTERCEPT, REGR_R2, REGR_SXX, REGR_SYY, and REGR_SXY
Other Statistical Functions
WIDTH_BUCKET Function
Tables: -
1. Permanent: -
The data stored in permanent tables consume space and contributes to the
storage charges that snowflake bills your account.
Permanent tables have a Fail-safe period and provide additional security of data
recovery and protection.
2. Temporary: -
Temporary tables only exist within the session in which they were created and
persist only for the remainder of the session.
They are not visible to other users or sessions.
Once the session ends, data stored in the table is purged completely from the
system and, therefore, is not recoverable, either by the user who created the
table or Snowflake.
3. Transient: -
Transient tables are like permanent tables; only key difference is that they do
not have a Fail-safe period.
Transient Tables are meant for temporary data that must be kept after each
session but do not require the same level of data protection and recovery as
Permanent Tables.
4. External: -
External tables allow to query the files stored in external stage like a regular
table, it means without moving data from files to Snowflake tables.
It accesses the files stored in external stage area such as Amazon S3, GCP, Azure
blob storage.
Basically, a metadata-only table, where the actual files and records are in the
cloud storage.
External tables are read only, therefore no DML operations can be performed on
them. But we can use them for query and join operations on them.
Querying data from external tables is likely slower than querying database
tables.
We can analyze the data without storing it in Snowflake.
We can also create views against external table.
Stages: -
Snowflake doesn’t allow any direct data loading to a table directly. It must be happened
via stage location.
A stage is a database object and specifies where data files are stored (staged) so that
the data in the files can be loaded into a table.
Snowflake file formats are used while loading/unloading data from Snowflake stages
into tables using COPY INTO command and while creating EXTERNAL TABLES on files
present in stages.
Types of stages: -
User stage: -
By default, each user has a snowflake stage allocated to them for storing the
files.
User stages are referenced using ‘@~’.
This stage is a convenient option if your files will only be accessed by a single
user but need to be copied into multiple tables.
This option is not appropriate if:
Multiple users require access to the files.
The current user does not have INSERT privileges on the tables the data
will be loaded into.
Unlike named stages, user stages cannot be altered or dropped.
Table stage: -
By default, each table has a snowflake stage allocated to it for storing the files.
Table stage can be referenced using ‘@%’.
This stage is a convenient option if your files need to be accessible to multiple
users and only need to be copied into a single table.
This option is not appropriate if you need to copy the data in the files into
multiple tables.
Unlike named stages, table stages cannot be altered or dropped.
Named stage: -
It is a snowflake object which can be created by us. We can list this stage by
using ‘@’.
We can use put command for loading data into stage.
We can use copy into command for unloading data from stage to table.
There are 2 types of named stages. They are:
Named internal stage.
Named external stage.
File format: -
Snowflake File format is a named database object that can be used to translate the
external files data into tabular format.
Snowflake supports 6 types of file formats (CSV, JSON, AVRO, ORC, PARQUET, XML).
We can assign file formats to a stage, in a copy command, in an external table while loading data
into Snowflake.
CSV and TSV are structured data file formats.
JSON, AVRO, ORC, PARQUET, XML are semi structured data file formats.
Views: -
A view is a database object that contains SQL query built over one or multiple tables.
It is considered as a virtual table that can be used almost anywhere that a table can be
used (filters, joins, subqueries, etc.,).
Whenever you query a view, the underlying SQL query associated with the view gets
executed dynamically and will fetch data from underlying tables.
Views serve a variety of purposes like combining, segregating, protecting data.
Changes to a table are not automatically propagated to views created on that table.
For example, if you drop a column in a table, the views on that table might become
invalid.
Advantages of views: -
Encapsulate complex query logic.
Store common queries in the schema, so they can be reused.
Views Allow Granting Access to a Subset of a Table.
Materialized Views Can Improve Performance.
No need of additional maintenance, auto refresh of results.
Types of views: -
1. Regular Views/Non-materialized views: -
A non-materialized view’s results are created by executing the query at the time
that the view is referenced in a query.
The results are not stored for future use.
Performance is slower as compared to materialized views.
Non-materialized views are the most common type of view.
2. Materialized views: -
Materialized views are designed to improve query performance for workloads
composed of common, repeated query patterns.
A materialized view stores pre-computed result set.
Materialized views require Enterprise edition or higher.
No need to refresh the materialized view manually. It can be refreshed
automatically.
Querying a materialized view gives better performance than querying the base
tables.
It can be created on single table; we can’t build it on multiple tables by joining.
Use materialized view on a table which is queried frequently.
The results of the view are kept up to date automatically, stored and directly
pulled every time the view is referenced.
Storage cost: Materialized view stores query results, which adds to the monthly
storage usage for account.
Compute cost: To prevent materialized view from becoming out-of-date,
snowflake performs automatic background maintenance of materialized views.
When a base table changes, all materialized views defined on the table are
updated by a background service that uses compute resources provided by
snowflake. So, there will be a compute cost associated with it.
3. Secured views: -
Secure view does not allow users to see the definition of the view.
The definition of the view is only exposed to authorized users only.
If we don’t want the users to see underlying tables present in a database create
secure view.
The view can be referenced but its underlying definition is not exposed.
Use secure views whenever the view logic must be hidden from the view users.
Important Points:
Order by clause can be a part of view definition, but snowflake recommends excluding
it.
Views are not dynamic and doesn’t change automatically unless underlying sources are
modified.
We cannot use limit clause in a materialized view.
Self-join is also not possible within the materialized view.
Whenever a view is created and granted privileges on that view to a role, the role can
use the view even if the role doesn’t have privileges on the underlying table.
Use materialized views when:
The query results from the view doesn’t change often.
The results of the view are used often.
The query consuming a lot of resources (it means query takes long time to
process and fetch the data).
Create a regular view when:
The results of the view change often.
The results are not used often.
The query is simple.
The query contains multiple tables.
For secure view snowflake doesn’t show how much data is scanned.
Snowflake accepts the force keyword but doesn’t support it.
Do not query stream objects in the select statement. Streams are not designed to serve
as a source for views or materialized views.
Creating a materialized view requires create materialized view privilege on the schema
and select privilege on the base table.
When you choose a name for a materialized view, note that a schema cannot contain a
table and view with the same name.
We can’t specify a Having and Order By clause in a materialized view.
A materialized view can’t query:
A materialized view
A non-materialized view
A UDTF (User Defined Table Function)
A materialized view can’t include:
UDFs, Limit, Window functions, etc.,
Snowpipe: -
Snowpipe enables loading data from files as soon as they are available in a stage.
This means you can load data from files in micro-batches, making it available to users
within minutes, rather than manually executing COPY statements on a schedule to load
larger batches.
Continuous loading means loading small volumes of data in continuous manner like for
every 10 minutes or for every hour etc.
It can be live or real time data.
For loading continuous data into tables, snowflake uses Snowpipe.
The data is loaded according to the COPY statement defined in a referenced pipe.
It uses the resources provided by Snowflake; it is a serverless task.
It is a onetime setup.
Suggested micro file size is 100-250 MB.
Snowpipe uses file loading metadata associated with each pipe object to prevent
reloading the same files (and duplicating data) in a table.
This metadata stores the path (i.e., prefix) and name of each loaded file, and prevents
loading files with the same name even if they were later modified.
Zero copy cloning: -
Snowflake allows you to create clones, also known as zero copy clones.
We can perform clone operation on databases, schemas, tables, streams, file formats,
stages, tasks.
We can maintain multiple copies of data with no additional cost, so called zero copy.
A snapshot of data present in the source object is taken when the clone is created and is
made available to cloned object.
The cloned object and its source are independent to each other.
Streams: -
A Stream is an object that records DML changes made to table including insert, update
and delete.
A stream records update operation as a set delete (delete old record) and insert (insert
new record).
It tracks all row level changes to a source table using offset but doesn’t store the
changed data.
We call this process as change data capture (CDC).
Streams store metadata about each change, so that actions can be taken using this
metadata.
Streams can be combined with tasks to set continuous data pipeline.
Snowpipe + stream + task continuous data load.
Along with changes made to table streams maintain 3 metadata fields i.e.,
METADATA$ACTION, METADATA$ISUPDATE and METADATA$ROW_ID.
Metadata$action Metadata$isupdate Action
Insert False To identify insert
records
Insert True To identify update
records
Delete false To identify delete
records
Types: -
1) Standard stream/Delta stream: - A standard stream records all DML changes made to
table including insert, update and delete.
Create or replace stream stream_name on table table_name;
2) Append-only streams - It tracks row inserts only. Update and delete (including table
truncate) operations are not recorded.
Create or replace stream stream_name on table table_name Append_only = true;
3) Insert only stream: - It tracks only row inserts for external tables only. They do not
record delete operations.
Create or replace stream stream_name on external table table_name insert_only = true;
Tasks: -
We use tasks for scheduling in snowflake.
We can schedule
SQL queries
Stored procedures
Tasks can be combined with table streams for implementing the continuous change data
captures.
We can maintain DAG of tasks to keep the dependencies between tasks.
Tasks require compute resources to execute SQL code, we can choose either of
Snowflake managed compute resources (serverless) --> introduced recently
(even though we don’t mention warehouse it will consume snowflake compute
resources)
User managed (Virtual warehouses)
DAG of tasks: -
DAG – Directed Acyclic Graph.
To maintain dependencies between tasks.
A root task followed by child tasks.
Just schedule root task, child tasks will be executed in order.
Time travel: -
It enables to access historical data i.e., data that has been changed or deleted at any
point within a defined period.
It serves as a powerful tool for performing the following tasks:
Restoring data-related objects (tables, schemas, and databases) that might have
been accidentally or intentionally deleted.
Duplicating and backing up data from key points in the past.
Analyzing data usage/manipulation over specified periods of time.
Retention period:
It specifies the number of days for which historical data is preserved. Higher the
retention period higher the storage cost.
Increasing Retention causes the data currently in Time Travel to be retained for the
longer time.
For example, if you have a table with a 10-day retention period and increase the period
to 20 days, data that would have been removed after 10 days is now retained for an
additional 10 days before moving into Fail-safe.
Note that this doesn’t apply to any data that is older than 10 days and has already
moved into Fail-safe.
Decreasing Retention reduces the amount of time data is retained in Time Travel:
For active data modified after the retention period is reduced, the new shorter period
applies.
For data that is currently in Time Travel:
If the data is still within the new shorter period, it remains in Time Travel.
If the data is outside the new period, it moves into Fail-safe.
For example, if you have a table with a 10-day retention period and you decrease the
period to 1-day, data from days 2 to 10 will be moved into Fail-safe, leaving only the
data from day 1 accessible through Time Travel.
Changing the retention period for your account or individual objects changes the value
for all lower-level objects that do not have a retention period explicitly set. For example:
If you change the retention period at the account level, all databases, schemas,
and tables that do not have an explicit retention period automatically inherit the
new retention period.
If you change the retention period at the schema level, all tables in the schema
that do not have an explicit retention period inherit the new retention period.
Keep this in mind when changing the retention period for your account or any objects in
your account because the change might have Time Travel consequences that you did
not anticipate or intend. In particular, we do not recommend changing the retention
period to 0 at the account level.
Fail safe:
Fail-safe provides a (non-configurable) 7-day period during which historical data may be
recoverable by Snowflake.
This period starts immediately after the Time Travel retention period ends.
Fail-safe is a data recovery service that is provided on a best effort basis and is intended
only for use when all other recovery options have been attempted.
Fail-safe is not provided as a means for accessing historical data after the Time Travel
retention period has ended. It is for use only by Snowflake to recover data that may
have been lost or damaged due to extreme operational failures.
Data recovery through Fail-safe may take from several hours to several days to
complete.
Querying historical data: -
1. The following query selects historical data from a table as of the date and time
represented by the specified timestamp:
select * from table_name at (timestamp => 'wed, 28 sep 2022 [Link]'::timestamp);
2. The following query selects historical data from a table as of 5 minutes ago:
Select * from table_name at(offset=> -60*5);
3. The following query selects historical data from a table up to, but not including any
changes made by the specified statement:
Select * from table_name before(statement=> ‘****query_id****’);
Column level security: -
Column level security in snowflake allows the application of a masking policy to a
column within a table or view.
To protect sensitive data like customers PHI, bank balance etc.,
It includes two features
Dynamic data masking
External Tokenization
Dynamic data masking is the process of hiding data by masking with other characters.
We can create masking policies to hide the data present in columns.
External Tokenization is the process of hiding sensitive data by replacing it with cypher
text. External tokenization makes use of masking policies with external functions
created at external cloud provider side.
Masking policies: -
Snowflake supports masking policies to protect sensitive data from unauthorized access
while allowing authorized users to access at query runtime.
Masking policies are schema level objects.
Masking policies can include conditions and functions to transform the data when these
conditions are met.
Same masking policy can be applied on multiple columns.
Dynamic data masking: -
Sensitive data in snowflake is not modified in an existing table. But when users execute
a query, it will apply the masking dynamically and displays the masked data. Hence
called the Dynamic data masking.
The data can be masked, partially masked, obfuscated(unclear) or tokenized data.
Unauthorized users can operate the data as usual, but they can’t view the data.
Mostly masking policies applied based on the roles.
Limitations: -
Before dropping masking policies, we should unset them.
Data type of input and output value must be same.
Caching: -
Caching is a temporary storage location that stores copies of files or data, so that they
can be accessed faster in near future.
Cache plays vital role in saving costs and speeding up results.
It improves query performance.
Types of caches in snowflake
Metadata Caching
Query results cache (or) Result cache
Local disk cache (or) Warehouse cache
Metadata cache: -
Metadata about Tables and micro-partitions are collected and managed by snowflake
automatically.
Snowflake does not use compute to provide Range values like MIN, MAX, Number of
distinct values, NULL count and ROW count and clustering information.
Fetching metadata is faster.
Results cache: -
Results cache is located in cloud layer. This cached data will be available for next 24
hours.
Results cache will be available and can be accessed across different virtual warehouses.
Query results returned to one user is available to any other user on the system who
executes the same query.
It works until underlying data has not changed.
Here mandatory condition is query should be same.
It won’t work for subset of data and even won’t work if we re-order columns.
Local disk cache: -
Local disk cache is located in the virtual warehouse (cached data is stored in EC2 if we
host our snowflake account on AWS and Virtual machines if we host our account on
AZURE).
Cache(stores) the data (not the results) fetched by SQL queries (we can re-order
columns i.e., query condition is not mandatory).
Whenever data is needed for a given query it is retrieved from the remote disk storage
and cached in SSD and memory (for the first time).
Cached data is only available until the VW is up and running.
Once the VW is suspended cache gets deleted.
Even works when we query subset of data that is available in local disk cache.
E.g. Suppose when we query 10k records for 1 st time, local disk cache will hold this 10k
records
Data and next time if we try to query only 2k or 3k records which is subset of above 10k,
it will fetch from local disk cache.
This cache depends on virtual warehouse size we are using.
E.g. small VW can’t hold millions of records, but it can fetch part of the data from local
disk cache and remaining from remote disk.
Imp. Points: -
Caching helps not only to improve performance but also saves a lot of compute credits.
Snowflake architecture has 3 major components Cloud service layer caches the query result
and sometime also referred as Result Set Cache or Query Result Cache.
Result set cache which holds the results of every query executed in the past 24 hrs.
The result cache will be available across all Virtual warehouses.
The result set cache is invalidated by snowflake when underlying data is changed.
The result set cache is not used by snowflake when the newly submitted query does not
match the previously executed query. Because it stays fresh as long as the underlying data
doesn’t change and submit a word-for-word identical query as a previous query within
24hrs of the original query.
Query result is reused if the following criteria is met,
New query syntactically matches the previously executed query.
Query doesn’t include functions that are evaluated at execution time (excluding
current date).
Query doesn’t include UDFs or external functions.
The underlying data has not changed.
Each time the continued result for a query is reused, Snowflake resets the 24-hour retention
period for the result up to a maximum of 31 days from the date and time that the query was
first executed.
Snowflake provides a table function that returns results for queries executed within last
24hrs. Result scan returns the set of a previous command (within 24hrs of when you
executed the query) as if the result was a table.
The role accessing the cached results should have required privileges to the underlying
tables.
The size of warehouse cache is determined by compute resources in the warehouse (i.e
larger the warehouse, therefore more compute resources larger the cache).
Decreasing the size of a running warehouse removes compute resources from the
warehouse. When the compute resources are removed, the cache associated with those
resources is dropped.
Any kind of caching doesn’t incur any storage cost.
Queries that evaluate functions at execution time (current_timestamp etc.,) can’t use result
cache but current_date() function is an exception.
Table record count is stored in snowflake’s cloud service and this information is fetched
from metadata service or metadata cache.
Show tables; (it is a metadata operation and doesn’t need any virtual warehouse or any
caching data usage. However, snowflake uses metadata cache to fetch result).
A security token used to access large, persisted query results (i.e. greater than 100KB in
size) expires after 6 hours. A new token can be retrieved to access results while they are still
in cache. Smaller persisted query results do not use an access token.
Access Control: -
Access control privileges determines who can access database objects and perform
operations on specific objects in Snowflake.
Snowflake’s approach to access control combines aspects from both of the following
models:
Discretionary Access Control (DAC): Each object has an owner, who can in turn
grant access to that object.
Role-based Access Control (RBAC): Access privileges are assigned to roles, which
are in turn assigned to users.
The key concepts to understanding access control in Snowflake are:
Securable object: An entity to which access can be granted. Unless allowed by a
grant, access is denied. Tables, Schemas, Views etc.
Role: An entity to which privileges can be granted. Roles are in turn assigned to
users. Note that roles can also be assigned to other roles, creating a role
hierarchy.
Privilege: A defined level of access that can be granted to an object. Multiple
distinct privileges may be used to control the granularity of access granted.
User: Specifies the person or system to whom access was granted.
In the Snowflake model, access to securable objects is allowed via privileges assigned to
roles, which are in turn assigned to other roles or users. In addition, each securable
object has an owner that can grant access to other roles.
This model is different from a user-based access control model, in which rights and
privileges are assigned to each user or group of users. The Snowflake model is designed
to provide a significant amount of both control and flexibility.
Data Sharing: -
Secure Data Sharing enables sharing selected objects in a database in your account with
other Snowflake accounts.
The following Snowflake database objects can be shared:
Tables
External tables
Secure views
Secure materialized views
Secure UDFs
Provider is the one who is sharing the data.
Consumer is the one who is consuming the data from data provider.
We can share data in the following ways.
1. Account to account share (Direct share)
In this provider can share data to consumer, who is in same cloud and
same region.
Let’s say the provider account exists on AWS cloud with the US-east-1
region, to provide data to consumer, the consumer account must also
exist on AWS cloud with US-east-1.
In this type of sharing consumer will have to pay only for compute and
provider will pay for the storage.
2. Reader account (Direct Share)
In this case the provider needs to share the data to the consumer who
don't have snowflake account, then the provider can create reader
account and allows consumer to access the data.
In this type of sharing consumer will have to pay for both compute and
storage.
Reader account users cannot perform any DML operations, instead they
can have only select access.
3. Cross cloud and Cross region (DATA REPLICATION)
If the provider wants to share the data to consumer, who is in the same
cloud and different region or different cloud different region we need to
replicate data.
Let’s say the provider account exists on AWS cloud with the US-east-1
region and the consumer account in AWS cloud with US-west-1 or in
GCP/AZURE then we need to replicate the data.
In this way of sharing snowflake makes a copy of data to the consumer
account.
This way of sharing is costlier.
User Defined Functions: -
UDF allows to perform operations that are not available through the built-in system
defined functions.
Create UDF whenever there is a need to repeat the same functionality.
Snowflake supports 4 languages for writing UDFs.
SQL
Java Script
Java
Python
Snowflake UDFs can return scalar (value or a string) and tabular results.
Snowflake UDFs overloading means support functions with same name but different
parameters.
Proc_calculate_area() is different from Proc_calculate_area(radius float)
Proc_calculate_area(radius float) is different from Proc_calculate_area(length
number, width number)
Sample UDFs:
SCALAR: Returns output for each input we TABULAR: Can return zero, one or multiple
are passing. rows.
Create function area_of_circle(radius float) create function t()
returns float returns table(name varchar, age number)
as as
$$ $$
pi() * radius * radius Select ‘RAVI’, 34
$$ union
; Select ‘LATHA’, 27
union
Select ‘MADHU’, 25
Select area_of_circle(4.5); $$
;
Note: - In 99% of cases we do not use tabular UDFs instead we can use Stored
Procedures.
How can you handle if the data coming from file exceeds the length of a file in the table?
We can handle this by using (TRUNCATECOLUMNS = TRUE) in copy command. If we don’t
specify this copy command will fail. By default it is set to false.
Does snowflake supports Indexes?
No, we can’t define indexes on snowflake tables instead we can use cluster keys on larger
tables for better performance.
What do you mean by Horizontal and Vertical Scaling?
Horizontal Scaling: Horizontal scaling increases concurrency by scaling horizontally. As
your customer base grows, you can use auto-scaling to increase the number of virtual
warehouses, enabling you to respond instantly to additional queries.
Vertical Scaling: Vertical scaling involves increasing the processing power (e.g. CPU,
RAM) of an existing machine. It generally involves scaling that can reduce processing
time. Consider choosing a larger virtual warehouse-size if you want to optimize your
workload and make it run faster.