INTRODUCTION TO CLOUD COMPUTING
Let's discuss what was here before the cloud and before the cloud? .
If you needed a server
● Buy it
● Install it to operating system and the software that you needed
● maintain it yourself.
● when it became old, you had to replace it.
● And in general you had to have your own I.T. team.
And that means that you often end up with something that looks like this.
Now, in addition to servers, the same goes also with networking databases, user
management and more. All this had to be taken care of by yourself or your team, but there
is more.
And here comes the cloud.
And what exactly is the cloud?
The unscientific definition of the cloud is compute, networking, storage and other services
managed by someone else.
That means that they're all around the world, huge data centers filled with servers,
networking, storage and all other things that someone else is responsible for them to work
properly.
And that brings us to cloud providers.
cloud provider:
Well, these are companies who build those huge data centers. They fill it with servers,
networking, cooling, electricity, etc. and also they design and install value services.
And then after they do that, they make it publicly accessible.
So basically, we have a huge server farm that someone else is taking care of and we can just
come and use it.
What exactly are cloud services?
The cloud providers offer a lot of additional services:
● AI
● IOT ( Internet of Things)
● Kubernates and lots more
So in the cloud era, if you need a server, you can
● Create it in the cloud within minutes.
● Use it as you wish.
● pay only for what we use.
● shut it down when not needed.
● automatically maintained, patched, secured and monitored.
Characteristics of cloud:
So as we said before, the unscientific definition of cloud is compute, networking, storage
and other services managed by someone else.
But actually there is a more formal definition of cloud.
And this definition comprises the five characteristics of cloud computing.
And these characteristics are
● On demand self-service
● Broad network access
● Resource pooling
● Rapid elasticity
● measured service
On demand self-service:
● No human interaction is needed for resource provisioning. That means that if I need
a new virtual machine or VM, I don't need to send an email to someone or to pick up
the phone and talk to someone. I can do that with a click of a button.
I just go to some Web app, define exactly what I need, click the create or next or
add or something like that. And after a few minutes I will get my virtual machine.
● provisioning is available 24/7
Broad Network Access:
● resources that I create in the cloud can be accessed from anywhere using the
network. For example, create a virtual machine in the US and access it from my
home in India.
No physical access is required at any time:
● So at no point in time do I need to access the physical host machine, either data
center to perform some maintenance or to check things or maybe to take cables or
anything.
Resource pooling:
● Resource pooling is when cloud providers offer provisional and scalable services
to multiple clients or customers. In other words, space and resources are pooled
to serve multiple clients at one time. Depending on a client’s resource
consumption, usage can be set to provide more or less at any given time.
● Now, some advanced cloud services allow for physical resource separation, mainly
for security reasons.
Some services, for example, allow us to define that this service is going to be
created on a separate physical host. They still do not allow us to select which host
to use, but we can still define that. We want to use a separate host. Of course, these
services are considerably more expensive than the usual ones, but still it's possible
and it's up to us to make this decision.
Rapid elasticity:
● and that means that resources can be scaled up and down as needed automatically,
● And this is exactly what the cloud enables us.
● There is no need to purchase resources for a one time peak scenario again,
And this is one of the most important capabilities of the cloud in one of the great
motivations of all premise organizations to move to the cloud.
Measured service:
● means that the payment is done only for resources actually used. So that means,
again, that if I decide to use, let's say, two virtual machines, then I pay only for those
two virtual machines. If later I. Another one, then I will pay for this one also, and if
after that I decide to shut it down, then at this exact moment I stop paying for it.
● And the very important bottom line here is that we do not need to invest money in
unused resources.
To better understand the financial aspect of using the cloud, there are two terms we need
to know.
These are CapEx and OpEx.
These terms stand for capital expense and operating expense.
Traditionally IT - CapEx oriented
● major investment for:
○ building data center
○ producing servers
○ purchasing air conditioning
○ purchasing the network devices
○ purchasing software licenses.
And only then after doing all that and spending all that money, we can finally use it now,
Even though traditional it is capex oriented.
There are also some OpEx involved.
● electricity is usually paid by the monthly
● Salaries that we pay I.T. guys every month.
● Maintenance is also an ongoing payment and more.
OpEx: On the other hand, what's going to happen is [Link] first buy two servers and
then we let them run throughout the year until we reach November,
In November we add two servers, then we prepare for Black Friday, and after that we can
decrease the number of servers back to two.
And so we pay for the additional two servers only in November.
So to summarize, these are CapEx and OpEx.
CapEx is not optimal and not flexible.
And OpEx, on the other hand, is extremely flexible and the most optimal method we want
to use and this is what we get with the cloud.
The cloud enables us to make all our investment in OpEx, which makes it much more
financially efficient.
Types of cloud services:
1. IAAS
2. PASS
3. SASS.
IAAS : stands for infrastructure as a service.
And what it means is that the cloud provides the underlying platform, such as compute,
networking and storage, and then the client is responsible for all the rest.
In other words, the cloud provides the absolute minimum infrastructure and we clients of
the cloud are expected to take care of all the rest.
The most common example of us is virtual machines with virtual machines.
The cloud provides the host machine, the networking and the disks, and the client creates
the virtual machine or the guest machine and then installs software on it, patches it,
maintains it, etc..
So it is the responsibility of the client to make sure that a virtual machine is up and running
and the cloud has nothing to do with it.
PASS: stands for platform as a service
The cloud provides a platform for running apps.
That means that the cloud provides the compute, networking, storage, runtime
environment, scaling redundancy, security updates, patching, maintenance, etc. and the
client just needs to bring the code to run.
So our responsibility as a client of the cloud is to develop our software and then upload it
to the cloud and the cloud takes care of all the rest.
The most common example of PASS are, of course, Web apps. The cloud provides the
runtime for running the Web apps and the client only uploads the code and then it just
runs.
SASS: stands for software as a service
software running completely in the cloud, the user or us In this case, we do not need to
install anything on premises or on the machine.
The provider of the software takes care of updates, patches, redundancy, scalability, etc.
We can only use software, but we can't and shouldn't mess with the underlying
infrastructure of it.
So the most common examples of SaaS are software such as Office 365 and Salesforce,
which are installed in the cloud.
We have no idea of the infrastructure that Office 365 and Salesforce are running on, virtual
machines, language they are developed in, what is a database and so on. We just use it.
Types of clouds:
the three types are
1. public cloud,
2. private cloud and
3. hybrid cloud.
Public cloud:
● It is a cloud that is set up in the public network.
● It is managed by large companies such as Microsoft, Amazon, Google and so on.
● It is accessible through the Internet.
● So we do not need a specialized internal network to access the cloud.
● It is also available to all clients and users and not only users of a specific
organization, and the clients have no access to the underlying infrastructure.
● The most popular public clouds are AWS of Amazon, Azure of Microsoft, Google
Cloud and IBM cloud.
Private cloud:
● It is a cloud set up in an organization's premises.
● It is managed by the organization IT team
● Accessible only in the organization's network.
● Available only to users from the organizations.
Other users, other external users that are not members of this organization do not have
access to this cloud.
This cloud uses private cloud infrastructure and engines, and it usually contains only a
subset of the public cloud capabilities.
Organizations usually use private clouds for security reasons because they want the
resources and data to be stored inside the organization premises and they do not want
them to be exposed in the public cloud.
So some examples of private cloud are VMware cloud, RedHat, open shift and azure stack.
hybrid cloud:
So a hybrid cloud is a cloud that is set up in an organization's premises, but it is also
connected to the public cloud.
That means that workloads can be separated between the two cloud, for example, sensitive
data in the organization's premises and public data in the public cloud.
So, for example, the organization can decide that the usernames and passwords and
perhaps even the credit codes of its users will be stored inside the organization's premises.
But for example, the professional profile of its users, such as the ones in LinkedIn, will be
stored in the public cloud.
Now, usually hybrid clouds are managed by the public cloud, which is connected to the
private cloud. But this is not always the case.
And sometimes you can see the other way around so that the private cloud controls and
manages the resources
So some examples for hybrid clouds are azure arc AWSoutposts both extend the public
cloud into the organization's premises and manage and control these private cloud from
the public cloud.
Microsoft Azure:
it's a collection of online services that organizations can use to build, host, and deliver
applications.
The best part is that you don't need to have your own data center or even any servers
because Azure runs in Microsoft's data centers around the world, which your users can
access over the internet.
Not only does this approach save you the trouble of having to build and maintain your own
on-premises IT infrastructure, but it can also save you money because you only have to pay
for what you use, and you can scale your Azure resources up and down as needed.
For most applications, you need three core elements:
[Link], [Link], and [Link].
1. Compute:
In Azure's early days, Microsoft offered only one type of compute service:
virtual machines, or VMs for short.
These are machines that run either Windows or Linux. If you currently have an application
running on a Windows or Linux server, then the most straightforward way to migrate it to
Azure is to do what's called a "lift and shift" migration.
That is, you simply lift the application from your on-premises server and shift it to a virtual
server in the cloud.
Azure VMs are known as Infrastructure-as-a-Service because they're traditional IT
infrastructure components that are offered as a service.
Platform-as-a-Service: offering called Azure App Service. This platform lets you host web
and mobile applications without having to worry about the underlying infrastructure.
After doing a minor amount of configuration, you can just upload your code to an App
Service instance and let Azure take care of the details.
In most cases, this is a better solution than using virtual machines, but there are times
when it makes more sense to use VMs.
For example, if you have an application that's not a web or mobile app, then you can't use
App Service, so you'll have to use a VM.
These days, the hottest compute technology is containers. These are self-contained
software environments.
For example, a container might include a complete application plus all of the third-party
packages it needs.
Containers are somewhat like virtual machines except they don't include the operating
system. This makes it easy to deploy them because they're very lightweight compared to
virtual machines. In fact, containers run on virtual machines.
Microsoft provides a variety of ways to run containers. The simplest way is to use Azure
Container Instances. This service lets you run a container using a single command.
If you have a more complex application that involves multiple containers, then you'll
probably want to use Azure Kubernetes Service, which is what's known as a container
orchestrator. It makes it easy to deploy and manage multi-container applications.
Azure Functions: it's Microsoft's main serverless offering. Azure Functions is kind of like
Azure App Service
except that it executes individual functions rather than entire applications, and you only
pay for it when it gets used. When you provision an App Service instance, it runs until you
shut it down, and you pay for it the whole time it's running. Although it's possible to
configure Azure Functions in the same way, it's usually better to use the Consumption plan,
which means that it only uses resources when a function is running,
so you only pay when a function is running.
2. Storage:
Azure blob storage:
The simplest form of storage is called Blob storage. It's referred to as object storage, but
really it's just a collection of files. It's not like a normal file system, though, because it
doesn't have a hierarchical folder structure. It has a flat structure. It's typically used for
unstructured data, such as images, videos, and log files. One of the great things about it is
that it has multiple access tiers: hot, cool, and archive.
The hot tier is for frequently accessed files.
The cool tier is for files you expect to access only about once a month or less.
The advantage is that it costs less than the hot tier as long as you don't access it frequently.
The archive tier is for files that are rarely accessed, such as backup [Link] has the lowest
storage costs
but the highest retrieval costs. It also takes several hours to retrieve files from the archive
tier.
File storage:
If you need hierarchical file storage, there are a couple of options. The one that will
probably seem more familiar is Azure File Storage, which is like a typical SMB file server. It
serves up file shares
that you can mount on Windows servers.
Azure Data Lake Storage Gen2:
This is Hadoop-compatible storage for use with data analytics applications.
In an on-premises Microsoft environment,
SQL Server is the most commonly used database, The cloud equivalent is Azure SQL
Database. It's very similar to SQL Server, although it's not 100% compatible
Azure DB open source: If you need to run an open source database, then Microsoft still has
you covered. It offers Azure Database for MySQL, MariaDB, and PostgreSQL. All of these
databases, including both SQL Database and the open source options, are suitable for
online transaction processing.
Azure Synapse Analytics: if you need to build a data warehouse, then Azure Synapse
Analytics is the best choice.
Azure Cosmos DB: If you release an application that attracts a very large number of users,
you may find that a traditional relational database can't scale to meet the demand. One
common solution is to use a so-called NoSQL database. These databases are designed to
handle far more data than relational databases. However, in order to achieve that massive
scalability, they have to sacrifice something, so they don't support all of the features of
relational databases. Nonetheless, they have become a cornerstone of many cloud-based
applications. Microsoft's main NoSQL offering is called Cosmos DB. It's an amazing
database service that can scale globally.
Azure Cache for Redis: Its an Another NoSQL service, which is typically used to speed up
applications
by caching frequently requested data.
3. Network services:
Vnet: When you create a virtual machine on Azure, you have to put it in a virtual network,
or VNet.
A virtual network is very similar to an on-premises network. Each virtual machine in a VNet
gets an IP address, and it can communicate with other VMs in the same VNet.
Subnets: You can also divide a VNet into subnets and define routes to specify how traffic
should flow between them. By default, all outbound traffic from a VM to the internet is
allowed. If you also want to allow inbound traffic, then you need to assign a public IP
address to the VM. If you want VMs in one VNet to be able to communicate with VMs in
another VNet, then you can connect the VNets together using VNet peering.
VPN: If you want to create a secure connection between a VNet and an on-premises
network, then you can use either a VPN, which stands for Virtual Private Network, or Azure
ExpressRoute. A VPN sends encrypted traffic over the public internet, whereas
ExpressRoute: communicates over a private, dedicated connection between your site and
Microsoft's Azure network. ExpressRoute is much more expensive than a VPN, but it
provides higher speed and reliability since it's a dedicated connection.
Azure Storage
[Link]'s durable and highly available: it stores all data redundantly.
[Link]'s secure: all data is encrypted automatically and you can set fine-grained access control
to it.
[Link]'s scalable because you can always add more data without having to worry about
provisioning hardware to hold it. There is a 500 terabyte limit, but if you need more than
that, you can contact Azure Support.
4. it's a managed service, so you don't have to worry about maintenance.
5. accessible over the web.
Redundancy Options:
there are four different options depending on your needs.
Locally-redundant storage: is replicated across racks in the same data center. This means
that if there's a disaster in that data center, your data could be lost. Although this is highly
unlikely, you should only use locally-redundant storage if you can easily reconstruct your
data.
Zone-redundant storage: is replicated across three zones within one region. So if an
entire zone goes down,
your data will be still be available.
Geo-redundant storage: is replicated across two regions. So even if an entire region goes
down,
your data will still be available. However, in the case of a regional disaster, you'd have to
wait for Microsoft
to perform a geo-failover before you could access your data in the secondary region.
That's why you may want to consider using read-access geo-redundant storage.
Its the same as geo-redundant storage: except that if there's a disaster in your primary
region,
then you can read your data from the secondary region immediately. You won't have write
access though,
so if you can't wait until Microsoft restores availability in the primary region, then you'll
have to copy your data to yet another region and point your applications to the new
location.
You can copy data to and from Azure Storage
using the AzCopy utility, Azure PowerShell, or the Azure Storage SDK, which is available for
a variety of programming languages.
Azure Storage supports four types of data:
[Link], [Link], [Link], and [Link]
1. Blob stands for binary large object, but really a blob is just a file.
So why is there a distinction between blob storage and file storage?
The difference is in how they're [Link] aren't really organized at all. Sure, you can
use slashes in their names, which makes it look like they have a folder structure, but they're
not actually stored that way.
Blob storage allows storing 3 types of blobs.
a).Page, b).Block c).Appends.
Page blobs: are a collection of 512 byte pages optimized for read and write operations.
Can be up to 8TB in size and more efficient for frequent read or write operations, Azure
virtual machines use page blobs as OS and data disks.
Block blobs:
A single block blobs contain up to 50,000 blocks
Each block can be a different size up to to maximum of 100mb
Used for storing text or binary files, such as documents and media files,
are used to efficiently store large blobs up to 4.75 terabytes(100*50000).
With a block blob, you can upload multiple blocks in parallel to decrease upload time.
The blocks within a blob can be updated or deleted if necessary.
Append blobs:
A single block blobs contain up to 50,000 blocks
Each block can be a different size up to to maximum of 4 mb
Total size is slightly more than 195 gb (4mb x 50,000)
and are an optimized form of blob storage for append operations so they are useful for
logging scenarios.
Modified append blobs only write data blocks to the end of the blob.
Existing blocks cannot be deleted or updated.
2. File storage, on the other hand,
has the sort of hierarchical structure you'd expect in a file system. In fact, it's SMB-
compliant so you can use it as a file [Link] makes it easy to move an on-premises file
server to Azure.
Even better, you can make this file share globally accessible over the web if you want. To
do that, users need a shared access signature token which allows access to particular data
for a specific amount of time.
You might be tempted to use file storage instead of blob storage even when you don't need
an SMB-compliant file share. But bear in mind that it's significantly more expensive than
blob storage. If you just need a place to put files, whether they're documents or videos or
logs or anything else, then you should use blob storage, which is by far the cheapest of all
the storage types.
There are options for making blob storage even cheaper too.
You can choose from three storage tiers:
a). hot, b). cool, and c).archive.
Hot storage: is the tier you'll probably use the most often. It's intended for data that gets
accessed frequently.
Cool storage: If you have data that doesn't get accessed frequently, then you should
consider the cool storage tier. It's optimized for data that still needs to be retrieved
immediately when requested even though it doesn't get accessed very often.
An example would be a video file that people rarely watch.
The cool tier has a much lower storage cost but a much higher cost for reads and writes.
The data also needs to be in the cool tier for at least 30 days.
Archive: If you have data that will almost never be accessed and you can live with it taking
up to 15 hours to access when you do need it, then the archive tier is the way to save lots of
money. It's five times cheaper than the cool tier for storage costs but it's dramatically more
expensive for read operations. The data also needs to reside in the archive tier for at least
180 days. You can move data between the tiers any time you want, but if you do it before
the minimum duration for the cool and archive tiers, then you'll be charged an early
deletion fee. For example, if you put your data in the archive tier and then move it back to
the cool tier 90 days later, you'll be charged half of the early deletion fee since you moved
the data when there was still half of the 180 day minimum left to go.
3. Queue storage:
is a very different option. It's intended for passing messages between applications. One
application pushes messages onto the queue, and another application asynchronously
receives those messages from the queue one at a time and processes them.
4. Table storage:
It's a NoSQL data store with storage costs that are about the same as file storage and with
way cheaper transaction costs. I think Microsoft realized what a good deal this is too
because they now have a premium version of table storage that's part of their CosmosDB
service.
5. Disk storage: is what's used for the disks that are attached to virtual machines,
AZURE DATA SERVICES:
Azure Purview:
Organizations typically have so much data in so many different places
that it’s hard to find what you’re looking for. The purpose of Azure Purview is to act as an
index to all of those data sources, so you can discover them. Of course, in order for this to
work, your employees need to register their data sources in the catalog. The data itself
stays where it is, but its location
and the metadata about it gets added to the catalog. The metadata includes things like
column names and data types. Users can also add additional information about a data
source,
such as a description or some tags. Once various data sources are registered, people can
search the catalog to find what they’re looking for using Purview Studio
Another way to deal with pockets of data is
Azure Data Lake Storage is built on top of Azure Blob Storage, and it provides the
additional capabilities needed for a modern data lake. Its most important feature is that it’s
compatible with Hadoop and Spark, which are the most popular open-source software
systems for doing data analytics.
Azure Synapse Analytics (formerly known as SQL Data Warehouse) offers an interesting
mix of data warehouse and data lake capabilities. If you need a data warehouse, you can
create a SQL pool, which lets you run SQL queries on structured, relational tables. If you
want a data lake, then you can create a Spark pool, which lets you use Spark to query both
structured and unstructured data.
Spark has become so popular that Microsoft has many services that let you use Spark for
data analytics. In addition to Data Lake Storage and Synapse Analytics, you can also use
Azure Databricks and Azure HDInsight.
Azure Databricks: is a managed Spark implementation that was developed by the people
who created Apache Spark.
HDInsight: supports a wide variety of open-source big data frameworks, including Hadoop,
Spark, Hive, Storm, and many others.
One difference between Databricks and HDInsight
is ease of use. For example, to run a processing job with either service, you need to spin up
a cluster, but Azure Databricks can be configured to automatically spin up a cluster when a
job runs and shut it down after the job is finished. In contrast, HDInsight doesn’t have a
built-in way to spin up a cluster automatically.
So if you need to run HDInsight jobs quite often, you can leave a cluster running all the
time, which would be expensive, or you could spin clusters up and down as you need them,
which would be kind of a pain.
One way to make HDInsight work in a more automated fashion is to use yet another
service, Azure Data Factory.
Azure Data Factory: It lets you create workflows to automate data movement and data
transformation.
One of its many capabilities is spinning up and down HDInsight clusters as needed, but it
can do far more than that. With Data Factory, you can create data processing pipelines. For
example, a pipeline could copy data from SQL Server to Data Lake Storage, run a Spark job
on the data using an HDInsight cluster, and store the results in Synapse Analytics, all
without any human intervention.
It can even automate machine learning jobs. It’s such a useful tool that Microsoft even
includes a stripped-down version of it in Synapse Analytics.
Azure Analysis Services: data analytics tool Services. It lets you create data models that
make sense of existing data. One of the problems with the multitude of data in
organizations is that it can be hard to understand how all of that data relates to the real
world. Using a data model is easier than working with the
raw data. Analysis Services also makes browsing large amounts of data faster because it
uses in-memory caching. However, end users don’t browse directly through Analysis
Services. Instead, they use one of the supported client tools, such as Power BI, Tableau, or
Excel.
Full load:
If there is source table and i want to extract data completely from the source and
load in to destination
Incremental:
I want to extract the data only if it is changed, or updated
INTRODUCTION TO AZURE DATA FACTORY
Azure Data Factory v1
Azure Data Factory went into public preview on October 28th, 2014, and became generally available on
August 6th, 2015. Back then, it was a fairly limited tool for processing time-sliced data. It did that part
really well, but it couldn’t even begin to compete with the mature and feature-rich SQL Server Integration
Services (SSIS). In the early days of Azure Data Factory, you developed solutions in Visual Studio, and
even though they made improvements to the diagram view, there was a lot of JSON editing involved. It
was a very different world just a few years ago.
But then! Something happened at Microsoft Ignite 2017.
What are the core components of Azure Data Factory (ADF)
In this section, we will look at the core concepts and components of the Azure Data
Factory toolkit.
1. Pipelines are the logical grouping of activities that perform one unit of work. An ADF
instance can have one or more active pipelines, and activities can be scheduled in
sequence (chaining) or in parallel (independent) for execution, as desired.
2. Activities represent a single processing step in a pipeline. Three types of activities
are currently supported — data copy, data transformation, and activity
orchestration.
3. Datasets represent a data structure that provides a selected view into a data store,
ideally for use in defining (and binding) inputs and outputs to a given activity.
4. Linked Services represent connection strings that can be used by an activity to
establish a connection to an external service, typically pointing to a data source
(ingest) or compute resource (transformation) required for execution.
5. Mapping Data Flows create and manage data transformation graphs that can be
applied to data of any size and be used to build up a reusable library of data
transformation routines.
6. Integration Runtimes are compute infrastructure used by ADF to provide fully
managed data flows, data movement, activity dispatch, and SSIS package execution
tasks in data pipelines.
In this context, a few additions to our terminology:
● Pipeline Run is an instance of pipeline execution. Pipelines are activated by passing
arguments to the parameters defined by pipeline activities. Activation can be
triggered or done manually.
● Trigger is a unit of processing whose outcome determines when to activate a
pipeline run.
● Parameters are read-only key/value pairs that are populated from the runtime
execution context of the pipeline. Dataset and linked service are strongly typed,
reusable parameter entities that define the structure (of data) and connection
information (of source) for activities.
● Variables are used inside pipelines to store temporary values e.g., for use with
parameters to pass values or context between activities, data flows, and pipelines.
Azure Synapse Analytics
When databases first became popular, they were primarily used for transaction processing
(and that’s still the case today). But managers also needed to analyze data and create
reports,
which is difficult to do when the data resides in numerous databases across an
organization.
So data warehouses were created to collect data from a wide variety of sources, and they
were
designed specifically for reporting. The language used to query a data warehouse is
normally SQL.
More recently, there has been a growing need to analyze unstructured data, such as
documents
and images, as well. Data lakes were created to collect all kinds of data in one place,
and they were designed for big data analytics.
The most common way to process data in a data lake is to use Apache Spark, which is an
open-source analytics engine for big data.
Microsoft has offered a variety of separate solutions to meet these two different needs over
the years, but now it has come out with a new analytics service that works with both
structured and unstructured data. It’s called Azure Synapse Analytics.
If you need a data warehouse, you can create a dedicated SQL pool, which lets you run SQL
queries on structured, relational tables.
If you want a data lake, then you can create a Spark pool, which lets you use Spark to query
both structured and unstructured data.
Not only does Synapse Analytics provide data warehouses and data lakes, but it also
provides
a sophisticated tool for getting data from other Azure services and transforming it.
Azure Synapse
Pipelines is a stripped-down version of another Azure service called Data Factory that lets
you create data processing pipelines. For example, a pipeline could copy data from Azure
Cosmos DB to a Spark pool, run a Spark job that creates statistics about the data, and store
the results in a SQL pool, all without any human intervention.
There are a couple of different approaches you could take to building a pipeline:
one using code and the other without using code.
using code
First, you’d create a Synapse workspace, which is a secure area where you can do your
analytics
work. Within that workspace, you’d create a linked service that would connect to your data
in Cosmos
DB. Then you’d create a Spark pool and a SQL pool.
Next, you’d create a notebook containing code that copies the data from Cosmos DB to the
Spark
cluster. Then you’d add analytics code to the notebook to create statistics about the data.
Finally, you’d add code to store the results in the SQL pool. If you only needed to run this
code once, then you’d just run it, but if you needed to run this code on a regular basis, then
you could create a pipeline and schedule it to run the notebook every night at 11 PM
no-code method.
You’d still create a linked service for Cosmos DB and a SQL pool in your workspace. Then,
instead of creating a notebook with code in it, you’d create a pipeline with a data flow in it.
A data flow is a graphical representation of the activities you want to perform.
This shows a simple one with activities for loading data from Cosmos DB, transforming it,
and storing the results in a SQL pool. You’d need to configure each one of these activities,
but you wouldn’t need to write any code. You also wouldn’t need to create a Spark pool to
run these
activities because the Synapse Pipelines service would take care of that for you. After
getting everything ready, you’d schedule the pipeline to run the data flow at specific times.
Before we go on, I should tell you
a bit more about Spark and SQL pools
If you’re not careful about Spark and SQL pools they can cost you a lot of money.
Both of these types of pools are clusters of virtual machines.
In the example shown here, the code in the notebook would run on the Spark pool cluster
and
would store the results on the SQL pool cluster. If you’ve ever accidentally left a cluster of
virtual machines running, you’ll know that it can get very expensive very quickly.
Microsoft provides some mechanisms to help you avoid that. First, when you create a Spark
pool,
if you enable Autoscale, you specify the minimum and maximum number of VMs in the
cluster.
Then the pool automatically scales up and down within that range based on the
requirements of
the workloads you run on it.
You can also enable Auto-pause, which will stop the cluster if it has been inactive for a
given period of time.
By default, it auto-pauses after 15 minutes.
SQL pools have similar features, but the underlying architecture is much different.
First, it’s important to know the difference between dedicated and serverless SQL pools.
Dedicated SQL pool is the new name for Azure SQL Data Warehouse, which is a service
that’s been around for quite a few years. It provides both a compute cluster and storage.
When you create a dedicated SQL pool, you specify how many DWUs (or Data
Warehousing
Units) to allocate. DWUs set the amount of CPU, memory, and I/O in the compute cluster.
You can
only increase or decrease the number of DWUs manually because there’s no auto scaling
feature.
Storage space is provided by Azure Storage, so it scales independently from the compute
cluster.
If you don’t need to run a SQL pool all the time, you can manually pause it when it’s not in
use.
When it’s paused, you won’t pay for the compute cluster, but you’ll still pay for the storage
being used by the data warehouse.
Serverless SQL pools are actually very different from dedicated SQL pools. The only thing
they have in common is that they both let you run SQL queries. Serverless SQL pools don’t
have their own storage, and they don’t have access to data in dedicated SQL pools either.
So what can you use them for?
Well, they can query data in other places.
For example, if you have files in Azure Storage that are in CSV, Parquet, or JSON format,
you can query them using a serverless SQL pool. You can also query Azure Open Datasets,
which are public datasets containing information on topics like weather and genetics.
Surprisingly, if you’ve used a Spark pool to
create an external table in Azure Storage, then you can use a serverless SQL pool to query
it, even if the Spark pool has been shut down.
Serverless SQL pools are different from the other types of pools in a couple of ways. First,
you don’t have to pay for the compute resources in them. You only have to pay for the
amount of data processed by your queries.
Second, you don’t even have to create them because when you create an Azure Synapse
Workspace, it’ll automatically create a serverless SQL pool as well.