0% found this document useful (0 votes)
102 views12 pages

Azure Data Engineering Course Overview

This course contains 12 modules covering topics in data engineering on the Microsoft Azure platform. The modules include exploring compute and storage options, running interactive queries with Synapse SQL pool, data exploration and transformation in Databricks, loading data into the data warehouse with Synapse and Databricks, data movement with Data Factory and Synapse pipelines, implementing end-to-end security, hybrid transactional/analytical processing, and real-time stream processing with Stream Analytics and Databricks. The modules contain theory, demos, and hands-on labs.

Uploaded by

raghu.learn.007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views12 pages

Azure Data Engineering Course Overview

This course contains 12 modules covering topics in data engineering on the Microsoft Azure platform. The modules include exploring compute and storage options, running interactive queries with Synapse SQL pool, data exploration and transformation in Databricks, loading data into the data warehouse with Synapse and Databricks, data movement with Data Factory and Synapse pipelines, implementing end-to-end security, hybrid transactional/analytical processing, and real-time stream processing with Stream Analytics and Databricks. The modules contain theory, demos, and hands-on labs.

Uploaded by

raghu.learn.007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

COURSE CONTENT

There are 12 modules in this course:

1. Module 0: Introduction to the course


2. Module 1: Explore Compute and Storage Options for Data Engineering Workloads
Introduction to Data Engineering and the Microsoft Azure Platform
3. Module 2: Run Interactive Queries with Azure Synapse Analytics Serverless SQL Pool Azure
Synapse Analytics: Serverless SQL Pool
4. Module 3: Data Exploration and Transformation in Azure Databricks Use Azure Databricks
to perform Batch Processing of Data
5. Module 4: Explore, Transform and Load Data into the Data Warehouse using Azure Synapse
Analytics Apache Spark Azure Synapse Analytics: Apache Spark Pools
6. Module 5: Ingest and Load Data into the Data Warehouse Azure Synapse Analytics:
Dedicated SQL Pool
7. Module 6: Transform Data with Azure Data Factory or Azure Synapse Pipelines Creating
Code-Free Pipelines to transform the Data
8. Module 7: Orchestrate Data Movement using Azure Data Factory or Azure Synapse
Pipelines Data Movement using the code-free pipelines
9. Module 8: Implementing End-to-end security of Data: How to secure the data while
performing Data Engineering
10. Module 9: Hybrid Transactional Analytical Processing (HTAP)
11. Module 10: Real-time stream processing with Azure Stream Analytics Processing
Streaming Data using Azure Stream Analytics
12. Module 11: Create a Stream Processing Solution with Event Hubs and Azure Databricks
Processing Streaming Data using Azure Databricks and Event Hubs

Module 0, Module 1, Module 3

Module 11, Module 10, Module 4

Module 5, Module 2

Module 6 & 7, Module 8, Module 9

Theory + Demos + LAB


Module 1: Explore Compute and Storage Options for Data Engineering
Workloads
1. What is Data Engineering?
2. Data Engineering is a process in which the data is first extracted from the different data
sources, then it is processed in the required way and then it will be provided to other users
for to perform their specific tasks on it.

Ingestion / Loading Data Store


Data Sources Processing and
Extractions
Text, CSV, Transforming A data store is a
DBs, etc the Data storage location

Other Users who want


to use this data

Data Analysts, Data


Scientists, DBAs, etc.

3. Roles and Responsibilities of a Data Engineer:


a. Connect and Collect the data from different data sources
b. Process and Transform the data and make it ready for further usage
c. Load the data at proper storage locations
d. Security
e. Monitoring and Maintenance of the DE solution

4. What is Azure?
5. It is a cloud service provider by Microsoft
6. What is Cloud?
7. It is an environment which provides us with all of the resources that we need to set up the IT
infrastructure of a company on rent. Cloud can provide us with Machines, Processors / CPUs,
Memory, Storage, Networking, VPNs, Security, Servers, etc.
8. The cloud service providers set up very big data centres at multiple locations all around the
globe. The resources that we need are created inside those data centres, and we are
provided with the access to those resources through the internet.

9. Since Data Engineering is nothing but the ETL process only which is performed at a very large
scale with different types of data coming in at huge speeds, that is why, on Azure we need to
tools which can handle this type of data to perform ETL on it.
10. Tools available on Azure to perform ETL are:
a. Extract Process: Connect and Collect the data from different data sources
Methods to perform extraction: Code, Query, Code-Free Pipeline
i. Azure Databricks: Code, Query
ii. Azure Data Factory: Code-Free Pipelines
iii. Azure Synapse Analytics: Code, Query, Code-Free Pipelines
b. Transformation Process: Process and Transform the data and make it ready for
further usage
Methods to perform Transformation: Code, Query, Code-Free Pipeline
i. Azure Databricks: Code, Query
ii. Azure Data Factory: Code-Free Pipelines
iii. Azure Synapse Analytics: Code, Query, Code-Free Pipelines
c. Loading Process: Store the data at proper storage locations
The type of resource to store the data depends upon the type of data itself
i. Structured Data: SQL Databases, SQL Data Warehouse, etc.
ii. Semi-Structured Data: Cosmos DB, Blob Storage Accounts, Azure Data Lake
Storage Gen2 (ADLS Gen2)
iii. Unstructured Data: Blob Storage Accounts, Azure Data Lake Storage Gen2
(ADLS Gen2)

11. Streaming Data: It is the real-time data which is processed as soon as it is generated by the
data source
12. Tools for processing streaming Data:
a. Azure Databricks
b. Azure Stream Analytics
c. Azure Synapse Analytics
Module 3: Data Exploration and Transformation using Azure Databricks
1. What is Databricks?
2. Azure Databricks is a compute resource
3. What is a compute resource?
4. It is a resource which provides us with the processing power on our data. In our PCs, the CPU
and the Memory are the compute resources.
5. Here also, in Databricks we are provided with CPUs and Memory to process the data.
6. Databricks is not a MS product, rather it is in direct competition with MS
7. A question arises here that if Databricks is in competition with MS then why has MS

8. The reason of providing Databricks on the Azure environment is that it is very a popular data
processing tool because of its high data processing speed
9. There are 2 reasons behind the high data processing speeds of Databricks:
a. Distributed Processing
b. In-Memory Processing

Azure Databricks Workspace

Cluster
External Storage Resource:
Source of Data
Worker Node Worker Node

Notebook
Worker Node Driver Node External Storage Resource: Sink
Used to write of Data
code/query

1. Scala
Worker Node 2. Python
Worker Node
3. SQL
4. R

10. Cluster: A group of Nodes


11. Node: It is the basic processing unit in Databricks. CPU + Memory
12. The nodes are of two types:
a. Worker Node: These are the nodes which actually perform the processing of data in
the cluster. All the nodes other than the Driver Node are Worker nodes in the cluster
b. Driver Node: It is the node which controls the working of all the other nodes inside
the cluster. There is only one single driver node in the cluster
13. Memory CPU
14. Storage CPU
15. Now, we will talk about an external storage resource. This resource is called the Storage
Account
16. A storage account is a resource on the Azure portal, which provides us with different types
of tools to store the non-relational data in it
17. The Storage Accounts are of two types:
a. Blob Storage Account
b. Azure Data Lake Storage Gen2 (ADLS Gen2)

Blob Storage Account Azure Data Lake Storage Gen2 (ADLS Gen2)

CONTAINER (Blob Container) CONTAINER (Data Lake)

It is used for storing non-relational data files It is used for storing non-relational data files

Hierarchical file system is not supported Hierarchical file system is supported

All data is stored in the root directory Data is stored in directories and sub-
directories

MESSAGE QUEUES MESSAGE QUEUES

Storing Messages Storing Messages

FILE SHARES FILE SHARES

Used for mapping hard drives of our PC to Used for mapping hard drives of our PC to
cloud cloud

AZURE TABLES AZURE TABLES

Storing non-relational data in the key-value Storing non-relational data in the key-value
format format
Module 11: Create a Stream Processing Solution with Event Hubs and
Azure Databricks
1. What is Streaming Data?
2. It is the real-time data, which it processed as soon as it is created by the data source.

Streaming Data Databricks


Source

Receive / Listen
Send

Streaming Data Databricks


Event Hub
Source

3. The Event Hubs in the Azure environment are used to control the flow of streaming data
4. On Azure the Event Hubs are not created directly, rather they are created inside an Event
Hub Namespace
Module 10: Real-time stream processing with Azure Stream Analytics
1. In this module also we will be talking about processing the Streaming Data only, but this
time, we will be using another tool available on the Azure portal which is called the Azure
Stream Analytics instead of Databricks.
2. What is Azure Stream Analytics?

Azure Databricks Azure Stream Analytics


It is a tool used for processing both Batch This tool is used for processing streaming
and Stream Data data specifically
It is a coding-based environment where we It is a GUI-based environment where we do
have to write the code for every operation not need to write the code for all the
that is performed operations
Being a coding-based environment, this is Being a GUI-based environment the
an open environment where we can do any configuration options are limited
activity by just writing the code for it

3. The process of using Azure Stream Analytics is the same as using Databricks for processing
Streaming Data

Receive / Listen
Send

Streaming Data Azure Stream


Event Hub Analytics
Source

Storage Location
Module 4: Explore, Transform and Load Data into the Data Warehouse
using Azure Synapse Analytics Apache Spark
1. The Azure Synapse Analytics is an environment on the Azure portal, which provides us with
all of the tools that are required to perform Data Engineering on one single environment
2. The Azure Synapse Analytics environment makes the integration of resources with each
other much easier for us

Azure Synapse Analytics

SQL Server
Azure Data Lake Storage Gen2 Apache Spark Pool
Storing non-relational data It is the same as a
files Databricks cluster Serverless SQL Pool Dedicated SQL Pool
Container It provides us with It provides us with storage
compute for the data and compute both for
structured data
Synapse Studio Same as a Data Warehouse

Apache Spark Notebooks


write commands to process data SQL Script
Azure Synapse Pipelines
1. SQL
Used to create code-free 2. Scala
Used to writing SQL
pipelines 3. Python Queries
4. C#
5. R

3. Under Azure Synapse, we get:


a. A tool to store the non-relational data: ADLS Gen2
b. A tool to store the relational data: Dedicated SQL Pool
c. A tool to process the non-relational data: Apache Spark Pool
d. Tools to process the relational data: Dedicated SQL Pool & Apache Spark Pool
e. A tool to write code: Apache Spark Notebooks
f. A tool to write SQL Queries: SQL Scripts
g. A tool to create code-free pipelines: Azure Synapse Pipelines
Module 5: Ingest and Load Data into the Data Warehouse
1. The Dedicated SQL Pools are the same as Data Warehouses and they are used to store and
process the structured data.
2. First of all we ingest the data into the Dedicated SQL Pools and afterwards we can write
queries on it to process that data according to our requirements.
3. In this module we will be mainly talking about how to store the data into the Dedicated SQL
Pools
4. There are many different ways to store the data in the Dedicated SQL Pools:
a. We can set up a Databricks process to transform the data and then store that data
into the Dedicated SQL Pool
b. We can set up a similar process with the Apache Spark Pools as well
c. We can write queries directly on the Dedicated SQL Pools and extract the data from
the data sources
d. We can use the Polybase technique
e. We can set up the code-free pipelines to store the data into the Dedicated SQL Pools
5. We know that the Dedicated SQL Pool stores structured data only in it by default, but by
using the Polybase technique, we give the Dedicated SQL Pool the ability to directly connect
to the semi-structured data files and extract the data from then and store it.
6. The Polybase technique stores the data in External Tables only

Module 2: Run Interactive Queries with Azure Synapse Analytics


Serverless SQL Pool
1. The Serverless SQL Pool is also a data processing resource, but it is a shared resource.
2. Whereas the Dedicated SQL Pools and also the Apache Spark Pools are both dedicated
resources.

Apache Spark Pool Dedicated SQL Pool Serverless SQL Pool


It is a dedicated resource It is a dedicated resource It is a shared resource
It provides us with compute only It provides us with storage and It provides us with compute only for
for the data compute both for the data the data
It can process all types of data It works on structured data only by It works on structured data only by
default default
It is a costly resource It is also a costly resource It is a cheaper resource
119 INR per hour 109 INR per hour 360 INR / TB
It is a fast resource It is also a fast resource It is a slower resource
It supports code in 5 different It supports SQL Queries only It also supports SQL Queries only
languages
The code is written in Spark Queries are written in SQL Scripts Queries are written in SQL Scripts
Notebooks

3. When do we use the Serverless SQL Pool?


4. The Serverless SQL Pool is used in 2 cases:
a. When we are creating a transformation process that will keep on running in the
background
b. When the data processing requirements are highly variable:
i. Day 1: 7 TB
ii. Day 2: 2 TB
iii. Day 3: No Data
iv. Day 4: 1.8 TB
v. Day 5: 4 TB
vi. Day 5: 650 GB

Module 6: Transform Data with Azure Data Factory or Azure Synapse


Pipelines
Module 7: Orchestrate Data Movement using Azure Data Factory or
Azure Synapse Pipelines
1. Both of these tools, i.e., Azure Data Factory and Azure Synapse Pipelines are used to create
code-free pipelines on Azure
2. A code free pipeline is basically a group of synchronised activities that are performed in
ordered way
3. Both of these tools are also exactly the same, there is no difference between them.
4. If both of the tools are the same, then why do we have two of them?
5. ADF is the cloud implementation of SSIS
6. Azure Synapse Pipelines are the implementation of ADF on the Synapse environment

Temporary
Data Copy Tansform
Copy of Data Storage of Transformed
Source Transformed Data
Data

HTTP ADLS Gen2 Staging Dedicated


Server
Service SQL Pool
Module 8: Implementing End-to-end security of Data
1. In this module we will talk about how to secure the data while it is in our data engineering
solution.
2. Security of the data while it is in our organisational network is mainly the task of the
network engineer or the security expert. But being a data engineer we are also responsible
for the data that we are going to handle
3. That is why we only have to protect the data while it is residing in our data engineering
solution. This is the reason why as data engineers there are not a lot of things that we need
to do to secure the data.
4. The following tasks can be performed to secure the data by a DE:
a. Never over-expose the data to the internet
b. Never provide unrestricted of storage resources to anyone
c. Instead of sharing keys and passwords, we should use Azure Key Vault

Module 9: Hybrid Transactional Analytical Processing (HTAP)


1. Online Transactional Processing (OLTP): It is the process of storing the real-time
transactional data into the databases
2. Online Analytical Processing (OLAP): It is the process of generating reports and analytics
based on the aggregated data stored in the data warehouses

Data Warehouse

Database Transformed
and Aggregated Clean, Transformed,
Reports
Transactional Aggregated and
and
Data Real-time data Historical Data
Analytics
OLTP

OLAP

3. HTAP or Hybrid Transactional Analytical Processing is a process using which we can perform
both OLTP and OLAP using one single resource only. The benefit of this is that we will get
real-time reports with the help of it.
4. The HTAP processing is quite new and not a lot of resources can perform it yet. On the Azure
Portal, two resources work together to provide us with HTAP capabilities. They are:
a. Cosmos DB
b. Azure Synapse Analytics
5. Here, we will be using Cosmos DB to store the real-time transactional data, i.e., for
performing the OLTP process. Then we will connect it to Azure Synapse Analytics using a
technology called Azure Synapse Link, in such a way that the data will appear to be stored in
Synapse only in real time. Then the Data Analysts can use the Analytical capabilities of
Synapse to generate reports on it

You might also like