0% found this document useful (0 votes)
246 views7 pages

Key Features of Apache Spark

Apache Spark is a fast, general-purpose cluster computing system. It uses in-memory computing to improve performance over traditional disk-based approaches like Hadoop MapReduce. Spark core provides functionality for distributed task dispatching, scheduling, and memory management. Additional Spark components support SQL queries, streaming data, machine learning, and graph processing. Resilient Distributed Datasets (RDDs) are Spark's fundamental data structure, allowing data to be partitioned across a cluster and cached in memory for faster iterative algorithms.

Uploaded by

Sailesh Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
246 views7 pages

Key Features of Apache Spark

Apache Spark is a fast, general-purpose cluster computing system. It uses in-memory computing to improve performance over traditional disk-based approaches like Hadoop MapReduce. Spark core provides functionality for distributed task dispatching, scheduling, and memory management. Additional Spark components support SQL queries, streaming data, machine learning, and graph processing. Resilient Distributed Datasets (RDDs) are Spark's fundamental data structure, allowing data to be partitioned across a cluster and cached in memory for faster iterative algorithms.

Uploaded by

Sailesh Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Apache Spark

Introduction
Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model
to efficiently use it for more types of computations, which includes interactive queries
and stream processing. The main feature of Spark is its in-memory cluster computing
that increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications,
iterative algorithms, interactive queries and streaming. Apart from supporting all these
workloads in a respective system, it reduces the management burden of maintaining
separate tools.

Features of Apache Spark:

1. Speed − Spark helps to run an application in Hadoop cluster, up to 100 times


faster in memory, and 10 times faster when running on disk. This is possible by
reducing the number of read/write operations to disk. It stores the intermediate
processing data in memory.
2. Supports multiple languages − Spark provides built-in APIs in Java, Scala, or
Python. Therefore, you can write applications in different languages. Spark
comes up with 80 high-level operators for interactive querying.
3. Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.

Spark deployment:

1. Standalone − Spark Standalone deployment means Spark occupies the place on


top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS,
explicitly. Here, Spark and MapReduce will run side by side to cover all spark
jobs on the cluster.
2. Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark
into the Hadoop ecosystem or Hadoop stack. It allows other components to run
on top of the stack.
3. Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark jobs
in addition to standalone deployment. With SIMR, users can start Spark and use
its shell without any administrative access.

Components of Spark:

Apache Spark Core:

Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in
external storage systems.

Spark SQL:

Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.

Spark Streaming:

Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed
Datasets) transformations on those mini-batches of data.

MLlib (Machine Learning Library):

MLlib is a distributed machine learning framework above Spark because of the


distributed memory-based Spark architecture. It is, according to benchmarks, done by
the MLlib developers against the Alternating Least Squares (ALS) implementations.
Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout
(before Mahout gained a Spark interface).

GraphX:

GraphX is a distributed graph-processing framework on top of Spark. It provides an API


for expressing graph computation that can model the user-defined graphs by using the
Pregel abstraction API. It also provides an optimized runtime for this abstraction.

RDD
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster. RDDs can contain
any type of Python, Java, or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created
through deterministic operations on either data on stable storage or other RDDs. RDD
is a fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they
are not so efficient.

Why RDD?

Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of
the Hadoop applications, they spend more than 90% of the time doing HDFS read-write
operations.

Recognizing this problem, researchers developed a specialized framework called


Apache Spark. The key idea of spark is Resilient Distributed Datasets (RDD); it
supports in-memory processing computation. This means, it stores the state of memory
as an object across the jobs and the object is shareable between those jobs. Data
sharing in memory is 10 to 100 times faster than network and Disk. use RDD?

Iterative Operations of Spark RDD:

The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make
the system faster.
Features of RDD:

1. In-Memory Computations: It improves the performance by an order of


magnitudes.
2. Lazy Evaluation: All transformations in RDDs are lazy, i.e, doesn’t compute their
results right away.
3. Fault Tolerant: RDDs track data lineage information to rebuild lost data
automatically.
4. Immutability: Data can be created or retrieved anytime and once defined, its
value can’t be changed.
5. Partitioning: It is the fundamental unit of parallelism in RDD.
6. Persistence: Users can reuse RDDs and choose a storage strategy for them.
7. Coarse-Grained Operations: These operations are applied to all elements in data
sets through maps or filters or group by operation.

Installation of Pyspark:
Prerequisite:

1. Should have Java version 1.8 in your system.


2. Anaconda Distribution to installed in the system

From [Link]

Download the required version of spark.

Extract it to folder and paste it directly into the C drive. C:\Spark


From [Link] download the [Link] file according to the
version you have installed.

Move the [Link] downloaded to \bin folder of Spark distribution i.e.,

● C:\Spark\spark-3.1.1-bin-hadoop2.7\bin

Set the environment variables:

Go to settings, select System

Search for env in the search bar in the left corner:

Select Edit the system environment variables, from the System properties pop window
on Advanced section click on Environment variables:

Variable Value

SPARK_HOME C:\Spark\spark-3.1.1-bin-hadoop2.7

PYSPARK_DRIVER_PYTHON jupyter

PYSPARK_DRIVER_PYTHON_OPTS notebook

HADOOP_HOME C:\hadoop\bin

JAVA_HOME C:\Java\jdk1.8.0_291

Enter the above variables and value respectively under user variables for vssan.
Make sure the respective paths specified in value are present inside the System
variables Path as well.

To check whether the Spark is installed or not, run the command on windows command
prompt:

● spark-shell --version

To run Jupyter notebook, open Anaconda command prompt and run the command:

● ·jupyter notebook

In notebook run the following code to check the pyspark is running successfully or not:

Reference:
1. [Link]
2. [Link]
3. [Link]

Common questions

Powered by AI

GraphX integrates advanced graph processing capabilities into Apache Spark by providing a distributed graph processing framework that leverages Spark's underlying high-performance engine . GraphX utilizes the Pregel abstraction, which allows users to express complex graph computation tasks in a manner that is both intuitive and scalable across distributed systems . The Pregel abstraction supports iterative algorithms and graph parallel processing, which is crucial for operations like PageRank or community detection within large datasets . Using GraphX with Pregel allows for efficient memory usage and processing speed, as computations can be distributed across multiple nodes in a cluster, enabling the handling of very large graph data sets with improved performance and reduced computational time.

Apache Spark offers significant advantages over traditional Hadoop MapReduce in terms of data processing speed and versatility. Spark's in-memory cluster computing substantially reduces the number of read/write operations to disk, resulting in faster execution times—up to 100 times faster when processing in memory compared to 10 times on disk . This is achieved by storing intermediate data in memory rather than on disk, which minimizes data I/O operations . Moreover, Spark's support for multiple workloads, including batch processing, interactive queries, and real-time stream processing, allows it to handle a broader range of applications than the more rigid MapReduce model . This versatility reduces the need for maintaining separate tools for different types of data processing tasks, thus simplifying data infrastructure management .

The lazy evaluation of transformations in RDDs contributes to performance optimization by enabling Spark to defer computation until an action is called. This strategy allows Spark to build up a computation graph, which can be optimized for execution efficiency before processing starts . By accumulating transformations and applying them efficiently in a single step, Spark reduces the need for intermediary storage and avoids unnecessary data passes, thus lowering memory usage and I/O inefficiencies . Lazy evaluation also helps in optimizing resource usage, as Spark can determine the most efficient path for data processing across the cluster and make runtime decisions about the best way to distribute computations. This results in significant performance improvements, as operations are executed in a minimized, optimized form when they are eventually triggered.

Apache Spark can be deployed using several modes, each offering unique advantages: Standalone mode, Hadoop YARN, and Spark in MapReduce (SIMR). In Standalone mode, Spark operates on top of HDFS but independently from other Hadoop components, allowing for a streamlined setup that can run Spark jobs directly on a cluster . This mode is beneficial for projects needing a dedicated Spark environment without additional complexity. Hadoop YARN, often preferred for integration into existing Hadoop ecosystems, allows Spark to run alongside other Hadoop services within the cluster . This mode facilitates resource sharing and management across diverse computing frameworks, enhancing resource utilization and operational coherence. Lastly, Spark in MapReduce (SIMR) is utilized to initiate Spark jobs without any root access or pre-installation requirements, providing flexibility in environments where administrative barriers exist . Each deployment strategy enables workloads to be adjusted based on organizational needs and resources efficiently.

The performance improvements in in-memory computations with RDDs in Apache Spark are primarily due to several key factors: reducing disk I/O operations, exploiting data locality, and optimizing data transformations. By storing intermediate results in memory instead of disk, Spark eliminates the latency associated with disk reads and writes, significantly speeding up data processing tasks . Data locality is improved as RDDs can leverage Spark's scheduling capabilities to process data on nodes where the data is already present, further reducing data shuffling and network overhead. Additionally, transformations on RDDs are evaluated lazily, meaning computations are batched and optimized when action triggers are explicitly called. This leads to more efficient execution plans corresponding to the computation graph, contributing to overall speed enhancements in data processing .

In-memory data persistence in Spark optimizes the execution efficiency of RDDs by maintaining intermediate data sets in memory between transformations, thereby circumventing expensive disk I/O operations . This approach allows subsequent operations to access data faster, improving the overall execution time for iterative transformations and reducing latency in real-time analytics scenarios. RDD partitioning further enhances Spark's performance by distributing the data across the nodes in a cluster based on a logical arrangement . Partitioning is the fundamental unit of parallelism; it facilitates the efficient execution of operations by enabling parallel processing, where transformations and actions can be conducted on independent partitions simultaneously. This division not only accelerates processing but also improves resource utilization, as the load is evenly distributed across the cluster nodes. Together, in-memory persistence and strategic partitioning form the backbone of Spark's data processing optimization, ensuring scalability and high performance in large-scale distributed systems.

Apache Spark manages data fault tolerance through its Resilient Distributed Datasets (RDDs), which are inherently fault-tolerant. RDDs track lineage information so they can automatically rebuild lost data in the event of node failure . This means that if a computation node fails, the RDD can re-compute the lost data using the original transformations applied to the initial dataset. The immutability and persistence of RDDs enhance reliability since each dataset, once created, is read-only and can be recomputed without affecting the rest of the data pipeline . This robust fault tolerance mechanism allows Spark to maintain data consistency and reliability, even in large-scale distributed computing environments where node failures may occur more frequently.

Spark MLlib significantly outperforms Apache Mahout by facilitating machine learning processes nine times faster than the Hadoop disk-based version of Mahout, due to its utilization of Spark's distributed memory-based architecture . This speed advantage in MLlib comes from in-memory computations that avoid the high latency of disk I/O operations prevalent in disk-based systems like Mahout before it adopted a Spark interface . This enhanced performance allows Spark MLlib to handle large-scale machine learning tasks more efficiently, making it more suitable for production-scale applications in distributed environments where speed and resource optimization are critical . The ability to rapidly process data sets in memory makes MLlib a preferable choice for enterprises needing quick insights and scalability in their machine learning workflows.

Spark SQL enhances Apache Spark's data processing capabilities by introducing a higher level of data abstraction called SchemaRDD (now known as DataFrame) that supports both structured and semi-structured data . This abstraction allows users to perform SQL-like queries on distributed datasets, facilitating complex data manipulations and optimizations similar to those available in traditional database systems. By supporting SQL queries, Spark SQL enables seamless integration with relational data stores and allows users to utilize their existing SQL skills to process big data . This fosters a broader accessibility to Apache Spark, making it a powerful tool for handling large-scale data analytics where structured data handling is essential.

Apache Spark simplifies the development process across various data processing domains by offering a unified engine that supports multiple types of workloads, including batch processing, streaming analytics, machine learning, and interactive queries . This unification reduces the complexity of managing separate systems for different tasks, as developers can leverage Spark's comprehensive suite of tools within a single framework. Its multi-language support, with APIs available in Java, Scala, and Python, enhances its accessibility to a broader developer audience, enabling engineers to work within their preferred programming environments . This flexibility encourages faster development times, as teams can build applications without having to learn new languages or paradigms, and allows organizations to assemble diverse development teams with varied skill sets more efficiently.

You might also like