Spark Notes 2 - Core Concepts (SparkContext、RDD、Stage、Executor)

blairchen

spark

Publish：Mar 30, 2024

views

Recap Spark’s introduction and the word count program.

understanding SparkContext, Stages, Executors, and RDDs for effective Spark application development.

1. SparkContext

Explain the role of SparkContext as the entry point of any Spark application.

SparkContext provides the various functions in Spark like get the current status of Spark Application, set the configuration, cancel a job, Cancel a stage and much more.

SparkContext vs SparkSession

Feature	SparkContext (Spark 1.x )	SparkSession (Spark 2.0)
Introduction	Spark 1.x as the entry point to Spark applications.	Spark 2.0 as the unified entry point to Spark features. Includes a reference to the underlying SparkContext.
Usage	Used mainly for RDD and low-level API operations.	Main interface for working with SparkSQL, DataFrame, and DataSet.
Advantages	Direct access to Spark’s core functionalities.	Simplified API for DataFrame and DataSet operations. Unified API access.

2. Core Concepts

Explain the role of SparkContext as the entry point of any Spark application.

Core Aspect	Description
Core Functions	SparkContext, Storage System, Execution Engine DAGScheduler, Deployment Modes SparkContext, 存储体系, 执行引擎 DAGScheduler, 部署模式
Application	A user program built on Spark consisting of a driver program and executors on the cluster. 用户在Spark上构建的程序，包括驱动程序和集群上的执行器。
Job	The set of transformations and actions on data; split based on actions. 数据的一系列转换和动作；基于动作进行切分。
Stage	Jobs are divided into stages at shuffle boundaries; split based on wide dependencies. 作业在shuffle界限处被划分为阶段；基于宽依赖进行切分。
Task	The smallest unit of work sent to an executor; types include ShuffleMapTask and ResultTask. 发送到执行器的最小工作单元；类型包括ShuffleMapTask和ResultTask。
Driver Application	Acts as the client program converting any program into RDDs and DAGs, communicating with the Cluster Manager. 充当客户端程序，将任何程序转换为RDD和DAG，并与集群管理器通信。
Programming Model	including loading data, applying transformations and actions, and processing results.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SimplifiedWordCount").getOrCreate()
# Read the text file and filter rows containing "New York", then split lines into words, and count them
word_count = (spark.read.text("/path/to/word_count.text")
              .filter("value LIKE '%New York%'")
              .selectExpr("explode(split(value, ' ')) as word")
              .groupBy("word")
              .count()
              .orderBy("count", ascending=False))
# Display the result
word_count.show()

No.	Action	Description
1	Obtain the programming entry point / 获取编程入口	SparkContext
2	Load data to get a data abstraction / 通过编程入口使用不同的方式加载数据得到一个数据抽象	RDD
3	Process the loaded data abstraction with different operators	Transformation + Action
4	Process the result data as RDD/Scala objects or collections	print or save

2. RDD (resilient distributed dataset)

Apache Spark evaluates RDDs lazily. It is called when needed, which saves lots of time and improves efficiency. The first time they are used in an action so that it can pipeline the transformation. Also, the programmer can call a persist method to state which RDD they want to use in future operations.

2.1 Concept

RDD Concept an immutable, partitioned collection that allows for distributed data processing across a cluster.

data source -> LineRDD -> WordRDD -> WordAndOneRDD -> WordCountRDD -> destination

A list of partitions / 分区

A function for computing on other RDDs / 作用在每个分区之上的一个函数

A list of dependencies on other RDDs / 依赖：宽依赖 & 窄依赖

Optionally, a Partition for key-value RDDs (e.g. to say that the RDD is hash-partitioned) / KeyValueRDD 分区器

Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

2.2 Attributes

rdd-of-string vs rdd-pair

2.3 Create RDD

Parallelizing an existing collection:

1 2	data = [1, 2, 3, 4, 5] rdd = spark.sparkContext.parallelize(data)

Referencing External Datasets

1	rdd = spark.sparkContext.textFile("path/to/textfile")

2.4 Operations

Transformation. RDD -> RDD
Action. RDD returns final result of RDD computations.
Cronroller，(cache/persist)

3. Stage & DAG

3.1 WordCount FlowChart

WorCount 各种流程划分

3.2 Stage & DAG

WorCount 2个Stage DAG有向无环图

Concept	Description
Application	A complete program built on Spark, consisting of a set of jobs
Job	A sequence of computations triggered by an action operation, split based on actions from front to back. A job is decomposed into multiple stages, with stages divided based on RDD’s wide dependencies (shuffle dependencies). Each action operation submits a job. A job consists of a series of stages that are executed in sequence based on RDD’s wide dependencies. Each stage encapsulates a group of tasks that can be executed in parallel, and the completion of a job is achieved by sequentially executing these stages. 由action操作触发的一系列计算，根据动作从前往后切分。Job被分解成多个Stage; Stage是基于RDD的宽依赖（shuffle依赖）来划分的。每个action操作会提交一个Job。Job是由一系列Stage组成的，这些Stage基于RDD的宽依赖被顺序执行; 每个Stage是对一组可以并行执行的任务的封装，而Job的完成则是通过顺序执行这些Stage来实现的。
Stage	Jobs are divided into stages at wide dependencies, split from back to front. Stages are formed at shuffle type/wide dependencies, with narrow dependencies incorporating the RDD into the stage. The boundary of a stage is shuffle: wide dependencies are introduced when an operation needs to redistribute data, such as grouping by key. Therefore, the boundary of a stage is where the data shuffle occurs. 从后往前找shuffle类型/宽依赖的算子，遇到一个就断开，形成一个stage；遇到窄依赖就将这个RDD加入该stage中。当一个操作需要对数据进行重新分布，比如通过key进行分组时，这就引入了宽依赖。Spark会在这些宽依赖的位置切分Stage，因此，Stage的边界就是数据shuffle的地方。
关系	A job contains multiple stages. Tasks within a stage can be executed in parallel, but there is a sequential order between multiple stages; the next stage will only start after all tasks in the current stage are completed. 一个Job包含多个Stage。Stage内的任务可以并行执行，但多个Stage之间是有先后顺序的；只有当一个Stage中的所有Task执行完成后，下一个Stage才会开始执行。
Execution Process	During execution, each stage is decomposed into multiple tasks by the TaskScheduler, and these tasks are executed in parallel by Executors. The execution of stages is sequential, but tasks within a stage are parallel. The completion of a job relies on the sequential execution and completion of all stages. 执行过程中，每个Stage被TaskScheduler分解成多个Task，这些Task由Executor并行执行。Stage的执行是顺序的，但Stage内的Task是并行的。Job的完成依赖于所有Stage的顺序执行和完成。
Task	The smallest unit of work in Spark, executed on the cluster. Each RDD can specify a different number of partitions. By default, each partition will be a Task. Spark中执行的最小工作单元。每个RDD可以指定不同的分区数。默认情况下：每个分区将是一个Task。
2 Task Types	Split into `ShuffleMapTask` and `ResultTask`: - ShuffleMapTask: Prepares data for a shuffle before the next stage. - ResultTask: Executes at the final stage for each partition’s result. 分为`ShuffleMapTask`和`ResultTask`： - ShuffleMapTask：在下一个阶段之前准备shuffle的数据。 - ResultTask：在最后一个阶段为每个分区的结果执行；简单来说，DAG的最后一个阶段会为每个结果的partition生成一个ResultTask，即每个Stage里面的Task的数量是由该Stage中最后一个RDD的Partition的

3.3 DAGScheduler Workflow

Stage is a TaskSet

No.	DAGScheduler Workflow
1	spark-submit submits the application.
2	Initializes DAGScheduler and TaskScheduler.
3	Upon receiving the application, DAGScheduler first abstracts the application into a DAG.
4	DAGScheduler splits this DAG (one of its jobs) into stages. DAGScheduler 对这个 DAG (DAG中的一个Job) 进行 stage 的切分。
5	Each stage is submitted to TaskScheduler.

4. Executor

Explain the role of SparkContext as the entry point of any Spark application.

Mode	Client vs. Cluster Submission Mode
Client Mode	In client mode, the driver program runs on the client node.
Cluster Mode	In cluster mode, the driver program runs within a worker node.

Spark application with a simple example.

Spark Executors are helpful for executing tasks. we can have as many executors we want. Therefore, Executors helps to enhance the Spark performance of the system.