scala - Spark : Why execution is carried by a mast

我有一个由一个主节点和两个工作节点组成的 spark 集群。

当执行以下代码从数据库中提取数据时,实际执行是由 master 执行的,而不是 worker 之一。

    sparkSession.read
      .format("jdbc")
      .option("url", jdbcURL)
      .option("user", user)
      .option("query", query)
      .option("driver", driverClass)
      .option("fetchsize", fetchsize)
      .option("numPartitions", numPartitions)
      .option("queryTimeout", queryTimeout)
      .options(options)
      .load()

这是预期的行为吗?

有什么方法可以禁止这种行为吗?

最佳答案

Spark 应用程序有两种类型的运行器:驱动程序和执行程序,以及两种类型的操作:转换和操作。根据这个doc :

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

...

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

所以在Spark应用中,有些操作在executors中执行,有些操作在drivers中执行。在 Dataproc 上,执行程序始终位于工作节点上的 YARN 容器中。但是驱动程序可以在主节点或工作节点上。默认称为“客户端模式”,这意味着驱动程序在 YARN 之外的主节点上运行。但是您可以使用 gcloud dataproc jobs submit spark ... --properties spark.submit.deployMode=cluster 启用“集群模式”,这将在工作节点上的 YARN 容器中运行驱动程序。看这个doc了解更多详情。

关于scala - Spark : Why execution is carried by a master node but not worker nodes?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68345018/

相关文章:

email - 我应该用什么方法发送带有 Airflow 的电子邮件?

python - 如何在生产中使用 gunicorn 和 nginx 托管 2 个 Django 应

javascript - 如何在 d3.js 中使用 .scale 和 .translate?

python - STATICFILES_DIRS 设置不是元组或列表。尽管它不包含逗号

node.js - sequelize - 如何为日期字段设置验证规则

java - 使用 Keycloak Script Mapper 聚合声明中角色的属性

rust - 等效于 abi.encodePacked

android - 你可以使用没有 App Store Id 的 firebase 动态链接吗

chainlink - 布朗尼安装疑难解答

javascript - 在 React 和 Vanilla Javascript 中输入数字 `e