找到你要的答案

Q:Apache spark , spark-submit, what is the behavior of --total-executor-cores option

Q:Apache的火花,火花提交,行为--总执行内核选项是什么

I am running a spark cluster over C++ code wrapped in python. I am currently testing different configurations of multi-threading options (at Python level or Spark level).

I am using spark with standalone binaries, over a HDFS 2.5.4 cluster. The cluster is currently made of 10 slaves, with 4 cores each.

From what I can see, by default, Spark launches 4 slaves per node (I have 4 python working on a slave node at a time).

How can I limit this number ? I can see that I have a --total-executor-cores option for "spark-submit", but there is little documentation on how it impacts the distribution of executors over the cluster !

I will run tests to get a clear idea, but if someone knowledgeable has a clue of what this option does, it could help.

Update :

I went through spark documentation again, here is what I understand :

  • By default, I have one executor per worker node (here 10 workers node, hence 10 executors)
  • However, each worker can run several tasks in parallel. In standalone mode, the default behavior is to use all available cores, which explains why I can observe 4 python.
  • To limit the number of cores used per worker, and limit the number of parallel tasks, I have at least 3 options :
    • use --total-executor-cores whith spark-submit (least satisfactory, since there is no clue on how the pool of cores is dealt with)
    • use SPARK_WORKER_CORES in the configuration file
    • use -c options with the starting scripts

The following lines of this documentation http://spark.apache.org/docs/latest/spark-standalone.html helped me to figure out what is going on :

SPARK_WORKER_INSTANCES
Number of worker instances to run on each machine (default: 1). You can make this more than 1 if you have have very large machines and would like multiple Spark worker processes. If you do set this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each worker will try to use all the cores.

What is still unclear to me is why it is better in my case to limit the number of parallel tasks per worker node to 1 and rely on my C++ legacy code multithreading. I will update this post with experiment results, when I will finish my study.

I am running a spark cluster over C++ code wrapped in python. I am currently testing different configurations of multi-threading options (at Python level or Spark level).

我使用的火花与独立的二进制文件,在HDFS 2.5.4集群。群集目前由10个奴隶,每个核心有4个。

从我所能看到的,默认情况下,每个节点的火花发射4个奴隶(我有4条工作在同一时间从节点)。

我怎样才能限制这个数字?我可以看到,我有一个总执行内核选项“火花报”,但很少有文献对它如何影响执行元件的分布在集群!

我将运行测试得到一个明确的想法,但如果有人知道这个选项的线索,它可以帮助。

更新:

我再次通过火花文件,这里是我所理解的:

  • By default, I have one executor per worker node (here 10 workers node, hence 10 executors)
  • However, each worker can run several tasks in parallel. In standalone mode, the default behavior is to use all available cores, which explains why I can observe 4 python.
  • To limit the number of cores used per worker, and limit the number of parallel tasks, I have at least 3 options :
    • use --total-executor-cores whith spark-submit (least satisfactory, since there is no clue on how the pool of cores is dealt with)
    • use SPARK_WORKER_CORES in the configuration file
    • use -c options with the starting scripts

以下系本文档http://spark.apache.org/docs/latest/spark-standalone.html帮我搞清楚是怎么回事:

SPARK_WORKER_INSTANCES
Number of worker instances to run on each machine (default: 1). You can make this more than 1 if you have have very large machines and would like multiple Spark worker processes. If you do set this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each worker will try to use all the cores.

目前还不清楚我是为什么在我的情况下更好地限制并行任务的每个工人的节点数1,依靠我的C++代码的多线程。我将更新这篇文章与实验结果,当我将完成我的研究。

answer1: 回答1:

The documentation does not seem clear.

From my experience, the most common practice to allocate resources is by indicating the number of executors and the number of cores per executor, for example (taken from here):

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 10 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 4 \
--queue thequeue \
lib/spark-examples*.jar \
10

However, this approach is limited to YARN, and is not applicable to standalone and mesos based Spark, according to this.

Instead, the parameter --total-executor-cores can be used, which represents the total amount of cores - of all executors - assigned to the Spark job. In your case, having a total of 40 cores, setting the attribute --total-executor-cores 40 would make use of all the available resources.

Unfortunately, I am not aware of how Spark distributes the workload when less resources than the total available are provided. If working with two or more simultaneous jobs, however, it should be transparent to the user, in that Spark (or whatever resource manager) would manage how the resources are managed depending on the user settings.

文件似乎不清楚。

从我的经验,资源配置的最常见的做法是通过指示执行器的数量和每个执行核心的数量,例如,(从这里):

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 10 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 4 \
--queue thequeue \
lib/spark-examples*.jar \
10

然而,这种方法是有限的纱,并不适用于独立的、基于目标的火花,根据这。

相反,参数--总执行内核可以使用,它代表的核心-总金额的所有执行者分配到火花的工作。在您的情况下,总共有40个内核,设置属性-总执行器内核40将利用所有可用的资源。

不幸的是,我不知道如何分配的工作量时,更少的资源比总可用。然而,如果使用两个或多个同时工作,它应该是透明的,在该火花(或任何资源管理器)将管理如何根据用户设置管理资源。

answer2: 回答2:

To make sure how many workers started on each slave, open web browser, type http://master-ip:8080, and see the workers section about how many workers has been started exactly, and also which worker on which slave. (I mention these above because I am not sure what do you mean by saying '4 slaves per node')

By default, spark would start exact 1 worker on each slave unless you specify SPARK_WORKER_INSTANCES=n in conf/spark-env.sh, where n is the number of worker instance you would like to start on each slave.

When you submit a spark job through spark-submit, spark would start an application driver and several executors for your job.

  • If not specified clearly, spark would start one executor for each worker, i.e. the total executor num equal to the total worker num, and all cores would be available to this job.
  • --total-executor-cores you specified would limit the total cores that is available to this application.

确定有多少工人开始在每一个奴隶,打开Web浏览器,HTTP:/ /主IP类型:8080、看工人的部分有多少工人已经开始完全,也是工人的奴隶。我提到以上这些,因为我不知道你说“每个节点4个奴隶”是什么意思?

By default, spark would start exact 1 worker on each slave unless you specify SPARK_WORKER_INSTANCES=n in conf/spark-env.sh, where n is the number of worker instance you would like to start on each slave.

当你提交一个火花火花火花通过提交工作,会为你的工作开始应用驱动和几家。

  • If not specified clearly, spark would start one executor for each worker, i.e. the total executor num equal to the total worker num, and all cores would be available to this job.
  • --total-executor-cores you specified would limit the total cores that is available to this application.
multithreading  hadoop  apache-spark  pyspark  cpu-cores