找到你要的答案

Q:Apache Spark: Master removed our application: Failed when using saveAsTextFile on large RDD

Q:Apache的火花:主人删除我们的应用:失败的时候使用RDD saveastextfile大

I'm loading a file of 20 GB in Spark Standalone mode on a machine with 4GB RAM and 2 cores, do some processing and then try to save the result (for testing purposes) to a text file using saveAsTextFile.

If I manually extract only a few thousand lines from the original input file and run the code on that, it works like a charm, resulting in the expected part-xxxxx files.

However, if I provide the whole 20GB file as input, it will start off fine, then hang somewhere along the process and when let run overnight it will have failed in the morning with the following message:

Py4JJavaError: An error occurred while calling o219.saveAsTextFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

Does anyone have an idea why this might be?

我加载一个文件的20 GB的火花独立模式的机器上,4GB的内存和2个核心,做一些处理,然后将结果保存(用于测试目的)使用saveastextfile文本文件。

如果我手动提取只有几千线从原始输入文件上运行的代码,它像一个魅力,产生所期望的part- xxxxx文件。

然而,如果我提供的全20gb文件作为输入,将好的开端,然后挂在某个地方,让它没有通宵运行在上午与以下消息:

Py4JJavaError: An error occurred while calling o219.saveAsTextFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

有人知道为什么会这样吗?

answer1: 回答1:

There might be a few issues here, but the most commons would be :

  • Too many file opens killed the process
  • The serializer you're using (by default the Java serializer) might create a GC Overhead - try switching to Kryo Serializer (c.f. Spark's documentation : Tuning page)
  • finally you're running out of disk.

Now without any idea of what your computation is or the logs from the actual server, it's difficult to tell what happened.

Regards,

Olivier.

这里可能有一些问题,但最重要的是:

  • Too many file opens killed the process
  • The serializer you're using (by default the Java serializer) might create a GC Overhead - try switching to Kryo Serializer (c.f. Spark's documentation : Tuning page)
  • finally you're running out of disk.

现在不知道你的计算是什么,或从实际服务器的日志,很难分辨发生了什么。

当做,

奥利维尔。

apache-spark  pyspark