找到你要的答案

Q:Action not making use of cached dataframe

Q:行动不利用缓存的数据帧

Hi I have a cached dataframe (which I can see in my spark UI). If I do a count() action it takes very little time as expected. Also if I visualize DAG I see that it makes use of the cached dataframe:

This code:

df_calls.count()

Gives this execution DAG:

(The little green circle means that it is a cached dataframe. So it means that the previous steps do not have to be executed, only the next steps)

Now the same for:

df_calls.map(row=>row.getString(10).toDouble).stats()

Gives this execution DAG:

No little green circle is present in the graph, so it does not make use of the cached dataframe, nevertheless the dataframe is still in memory. It seems that the data is being loaded again.

What is happening here and why ?

嗨,我有一个缓存的数据框(我可以看到我的火花UI)。如果我做一个count()行动需要很少的时间预期。如果我想象DAG我看到它使用缓存的数据帧:

此代码:

df_calls.count()

给这个执行DAG:

(绿色的小圆圈意味着它是一个缓存的数据帧。因此,这意味着,前面的步骤不必执行,只有下一步)

现在一样:

df_calls.map(row=>row.getString(10).toDouble).stats()

给这个执行DAG:

没有一点绿色的圆图中,因此不使用缓存的数据帧的帧,但仍在记忆。似乎数据再次被加载。

这里发生了什么?为什么?

scala  apache-spark