找到你要的答案

Q:How to Convert a Column of Dataframe to A List in Apache Spark?

Q:如何将一列数据框在Apache Spark列表?

I would like convert a string column of a dataframe to a list. What I can found from the Dataframe API is rdd so I tried converting it back to rdd first, and then apply toArray function to the rdd. In this case, the length and sql work just fine. However, the result I got from rdd has a square brackets around every element like this [A00001]. I was wondering if there's an appropriate way to convert a column to a list or a way to remove the square brackets.

Any suggestions would be appreciated. Thank you!

我想转换一个数据帧在一个列表中柱。我发现从数据帧的API是RDD所以我试着转换回RDD第一,然后使用toArray函数RDD。在这种情况下,长度和SQL的工作就好了。然而,我从RDD结果周围的每一个元素,这样a00001方括号[ ]。我想知道是否有适当的方法将列转换为列表或删除方括号的方法。

任何建议将不胜感激。谢谢您!

answer1: 回答1:

This should return the collection containing single list:

dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()

Without the mapping, you just get a Row object, which contains every column from the database.

Keep in mind that this will probably get you a list of Any type. Ïf you want to specify the result type, you can use .asInstanceOf[YOUR_TYPE] in r => r(0).asInstanceOf[YOUR_TYPE] mapping

P.S. due to automatic conversion you can skip the .rdd part.

这将返回包含单个列表的集合:

dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()

没有映射,你只需要得到一个行对象,它包含数据库中的每一列。

请记住,这可能会给你的任何类型的列表。ÏF你想指定的结果类型,你可以使用。asInstanceOf [ your_type ]在r & gt;R(0)。asInstanceOf [ your_type ]映射

注:由于自动转换,你可以跳过这一部分。RDD。

answer2: 回答2:

I know the answer given and asked for is assumed for Scala, so I am just providing a little snippet of Python code in case a PySpark user is curious. The syntax is similar to the given answer, but to properly pop the list out I actually have to reference the column name a second time in the mapping function and I do not need the select statement.

i.e. A DataFrame, containing a column named "Raw"

To get each row value in "Raw" combined as a list where each entry is a row value from "Raw" I simply use:

MyDataFrame.rdd.map(lambda x: x.Raw).collect()

我知道了,问的答案是假定为斯卡拉,所以我只是提供了一个简短的Python代码的情况下,pyspark用户好奇。语法类似于给定的答案,但要正确弹出列表,实际上我必须在映射函数中引用第二次列名,我不需要SELECT语句。

即一个数据框,包含一列命名为“原始”

以“RAW”的方式将每个行值合并为一个列表,其中每个条目是“原始”的行值,我只使用:

MyDataFrame.rdd.map(lambda x: x.Raw).collect()
answer3: 回答3:

I don't have enough reps to reply to the post above, but it throws an error the way it's written ^^. I had to change this to the following to get it to work (basically remove the call to Raw for each x):

MyDataFrame.rdd.map(lambda x: x).collect()

And it does seem to return a list of Row objects (using my example here):

>>> df.select('name').rdd.map(lambda r: r).collect()
[Row(name=u'Yin'), Row(name=u'Michael')]

我没有足够的代表来回答以上的职位,但它抛出一个错误它是这样写的^ ^。我必须改变这一点,以使其工作(基本上删除呼叫为RAW每x):

MyDataFrame.rdd.map(lambda x: x).collect()

它似乎返回的行对象列表(使用我的例子在这里):

>>> df.select('name').rdd.map(lambda r: r).collect()
[Row(name=u'Yin'), Row(name=u'Michael')]
scala  apache-spark  apache-spark-sql  spark-dataframe