找到你要的答案

Q:How to restore RDD of (key,value) pairs after it has been stored/read from a text file

Q:如何恢复病(键,值)对后,它已存储/读取文本文件

I saved my RDD of (key, value) pairs to a text file using saveAsTextFile. After I read the text file back using sc.textFile("filename.txt") command, I ended up with strings, instead of (key, value) pairs. My keys used to be strings and values were lists of floats. Here's an example:

(u'ALM_0', [98.0, 110.0, 104.0, 6.0, 208.0, -262.0, 136.0, -204.67395833333333, 45.362440283766297, -196487.0, 1.0, 4.0, 2.5, 1.1180339887498949, 10.0, -46.0, 261.0, -3.6343749999999999])  

How do I easily convert this string to (key, value) pair? Is there Spark read command that will do it on read?

I am using Python interface to Spark.

我救了我的病(键,值)对使用saveastextfile文本文件。我读文本文件使用sc.textfile后(“文件名.txt”)命令,我结束了字符串,而不是(键,值)对。我的键过去是字符串和值是浮动列表。这里有一个例子:

(u'ALM_0', [98.0, 110.0, 104.0, 6.0, 208.0, -262.0, 136.0, -204.67395833333333, 45.362440283766297, -196487.0, 1.0, 4.0, 2.5, 1.1180339887498949, 10.0, -46.0, 261.0, -3.6343749999999999])  

如何轻松地将此字符串转换为(键,值)对?是否有火花读取命令将在阅读吗?

我使用Python接口的火花。

answer1: 回答1:

ast.literal_eval should do the trick:

import ast

data1 = [(u'BAR_0', [1.0, 2.0, 3.0]), (u'FOO_1', [4.0, 5.0, 6.0])]
rdd = sc.parallelize(data1)
rdd.saveAsTextFile("foobar_text")

data2 = sc.textFile("foobar_text").map(ast.literal_eval).collect()
assert sorted(data1) == sorted(data2)

but generally speaking it is better to avoid situation like this in the first place and use for example a SequenceFile:

rdd.saveAsPickleFile("foobar_seq")
sc.pickleFile("foobar_seq")

ast.literal_eval应该做的伎俩:

import ast

data1 = [(u'BAR_0', [1.0, 2.0, 3.0]), (u'FOO_1', [4.0, 5.0, 6.0])]
rdd = sc.parallelize(data1)
rdd.saveAsTextFile("foobar_text")

data2 = sc.textFile("foobar_text").map(ast.literal_eval).collect()
assert sorted(data1) == sorted(data2)

但一般来说,最好是避免在第一个地方像这样的情况,例如通过使用:

rdd.saveAsPickleFile("foobar_seq")
sc.pickleFile("foobar_seq")
answer2: 回答2:

You're going to have to implement a parser for your input. The easiest thing to do is to map your output to a character separated output with a tab or colon delimeter and use spilt(delimiter) in your map upon reading, basically like in the wordCount example.

你必须为你的输入实现一个解析器。最简单的做法是地图你的输出到一个字符分隔输出一个标签或结肠定界符使用泼(分隔符)在你的地图上看,基本上就像在WordCount例子。

python  apache-spark  pyspark