找到你要的答案

Q:Spark Dataframe parallel read

Q:火花帧并行读取

When using pyspark you can set the number of reduces in the sc.textFile method such that you can read a file quicker form S3 as explained here. This works well, but as of Spark 1.3 we can also start using DataFrames.

Is something like this also possible for Spark DataFrames? I am trying to load them from S3 to a spark cluster (which was created via ec2-spark). Basically I am trying to get this bit of code to run quick for very large 'data.json' files:

from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext(CLUSTER_URL, 'ipython-notebook')
sqlContext = SQLContext(sc)
df = sqlContext.jsonFile('s3n://bucket/data.json').cache()

当使用pyspark可以减少在sc.textfile这样的方法,你可以读一个文件更快形成S3如在这里数。这工作很好,但作为火花1.3我们也可以开始使用数据帧。

是这样的事情也可能引发数据帧?我想载他们从S3到火花簇(这是由EC2火花)。基本上我是想把这段代码运行速度非常大的数据的JSON文件:

from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext(CLUSTER_URL, 'ipython-notebook')
sqlContext = SQLContext(sc)
df = sqlContext.jsonFile('s3n://bucket/data.json').cache()
answer1: 回答1:

there's actually a TODO note related to this here and I created the corresponding issue here, so you can up-vote it if that's something you'd need.

Regards,

Olivier.

其实是有来这里做笔记相关和我创建了相应的问题,所以你可以投票,如果这是你需要的。

当做,

奥利维尔。

answer2: 回答2:

While waiting for the issue to get fixed I've found a workaround that works for now. The .json file contains a dictionary for each row so what I could do is first read it in as an RDD textfile and then cast it into a dataframe by specifying the columns manually:

from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
sc = SparkContext(CLUSTER_URL, 'ipython-notebook')
sqlContext = SQLContext(sc)
data = sqlContext.textFile('s3n://bucket/data.json', 30).cache()
df_rdd = data\
    .map(lambda x : dict(eval(x)))\
    .map(lambda x : Row(x1=x['x1'], x2=x['x2'], x3=x['x3'], x4=x['x4']))
df = sqlContext.inferSchema(df_rdd).cache()

As per the docs. This also means that you could use a .csv file instead of a json file (which usually saves a lot of disk space) as long as you manually specify the column names in spark.

在等待问题得到固定的我找到了一个解决方法,适合现在。的。JSON文件包含每一行的字典,我能做的就是先看作为一个RDD文件然后将其转换成一个数据帧通过指定列手动:

from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
sc = SparkContext(CLUSTER_URL, 'ipython-notebook')
sqlContext = SQLContext(sc)
data = sqlContext.textFile('s3n://bucket/data.json', 30).cache()
df_rdd = data\
    .map(lambda x : dict(eval(x)))\
    .map(lambda x : Row(x1=x['x1'], x2=x['x2'], x3=x['x3'], x4=x['x4']))
df = sqlContext.inferSchema(df_rdd).cache()

按照文档。这也意味着,你可以使用,而不是一个JSON文件CSV文件(通常可以节省大量的磁盘空间)只要你手动指定列名在火花。

amazon-s3  apache-spark