找到你要的答案

Q:Spark (pyspark) having difficulty calling statistics methods on worker node

Q:火花(pyspark)有困难,号召工人节点统计方法

I am hitting a library error when running pyspark (from an ipython-notebook), I want to use the Statistics.chiSqTest(obs) from pyspark.mllib.stat in a .mapValues operation on my RDD containing (key, list(int)) pairs.

On the master node, if I collect the RDD as a map, and iterate over the values like so I have no problems

keys_to_bucketed = vectors.collectAsMap()
keys_to_chi = {key:Statistics.chiSqTest(value).pValue for key,value in keys_to_bucketed.iteritems()}

but if I do the same directly on the RDD I hit issues

keys_to_chi = vectors.mapValues(lambda vector: Statistics.chiSqTest(vector))
keys_to_chi.collectAsMap()

results in the following exception

Traceback (most recent call last):
  File "<ipython-input-80-c2f7ee546f93>", line 3, in chi_sq
  File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/stat/_statistics.py", line 238, in chiSqTest
    jmodel = callMLlibFunc("chiSqTest", _convert_to_vector(observed), expected)
  File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/common.py", line 127, in callMLlibFunc
api = getattr(sc._jvm.PythonMLLibAPI(), name)
AttributeError: 'NoneType' object has no attribute '_jvm'

I had an issue early on in my spark install not seeing numpy, with mac-osx having two python installs (one from brew and one from the OS) but I thought I had resolved that. Whats odd here is that this is one of the python libs that ships with the spark install (my previous issue had been with numpy).

  1. Install Details
    • Max OSX Yosemite
    • Spark spark-1.4.0-bin-hadoop2.6
    • python is specified via spark-env.sh as
    • PYSPARK_PYTHON=/usr/bin/python
    • PYTHONPATH=/usr/local/lib/python2.7/site-packages:$PYTHONPATH:$EA_HOME/omnicat/src/main/python:$SPARK_HOME/python/
    • alias ipython-spark-notebook="IPYTHON_OPTS=\"notebook\" pyspark"
    • PYSPARK_SUBMIT_ARGS='--num-executors 2 --executor-memory 4g --executor-cores 2'
    • declare -x PYSPARK_DRIVER_PYTHON="ipython"

我撞到一个库中的错误在运行时pyspark(从IPython笔记本),我想用统计。chisqtest(OBS)从pyspark.mllib.stat在操作。mapvalues我RDD含(关键,列表(int))对。

在主节点,如果我收集的RDD为图,遍历的价值,所以我没有问题

keys_to_bucketed = vectors.collectAsMap()
keys_to_chi = {key:Statistics.chiSqTest(value).pValue for key,value in keys_to_bucketed.iteritems()}

但如果我做同样的事情,直接在RDD我命中的问题

keys_to_chi = vectors.mapValues(lambda vector: Statistics.chiSqTest(vector))
keys_to_chi.collectAsMap()

结果出现以下异常

Traceback (most recent call last):
  File "<ipython-input-80-c2f7ee546f93>", line 3, in chi_sq
  File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/stat/_statistics.py", line 238, in chiSqTest
    jmodel = callMLlibFunc("chiSqTest", _convert_to_vector(observed), expected)
  File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/common.py", line 127, in callMLlibFunc
api = getattr(sc._jvm.PythonMLLibAPI(), name)
AttributeError: 'NoneType' object has no attribute '_jvm'

我有一个问题早在我没有看到火花安装NumPy,与Mac OSX两Python安装(从酿酒和一个从OS)但是我认为我已经解决了。什么奇怪的是,这是一个Python库与火花船安装(我以前的问题已经和NumPy)。

  1. Install Details
    • Max OSX Yosemite
    • Spark spark-1.4.0-bin-hadoop2.6
    • python is specified via spark-env.sh as
    • PYSPARK_PYTHON=/usr/bin/python
    • PYTHONPATH=/usr/local/lib/python2.7/site-packages:$PYTHONPATH:$EA_HOME/omnicat/src/main/python:$SPARK_HOME/python/
    • alias ipython-spark-notebook="IPYTHON_OPTS=\"notebook\" pyspark"
    • PYSPARK_SUBMIT_ARGS='--num-executors 2 --executor-memory 4g --executor-cores 2'
    • declare -x PYSPARK_DRIVER_PYTHON="ipython"
answer1: 回答1:

As you've noticed in your comment the sc on the worker nodes is None. The SparkContext is only defined on the driver node.

正如您在注释中注意到的,在工作节点上的SC是没有。的sparkcontext在驱动器的节点定义。

python  osx  apache-spark  pyspark