找到你要的答案

Q:naive bayes pyspark 1.3 no response

Q:朴素贝叶斯pyspark 1.3无响应

I am trying to run a Naive Bayesian classifier for my data in PySpark 1.3

Here is my data sample:

Using a text file, I am converting it into a LabeledPoint object

67,[0,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,3.....60,66],[0.45,0.441666666667,0.475,0.0,0.717763157895,0.0,0.497300944669,0.476608187135,0.0,0.0,0.45183714002,0.616666666667,0.966666666667,0.0790064102564,-0.364093614847,0.0679487179487,0.256043956044,0.7,0.449583333333,0.231904697754,0.341666666667,0.06....,0.0]

data = MLUtils.loadLibSVMFile(sc, 'path to file')

training, test = data.randomSplit([0.7, 0.3], seed=0)

model = NaiveBayes.train(training, 1.0)

predictionAndLabel = test.map(lambda p: (model.predict(p.features), p.label))

accuracy = (
    1.0 * predictionAndLabel.filter(lambda (x, v): x == v).count() / test.count()
)

PySpark seems to hang forever on calculating variable model. Does anyone else has faced this issue before? Thanks.

我试图运行一个朴素贝叶斯在pyspark 1.3我的数据分类

这是我的数据样本:

使用一个文本文件,我把它转换成一个labeledpoint对象

67、[ 0,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,3 ..... 60,66 ]、[ 0.45,0.441666666667,0.475,0.0,0.717763157895,0.0,0.497300944669,0.476608187135,0.0,0.0,0.45183714002,0.616666666667,0.966666666667,0.0790064102564,- 0.364093614847,0.0679487179487,0.256043956044,0.7,0.449583333333,0.231904697754,0.341666666667,0.06的…,0 ]

data = MLUtils.loadLibSVMFile(sc, 'path to file')

training, test = data.randomSplit([0.7, 0.3], seed=0)

model = NaiveBayes.train(training, 1.0)

predictionAndLabel = test.map(lambda p: (model.predict(p.features), p.label))

accuracy = (
    1.0 * predictionAndLabel.filter(lambda (x, v): x == v).count() / test.count()
)

pyspark似乎永远挂在计算变量模型。以前有没有人曾经面对过这个问题?谢谢.

answer1: 回答1:

The Naive Bayes algorithm in Spark requires that no features (e.g. x values) are negative. You can see in your LabeledPoints that -0.364093614847 is negative. This should be throwing an error. So, try going back through your raw data and finding a way to convert anything that is negative to a positive value. In the example below, my data is all between -1.0 and 1.0. I just add 1.0 to all values so that the distributions / means/ standard deviations all remain the same.

Your data looks like this:

[LabeledPoint(1.0,(1,[0,1,2,3],[-0.5,0.5,0.0,0.8]))], 
[LabeledPoint(0.0,(1,[0,1,2,3],[0.1,0.5,0.5,-0.6]))],
[LabeledPoint(1.0,(1,[0,1,2,3],[0.9,0.1,-0.2,0.7]))]

The problem is that data structures in Spark are basically immutable. Therefore, you need to go back to when your data was not yet converted into a LabeledPoint object (e.g. when it was still text). Here is some sample code on how to read in a text file (with some missing values), add one to each feature, then convert to LabeledPoint. Note that this is for a csv, but if you change what is in split you can alter it for a tsv or other delimiter.

sc.textFile("/your/directory/your-file/*") \
     .map(lambda x: [unicode("") if x1=="nan" else x1 for x1 in x.split(',')[1:]])\
     .map(lambda x: x[0] + " " + " ".join([str(i+1)+":"+str(float(x1)+1) for i,x1 in enumerate(x[1:4]) if x1 != ''])) \ 
     .saveAsTextFile("/your/directory/new-directory/no-neg")

This assumes the original file you have takes the form:

Label, X1, X2, X3, X4

在火花的朴素贝叶斯算法要求,没有任何功能(如x值)是负的。你可以看到你的labeledpoints,-0.364093614847是负的。这应该是一个错误。所以,试着通过你的原始数据,找到一种方法来转换任何负面的积极价值。在下面的例子中,我的数据都在1到1之间。我只添加1的所有值,使分布/手段/标准偏差都保持不变。

你的数据看起来像这样:

[LabeledPoint(1.0,(1,[0,1,2,3],[-0.5,0.5,0.0,0.8]))], 
[LabeledPoint(0.0,(1,[0,1,2,3],[0.1,0.5,0.5,-0.6]))],
[LabeledPoint(1.0,(1,[0,1,2,3],[0.9,0.1,-0.2,0.7]))]

问题在于火花中的数据结构基本上是不可变的。因此,你需要回到你的数据尚未转化为labeledpoint对象(例如,当它仍然是文本)。这里是一些示例代码如何在文本文件中读取(有一些缺失值),添加到每一个特征,然后将其转换为labeledpoint。请注意,这是一个CSV文件,但如果你改变什么是你能改变它的分裂为TSV或其他分隔符。

sc.textFile("/your/directory/your-file/*") \
     .map(lambda x: [unicode("") if x1=="nan" else x1 for x1 in x.split(',')[1:]])\
     .map(lambda x: x[0] + " " + " ".join([str(i+1)+":"+str(float(x1)+1) for i,x1 in enumerate(x[1:4]) if x1 != ''])) \ 
     .saveAsTextFile("/your/directory/new-directory/no-neg")

这假定您已采取的原始文件的形式:

X1、X2、X3的标签,X4

apache-spark  pyspark  apache-spark-mllib  naivebayes