找到你要的答案

Q:How to get term-document matrix from multiple documents with Spark?

Q:如何从多文档的火花中获取术语文档矩阵?

I'm trying to generete a term-document matrix from multiple documents. I could run LDA Model from a already created matrix, now I need this step back. Ive tried to implement a simple term-doc matrix, but now I'm stucked. What I did was:

//GETS ALL FILES FROM INPUT PATH
JavaPairRDD<String, String> doc_words = context.wholeTextFiles(input_path);

//SPLIT BY " "
JavaPairRDD<String, String> tokenized = doc_words.flatMapValues(Preprocessing_DocumentTermMatrix.WORDS_EXTRACTOR);

//SEE METHOD WORDS_MAPPER.
JavaRDD<Tuple2<Tuple2<String, String>, Integer>> rdd = tokenized.flatMap(WORDS_MAPPER);


//METHOD WORDS_MAPPER
public static final FlatMapFunction<Tuple2<String, String>, Tuple2<Tuple2<String, String>, Integer>> WORDS_MAPPER = new FlatMapFunction<Tuple2<String, String>, Tuple2<Tuple2<String, String>, Integer>>() {

    public Iterable<Tuple2<Tuple2<String, String>, Integer>> call(Tuple2<String, String> stringIntegerTuple2) throws Exception {
        return Arrays.asList(new Tuple2<Tuple2<String, String>, Integer>(new Tuple2<String,String>(stringIntegerTuple2._1(), stringIntegerTuple2._2()), 1)); 
    } 
};

So, this function give me a result like this:

((DOC_0, TERM0), 1)
((DOC_0, TERM0), 1)
((DOC_0, TERM1), 1)
((DOC_1, TERM0), 1)
((DOC_1, TERM2), 1)

I guess this is allright, but now I need to reduce it and extract an output like this:

(DOC_0, (TERM0, 2), (TERM1, 1))
(DOC_1, (TERM0, 1), (TERM2, 1))

Ive tried a lot of things and could not get it... Some one can help me?

I'm trying to generete a term-document matrix from multiple documents. I could run LDA Model from a already created matrix, now I need this step back. Ive tried to implement a simple term-doc matrix, but now I'm stucked. What I did was:

//GETS ALL FILES FROM INPUT PATH
JavaPairRDD<String, String> doc_words = context.wholeTextFiles(input_path);

//SPLIT BY " "
JavaPairRDD<String, String> tokenized = doc_words.flatMapValues(Preprocessing_DocumentTermMatrix.WORDS_EXTRACTOR);

//SEE METHOD WORDS_MAPPER.
JavaRDD<Tuple2<Tuple2<String, String>, Integer>> rdd = tokenized.flatMap(WORDS_MAPPER);


//METHOD WORDS_MAPPER
public static final FlatMapFunction<Tuple2<String, String>, Tuple2<Tuple2<String, String>, Integer>> WORDS_MAPPER = new FlatMapFunction<Tuple2<String, String>, Tuple2<Tuple2<String, String>, Integer>>() {

    public Iterable<Tuple2<Tuple2<String, String>, Integer>> call(Tuple2<String, String> stringIntegerTuple2) throws Exception {
        return Arrays.asList(new Tuple2<Tuple2<String, String>, Integer>(new Tuple2<String,String>(stringIntegerTuple2._1(), stringIntegerTuple2._2()), 1)); 
    } 
};

所以,这个函数给了我这样的结果:

((DOC_0, TERM0), 1)
((DOC_0, TERM0), 1)
((DOC_0, TERM1), 1)
((DOC_1, TERM0), 1)
((DOC_1, TERM2), 1)

我想这是好的,但现在我需要减少和提取一个像这样的输出:

(DOC_0, (TERM0, 2), (TERM1, 1))
(DOC_1, (TERM0, 1), (TERM2, 1))

我尝试了很多东西,却拿不到它…有人能帮我吗?

answer1: 回答1:

Here is solution :

JavaPairRDD<String, Iterable<Tuple2<String, Integer>>> newrdd = JavaPairRDD.fromJavaRDD(rdd).reduceByKey((a, b) -> a + b)
                .mapToPair(t -> new Tuple2<>(t._1._1, new Tuple2<>(t._1._2, t._2))).groupByKey();

这里是解决方案:

JavaPairRDD<String, Iterable<Tuple2<String, Integer>>> newrdd = JavaPairRDD.fromJavaRDD(rdd).reduceByKey((a, b) -> a + b)
                .mapToPair(t -> new Tuple2<>(t._1._1, new Tuple2<>(t._1._2, t._2))).groupByKey();
java  apache-spark  text-mining  apache-spark-mllib  term-document-matrix