找到你要的答案

Q:RDD is having only first column value : Hbase, PySpark

Q:RDD是只有第一列的值:Hbase、PySpark

We are reading a Hbase table with Pyspark using the following commands.

from pyspark.sql.types import *
host=<Host Name>
port=<Port Number>

keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"

cmdata_conf = {"hbase.zookeeper.property.clientPort":port, "hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": "CMData", "hbase.mapreduce.scan.columns": "info:Tenure info:Age"}

cmdata_rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=cmdata_conf)

output = cmdata_rdd.collect()

output

I am getting the result as below. (Key and Age)

[(u'123', u'5'), (u'234', u'4'), (u'345', u'3'), (u'456', u'4'), (u'567', u'7'), (u'678', u'7'), (u'789', u'8')]

Instead am expecting Key,Tenure and Age. If I have only Tenure column then its returning Key and Tenure. But If add more columns the result always has Key and Age column.

Can anyone help us to solve this one.?

Note : We are new to this tools

Thank you in advance.

我们使用下面的命令Pyspark读HBase表。

from pyspark.sql.types import *
host=<Host Name>
port=<Port Number>

keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"

cmdata_conf = {"hbase.zookeeper.property.clientPort":port, "hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": "CMData", "hbase.mapreduce.scan.columns": "info:Tenure info:Age"}

cmdata_rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=cmdata_conf)

output = cmdata_rdd.collect()

output

我得到的结果如下。(关键和年龄)

[(u'123', u'5'), (u'234', u'4'), (u'345', u'3'), (u'456', u'4'), (u'567', u'7'), (u'678', u'7'), (u'789', u'8')]

相反,我期待的关键,任期和年龄。如果我只有任期,那么它的回报和任期。但是,如果添加更多的列,结果总是有关键和年龄栏。

谁能帮我们解决这个问题?

注:我们是新的工具

提前谢谢。

answer1: 回答1:

I you're prototyping and don't want to update your cluster, it can be useful to have a look at happybase (https://happybase.readthedocs.org/en/latest/).

The following code does the trick to get my small (9Gig) Hbase table 'name_Hbase_Table' from my cluster in under a second.

import happybase
connection = happybase.Connection(host ='your.ip.cluster') #don't specify :port
table = connection.table('name_Hbase_Table')
def hbaseAccelerationParser(table): #create UDF to format data
    finalTable=[]
    for key, data in table.scan(): #don't need the key in my case
        line=[]
        for values in data.itervalues():
            line.append(values)
        finalTable.append(line)
    return finalTable
table =table.map(hbaseAccelerationParser) #capture data in desired format
table = sc.parallelize(table ,4) #put in RDD

我是你的原型和不想更新你的集群,它可以在happybase看看有用(https://happybase。readthedocs。org / EN /新/)。

下面的代码就让我的小(9gig)HBase表name_hbase_table '从我的集群下一个第二。

import happybase
connection = happybase.Connection(host ='your.ip.cluster') #don't specify :port
table = connection.table('name_Hbase_Table')
def hbaseAccelerationParser(table): #create UDF to format data
    finalTable=[]
    for key, data in table.scan(): #don't need the key in my case
        line=[]
        for values in data.itervalues():
            line.append(values)
        finalTable.append(line)
    return finalTable
table =table.map(hbaseAccelerationParser) #capture data in desired format
table = sc.parallelize(table ,4) #put in RDD
python  hadoop  hbase  bigdata  pyspark