找到你要的答案

Q:How to read Azure Table Storage data from Apache Spark running on HDInsight

Q:如何从运行于Apache的火花HDInsight读表格储存数据

Is it any way of doing that from a Spark application running on Azure HDInsight? We are using Scala.

Azure Blobs are supported (through WASB). I don't understand why Azure Tables aren't.

Thanks in advance

它是任何做法,从火花应用Azure HDInsight跑步吗?我们使用Scala。

蓝色斑点的支持(通过威斯康星校务联合会)。我不明白为什么天青表不是。

先谢谢了。

answer1: 回答1:

Currently Azure Tables are not supported. Only Azure blobs support the HDFS interface required by Hadoop & Spark.

目前Azure表不支持。只有蓝色的斑点支持HDFS接口所需的Hadoop &;火花。

answer2: 回答2:

You can actually read from Table Storage in Spark, here's a project done by a Microsoft guy doing just that:

https://github.com/mooso/azure-tables-hadoop

You probably won't need all the Hive stuff, just the classes at root level:

  • AzureTableConfiguration.java
  • AzureTableInputFormat.java
  • AzureTableInputSplit.java
  • AzureTablePartitioner.java
  • AzureTableRecordReader.java
  • BaseAzureTablePartitioner.java
  • DefaultTablePartitioner.java
  • PartitionInputSplit.java
  • WritableEntity.java

You can read with something like this:

import org.apache.hadoop.conf.Configuration

sparkContext.newAPIHadoopRDD(getTableConfig(tableName,account,key),
                                                classOf[AzureTableInputFormat],
                                                classOf[Text],
                                                classOf[WritableEntity])

def getTableConfig(tableName : String, account : String, key : String): Configuration = {
    val configuration = new Configuration()
    configuration.set("azure.table.name", tableName)
    configuration.set("azure.table.account.uri", account)
    configuration.set("azure.table.storage.key", key)
    configuration
  }

You will have to write a decoding function to transform your WritableEntity to the Class you want.

It worked for me!

实际上,你可以从表存储中读取火花,这里有一个由微软做的:

https://github.com/mooso/azure-tables-hadoop

您可能不需要所有的蜂箱的东西,只是在根级的类:

  • AzureTableConfiguration.java
  • AzureTableInputFormat.java
  • AzureTableInputSplit.java
  • AzureTablePartitioner.java
  • AzureTableRecordReader.java
  • BaseAzureTablePartitioner.java
  • DefaultTablePartitioner.java
  • PartitionInputSplit.java
  • WritableEntity.java

你可以读这样的东西:

import org.apache.hadoop.conf.Configuration

sparkContext.newAPIHadoopRDD(getTableConfig(tableName,account,key),
                                                classOf[AzureTableInputFormat],
                                                classOf[Text],
                                                classOf[WritableEntity])

def getTableConfig(tableName : String, account : String, key : String): Configuration = {
    val configuration = new Configuration()
    configuration.set("azure.table.name", tableName)
    configuration.set("azure.table.account.uri", account)
    configuration.set("azure.table.storage.key", key)
    configuration
  }

你要写一个解码功能来改变你的writableentity你想要的类。

它为我工作!

azure  apache-spark  windows-azure-storage  hdinsight