Categories
日常应用

Hadoop Arvo Schema 和 HIVE 笔记

昨天捣鼓了一天这个东西,随便写点笔记。

  • arvo:除了著名的hdfs文件,hadoop上常用的另一种序列化存储的文件格式就是arvo。简单的讲,这货就是由一个定义好的schema来读取的二进制文本文件。
  • arvo schema:很像json...比如这里这个:
{
 "type" : "record",
 "name" : "Tweet",
 "namespace" : "com.miguno.avro",
 "fields" : [ {
 "name" : "username",
 "type" : "string",
 "doc" : "Name of the user account on Twitter.com"
 }, {
 "name" : "tweet",
 "type" : "string",
 "doc" : "The content of the user's Twitter message"
 }, {
 "name" : "timestamp",
 "type" : "long",
 "doc" : "Unix epoch time in seconds"
 } ],
 "doc:" : "A basic schema for storing Twitter messages"
}
  • 定义好schema之后可以用java去build...
  • arvo to HIVE:可以直接建HIVE external table. (还是上面那个link)
CREATE EXTERNAL TABLE tweets
 COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
 STORED AS
 INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
 LOCATION '/user/YOURUSER/examples/input/'
 TBLPROPERTIES (
 'avro.schema.url'='hdfs:///user/YOURUSER/examples/schema/twitter.avsc'
 );

然后就是正常的玩法了。