Categories
日常应用

Hadoop Arvo Schema 和 HIVE 笔记

昨天捣鼓了一天这个东西,随便写点笔记。

  • arvo:除了著名的hdfs文件,hadoop上常用的另一种序列化存储的文件格式就是arvo。简单的讲,这货就是由一个定义好的schema来读取的二进制文本文件。
  • arvo schema:很像json...比如这里这个:
{
 "type" : "record",
 "name" : "Tweet",
 "namespace" : "com.miguno.avro",
 "fields" : [ {
 "name" : "username",
 "type" : "string",
 "doc" : "Name of the user account on Twitter.com"
 }, {
 "name" : "tweet",
 "type" : "string",
 "doc" : "The content of the user's Twitter message"
 }, {
 "name" : "timestamp",
 "type" : "long",
 "doc" : "Unix epoch time in seconds"
 } ],
 "doc:" : "A basic schema for storing Twitter messages"
}
  • 定义好schema之后可以用java去build...
  • arvo to HIVE:可以直接建HIVE external table. (还是上面那个link)
CREATE EXTERNAL TABLE tweets
 COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
 STORED AS
 INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
 LOCATION '/user/YOURUSER/examples/input/'
 TBLPROPERTIES (
 'avro.schema.url'='hdfs:///user/YOURUSER/examples/schema/twitter.avsc'
 );

然后就是正常的玩法了。

Categories
日常应用

install R on Centos 6

following this thread: http://blogs.helsinki.fi/bioinformatics-viikki/documentation/getting-started-with-r-programming/installingrlatest/#CentOS

Installing the latest R on CentOS:

Add the latest EPEL repository which you can find from here. Don’t forget to add the 64 bit f you are using a 64 bit OS. I have a CentOS release 5.8, 64 bit (Check the Ubuntu installation section of this document if you don’t know your Linux distribution or whether it is 64 or 32 bit ) and I used the following script to add the proper repository:

$ sudo rpm -Uvh http://www.nic.funet.fi/pub/mirrors/fedora.redhat.com/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm

then I got the error

CentOS 6.3 Instance Giving "Cannot retrieve metalink for repository: epel" Error

follow this page: https://community.hpcloud.com/article/centos-63-instance-giving-cannot-retrieve-metalink-repository-epel-error

Walkthrough Steps

Running this command will update the repo to use HTTP rather than HTTPS:

sudo sed -i "s/mirrorlist=https/mirrorlist=http/" /etc/yum.repos.d/epel.repo

You should then be able to update with this command:

yum -y update

then I am able to install R...

$ sudo yum install R

Installed:
  R.x86_64 0:3.1.2-1.el6                                                        

Dependency Installed:
  R-core.x86_64 0:3.1.2-1.el6                                                   
  R-core-devel.x86_64 0:3.1.2-1.el6                                             
  R-devel.x86_64 0:3.1.2-1.el6                                                  
  R-java.x86_64 0:3.1.2-1.el6                                                   
  R-java-devel.x86_64 0:3.1.2-1.el6                                             
  blas.x86_64 0:3.2.1-4.el6                                                     
  blas-devel.x86_64 0:3.2.1-4.el6                                               
  bzip2-devel.x86_64 0:1.0.5-7.el6_0                                            
  cups.x86_64 1:1.4.2-67.el6                                                    
  desktop-file-utils.x86_64 0:0.15-9.el6                                        
  fontconfig-devel.x86_64 0:2.8.0-5.el6                                         
  freetype-devel.x86_64 0:2.3.11-14.el6_3.1                                     
  gcc-gfortran.x86_64 0:4.4.7-11.el6                                            
  ghostscript.x86_64 0:8.70-19.el6                                              
  ghostscript-fonts.noarch 0:5.50-23.2.el6                                      
  java-1.6.0-openjdk.x86_64 1:1.6.0.0-11.1.13.4.el6                             
  java-1.6.0-openjdk-devel.x86_64 1:1.6.0.0-11.1.13.4.el6                       
  jline.noarch 0:0.9.94-0.8.el6                                                 
  kpathsea.x86_64 0:2007-57.el6_2                                               
  lapack.x86_64 0:3.2.1-4.el6                                                   
  lapack-devel.x86_64 0:3.2.1-4.el6                                             
  lcms-libs.x86_64 0:1.19-1.el6                                                 
  libRmath.x86_64 0:3.1.2-1.el6                                                 
  libRmath-devel.x86_64 0:3.1.2-1.el6                                           
  libX11-devel.x86_64 0:1.6.0-2.2.el6                                           
  libXau-devel.x86_64 0:1.0.6-4.el6                                             
  libXft-devel.x86_64 0:2.3.1-2.el6                                             
  libXmu.x86_64 0:1.1.1-2.el6                                                   
  libXrender-devel.x86_64 0:0.9.8-2.1.el6                                       
  libXt.x86_64 0:1.1.4-6.1.el6                                                  
  libgfortran.x86_64 0:4.4.7-11.el6                                             
  libicu.x86_64 0:4.2.1-9.1.el6_2                                               
  libicu-devel.x86_64 0:4.2.1-9.1.el6_2                                         
  libxcb-devel.x86_64 0:1.9.1-2.el6                                             
  netpbm.x86_64 0:10.47.05-11.el6                                               
  netpbm-progs.x86_64 0:10.47.05-11.el6                                         
  openjpeg-libs.x86_64 0:1.3-10.el6_5                                           
  pcre-devel.x86_64 0:7.8-6.el6                                                 
  poppler.x86_64 0:0.12.4-3.el6_0.1                                             
  poppler-data.noarch 0:0.4.0-1.el6                                             
  poppler-utils.x86_64 0:0.12.4-3.el6_0.1                                       
  portreserve.x86_64 0:0.0.4-9.el6                                              
  psutils.x86_64 0:1.17-34.el6                                                  
  rhino.noarch 0:1.7-0.7.r2.2.el6                                               
  tcl.x86_64 1:8.5.7-6.el6                                                      
  tcl-devel.x86_64 1:8.5.7-6.el6                                                
  tex-preview.noarch 0:11.85-10.el6                                             
  texinfo.x86_64 0:4.13a-8.el6                                                  
  texinfo-tex.x86_64 0:4.13a-8.el6                                              
  texlive.x86_64 0:2007-57.el6_2                                                
  texlive-dvips.x86_64 0:2007-57.el6_2                                          
  texlive-latex.x86_64 0:2007-57.el6_2                                          
  texlive-texmf.noarch 0:2007-38.el6                                            
  texlive-texmf-dvips.noarch 0:2007-38.el6                                      
  texlive-texmf-errata.noarch 0:2007-7.1.el6                                    
  texlive-texmf-errata-dvips.noarch 0:2007-7.1.el6                              
  texlive-texmf-errata-fonts.noarch 0:2007-7.1.el6                              
  texlive-texmf-errata-latex.noarch 0:2007-7.1.el6                              
  texlive-texmf-fonts.noarch 0:2007-38.el6                                      
  texlive-texmf-latex.noarch 0:2007-38.el6                                      
  texlive-utils.x86_64 0:2007-57.el6_2                                          
  tk.x86_64 1:8.5.7-5.el6                                                       
  tk-devel.x86_64 1:8.5.7-5.el6                                                 
  tmpwatch.x86_64 0:2.9.16-4.el6                                                
  unzip.x86_64 0:6.0-1.el6                                                      
  urw-fonts.noarch 0:2.4-10.el6                                                 
  xdg-utils.noarch 0:1.0.2-17.20091016cvs.el6                                   
  xorg-x11-proto-devel.noarch 0:7.7-9.el6                                       
  xz-devel.x86_64 0:4.999.9-0.5.beta.20091007git.el6                            

Dependency Updated:
  cpp.x86_64 0:4.4.7-11.el6                                                     
  cups-libs.x86_64 1:1.4.2-67.el6                                               
  gcc.x86_64 0:4.4.7-11.el6                                                     
  gcc-c++.x86_64 0:4.4.7-11.el6                                                 
  libgcc.x86_64 0:4.4.7-11.el6                                                  
  libgomp.x86_64 0:4.4.7-11.el6                                                 
  libstdc++.x86_64 0:4.4.7-11.el6                                               
  libstdc++-devel.x86_64 0:4.4.7-11.el6                                         
  xz-libs.x86_64 0:4.999.9-0.5.beta.20091007git.el6                             

Complete!
Categories
日常应用

据说是R 2014年最重要的发明...

今儿听Hadley大人做training,才第一次好好去看pipe这个东西...以前有点印象,主要是R会上有人讲过,当时只是记住了一个名词。今儿才有机会好好的去看看去想一想。(吐槽:R有的时候是不是太灵活了...)

pipe的广告语: the pipe operator is one (if not THE) most important innovation introduced, this year, to the R ecosystem. 听起来挺神奇的,好像是从F#那里搬过来的....R果然是耐揉。

短短的历史就是,随着Hadley大人搞定了dplyr,MAGRITTR 这个包开始浮出水面,各种热门...

然后果然COS上有人介绍过,Ren Kun童鞋早已经进一步弄好了一个pipeR包可以玩:http://cos.name/2014/04/use-pipeline-operators-in-r/

然后再去看今年5月份北京R会议的slides...原来这么赞(可是当时我明明在北京呀,当时干嘛去了...总是这么后知后觉)。

然后COS论坛上果然早早就有讨论了,这群geek...

没了,我要好好学习去了,R永远是个学不完的东西啊啊啊啊!三观总是不时被重新颠覆一次,唉。

Categories
日常应用

Shiny的架构浅析

不是说学一门语言学的不仅仅是他的语法,更重要的是他背后的思想么?R本身是个大杂烩,ggplot可以单独拎出来作为一门语言学,shiny其实也可以单独拎出来学一番。

只是简单的实现一个shiny app确实不难,就像官网上一进去看到的那个例子那样。基本上如果只是做一些比较简单的可控的dashboard,shiny的代码无非就是写的细致一点,谈不上什么架构之类的。

直到某一天...你发现这东西还可以玩的更深,然后就毅然跳入了下一个大坑——shiny reactivity。官网的开场白很直接

It’s easy to build interactive applications with Shiny, but to get the most out of it, you’ll need to understand the reactive programming model used by Shiny.

然后就介绍三剑客:reactive sources, reactive conductors, and reactive endpoints。

最简单的情况:没有conductor,直接从source到endpoint。

继续拷贝官网图:

嗯,听起来不难...其实大部分情况下shiny app处理的都是这样的情况。基本上就是一个输入(input)和输出(output)的过程。最简单的,就是我们经常在shiny app里面需要处理的input对象,会读入各种用户操作带来的值,然后后面返回一个表格或者图或者文字什么的。回到官网的例子(继续拷贝官网图)

shinyServer(function(input, output) {
  output$distPlot <- renderPlot({
    hist(rnorm(input$obs))
  })
})

大概就是这么一个简单的架构。在这种架构下,自然可以写很多if else之类的在server.R中加入各种各样输入输出的组合,然后在ui.R中排列一下输入的各种框框和输出的各种图标文字什么的。

考虑conductors的情况

shiny用的稍稍熟悉了,就开始想更多的控制这个东西。其实conductor说白了有点中间变量的感觉,就是他本身并不会最终显示出来,但是作为一个中间过程存在着。比如官网一个计算Fibonacci 序列的例子:

fib <- function(n) ifelse(n<3, 1, fib(n-1)+fib(n-2))

shinyServer(function(input, output) {
  currentFib         <- reactive({ fib(as.numeric(input$n)) })

  output$nthValue    <- renderText({ currentFib() })
  output$nthValueInv <- renderText({ 1 / currentFib() })
})

这里无非就是多了一个新的对象currentFib(跟input和output都无关),暂时的存储了一下计算过程中的变量,然后基于这个东西,又衍生了两个最终显示出来的返回值。新的架构图就多了一个conductor这样。

再复杂一点:考虑Reactive expressions

按照官网的说法,这个东西包括:

  • accessing a database
  • reading data from a file
  • downloading data over the network
  • performing an expensive computation

基本就是从其他的数据源而不是用户的input那里读数据的情况。常见的场景比如,按照用户给定的一些条件(由input传入),返回一个符合条件的数据子集。如果数据全集是会不断更新的(比如每次都应该重新load("xxx.rdata")这样),那么这个这句话就必须写在reactive里面。同理,如果直接利用input生成一句SQL然后连接数据库读取数据,那么也要写在reactive里面。

再复杂一点:自动还是手动更新

shiny默认的情况下,如果没有放submit按钮,那么就是随便一个操作就trigger整个reactive的流程。比如官网首页上,我们拖拖鼠标那个直方图就跟着变化。如果不想让他直接变化,那么简单的可以放一个submit按钮,然后只有submit之后才会变化。

如果希望更好的控制一个app界面上各个部分呢?就要动用更高级一点的actionButton

这东西还要配合一个isolate(),就是你要告诉shiny哪些东西跟着这个button变化,哪些不属于这个button的控制范围。比如官网的例子:

shinyServer(function(input, output) {
  output$distPlot <- renderPlot({
    # Take a dependency on input$goButton
    input$goButton

    # Use isolate() to avoid dependency on input$obs
    dist <- isolate(rnorm(input$obs))
    hist(dist)
  })
})

这里只有goButton被点击的时候,才会执行下面的抽随机数并画图。actionButton是随着每次点击而不断增加计数的,从0开始一直增加过去,所以可以在server.R里面利用这个button的返回值来进行流程控制。

篇外废话:

shiny有希望成长成一个可以替代php这样实现客户端和服务器端大部分交互请求的巨人么?从这点来看,Tableau还是限制太多了。

Categories
日常应用

linux下自动同步到github

只是存一些代码。