Categories
日常应用

ODBC(Teradata)和R连接常见问题

以前总结了一些R、SAS和TD彼此之间数据导入导出的办法,可是实际使用中还是会碰到各种各样的问题...问的人多了,就多少攒出一些FAQ放在了内部的wiki上。

还有一些问题估计大家都会遇到,所以也摘到blog上来好了。一开始用英文写的,简单翻译一下,不全翻译了。首先是通过RODBC连TD的一些常见问题,不知道是不是通用于其他ODBC driver...

  • Q: I cannot load RODBC on my own computer (not on R servers)... what can I do?
    A: Try to debug in these steps -1)  Check if you have installed the 64-bit version of Teradata ODBC driver on your 64-bit OS. If not, download and install these three packages here: http://downloads.teradata.com/download/connectivity/odbc-driver/windows(保证R、操作系统和ODBC driver位数一致,如果是64位的操作系统就装一套儿64位的吧...上面那个link是64位TD ODBC Driver的下载地址)。2)  Open your 64-bit ODBC administrator and configure the TD server DNS. (在新的ODBC中配置server地址) 

    3)  Close and re-open R again. (重启R,这步一定不能省)
    4)  If you still have problems, use R 2.*** instead of R 3.0+ (R 3.0以上版本有时候会莫名其妙报错,如是,改用2.15+版本吧。)

  • Q: How can I call R in the command line (to schedule a recurring job)? (怎么run batch job?)
    A: You can call RScript directly in your command line. If failure, you need to add R's path to your system environment variable - PATH. e.g. C:\Program Files\R\R-2.15.2\bin (设一下环境变量)
    Pay attention to its parameters. Also, you can save the log. A sample command is as follows - (命令行类似如下)

    Rscript --no-save --no-restore --verbose "/home/Liyun.Chen/R/recurring/recurring_dashboard_by_week.r" > outputFile.log 2>&1

    (记得存个log)

暂时就是这些...做个存档好了。

Categories
日常应用

从词频统计到词频矩阵(善用sparse matrix)

今天在下手写一个算text similarity的程序时,万般痛恨tm包不争气,而lijian哥的tmcn包又还木有完工,所以只能自己从头开始写了...

一切正常,基本的清洗啊,全角换半角啊,分词啊,去stop words啊几行代码顺利搞定,结果统计完了每行的词频却找不到一个有效的办法来算词频矩阵!数据也就是那么几万行嘛...reshape2怎么一下子就挂了呢?研究了一番,想到就算搞出来一个几万乘几万的词频矩阵,后面的distance之类也算不出来...悲哀的感觉充斥。没办法,只能祭出sparse matrix这面大旗了!

Google之,R里面可以调用Matrix或者SparseM。鉴于前者看起来比较简单的样子,我就没有去折腾后者。Matrix里面Sparse Matrix的定义比较简单,就是记录一下值不为0的行和列编号,以及他们对应的值就可以了。等等,这个东西怎么听起来这么熟悉?!好像在network analysis里面整天写这样的程序嘛!对嘛,不就是一个edgelist格式的network(记录相互连接的边)转换为一个matrix格式的网络(边边矩阵)嘛!瞬间有种打通任督二脉的感觉....

这样,只要我数出来每篇文章中的词频就足够了! 形如:
Post_id word count
1.           A.       1
1.           B.       2

这样,然后

sparseMatrix(as.factor(data$Post_id),
as.factor(data$word),
data$count)

就ok啦。超级高效!瞬间秒杀。根本不用去算真正的词频矩阵。

附这个函数的用法:

sparseMatrix(i = ep, j = ep, p, x, dims, dimnames,
             symmetric = FALSE, index1 = TRUE,
             giveCsparse = TRUE, check = TRUE)

同样的,dim()还是可以用来看维度,一般没错的。

再后面就是考验大家具真学得好不好的时候啦。比如方差这种东西,不就是 XX'嘛(当然要先减一下均值....),这里Matrix这个包有tcrossprod()可直接算XX',colsum这些也都还有...正常的矩阵乘法也都还可以用。关键是,QR分解什么的也有!

于是乎我可以深深满足的继续码下面的分析去了...

--------------------吐槽分割线------------------
作为一个经济学出身的孩子,我做梦也没想到有一日我居然会去碰稀疏矩阵这种东西...没办法,生活所迫呀!自己动手,丰衣足食.... T_T

Categories
日常应用

R的outreg:输出回归结果表格via stargazer

大概被这个问题折磨的最多的人就是搞econometrics的人吧...熟悉stata的人应该都很喜欢outreg这个功能,而R里面就麻烦得多。以前一直写一个outreg()的函数来搞定这件事儿,现在看来有更方便的方法了——stargazer这个包。什么AER、QJE之类的完全不在话下。

stargazer_regression

这个包支持这些对象:

lm, glm, svyglm, plm, betareg, gee, gam, polr, survreg , coxph, tobit (AER), ivreg (AER), zeroinfl (pscl), hurdle (pscl), multinom (nnet), as well as their implementations in Zelig

还支持这些期刊的表格格式:

American Economic Review, in the Quarterly Journal of Economics, or in Administrative Science Quarterly. Political scientists can avail themselves of templates based on the American Political Science Review, the American Journal of Political Science, and on International Organization. For sociologists and demographers, the American Sociological Review, the American Sociological Reviews and Demography are available.

该收收了吧,哈哈。

Categories
日常应用

从R里面底层操纵Excel/xlsx(自动化报告福音)

好吧,我在eBay折腾的最多的就是生成自动化报告时候各种软件之间的相互调用,什么R啊,SAS啊,Teradata啊,Excel啊,Python啊,反正基本都有机会相互调用一下。每到此时我就深深感慨选择一个library丰富的工具是多么的重要!You could hardly expect what you colleagues are handy with!(P.s. 不要跟我提VBA这种逆天存在的东西。有哪个时间研究它你学点啥别的不好...)

今天忍无可忍+心情大好的折腾了一下R和excel。这个不是简单的从R里面读写excel数据,而是真心用R去操纵excel里面的单元格(cell),除了读写数据之外还要定义样式什么的。excel作为一个奇葩的软件,you may never expect where people would paste data to! 然后他们再自定义一堆样式(我恨这种点点鼠标就能改的东西,你丫又不是Photoshop...)。

但是没办法,人家定义好的“高端洋气”的报表姿态你不能轻易动啊。只能乖乖的往里面paste数据。这件事虽说一次两次手动也就罢了,三五次真的是要疯掉的。anyway,万事总有解决的途径...

很久以前从Yixuan 的博客上得知有xlsx这么个包,当时只记得这东西可以读写xlsx...直到后面折腾了一下才知道这货底层居然调用的是java的xlsx API,也就是说不用写Java也可以操作xlsx了,yeah!

为了生成excel格式的自动化报告(不要问我为啥不用knitr,不用***,说起来都是泪呀!),我主要需要解决的就是:

  • 读取原有xlsx文件,保持格式、附加新格式。
  • 在相应的位置粘进去新的数据。(当然如果只有这么一个需求可以通过ODBC来做...)

第一个倒是满简单的,就是较之yixuan代码里面的createWorkbook(),改成loadWorkbook()就可以了。然后就是找到相应的sheet,这个也满简单的,一行getSheets搞定。

然后第二步建议不要去操作cell(太没效率了),直接操作cellblock。CellBlock()可以用来定义一个新的CellBlock,然后灵活运用CB.setBorder()和CB.setColData()就可以先增加边框、然后一列列填充数据。这里使用按列填充数据主要是因为R里面的Data Frame是一列一个数据格式的,一下子把一块儿都paste到excel的cellblock里面的话,会报错...BTW为了定义边框的样式,需要用到Border()。类似的还可以定义Fill和Font这些。

同上,最好不要直接用addDataFrame()来直接贴数据...格式不能覆盖。如果是要在一个新的sheet上贴数据,那么就write.xlsx(sheetName="newsheet",append=T)好了。不需要通过上述底层的API折腾了。

最后还有一个比较有用的函数,autoSizeColumn()可以用来自动调整列宽。全鼓捣完之后saveWorkbook()保存就可以啦。

最后的最后,一个珍贵的建议——都在R里面把数据整理好再去想输出到excel里面(什么reshape2啊,data.table啊,plyr啊,该上的一起上啊!),千万别手贱在excel里面改一点点小东西...每一次都手动改一下下你的时间就被白白浪费了好几分钟!珍爱生命,远离excel...

附上一段我最后搞定自动化报告的代码:

library("xlsx")
test_template <- loadWorkbook("template.xlsx") #读入template.xlsx文件。定义好各种乱七八糟的格式的。
design_tab <- getSheets(test_template)[["design"]] #转到design这个sheet。
data_block <- CellBlock(design_tab, 5,5,nrow(mydata),ncol(mydata)) #准备贴数据的方块,我这里从第5行第5列开始贴。
border <-  Border(color="black", position=c("LEFT", "RIGHT"),
pen=c("BORDER_THIN", "BORDER_THIN")) #定义边框样式——左右黑色细直线。
for (i in 1:ncol(mydata))
{
CB.setBorder(data_block, border,colIndex = i,rowIndex=1:nrow(onetime_design_tab)) #给每一列都贴上边框
CB.setColData(data_block, mydata[,i], i, rowOffset=0, showNA=F, colStyle=NULL)#给每一列贴数据
}
border_bottom <-  Border(color="black", position=c("LEFT", "RIGHT","BOTTOM"),pen="BORDER_THIN") #定义结尾行样式——底端黑细直线
data_block_bottom <- CellBlock(design_tab, 5+nrow(mydata),5,1,ncol(mydata)) #选择最后一行
CB.setBorder( data_block_bottom, border_bottom, 1, 1:ncol(onetime_design_tab)) #定义最后一行格式
autoSizeColumn(design_tab, 5:(5+ncol(onetime_design_tab)))#调整列宽
saveWorkbook(test_template, file=output_xlsx_name) #保存
##add row data
write.xlsx(rawdata, file=output_xlsx_name, sheetName="raw_data",append=T,row.names=F) #直接贴原始数据,无格式
###add queries
R_file <- readLines(R_file_name, n=-1) #直接贴R代码
SQL_file <- readLines(SQL_file_name, n=-1)
write.xlsx(SQL_file, file=output_xlsx_name, sheetName="query_SQL",append=T,row.names=F) #直接贴代码到新的sheet中
write.xlsx(R_file, file=output_xlsx_name, sheetName="query_R",append=T,row.names=F)

 


Categories
日常应用

不说人话的R报错信息

此文为译文,谨此来纪念那些被R不知所云的稀奇古怪的报错折磨过的凄凉岁月...

Translating Weird R Errors
January 20, 2013
By Slawa Rokicki

原文写的很风趣,时间所限我就简单的翻译一下了。

1. 其实我只是拼错了变量名...

运行这段代码:

prob1<-as.data.frame(cbind(c(1,2,3),c(5,4,3)))
colnames(prob1)<-c("Education","Ethnicity")
table(prob1$education, prob1$Ethnicity)

然后R会报错:

all arguments must have the same length

莫名其妙有木有?其实正确的应该是:

table(prob1$Education, prob1$Ethnicity)

我只是忘了大写了...囧。

2. 我只是调用了不存在的变量....

比如我运行:

prob1$gender_recode <-as.numeric(prob1$Gender==2)

然后就会报错:

replacement has 0 rows, data has 3

但是这样就没问题:

prob1$Educ_recode<-as.numeric(prob1$Education==2)

原因只是gender这个变量不存在....你就不能直接告诉我找不到变量么?

3. 找不到变量?

我这次确保Education是有的,但是居然还是报错?

nrow(prob1[prob1$Education!=1])

报错:

undefined columns selected

而人家只是少打了一个逗号而已嘛...

nrow(prob1[prob1$Education!=1,])

哎,你就不能直接报语法错误嘛!

原文附下:

I love R. I think it's intuitive and clever and overall a great language. But I do get really annoyed sometimes at the completely ridiculous, cryptic error messages it often gives me. This post will go over some of those seemingly nonsensical errors so you don't have to go crazy trying to find the bug in your code.

1. all arguments must have the same length

To start with, I just make up some quick data:

prob1<-as.data.frame(cbind(c(1,2,3),c(5,4,3)))
colnames(prob1)<-c("Education","Ethnicity")

And now I just want to do a simple table but I get this error:

all arguments must have the same length

What the heck. I look back at my dataset and make sure that both those variables are the same length, which they do. The problem here is that I misspelled "Education". There's a missing "a" in there and instead of telling me that I referenced a variable that doesn't exist, R bizarrely tells me to check the length of my variables. Remember: Anytime you get an error, check to make sure you've spelled everything right.

If I do this, everything works out great:

table(prob1$Education, prob1$Ethnicity)

2. replacement has 0 rows, data has 3

A very similar problem, with a very different error message. Let's say I forgot what columns were in my prob1 data and I thought I had a Sex indicator in there. So I try to recode it like this:

This error message is also pretty unhelpful. The syntax is totally correct; the problem is that I just don't have a variable named Sex in my dataset. If I do this instead to recode education, a variable that exists, everything is fine:

prob1$Educ_recode<-as.numeric(prob1$Education==2)

3. undefined columns selected

Ironically, the error we so badly wanted before comes up but for a completely different reason. See if you can find the problem here. I'll take that same little dataset and I just want to know how many rows there are in which Education is not equal to 1.

So, if I want to know the number of rows of the dataframe prob1, I do:

nrow(prob1)

and if I want to know how many have a value of Education not equal to 1, I do the following (incorrectly) and get an error:

Now I check my variable name and I've definitely spelled Education right this time. The problem, actually, is not that I have referenced a column that doesn't exist but I've messed up the syntax to the nrow() function, in that I haven't defined what columns I want to subset. When I do,

prob1[prob1$Education!=1]

this doesn't make any sense, because I'm saying to subset prob1 but to do this I have to specify which rows I want and which columns I want. This just lists one condition in the brackets and it's unclear whether it's for the rows or columns. See my post on subsetting for more details on this.

If I do it the following way, all is good since I'm saying to subset prob1 with only rows with education !=1 and all columns:

nrow(prob1[prob1$Education!=1,])