Categories
日常应用

七天搞定SAS(七):常用统计模型

本系列连载文章:

其实最后一天,反而是任务最繁重的。这一天,需要纵览SAS的各个常用的统计模块。BTW,在用惯了ggplot2之后,再也不认为有任何理由用其他软件画图了...所以SAS的图形模块自动被我无视(貌似很多SAS用户也一直在吐槽这东西着实不好使)。

SAS里面的概要统计:PROC MEANS

其实前几天也说过了PROC MEANS,不过这里稍稍补充一点置信区间的东西吧。其实它的参数真的挺多的:

    • CLM:双侧置信区间
    • CSS:调整平方和
    • CV:变异系数
    • KURTOSIS:峰度
    • LCLM :单侧置信区间——左侧
    • MAX:最大值
    • MEAN:均值
    • MIN:最小值
    • MODE:众数
    • N :非缺失值个数
    • NMISS:缺失值个数
    • MEDIAN(P50):中位数
    • RANGE:范围
    • SKEWNESS:偏度
    • STDDEV:标准差
    • STDERR:均值的标准误
    • SUM:求和
    • SUMWGT:加权求和
    • UCLM:单侧置信区间:右侧
    • USS:未修正的平方和
    • VAR:方差

ode variance

  • PROBT:t统计量对应的p值
  • T:t统计量
  • Q3 (P75):75%分位数,etc.
  • P10:10%分位数,etc.

在调用CLM的时候需要指定ALPHA:

DATA booklengths;
INFILE 'c:\MyRawData\Picbooks.dat';
INPUT NumberOfPages @@;
RUN;
*Produce summary statistics;
PROC MEANS DATA=booklengths N MEAN MEDIAN CLM ALPHA=.10;
TITLE 'Summary of Picture Book Lengths';
RUN;

结果如下:

2013-12-09 15_46_26-The Little SAS Book(Fourth).PDF - Adobe Reader

SAS里面的相关性分析:PROC CORR

虽然correlation一直被各种批判,但是往往在拿到数据的第一步、毫无idea的时候,correlation还是值得一看的参考指标。SAS里面的PROC CORR提供了相应的功能。

PROC CORR DATA = class;
VAR Television Exercise;
WITH Score;
TITLE ’Correlations for Test Scores’;
TITLE2 ’With Hours of Television and Exercise’;
RUN;

SAS的相关性分析结果输出如下:

2013-12-09 15_47_04-The Little SAS Book(Fourth).PDF - Adobe Reader

SAS里面的基本回归分析:PROC REG

类似于R中的lm(),这个实在是没什么好说的了,最基本的最小二乘法。

DATA hits;
INFILE 'c:\MyRawData\Baseball.dat';
INPUT Height Distance @@;
RUN;
* Perform regression analysis;
PROC REG DATA = hits;
MODEL Distance = Height;
TITLE 'Results of Regression Analysis';
RUN;

SAS的输出结果如下:2013-12-09 15_47_46-The Little SAS Book(Fourth).PDF - Adobe Reader

 

包含了回归模型的基本统计量。我们一般更关注的回归系数:

2013-12-09 15_49_16-The Little SAS Book(Fourth).PDF - Adobe Reader

到这里,我的感慨就是:真的很像Stata呀!值得注意的是,REG有很多可选的参数,对于这些参数是干嘛用的,最权威的自然还是SAS官方的文档:http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_reg_sect007.htm。其实熟悉了SAS的语法和工作模式之后,具体到某个模型还是看官方文档比较舒服。不愧是商业软件啊,文档写的都很专业,有很多模型选择问题其实看看文档就能多少明白一些了。

比如PROC REG的参数就有:

Table 73.1 PROC REG Statement Options
Option Description
Data Set Options
DATA= names a data set to use for the regression
OUTEST= outputs a data set that contains parameter estimates and other
model fit summary statistics
OUTSSCP= outputs a data set that contains sums of squares and crossproducts
COVOUT outputs the covariance matrix for parameter estimates to the
OUTEST= data set
EDF outputs the number of regressors, the error degrees of freedom,
and the model to the OUTEST= data set
OUTSEB outputs standard errors of the parameter estimates to the
OUTEST= data set
OUTSTB outputs standardized parameter estimates to the OUTEST= data
set. Use only with the RIDGE= or PCOMIT= option.
OUTVIF outputs the variance inflation factors to the OUTEST= data set.
Use only with the RIDGE= or PCOMIT= option.
PCOMIT= performs incomplete principal component analysis and outputs
estimates to the OUTEST= data set
PRESS outputs the PRESS statistic to the OUTEST= data set
RIDGE= performs ridge regression analysis and outputs estimates to the
OUTEST= data set
RSQUARE same effect as the EDF option
TABLEOUT outputs standard errors, confidence limits, and associated test
statistics of the parameter estimates to the OUTEST= data set
ODS Graphics Options
PLOTS= produces ODS graphical displays
Traditional Graphics Options
ANNOTATE= specifies an annotation data set
GOUT= specifies the graphics catalog in which graphics output is saved
Display Options
CORR displays correlation matrix for variables listed in MODEL and
VAR statements
SIMPLE displays simple statistics for each variable listed in MODEL and
VAR statements
USCCP displays uncorrected sums of squares and crossproducts matrix
ALL displays all statistics (CORR, SIMPLE, and USSCP)
NOPRINT suppresses output
LINEPRINTER creates plots requested as line printer plot
Other Options
ALPHA= sets significance value for confidence and prediction intervals and tests
SINGULAR= sets criterion for checking for singularity

SAS里面的基本方差分析:PROC ANOVA

方差分析也就不赘述了,其实我感觉没有回归分析更用的普遍...这俩东西某种程度上也是一回事儿,看怎么理解了。

PROC ANOVA DATA = basket;
CLASS Team;
MODEL Height = Team;
MEANS Team / SCHEFFE;
TITLE ”Girls’ Heights on Basketball Teams”;
RUN;

SAS的输出如下:

2013-12-09 15_50_40-The Little SAS Book(Fourth).PDF - Adobe Reader

先是用作分类的变量的基本统计。然后是模型的基本统计:

2013-12-09 15_50_34-The Little SAS Book(Fourth).PDF - Adobe Reader

最后是各个组的分析结果(两两比较,由于指定了SCHEFFE参数):

2013-12-09 15_51_16-The Little SAS Book(Fourth).PDF - Adobe Reader

SAS中的离散被解释变量模型:PROC LOGISTIC和PROC GENMOD

最简单的离散被解释变量模型就是logit了,在SAS里面有直接的PROC LOGISTIC。官方文档在此:http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#logistic_toc.htm

语法自然是一如既往的简单:

proc logistic;
model y=x1 x2;
run;

结果返回:

The LOGISTIC Procedure

 

Model Information
Data Set WORK.INGOTS
Response Variable (Events) r
Response Variable (Trials) n
Model binary logit
Optimization Technique Fisher's scoring

Number of Observations Read 19
Number of Observations Used 19
Sum of Frequencies Read 387
Sum of Frequencies Used 387

首先自然是模型的统计信息。然后是数据的统计:

Response Profile
Ordered
Value
Binary Outcome Total
Frequency
1 Event 12
2 Nonevent 375

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied

然后是假设检验:

Model Fit Statistics
Criterion Intercept
Only
Intercept
and
Covariates
AIC 108.988 103.222
SC 112.947 119.056
-2 Log L 106.988 95.222

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 11.7663 3 0.0082
Score 16.5417 3 0.0009
Wald 13.4588 3 0.0037

最后是参数估计:

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -5.9901 1.6666 12.9182 0.0003
Heat 1 0.0963 0.0471 4.1895 0.0407
Soak 1 0.2996 0.7551 0.1574 0.6916
Heat*Soak 1 -0.00884 0.0253 0.1219 0.7270

而对于泊松模型,则需要PROC GENMOD。我觉得我一一个列出这些模型已经超出了这篇笔记的范围了...所以干脆就改成简单翻译一下各个PROC的主要模型吧。说过了,学习模型不是主要的目的——模型终究不该通过软件来学...虽然SAS的user guide真的还算是比较好的统计学教材呢。

SAS里面的PROC一览

除了上面说到的PROC,SAS当然还有更多强大的模块。我就顺手一一点开看看这些东西都能做什么...