Statistical data analysis - RStatistical data analysis -R -A

Basic Operation

data3 <- read.csv("data3.csv") 读取数据时注意，文件名要加双引号
table(df$sex, df$age) give the frequency regarding of sex and age => contigency table，一个查找关键词也可以
length(x)返回vector/data frame的长度
x <- df[df$smoke== 0,] 提取特定行 where smoke== 0, 注意行提取有逗号 "," df['列名'] 提取特定列
df$smoke <- as.factor(df$smoke) creae factors, 对于factor/nominal/categorical variables 需要 tell R -> no normal order df$smoke <- ordered(df$smoke) creates ordered factors.
cbind() function: combine vectors, matrices, or data frames by column.
tapply(df1$W, df1$A, mean)计算W和A的mean
rep("M", 825), the function rep() is used to repeat the character string "M" 825 times. The resulting output will be a character vector of length 825, with each element of the vector containing the string "M".

Check the normality

Draw the Q-Q plot

x=df$bwt
qqnorm(x)
abline(mean(x), sd(x), col=2,lwd=2)

Draw the P-P plot

x=df$bwt
plot(rank(x)-0.5)/length(x),pnorm(mean=mean(x),sd=sd(x),x),main="p-p plot")

Shapiro-Wilk Test

H0: The variable xx is normally distributed

H1: The variable xx is not normally distributed

shapiro.test(df$bwt)

Kolmogorov-Smirnov Test

ks.test(df$xx,"pnorm", mean=mean(df$xx), sd=sd(df$xx))

Carmer-von Mises Test

library(goftest) 
cvm.test(df$xx,"pnorm", mean=mean(df$xx), sd=sd(df$xx))

画图

用ggplot需要先加载包 library(ggplot2)

# boxplot
p <-ggplot(df, aes(x=smoke, y=bwt, fill=smoke)) # fill可省略
p + geom_boxplot(outlier.colour="red")+
     ggtitle("boxplot of bwt versus smoke groups")+
     xlab("smoke")+
     ylab("birth weight in grams")

Association of Nominal Variable

Check whether A and B are independent, where A and B are factor/ nominal/ categorical variables

Pearson Chi-Square(χ2) Test chisq.test(table, correct=F)
LRT
Fisher's Exact Test fisher.test(table, alternative = c("two.sided", "less", "greater"))

Both Fisher’s Exact test and Pearson chi-square test can test the independence between the column and row variable of a 2×2 contingency table. However, Pearson chi-square test can only do the two-sided test. While, Fisher’s Exact test can do the one-sided test (not only whether there exists a association, but also a positive or negative association).

Two-sample z-test

Assumption: the sample x & y are independent

Consider the contigency table, this test is to test if there's a significant different in xxx between Men and Women

Hypothsis: Px -> the proportion of xxx of men
Py-> the proportion of xxx of women
H0: Px = Py
H1: Px ≠ Py

prop.test(x=c(40,23), n=c(80,66), correct=F) # 40,23 为每一类中符合特定行为的数量； 80，66为该类总人数 e.g. Men 80; Female 66

Testing for equality of two variances

x <- df[df$smoke== 0,]$bwt
y <- df[df$smoke== 1,]$bwt
varx = var(x)
vary = var(y)
F <- varx/vary
m = length(x)
n = length(y)
p_value <- pf(F, m-1, n-1, lower.tail = FALSE)
p_value*2 #计算两个tail, p_value*2

Two-sample t-test

Test whether the means of xxx are equal between the two groups.

H0: μx = μy
H1: μx ≠ μy / μx > μy / μx > μy 需根据实际情况选择，注意计算两边tail , p_value*2

x <- df[df$smoke== 0,]$bwt
y <- df[df$smoke== 1,]$bwt

x_bar = mean(x)
y_bar = mean(y)
varx = var(x)
vary = var(y)
m = length(x)
n = length(y)
s2 = ((m-1)*varx + (n-1)*vary) / (m+n-2)
t_obs = (x_bar - y_bar) / sqrt(s2*(1/m + 1/n)) 

# p-value
p_value = 1-pt(t_obs,m+n-2) #注意用1-
cat('p-value: ', p_value*2 , "\n")

Linear regression model

lm(bwt ~ (race) + age + lwt + (smoke) + ptl + (ht) + ui + ftv, data = df)
The parentheses( ) around certain variables indicate that they are categorical variables or factors, they will be converted into binary indicator variables by the lm() function & 同时也要用 df$smoke <- as.factor(df$smoke) 把他们变成 factors

Find the best regression model by using backward/forward/stepwise elimination (selection) method:

library(leaps)
fit = regsubsets(bwt~(race)+age+lwt+(smoke)+ptl+(ht)+ui+ftv, data=df, method="backward") # backward method 结果从下往上看
summary(fit) # R返回selection的过程
# fit the submodels 根据返回selection过程fit submodels
fit1=lm(bwt~(race)+lwt+(smoke)+ptl+(ht)+ui+ftv, data=df)
fit2=lm(bwt~(race)+lwt+(smoke)+ptl+(ht)+ui, data=df)
fit3=lm(bwt~(race)+lwt+(smoke)+(ht)+ui, data=df)
fit4=lm(bwt~(race)+(smoke)+(ht)+ui, data=df)
fit5=lm(bwt~(race)+(smoke)+ui, data=df)
fit6=lm(bwt~(race)+ui, data=df)
fit7=lm(bwt~ui, data=df)
print(c(summary(full_model)$adj.r.squared,summary(fit1)$adj.r.squared,summary(fit2)$adj.r.squared,summary(fit3)$adj.r.squared,summary(fit4)$adj.r.squared,summary(fit5)$adj.r.squared,summary(fit6)$adj.r.squared,summary(fit7)$adj.r.squared)) #通过想使用的指标选择最好的model
# 这里用的是 adjusted 𝑹𝟐 criterion、R2、Min_Max Accuracy=> the larger the better
# explain: about R2*100% percentage can be estimated by the model
library(olsrr) #for the ols_mallows_cp() function, which calculates the Mallows' Cp statistic for model selection => smaller is better, closer to p+1 and small p, where p is the number of variables
# AIC、BIC、MAPE(Mean absolute percentage error)、MSE=> the lower the better
print(cbind(c(ols_mallows_cp(fit1, fit2.full),
              ols_mallows_cp(fit2, fit2.full),
              ols_mallows_cp(fit3, fit2.full),
              ols_mallows_cp(fit4, fit2.full),
              ols_mallows_cp(fit5, fit2.full),
              ols_mallows_cp(fit6, fit2.full)),
            c(summary(fit1)$r.squared,summary(fit2)$r.squared,
              summary(fit3)$r.squared,summary(fit4)$r.squared,
              summary(fit5)$r.squared,summary(fit6)$r.squared),
            c(summary(fit1)$adj.r.squared,summary(fit2)$adj.r.squared,
              summary(fit3)$adj.r.squared,summary(fit4)$adj.r.squared,
              summary(fit5)$adj.r.squared,summary(fit6)$adj.r.squared)))

Compare the model in (d) and the best model selected in (e). Which model is better?
anova(fit3, full_model)to obtain the F-statistic and p-value.
Through comparing the two models, we can determine if adding additional predictors to the model significantly improves the fit.

The F-statistic is used to test the null hypothesis that the simpler model (fit3) is just as good as the more complex model (full_model), against the alternative hypothesis that the more complex model provides a better fit.

ANOVA table

Difference between Regression and ANOVA :
Regression-> investigate the effect of one or more continous independent variables on Y
ANOVA-> investigate the effect of one or more categorical independent variables on Y

One-way ANOVA

used to test the null hypothesis that the means of k independent normal populations with a common variance are identical
H0: μ1 = μ2 = ... = μk
v.s. H1: at least two means are not equal

-e.g. Conduct a hypothesis test to test whether the means are equal of all six treatment*exercise groups.

onewayanova = aov(score ~ factor(interaction(treatment,exercise)),data = data3)
summary(onewayanova)

Here the interaction(treatment, exercise) function creates a new factor by combining the levels of the treatment and exercise variables. The factor() function then converts this combined variable into a factor so that it can be used in the ANOVA model.

Three assumptions:

Independence
Constant variance
Normality

When a one-way ANOVA is performed, it assumes that the group variances are statistically equal.

So we need to test the Homogeneity of Variance/ equality of group variances first before conducting ANOVA
Null hypothesis: the variances are equal across all groups
Alternative hypothesis: at least one variance is different

Bartlett's Test bartlett.test(x~grps,dat)
bartlett.test(score ~ interaction(treatment, exercise), data = data3)
-Noticed that when we need to specify the interaction between the two categorical variables, use interaction() function.
-Note that the bartlett.test function assumes that the data are normally distributed within each group. If this assumption is not met, the test may not be appropriate and alternative tests, such as the Levene's test, may be more suitable.
The below 2 tests are less senstivie to departures from normality
Levene's Test

library(car)
leveneTest(score ~ treatment*exercise, data = data3)

Brown-Forsythe Test

Write down the R output ANOVA table for a specifc model

# build the model
model <- lm(score~age+treatment+exercise+treatment*exercise+age*treatment+age*exercise+age*exercise*treatment,data=data3)
library(car) # Use the Anova function under package "car"
Anova(model, type="II")

Multiple Comparisons

After we perform an ANOVA and rejected the null hypothesis, we know that at least two group means are different. Then we consider methods for multiple comparsion (pairwise comparisons) to identify significant difference in group means.
H0: μi = μj

Fisher's leat significant difference (LSD) method


res = aov(dat$x~ dat$grps)
summary(res) # From this, obtain the DFerror = Residuals-Df, MSerror= Residuals-Mean Sq 

library(agricolae)
out = LSD.test(dat$x, dat$grps, DFerror=15, MSerror=0.386,p.adj="none") #获得LSD
out
#用每两个mean（从out的结果$groups获得）的差的绝对值与LSD比较，> LSD, reject H0

The function: LSD.test(y, trt, DFerror, MSerror, alpha = 0.05, p.adj=c(“none”,“holm”,“hommel”, “hochberg”, “bonferroni”, “BH”, “BY”, “fdr”), group=TRUE, main = NULL,console=FALSE)

Arguments: y: model(aov or lm) or answer of the experimental unit

trt: Constant( only y=model) or vector treatment applied to each experimental unit

DFerror: Degrees of freedom of the experimental error

MSerror: the mean squared error

Bonferroni's method/ correction/ adjustment
k groups -> c=k(k-1)/2 comparisons, the significance level for each comparison is α/c

# conduct t test for each pair
# Bonferroni adjustment
pairwise.t.test(data3$score,interaction(data3$treatment,data3$exercise), p.adj = "bonferroni") 
# 注意：这里specify dataset 只能用在variables前面加data$的形式， 后面加",data=data3"是无效的

注意把获得的结果与α/c比较得到结论

Tukey's method

# Tukey’s HSD
Para <- aov(x~factor(grps), data=dat)
summary(Para)

TukeyHSD(Para, "factor(grps)", ordered = TRUE)

Non-parametric method - Kruskal-Wallis Test

When the assumption of ANOVA is not met, like, group variances are different and also are not normally distributed, the result provided by F-test is not reliable. We therefore consider the use of Kruskal-Wallis test on equality of distributions
H0 : there is no difference between the median scores for all three chefs, μA=μB=μC
H1 : there is difference between the median scores for all three chefs.

Calculation by hand:

# 共有18个记录， 首先给所有的记录rank,分组并求出每组rank的平均值
rankA = rank(x)[1:6]
rankB = rank(x)[7:12]
rankC = rank(x)[13:18]
c(mean(rankA), mean(rankB), mean(rankC))

H = \frac{12}{N(N+1)} \sum_{i=1}^k n_i\overline{R}_{i}^2 - 3(N+1)

where N indicates the number of observations across the group
ni: No. of observations in the i-th group
$\overline{R}_{i}^2$ : i-th group 的 average rank (刚刚获得的mean们）求得H 后，用1-pchisq(H, 2)算p-value

Use R function: kruskal.test(x~grps, data=dat)

ANCOVA table

Covariance analysis combines ANOVA and linear regression analysis
To include 1+ continuous variables (covariates) into the ANOVA model.

-Assumptions of one-way ANCOVA:

Linearity between the covariate and the outcome variable
Check by creating a grouped scatter plot of covariate and the outcome variable
The outcome variable should be approximatley Normally DIstributed
Shapiro-Wilk test
Homoscedasticity/homogeneity of residuals variance for all groups The residuals are assumed to have a constant variance
No significant outliers in the groups

-General Model in covariance Analysis
$y_{ij}= \mu+\alpha_i +\beta x_{ij}+\beta_ix_{ij}+\epsilon_{ij},\\ where \ \alpha_k=\beta_k =0\ and\ \epsilon_{ij}$ ~ $N(0,\sigma^2),~i=1,2,...,k;~j=1,2,...,n_i$
1st testing hypothesis:
$H_0:\beta_1=\beta_2=...=\beta_{k-1}=0$
v.s. $H_1:$ at least one of the $\beta$ 's is not zero\

# test the significance of the interaction term 
anova(lm(W ~ A + B, data = df1), model.2)
# model.2 is the model with interaction term

获得p-value，和5%比较. >0.05, the null hypothesis is not rejected. if $H_0$ cannot be rejected -> equal slope
Consider $y_{ij}= \mu+\alpha_i +\beta x_{ij}+\epsilon_{ij},\ \epsilon_{ij}$ ~ $N(0,\sigma^2)$
test the hypothese: $H_0^x:\beta=0$ (no x effect)
v.s. $H_1^x:\beta\neq0$ (no treatment effect)
$H_0^{treatment}:\alpha_1=\alpha_2=...=\alpha_{k-1}=0$ (no treatment effect)
v.s. $H_1^{treatment}:$ at least one of the $\alpha$ 's is not zero
If both $H_1^x\ and\ H_1^{treatment}$ are true, consider multiple comparisons for all pairs of treatments based on the least squares means
-The mean of Y for treatment i at X = $\overline{x}$ is E(Y|X = $\overline{x}$ )= $\mu+\alpha_i+\beta\overline{x}$ , corresponding least squares mean is $\hat\mu+\hat\alpha_i+\hat\beta\overline{x}$
-The mean of Y for treatment k at X = $\overline{x}$ is E(Y|X = $\overline{x}$ )= $\mu+\beta\overline{x}$ , corresponding least squares mean is $\hat\mu+\hat\beta\overline{x}$

Two-Way Analysis of Variance

Two-way ANOVA：Two nominal independent variable
-Two-way interaction plot

interaction.plot(x.factor = df1$B, trace.factor = df1$A, 
                 response = df1$Days, fun = mean, 
                 type = "b", legend = TRUE, 
                 xlab = "B", ylab="Days",
                 pch=c(1,20), col = c("#00AFBB", "#E7B800"))

If there is no interaction, the difference will be the same regardless of the level of the other factor.
Plot of means- >parallel
-Estimate σ^2 based on the final model

We use MSE (Mean Square of Error) to estimate σ^2

sum(residuals(model.3)^2) /56 # 56 = degrees of freedom

Generalized Linear Model (GLM)

GLM for Binary Data - logit Model

Fit a logistic regression model

# 1. 把所需的class variables变成factor
data4$diabetes = as.factor(data4$diabetes)
data4$age_bucket = as.factor(data4$age_bucket)
data4$preg_bucket = as.factor(data4$preg_bucket)

# 2. 为每一个class variables设定相应的reference level
data4$age_bucket = relevel(data4$age_bucket, ref = "20-30")
data4$preg_bucket = relevel(data4$preg_bucket, ref = "10+")

# fit logistic regression model
fit1 <-glm(diabetes~glucose+pressure+triceps+insulin+mass+pedigree+age_bucket+preg_bucket, family = "binomial"("logit"), data = data4)
# ("logit") refers to use logit link function specifically.
# 在glm中加上`weights = count`可以根据每一个group的count数建logistic regression model

代入获得的系数获得logistic regression $logit(\pi)=\log(\frac{\pi}{1-\pi})=−6.8451+1.8271∗FIBRIN$
the probabitlity $\pi=P(Y=1)=\frac{\exp(刚刚获得的公式)}{1+\exp(公式)}$

Estimated probability

glm.probs = predict(glm.fit, newdata=data.frame(FIBRIN=2.2), type = "response", se.fit=T,interval="prediction")
# newdata放进用来预估的值，注意需要求的prob是Y=0还是1
glm.probs

adequate 充分的 -> 使用该model

Report the exponentiated regression coefficients. Interpret each exponentiated regression coefficients in terms of odds ratio.
To find to the estimated odds-ratio between group A and group B, use exp(coef(model)), notice that the model need to set reference class to group B
Confidence intervals

# confint(fit3)
exp(confint(fit3)) # choose the line about A

-If the confidence interval does not include 1, indicating that the odds ratio is significantly greater than 1, i.e. children with age 3 are significantly more likely to believe in Santa than children with age 6
-If the confidence interval include 1, indicating that the odds ratio is not significantly difference from 1, i.e. there is no significant difference between the odds of believing in Santa for children of age 4 and age 5.

Conduct a hypothesis test to check whether there is a difference between the coefficients for age_bucket 31-40 and age_bucket 50+ levels

# Wald test for difference between coefficients for age_bucket 31-40 and age_bucket 50+ levels
# Create the contrast matrix
L <- matrix(c(0, 0, 0, 0, 1, 0, -1), nrow = 1, ncol = 7)
# 用L matrix来表示想要验证的两个coefficient的关系
library(aod)
# Perform the Wald test
wald.test(b = coef(fit2), Sigma = vcov(fit2), L = L)

Do you think that the sample size is sufficiently large for the asymptotic theory of maximum likelihood estimation to hold in this case? Explain.

fit0 <- glm(success~1, family = "binomial", data = df)

library(lmtest)
lrtest(fit1, fit0)

library(aod)
wald.test(b = coef(fit1), Sigma = vcov(fit1), Terms = 2:12) # test the significance of the 2-12 terms

通过lrt test和wald test 获得chi-square和p-value, 比较
If good enough, the difference of the two test should have be small. Since if the sampke size is small to moderate, the Wald test is the least reliable.

Goodness-of-fit
1.Test for the global significance of this model (Goodness of fit test ) Goodness of fit test in logistic regression model is based on the likelihood ratio test (or Deviance).

#  likelihood ratio test 
# -- Full model: logit(pi) = intercept + beta1*Age3 + beta2*Age4 + beta3*Age5
# -- Null model: logit(pi) = intercept
fit0 <- glm(cbind(bYes,bNo)~1, family = "binomial", data = santa3)
# `~1` refers a null model
library(lmtest)
lrtest(fit3, fit0)

通过lrtest获得p-value,更长的那个好不好。 if p-value < 0.05,the model is significant at the 5% level of significance. Our model as a whole fits significantly better than an empty model.
2.Deviance

fit0 <- glm(diabetes~1, family = "binomial",data = data4)
summary(fit0)
summary(fit2)
#通过两个summary获得residual deviance和DF
#两两相减获得p_value
pchisq(498.1-341.42,391-385,lower.tail = FALSE)

estimated probabilities

根据不同的输入variable, 选择不同的coefficent from summary(model)代入进 $x\beta$ 部分，算出每一部分的probabilities
Then we can obtain the estimated odds by the formula:

GLM for Count Data - Poisson Model

Let Y be a Poisson random variable with $\mu=E(Y)$
The probability mass function of Possion ( $\mu$ ) is $f(y;\mu)=\frac{e^{\mu}\mu^y}{y!}=exp(ylog(\mu)-\mu)-log(y!))$ $Log-link\ g(\mu)=log(\mu)$ canonical link
The sum of independment Poisson random variables is also Poisson.

Fit Poisson regression model

# Read Publication Data
pub <- read.csv(file = "publications.csv", header = TRUE, sep = ",")
pub$gen <- relevel(pub$gen, ref="Male") #也需要先对categorical variable设置reference class
pub$mar <- relevel(pub$mar, ref="Single")

# Fit Poisson regression model(Model 1)
fit1 <- glm(art ~ gen + mar + kid5 + phd + ment + gen:kid5, 
             family = "poisson", data = pub)
summary(fit1)
drop1(fit1, test = "LR")
#Refine the model, fitting a reduced model without the interaction term as it is the most insignifican 重复以上直至没有interaction term可以被remove -> final fitted model

Give an interpretation for each parameter estimate in the final fitted model.
The estimated constant term $\hat{\beta_0}$ =0.3452 is just a baseline value that does not have a specific meaning.

The coefficient for the gender effect is estimated as $\hat{\beta_{gen}}$ =−0.2253, indicating that female PhD biochemists produced less publications than male PhD biochemists, when the other variables are kept fixed. The average ratio is estimated as $e^{−0.2253}=0.7983$ , i.e. about 20.2% = (1-0.7983)*100% less in the expected number of articles published by female PhD biochemists in last three years.

The coefficient for the marital effect is estimated as $\hat{\beta_{mar}}$ =0.1522, indicating that married PhD biochemists produced more publications than single PhD biochemists, when the other variables are kept fixed. The average ratio is estimated as $e^{0.1522}=1.1644$ , i.e. about 16.4% more in the expected number of articles published by married PhD biochemists in last three years.

Assume the model in (b) is valid. Find a 95% confidence interval for the mean number of articles published by a female scientist who is married, graduated from a program with prestige rank 3.5, has two kids, and a mentor having 10 published articles.

predict = predict(model2, newdata=data.frame(fem=factor(1),mar=factor(1),phd=3.5,kid5=2,ment=10), type = "response", se.fit=T,interval="prediction")
predict #用predict出来的expected number和se算CI
# CI = expected number +/- 1.96 * SE
lower=1.173127-1.96*0.09010848 
higher=1.173127+1.96*0.09010848 
cat("95% confidence interval:", lower, "-", higher, "\n")
# 有了两个分别的，求diff的CI: CI_diff = (expected value_1 - expected value_2) +/- 1.96 * sqrt(SE_1^2 + SE_2^2)

negative binomial regression model

In the Poission log-linear model, the assumption is Var(Y)=E(Y)
Over-dispersion: Var(Y) > E(Y)
Here negative binomial distributuion is appropriate to describe Over-dispersion phenomenon.

fitting a negative binomial regression model with log-link:

# Fit Negative Binomial regression model
fit4 <- glm.nb(art ~ gen + mar + kid5 + phd + ment + gen:kid5, 
               link = log, data = pub)
summary(fit4)

-比较Negative binomial和Poisson:

从两个模型的summary中找到Residual Deviance/相应的df比较，哪一个更close to 1, indicating a more adequate fit than the other model where Deviance/df=1.800. \
The lack-of-fit of the Poisson model is probably due to over-dispersion, as also indicated by the estimated dispersion $\phi$ =1/Theta from the summary from the negative binomial fit.
看 $\phi$ 是否close to 0, 若成立，Negative binomial → Poisson

Classification table

Calculate the predicted responses for all observations in the original data, then compare the predicted responses with the true data. Report the sensitivity and specificity.

# Make predictions for all observations
predictions=predict(fit2,type = "response")

# Convert predictions to binary format (0 or 1) using a threshold of 0.5
predicted_diabetes <- ifelse(predictions > 0.5, 1, 0)

# Compare predicted diabetes status with true diabetes status
table(data4$diabetes, predicted_diabetes)

The proportion of correct classification -> accuracy