R Programming/Binomial Models
In this section, we look at the binomial model. We have one outcome which is binary and a set of explanatory variables.
This kind of model can be analyzed using a linear probability model. However a drawback of this model for the parameter of the Bernoulli distribution is that, unless restrictions are placed on , the estimated coefficients can imply probabilities outside the unit interval . For this reason, models such as the logit model or the probit model are more commonly used. If you want to estimate a linear probability model, have a look at the linear models page.
Logit model
[edit | edit source]The model takes the form : with the inverse link function : . It can be estimated using maximum likelihood or using bayesian methods.
Fake data simulations
[edit | edit source]> x <- 1 + rnorm(1000,1)
> xbeta <- -1 + (x* 1)
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> table(y)
> df <- data.frame(y,x)
Maximum likelihood estimation
[edit | edit source]- The standard way to estimate a logit model is glm() function with family binomial and link logit.
- lrm() (Design) is another implementation of the logistic regression model.
- There is an implementation in the Zelig package[1].
In this example, we simulate a model with one continuous predictor and estimate this model using the glm() function.
> res <- glm(y ~ x , family = binomial(link=logit))
> summary(res) # results
> confint(res) # confindence intervals
> names(res)
> exp(res$coefficients) # odds ratio
> exp(confint(res)) # Confidence intervals for odds ratio (delta method)
> predict(res) # prediction on a linear scale
> predict(res, type = "response") # predicted probabilities
> plot(x, predict(res, type = "response")) # plot the predicted probabilities
Zelig
[edit | edit source]The Zelig' package makes it easy to compute all the quantities of interest.
We develop a new example. First we simulate a new dataset with two continuous explanatory variables and we estimate the model using zelig() with the model = "logit" option.
- We the look at the predicted values of y at the mean of x1 and x2
- Then we look at the predicted values when x1 = 0 and x2 = 0
- We also look at what happens when x1 changes from the 3rd to the 1st quartile.
> x1 <- 1 + rnorm(1000)
> x2 <- -1 + x1 + rnorm(1000)
> xbeta <- -1 + x1 + x2
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> mydat <- data.frame(y,x1,x2)
> table(y)
>
> z.out <- zelig(y ~ x1 + x2, model = "logit", data = mydat) # estimating the model
> summary(z.out)
> x.out <- setx(z.out, x1 = mean(x1), x2 = mean(x2)) # setting values for the explanatory variables
> s.out <- sim(z.out, x = x.out) # simulating the quantities of interest
> summary(s.out)
> plot(s.out) # plot the quantities of interest
> # the same with other values
> x.out <- setx(z.out, x1 = 0, x2 = 0)
> s.out <- sim(z.out, x = x.out)
> summary(s.out)
> # What happens if x1 change from the 3rd quartile to the 1st quartile ?
> x.high <- setx(z.out, x1 = quantile(mydat$x1,.75), x2 = mean(mydat$x2))
> x.low <- setx(z.out, x1 = quantile(mydat$x1,.25), x2 = mean(x2))
> s.out2<-sim(z.out, x=x.high, x1=x.low)
> plot(s.out2)
- ROC Curve in the verification package.
- Zelig has a rocplot() function.
Bayesian estimation
[edit | edit source]- bayesglm() in the arm package
- MCMClogit() in the MCMCpack for a bayesian estimation of the logit model.
> # Data generating process
> x <- 1 + rnorm(1000,1)
> xbeta <- -1 + (x* 1)
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> table(y)
>
> library(MCMCpack)
> res <- MCMClogit(y ~ x)
> summary(res)
> library("arm")
> res <- bayesglm(y ~ x, family = binomial(link=logit))
> summary(res)
Probit model
[edit | edit source]The probit model is a binary model in which we assume that the link function is the cumulative density function of a normal distribution.
We simulate fake data. First, we draw two random variables x1 and x2 in any distributions (this does not matter). Then we create the vector xbeta as a linear combination of x1 and x2. We apply the link function to that vector and we draw the binary variable y as Bernouilli random variable.
> x1 <- 1 + rnorm(1000)
> x2 <- -1 + x1 + rnorm(1000)
> xbeta <- -1 + x1 + x2
> proba <- pnorm(xbeta)
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> mydat <- data.frame(y,x1,x2)
> table(y)
Maximum likelihood
[edit | edit source]We can use the glm() function with family=binomial(link=probit) option or the probit() function in the sampleSelection package which is a wrapper of the former one.
> res <- glm(y ~ x1 + x2 , family = binomial(link=probit), data = mydat)
> summary(res)
>
> library("sampleSelection")
> probit(y ~ x1 + x2, data = mydat)
> summary(res)
Bayesian estimation
[edit | edit source]- MCMCprobit() (MCMCpack)
> library("MCMCpack")
> post <- MCMCprobit(y ~ x1 + x2 , data = mydat)
> summary(post)
> plot(post)
See Also
[edit | edit source]- There is an example of a probit model with R on the UCLA statistical computing website[2].
Semi-Parametric models
[edit | edit source]- Klein and Spady estimator[3] is implemented in the np package[4] (see npindex() with method = "kleinspady" option).
References
[edit | edit source]- ↑ Kosuke Imai, Gary King, and Oliva Lau. 2008. "logit: Logistic Regression for Dichotomous Dependent Variables" in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software," http://gking.harvard.edu/zelig
- ↑ UCLA statistical computing probit example http://www.ats.ucla.edu/stat/R/dae/probit.htm
- ↑ Klein, R. W. and R. H. Spady (1993), “An efficient semiparametric estimator for binary response models,” Econometrica, 61, 387-421.
- ↑ Tristen Hayfield and Jeffrey S. Racine (2008). Nonparametric Econometrics: The np Package. Journal of Statistical Software 27(5). URL http://www.jstatsoft.org/v27/i05/.