ggplot2: Logistic Regression - plot probabilities and regression line

vincentqu picture vincentqu · Jun 9, 2013 · Viewed 43.6k times · Source

I have a data.frame containing a continuous predictor and a dichotomous response variable.

> head(df)
  position response
1        0        1
2        3        1
3       -4        0
4       -1        0
5       -2        1
6        0        0

I can easily compute a logistic regression by means of the glm()-function, no problems up to this point.

Next, I want to create a plot with ggplot, that contains both the empiric probabilities for each of the overall 11 predictor values, and the fitted regression line.

I went ahead and computed the probabilities with cast() and saved them in another data.frame

> probs
   position   prob
1        -5 0.0500
2        -4 0.0000
3        -3 0.0000
4        -2 0.2000
5        -1 0.1500
6         0 0.3684
7         1 0.4500
8         2 0.6500
9         3 0.7500
10        4 0.8500
11        5 1.0000

I plotted the probabilities:

p <- ggplot(probs, aes(x=position, y=prob)) + geom_point()

But when I try to add the fitted regression line

p <- p + stat_smooth(method="glm", family="binomial", se=F)

it returns a warning: non-integer #successes in a binomial glm!. I know that in order to plot the stat_smooth "correctly", I'd have to call it on the original df data with the dichotomous variable. However if I use the dfdata in ggplot(), I see no way to plot the probabilities.

How can I combine the probabilities and the regression line in one plot, in the way it's meant to be in ggplot2, i.e. without getting any warning or error messages?

Answer

Beasterfield picture Beasterfield · Jun 9, 2013

There are basically three solutions:

Merging the data.frames

The easiest, after you have your data in two separate data.frames would be to merge them by position:

mydf <- merge( mydf, probs, by="position")

Then you can call ggplot on this data.frame without warnings:

ggplot( mydf, aes(x=position, y=prob)) +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = FALSE) 

enter image description here

Avoiding the creation of two data.frames

In future you could directly avoid the creation of two separate data.frames which you have to merge later. Personally, I like to use the plyr package for that:

librayr(plyr)
mydf <- ddply( mydf, "position", mutate, prob = mean(response)  )

Edit: Use different data for each layer

I forgot to mention, that you can use for each layer another data.frame which is a strong advantage of ggplot2:

ggplot( probs, aes(x=position, y=prob)) +
  geom_point() +
  geom_smooth(data = mydf, aes(x = position, y = response),
    method = "glm", method.args = list(family = "binomial"), 
    se = FALSE)

As an additional hint: Avoid the usage of the variable name df since you override the built in function stats::df by assigning to this variable name.