Remove missing values from frequency distributions in ggplot

Question 1

Remove missing values from frequency distributions in ggplot

r ggplot2 rstudio frequency-distribution stackedbarseries

andrey · Jun 21, 2014 · Viewed 7.5k times · Source

Answer

Answer

The easiest place to drop them is when you set the data set for the plot

p <- ggplot(subset(ds, !is.na(attendF)), aes(x=yearF, fill=attendF))

Here i've created some sample data (which would have been helpful in the initial question) and re-ran your plotting commands after subsetting

ds<-data.frame(
    id=rep(1:100, each=4),
    yearF=factor(rep(2001:2004, 100)),
    attendF=sample(1:8, 400, T, c(.2,.2,.15,.10,.10, .20, .15, .02))
)
ds[sample(which(ds$year==2002), 5), "attendF"]<-NA
ds[sample(which(ds$year==2003), 15), "attendF"]<-NA
ds[sample(which(ds$year==2004), 40), "attendF"]<-NA

attcol8<-c("Never"="#4575b4",
           "Once or Twice"="#74add1",
           "Less than once/month"="#abd9e9",
           "About once/month"="#e0f3f8",
           "About twice/month"="#fee090",
           "About once/week"="#fdae61",
           "Several times/week"="#f46d43",
           "Everyday"="#d73027")
ds$attendF<-factor(ds$attendF, levels=1:8, labels=names(attcol8))

library(ggplot2)
p<-ggplot(subset(ds, !is.na(attendF)), aes(x=yearF, fill=attendF))
p<-p+ geom_bar(position="fill")
p<-p+ scale_fill_manual(values = attcol8,
                    name="Response category" )
p<-p+ scale_y_continuous("Prevalence: proportion of total",
                     limits=c(0, 1),
                     breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<-p+ scale_x_discrete("Waves of measurement",
                   limits=as.character(c(2001:2004)))
p<-p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p

This gave the following plot

enter image description here

Question 2

My data

dsL<-readRDS("./Data/Derived/dsL.rds")
# color palette for the outcome
attcol8<-c("Never"="#4575b4",
           "Once or Twice"="#74add1",
           "Less than once/month"="#abd9e9",
           "About once/month"="#e0f3f8",
           "About twice/month"="#fee090",
           "About once/week"="#fdae61",
           "Several times/week"="#f46d43",
           "Everyday"="#d73027")

# view for one respondent
print (dsL[dsL$id==1,c("id","year","attend","attendF")])

   id year attend         attendF
1   1 1997     NA            <NA>
2   1 1998     NA            <NA>
3   1 1999     NA            <NA>
4   1 2000      1           Never
5   1 2001      6 About once/week
6   1 2002      2   Once or Twice
7   1 2003      1           Never
8   1 2004      1           Never
9   1 2005      1           Never
10  1 2006      1           Never
11  1 2007      1           Never
12  1 2008      1           Never
13  1 2009      1           Never
14  1 2010      1           Never
15  1 2011      1           Never

Creating frequency distributions for each of the measurement wave we have:

ds<- dsL
p<-ggplot(ds, aes(x=yearF, fill=attendF))
p<-p+ geom_bar(position="fill")
p<-p+ scale_fill_manual(values = attcol8,
                    name="Response category" )
p<-p+ scale_y_continuous("Prevalence: proportion of total",
                     limits=c(0, 1),
                     breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<-p+ scale_x_discrete("Waves of measurement",
                   limits=as.character(c(2000:2011)))
p<-p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p

enter image description here

Missing values are used in the calculation of total responses to show the natural attrition in the study. Assumming that attrition is not significantly associated with the outcome measure, we can remove missing values from the calculation of the total of responses and look at percentages that each response was endorsed at each time point.

The question is

what can be done produce the graph i just described? and do it most efficiently? I tried na.rm=TRUE in various places, but it didn't go the trick. Any ideas?

ds<- dsL
### ???
p<-ggplot(ds, aes(x=yearF, fill=attendF))
p<-p+ geom_bar(position="fill")
p<-p+ scale_fill_manual(values = attcol8,
                    name="Response category" )
p<-p+ scale_y_continuous("Prevalence: proportion of total",
                     limits=c(0, 1),
                     breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<-p+ scale_x_discrete("Waves of measurement",
                   limits=as.character(c(2000:2011)))
p<-p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
#p

Update

After @MrFlick solution:

ds<- dsL
p<-ggplot(subset(ds, !is.na(attendF)), aes(x=yearF, fill=attendF))
p<-p+ geom_bar(position="fill")
p<-p+ scale_fill_manual(values = attcol8,
                    name="Response category" )
p<-p+ scale_y_continuous("Prevalence: proportion of total",
                     limits=c(0, 1),
                     breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<-p+ scale_x_discrete("Waves of measurement",
                   limits=as.character(c(2000:2011)))
p<-p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
#p

enter image description here

Remove missing values from frequency distributions in ggplot

The question is

Update

Answer

Related questions