Drawing Box Plots

The box plot is useful for comparing the quartiles and variation of quantitative variables. In a box plot, lower and upper ends of a box (the hinges) are the first (Q1) and third quartile (Q3), and the middle horizontal line represents the median (Q2) of the data. Outliers of the data are shown by the whiskers of the boxes, when data falls above 1.5 * IQR, where the inter-quartile range IQR = Q3 - Q1.

Drawing box plots using R

To Understand the different visualization cases, we will use the monthly dengue counts in Sri Lanka for 2017. The data set is saved in Google Drive as a Google sheet.

library(gsheet)
df=gsheet2tbl('docs.google.com/spreadsheets/d/11JspFL9bz6jEiaPw7ih0cllfVla4VYFyeSyH-EG0J2g')
data=df[,2:13]
head(data)
## # A tibble: 6 x 12
##     Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  2734  1900  2467  2570  3333  5372  7471  3620  1251   823  1131  1602
## 2  1635  1087  1870  2072  3168  4901  9039  3553  1246   779  1078  1219
## 3   581   448   836   739   946  1248  2612  1477   663   337   528   546
## 4   252   207   369   443   862  2314  3855  2354  1090   821   884   957
## 5   129   103   145   120   165   436   766   534   167   177   214   215
## 6    50    32    42    37    57    94   294   159    39    27    22    39
boxplot(data, ylab ="Dengue Counts", xlab ="Month")

According to the figure, we can see that there is an increasing trend from May to July, and then it decreases. Overall the dengue counts are high from May to September with compared to the other months. This period is the south-West monsoon period which is from mid May to September in Sri Lanka.
Now if you need the variable names vertically, instead of horizontally, use las option, and we add a legend by using pch=20 to the plot. Further we reduce the font size by using cex option. Note that the default cex value is 1.

x.labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
boxplot(data, ylab ="Dengue Counts", xlab ="Month", las = 2)
legend("topright", legend = c(unique(x.labels)), pch = 20,cex=0.8)

If you want to change the variable names use the option names:

boxplot(data, ylab ="Dengue Counts", xlab ="Month", las = 2, names = c("January","Febrary","March","April","May","June","July","August","September","October","November","December"))

If the names are too long and they do not fit into the plot’s window you can increase it by using the option par:

boxplot(data, ylab ="Dengue Counts", xlab ="Month", las = 2, par(mar = c(12, 5, 4, 2)+ 0.1), names = c("January","Febrary","March","April","May","June","July","August","September","October","November","December"))

There is a seasonal variation of Dengue counts. Four seasonal periods are identified in Sri Lanka on the basis of the two monsoons and two transitional periods. They are the North-East monsoon from December to February, First inter-monsoon from March to mid May, South-West monsoon from mid May to September and Second inter-monsoon from October to November.
Now, I will group January and February in one group, March to May to the second group, June to September to the third group, October to November to the fourth group , and December seperately. To do this, specify the position, along the X axis, of each box-plot. Then, the first 2 box-plots at position x=1, and x=2, then leave a space between the second and fourth place and start the next at x=4, and so on.

boxplot(data, ylab ="Dengue Counts", xlab ="Month", las = 2, at =c(1,2, 4,5,6, 8,9,10,11, 13,14, 16), par(mar = c(12, 5, 4, 2)+ 0.1), names = c("January","Febrary","March","April","May","June","July","August","September","October","November","December"))

Now, we add colours to box plots by using the option col, and specify a vector with the colour numbers or the colour names. You can find the colour numbers here, and the colour names here.

boxplot(data, ylab ="Dengue Counts", xlab ="Month", las = 2, col =c("red","sienna","palevioletred1","royalblue2","red","sienna","palevioletred1", 
"royalblue2","red","sienna","palevioletred1","royalblue2"),
at =c(1,2, 4,5,6, 8,9,10,11, 13,14, 16), par(mar = c(12, 5, 4, 2)+ 0.1), names = c("January","Febrary","March","April","May","June","July","August","September","October","November","December"))

Drawing box plots using plotly package

Using plotly package we can draw publication quality graphs. More details of drawing box plots using this package are given in this website.

Now we use monthly dengue counts in Colombo district from year 2010 to 2017 to draw box plots.

library(gsheet)
df=gsheet2tbl('docs.google.com/spreadsheets/d/1KsgRDiGPitkW3l8WAOzijBXfFlPIx8a2RCm538nCLik')
df=df[,1:4]
suppressMessages(library(dplyr))
head(df)
## # A tibble: 6 x 4
##    year month district count
##   <dbl> <chr> <chr>    <dbl>
## 1  2010 Jan   Colombo    584
## 2  2010 Feb   Colombo    606
## 3  2010 Mar   Colombo    294
## 4  2010 Apr   Colombo    224
## 5  2010 May   Colombo    296
## 6  2010 Jun   Colombo    700
# To apply the given codes the data should be in the same format.
library(plotly)
month <- c("Jan", "Feb", "Mar", "Apr","May", "Jun", "Jul", "Aug","Sep", "Oct", "Nov", "Dec")
# Months in january to December format use xform and then add layout
xform <- list(categoryorder = "array",
              categoryarray = c("Jan", "Feb", "Mar", "Apr","May", "Jun", "Jul", "Aug","Sep", "Oct", "Nov", "Dec"))
p1 <- plot_ly(df, y = ~count, color = ~month, type = "box")%>% 
layout(title='Monthly Dengue incidence cases from year 2010 to 2017 in Colombo  District', xaxis = xform, yaxis = list(title = "Dengue Count"))
p1

Now we draw box plots for year-wise dengue counts in Colombo district from 2010 to 2017 .

#Transform year variable as a factor variable
df$year=as.factor(df$year)
suppressMessages(library(dplyr))
library(plotly)
p2 <- plot_ly(df, y = ~count, color = ~year, type = "box")%>%
layout(title='Dengue incidence cases from year 2010 to 2017 in Colombo  District')  
p2

Drawing box plots using ggplot2 package

The package ggplot2 also gives publication quality graphics. Here, we have to provide three different fundamental parts to the plot, i.e.
Plot = data + Aesthetics + Geometry.
A very good tutorial for drawing box plts using this package is given in this website. Another one that I usually refer is STHDA
Top 50 ggplot visualizations are shown here.
Again we use monthly dengue counts in Colombo district from year 2010 to 2017 to draw box plots.

library(ggplot2)
theme_set(theme_bw())
# Months to create January to December format in ggplot then add layout
df$month<-factor(month.name,levels=month.name)
library(reshape2)
df.long<-melt(df,id.vars="month")
p3=ggplot(df, aes(x = month, y = count, color=month)) +
    geom_boxplot() + 
    labs(title="Monthly Dengue counts for year 2010-2017",
         subtitle="Colombo district, Sri Lanka", 
       y="Dengue Count", 
       x="Month", 
       caption="Source: Epidemiology Unit, Ministry of Health, Sri Lanka") + 
  theme(axis.text.x = element_text(angle=90, vjust=0.2))
p3

# You can rotate the box plot
p3 + coord_flip()

Drawing box plot by showing all observations

To view all observations together with the box plot, use the functions geom_dotplot() or geom_jitter(). We can draw this either by adding dot plots or jittered points as below:

library(ggplot2)
p3=ggplot(df, aes(x = month, y = count, color=month)) +
    geom_boxplot() + 
    labs(title="Monthly Dengue counts for year 2010-2017",
         subtitle="Colombo district, Sri Lanka", 
       y="Dengue Count", 
       x="Month", 
       caption="Source: Epidemiology Unit, Ministry of Health, Sri Lanka") +  
  theme(axis.text.x = element_text(angle=65, vjust=0.6))
# Box plot with dot plot
p3 + geom_dotplot(binaxis='y', stackdir='center', dotsize=0.5)

# Box plot with jittered points
# 0.2 : degree of jitter in x direction
p3 + geom_jitter(shape=16, position=position_jitter(0.2))

This plot helps us to determine whether the sample size is sufficient. In this case, total Dengue count for each month from 2010 to 2018 was presented. Therefore for each month only 8 observations (n=8) are availble. Since some outliers are present for March to June, August and December, a question would be whether the variability shown is inherent to the data or a result of the small sample size (n = 8).
Note that we can combine a box plot with a beeswarm plot using the beeswarm package to optimize the locations of the points. The beeswarm plot is a one-dimensional scatter plot which displays individual measurements as points.

library(beeswarm)
boxplot(count ~ month, data = df, las = 2, main = "Boxplot with beeswarm for Dengue counts of Colombo district from 2010-2017",
       outline = FALSE)
beeswarm(count ~ month, data = df,
    main = "Beeswarm of Dengue counts versus Month", add = TRUE,
    pwcol = as.numeric(year), pch = 16)
legend("topright", legend = levels(df$year),
       title = "Year", pch = 20, cex=0.5, col = 1:3)

The plot shows how each year the dengue counts vary in Colombo district.

Violin plot to combine a box plot with a density plot

For density estimation, a sufficient number of data should be available for obtaining reliable estimates.

suppressMessages(library(ggplot2))
require(gridExtra)
## Loading required package: gridExtra
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
p.violin <- ggplot(df, aes(x = month, y = count, color=month, fill=month)) +
    # add horizontal lines for Quartiles at Q1, Q2, and Q3
    geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
    labs(title="Monthly Dengue counts for year 2010-2017",
         subtitle="Colombo district, Sri Lanka", 
       y="Dengue Count", 
       x="Month", 
       caption="Source: Epidemiology Unit, Ministry of Health, Sri Lanka") +  
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

p.violin

From three lines shown in the plots, middle line indicates median of data. Based on this line we can understand how 50% bove and below the data vary. Note that the violin plot stretches up to the outliers for months from March to August. From the boxplot we can easily identify these outliers.