Advanced Data Visualization in R

UC Berkeley Biostatistics seminar

Sara E. Moore
24 February 2015

Download

Intro

You have:

Why ggplot?

Why not ggplot?

Tidy data [1]

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

The (Layered) Grammar of Graphics [2; 3; 4]

"Good grammar is just the first step in creating a good sentence."

Wickham, 2010

Components of the Grammar

Specify a statistical graphic using components of statements:

The anatomy of a ggplot command

ggplot(data=, aes(x=,y=,...)) +
    geom_????(...) +
    ...
ggplot() +
    geom_????(data=, aes(x=,y=,...),...) +
    ...

The data

source(paste("assets","load_marchmania2015.R", 
    sep=.Platform[["file.sep"]]))
# change this path to the location of your CSVs downloaded from kaggle
# note that required packages will be installed automatically
ncaa.bball = load.clean.summ.ncaa("~/Dropbox/kaggle/marchmania2015")

A simple scatterplot

library(ggplot2)
ggplot(data=ncaa.bball[["games"]], aes(x=lscore, y=wscore)) + 
    geom_point() + 
    xlab("Points scored by losing team") +
    ylab("Points scored by winning team") +
    ggtitle("Final scores of NCAA basketball games\nNovember 1984 - April 2014") +
    theme(text = element_text(size = 16))

Improving the scatterplot

library(munsell) # color system used by ggplot2
ggplot(data=ncaa.bball[["games"]], aes(x=lscore, y=wscore)) + 
    # default is bins=c(30,30) 
    # can use scalar when no. of bins for x and y are same.
    # here alternatively specifying binwidth:
    stat_binhex(binwidth=c(4, 4)) + 
    # mnsl converts [hue lightness/color purity] to hex color codes.
    # scale_fill_gradient == scale_fill_continuous.
    scale_fill_gradient("Number of games", trans = "sqrt",
        low=mnsl("7.5G 2/4"), high=mnsl("7.5G 9/6")) +
    xlab("Points scored by losing team") +
    ylab("Points scored by winning team") +
    ggtitle("Final scores of NCAA basketball games\nNovember 1984 - April 2014") +
    theme_classic(base_size = 16)

Improving the scatterplot

ggplot(data=ncaa.bball[["games"]], aes(x=lscore, y=wscore)) + 
    # default is bins=c(30,30) 
    # can use scalar when no. of bins for x and y are same.
    # here alternatively specifying binwidth:
    geom_hex(stat = "binhex", binwidth=c(4,4)) + 
    # scale_fill_gradient == scale_fill_continuous.
    # the mnsl fxn converts [hue lightness/color purity] to hex color codes.
    scale_fill_gradient("Number of games", trans = "sqrt",
        low=mnsl("7.5G 2/4"), high=mnsl("7.5G 9/6")) +
    xlab("Points scored by losing team") +
    ylab("Points scored by winning team") +
    ggtitle("Final scores of NCAA basketball games\nNovember 1984 - April 2014") +
    theme_classic(base_size = 16)

When geoms transform

geom stat modifiable defaults
geom_boxplot() stat_boxplot() max length of whiskers (beyond hinges) = 1.5*IQR
geom_bar() stat_bin() 30 bins: binwidth = [range of x]/30
geom_histogram() stat_bin() 30 bins: binwidth = [range of x]/30
geom_freqpoly() stat_bin() 30 bins: binwidth = [range of x]/30
geom_dotplot() stat_bindot() 30 bins: binwidth = [range of x]/30; "dotdensity" method
geom_bin2d() stat_bin2d() 30 bins for each of x and y
geom_hex() stat_binhex() 30 bins for each of x and y (calls hexbin::hexBin())
geom_density2d() stat_density2d() Gaussian kernel; bandwidths (x and y) set by Silverman's "rule of thumb"; 100 grid points for x and y (calls MASS::kde2d())
geom_density() stat_density() Gaussian kernel; bandwidth set by Silverman's "rule of thumb" (calls stats::density())
geom_violin() stat_ydensity() Gaussian kernel; bandwidth set by Silverman's "rule of thumb" (calls stats::density()); all violins have same area before trimming tails, tails are trimmed to [range of y]
geom_smooth() stat_smooth() if n<1000, stats::loess() with polynomial degree 2, \(\alpha=0.75\), etc.; else, gam::gam() with penalized cubic regression splines, etc.; 80 evaluation points
geom_quantile() stat_quantile() 3 quartiles; "br" method (modified Barrodale & Roberts method; calls quantreg::rq())
geom_contour() stat_contour() 10 pretty breakpoints covering [range of z]

Other ggplot transformations

Dates, tidyr, and summaries with ggplot2

library(tidyr) # gather
duke.2014 = subset(ncaa.bball[["teams"]], 
        (season==2014)&(team.name=="Duke"))[,
        c("date","fg.pct","fg3.pct","ft.pct","result")] %>%
    tidyr::gather(shot.type, pct, -c(date, result))

library(scales) # date_format
ggplot(duke.2014, aes(x=date, y=pct, color=shot.type)) +
    geom_line() +
    geom_line(stat = "hline", yintercept = "mean",
        linetype="dashed", alpha=0.65) +
    geom_rug(data=subset(duke.2014,(result=="Loss")&(shot.type=="fg.pct")), 
        sides="b", color="grey20") +
    scale_x_datetime("Game date", labels = date_format("%b %Y")) +
    ylab("Proportion of shots made") +
    scale_color_discrete("Type of shot", 
        labels=c("two point field goal","three point field goal","free throw")) +
    ggtitle("Duke's per-game shot percentages, 2013-14 season:\nbottom ticks indicate losses; dashed lines are season averages") +
    theme_classic(base_size = 16) + theme(legend.position = "bottom")

A heatmap with ggplot2::geom_tile

library(RColorBrewer) # brewer.pal
library(grid)         # unit
tourney.teams.2014 = as.character(
    unique(subset(ncaa.bball[["team.season.summ"]],
        (season==2014)&(max.tourney.round>="Sweet Sixteen"))$team.name))
hist.perf.teams.2014 = subset(ncaa.bball[["team.season.summ"]], 
    team.name%in%tourney.teams.2014)
p1 = ggplot(hist.perf.teams.2014,
        aes(x=season, y=team.name, fill=win.pct)) +
    geom_tile() + 
    scale_fill_gradientn(
        "Proportion of regular season games won",
        colours = brewer.pal(9,"GnBu"),
        na.value="grey80", breaks=seq(0,1,0.25),
        guide = guide_colorbar(barwidth = 15, barheight = 1)) +
    scale_x_continuous("Season", expand = c(0, 0)) + 
    scale_y_discrete("Team", expand = c(0, 0)) +
    ggtitle("Historical regular season performance of\n2014 NCAA tournament 'Sweet 16' teams") +
    theme_classic(base_size=16) + 
    theme(legend.position = "bottom",
        axis.text.y = element_text(size = 11),
        plot.margin = unit(c(0,0.1,-0.4,0.1), "cm"))
p1

Creating a dendrogram with ggdendro

library(ggdendro)

team.season.df = subset(ncaa.bball[["team.season.summ"]],
    team.name%in%tourney.teams.2014)[,
        c("season", "team.name","win.pct")]
# use tidyr, but this time go long --> wide (spread)
team.season.mat = as.matrix(team.season.df %>% spread(team.name, win.pct))
rownames(team.season.mat) = team.season.mat[,"season"]
team.season.mat = team.season.mat[ ,-which(colnames(team.season.mat)=="season")]

teams.hc = hclust(dist(t(team.season.mat)), "ave")
ggdendrogram(teams.hc, rotate = TRUE)

Simplifying the dendrogram

teams.dendro = as.dendrogram(teams.hc)
teams.ddata = dendro_data(teams.dendro)

p2 = ggplot(segment(teams.ddata)) + 
    geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) + 
    coord_flip() + 
    theme_dendro() +
    # tweak these if the dendrogram doesn't line up:
    theme(plot.margin = unit(c(-7,0,-15,-20), "points"))
p2

Heatmap, reordered

# need to remove the extra factor levels
# and rorder according to the clustering
hist.perf.teams.2014$team.name = as.character(
    hist.perf.teams.2014$team.name)
# can do sort(unique(x)) here because they were originally 
# in alphabetical order. just be sure the order of the 
# dendrogram matches up with the new order of the heatmap.
hist.perf.teams.2014$team.name = factor(
    hist.perf.teams.2014$team.name,
    sort(unique(hist.perf.teams.2014$team.name))[
        order.dendrogram(teams.dendro)],
    ordered=TRUE)
p1 = ggplot(hist.perf.teams.2014,
        aes(x=season, y=team.name, fill=win.pct)) +
    geom_tile() + 
    scale_fill_gradientn(
        "Proportion of regular season games won",
        colours = brewer.pal(9,"GnBu")[3:9],
        na.value="grey80", breaks=seq(0,1,0.25),
        guide = guide_colorbar(barwidth = 15, barheight = 1)) +
    scale_x_continuous("Season", expand = c(0, 0)) + 
    scale_y_discrete("Team", expand = c(0, 0)) +
    ggtitle("Historical regular season performance of\n2014 NCAA tournament 'Sweet 16' teams") +
    theme_classic(base_size=16) + 
    theme(legend.position = "bottom",
        axis.text.y = element_text(size = 11),
        plot.margin = unit(c(0,0.1,-0.4,0.1), "cm"))
p1

Putting it all together

library(gtable)
g1 = gtable_add_cols(ggplotGrob(p1), unit(4,"cm"))
# may need to adjust "t" and "b" if you don't add a ggtitle:
g = gtable_add_grob(g1, ggplotGrob(p2),
    t=3, l=ncol(g1), b=4, r=ncol(g1))
grid.newpage()
grid.draw(g)

Packages that pair well with ggplot2

Where to go for help with ggplot2

Interactive data visualizations in R

Why interactive?

SVG ggplot with plotly

# start with a static ggplot
p = ggplot(subset(ncaa.bball[["team.season.summ"]], 
        max.tourney.round>="Sweet Sixteen"),
    aes(x=season, y=win.pct, color=max.tourney.round)) +
    # ideally we would jitter here, but plotly has trouble with this
    # geom_point(position = position_jitter(w = 0.4, h = 0.002)) +
    geom_point(size=3, alpha=0.7) + # instead use transparency
    xlab("Year") + 
    ylab("Proportion of regular season games won") +
    scale_color_discrete("Highest tournament round achieved") +
    ggtitle("Regular season performance of 'Sweet Sixteen' teams 1985-2014, by season") +
    theme_classic(base_size = 16) + theme(legend.position = "bottom")
p

SVG ggplot with plotly

# library(devtools)
# install_github("ropensci/plotly")
library(plotly)

# get a plot.ly account and get your api key here:
# https://plot.ly/settings/api
# plug it in with your username in the statement below.
# set_credentials_file("username", "xxxxxxxxxx")

py <- plotly()
# recall p is the object returned by ggplot
# plot.info = py$ggplotly(p) # in an R session, opens in browser
plot.info = py$ggplotly(p, session="knitr") # embed in knitr document
# if you're embedding in a knitr document, 
# be sure to also set the code chunk
# plotly=TRUE

SVG graphic with clickme

# library(devtools)
# install_github("nachocab/clickme")
library(clickme)
cmplot = with(subset(ncaa.bball[["team.season.summ"]], 
        max.tourney.round>="Sweet Sixteen"),
    clickme("points",
        x = season, y = win.pct,
        names = team.name,
        color_groups = as.character(max.tourney.round),
        x_title = "Year", x_format = "",
        y_title = "Proportion of regular season games won",
        color_title = "Highest tournament round achieved",
        color_group_order = levels(max.tourney.round)[4:8],
        title = "Regular season performance of 'Sweet 16' teams",
        subtitle = "1985-2014, by season",
        file_path = paste(getwd(),"clickme0.html", 
            sep=.Platform$file.sep),
        height = 600, width = 700))
# cmplot # in an R session, open in browser
# embed in knitr document:
cmplot$iframe()$hide()

MotionChart with googleVis (Flash)

library(googleVis)
# data.frame with >=4 cols: x, y, id, time. color and size optional,
# but if you don't provide them, 
# it will choose them for you (if there are columns left to use)
mc = gvisMotionChart(
    subset(ncaa.bball[["team.season.summ"]], 
        !is.na(tourney.seed))[,c("team.name", "season", "win.pct", 
            "points.avg", "mov.avg", "tourney.seed")],
    idvar="team.name", timevar="season", 
    xvar="win.pct", yvar="tourney.seed",
    sizevar="mov.avg", colorvar="points.avg",
    options=list(width=750, height=650))
# plot(mc) # in an R session, opens in browser
print(mc, 'chart') # embed in knitr document

Many interactive options in R

Questions?

References

[1] L. Wilkinson. The Grammar of Graphics. Springer-Verlag New York, 2005. DOI: 10.1007/0-387-28695-0. .

[2] H. Wickham. ggplot2. Springer Science + Business Media, 2009. DOI: 10.1007/978-0-387-98141-3. .

[3] H. Wickham. "A Layered Grammar of Graphics". In: Journal of Computational and Graphical Statistics 19.1 (2010), pp. 3-28. DOI: 10.1198/jcgs.2009.07098. .

[4] H. Wickham. "Tidy Data". In: Journal of Statistical Software 59.10 (2014). .

Credits

Built using slidify with

/

#