Advanced Data Visualization in R

UC Berkeley Biostatistics seminar

Sara E. Moore
24 February 2015

Download

Intro

You have:

a working knowledge of R,
some familiarity with the usage of ggplot2 (such as what was presented during the 2013 or 2014 UC Berkeley SCF/D-Lab R Bootcamp),
an interest in creating data visualizations in R, both static (mostly using ggplot2) and interactive (using a variety of packages).

Why `ggplot`?

It's pretty.
Its commands are intuitive and "human-readable."
Nearly any graphic can be created, so you can use it for everything and maintain a consistent style.
It has (sort of) built-in support for maps.

Why not `ggplot`?

It's slow.
It won't do some things.
There's a steep learning curve.
~~lattice is better at trellis graphs?~~ Faceting is just as powerful.

Tidy data ^[1]

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

The (Layered) Grammar of Graphics ^{[2; 3; 4]}

Move away from using "names" and "chart typologies."
Instead, use "statements" constructed via grammar
Why?
- An infinite number of unique graphics can be created.
- The implementation is DRY ("don't repeat yourself") not WET ("write everything twice" or "we enjoy typing").

"Good grammar is just the first step in creating a good sentence."

^{Wickham, 2010}

Components of the Grammar

Specify a statistical graphic using components of statements:

Data (data),
Statistical transformations (stat: identity, count, mean, etc.),
Geometric elements/objects (geom: points, lines, etc.),
Aesthetic mappings (aes: color, shape, size, transparency, etc.),
Coordinate systems (coord: cartesian, polar, map, etc.),
Guides/Scales and transformations thereof (scale, guide, and others: axes, log-transformed scales, legends, etc.),
Faceting/conditioning/latticing/trellising (facet),
Tweaking graphical positioning and visual elements (position, theme, etc.), and
Layering.

The anatomy of a `ggplot` command

All arguments to the main, initial function call, ggplot, set graph defaults.
These defaults can be changed for an individual element (even data).

ggplot(data=, aes(x=,y=,...)) +
    geom_????(...) +
    ...

ggplot() +
    geom_????(data=, aes(x=,y=,...),...) +
    ...

The data

Kaggle March Machine Learning "Mania" 2015

source(paste("assets","load_marchmania2015.R", 
    sep=.Platform[["file.sep"]]))
# change this path to the location of your CSVs downloaded from kaggle
# note that required packages will be installed automatically
ncaa.bball = load.clean.summ.ncaa("~/Dropbox/kaggle/marchmania2015")

Alternatively, use data that ships with ggplot2:
- diamonds,
- mpg,
- movies, etc.

A simple scatterplot

library(ggplot2)
ggplot(data=ncaa.bball[["games"]], aes(x=lscore, y=wscore)) + 
    geom_point() + 
    xlab("Points scored by losing team") +
    ylab("Points scored by winning team") +
    ggtitle("Final scores of NCAA basketball games\nNovember 1984 - April 2014") +
    theme(text = element_text(size = 16))

Improving the scatterplot

library(munsell) # color system used by ggplot2
ggplot(data=ncaa.bball[["games"]], aes(x=lscore, y=wscore)) + 
    # default is bins=c(30,30) 
    # can use scalar when no. of bins for x and y are same.
    # here alternatively specifying binwidth:
    stat_binhex(binwidth=c(4, 4)) + 
    # mnsl converts [hue lightness/color purity] to hex color codes.
    # scale_fill_gradient == scale_fill_continuous.
    scale_fill_gradient("Number of games", trans = "sqrt",
        low=mnsl("7.5G 2/4"), high=mnsl("7.5G 9/6")) +
    xlab("Points scored by losing team") +
    ylab("Points scored by winning team") +
    ggtitle("Final scores of NCAA basketball games\nNovember 1984 - April 2014") +
    theme_classic(base_size = 16)

Improving the scatterplot

ggplot(data=ncaa.bball[["games"]], aes(x=lscore, y=wscore)) + 
    # default is bins=c(30,30) 
    # can use scalar when no. of bins for x and y are same.
    # here alternatively specifying binwidth:
    geom_hex(stat = "binhex", binwidth=c(4,4)) + 
    # scale_fill_gradient == scale_fill_continuous.
    # the mnsl fxn converts [hue lightness/color purity] to hex color codes.
    scale_fill_gradient("Number of games", trans = "sqrt",
        low=mnsl("7.5G 2/4"), high=mnsl("7.5G 9/6")) +
    xlab("Points scored by losing team") +
    ylab("Points scored by winning team") +
    ggtitle("Final scores of NCAA basketball games\nNovember 1984 - April 2014") +
    theme_classic(base_size = 16)

When geoms transform

geom	stat	modifiable defaults
`geom_boxplot()`	`stat_boxplot()`	max length of whiskers (beyond hinges) = 1.5*IQR
`geom_bar()`	`stat_bin()`	30 bins: binwidth = [range of x]/30
`geom_histogram()`	`stat_bin()`	30 bins: binwidth = [range of x]/30
`geom_freqpoly()`	`stat_bin()`	30 bins: binwidth = [range of x]/30
`geom_dotplot()`	`stat_bindot()`	30 bins: binwidth = [range of x]/30; "dotdensity" method
`geom_bin2d()`	`stat_bin2d()`	30 bins for each of x and y
`geom_hex()`	`stat_binhex()`	30 bins for each of x and y (calls `hexbin::hexBin()`)
`geom_density2d()`	`stat_density2d()`	Gaussian kernel; bandwidths (x and y) set by Silverman's "rule of thumb"; 100 grid points for x and y (calls `MASS::kde2d()`)
`geom_density()`	`stat_density()`	Gaussian kernel; bandwidth set by Silverman's "rule of thumb" (calls `stats::density()`)
`geom_violin()`	`stat_ydensity()`	Gaussian kernel; bandwidth set by Silverman's "rule of thumb" (calls `stats::density()`); all violins have same area before trimming tails, tails are trimmed to [range of y]
`geom_smooth()`	`stat_smooth()`	if n<1000, `stats::loess()` with polynomial degree 2, \(\alpha=0.75\), etc.; else, `gam::gam()` with penalized cubic regression splines, etc.; 80 evaluation points
`geom_quantile()`	`stat_quantile()`	3 quartiles; "br" method (modified Barrodale & Roberts method; calls `quantreg::rq()`)
`geom_contour()`	`stat_contour()`	10 `pretty` breakpoints covering [range of z]

Other `ggplot` transformations

stat_ecdf: Empirical Cumulative Density Function
stat_function: Superimpose a function.
stat_qq: Calculation for quantile-quantile plot.
stat_spoke: Convert angle and radius to xend and yend.
stat_sum: Sum unique values. Useful for overplotting on scatterplots.
stat_summary: Summarise y values at every unique x.
stat_summary_hex: Apply funciton for 2D hexagonal bins.
stat_summary2d: Apply function for 2D rectangular bins.
stat_unique: Remove duplicates.
stat_identity

Dates, `tidyr`, and summaries with `ggplot2`

library(tidyr) # gather
duke.2014 = subset(ncaa.bball[["teams"]], 
        (season==2014)&(team.name=="Duke"))[,
        c("date","fg.pct","fg3.pct","ft.pct","result")] %>%
    tidyr::gather(shot.type, pct, -c(date, result))

library(scales) # date_format
ggplot(duke.2014, aes(x=date, y=pct, color=shot.type)) +
    geom_line() +
    geom_line(stat = "hline", yintercept = "mean",
        linetype="dashed", alpha=0.65) +
    geom_rug(data=subset(duke.2014,(result=="Loss")&(shot.type=="fg.pct")), 
        sides="b", color="grey20") +
    scale_x_datetime("Game date", labels = date_format("%b %Y")) +
    ylab("Proportion of shots made") +
    scale_color_discrete("Type of shot", 
        labels=c("two point field goal","three point field goal","free throw")) +
    ggtitle("Duke's per-game shot percentages, 2013-14 season:\nbottom ticks indicate losses; dashed lines are season averages") +
    theme_classic(base_size = 16) + theme(legend.position = "bottom")

A heatmap with `ggplot2::geom_tile`

library(RColorBrewer) # brewer.pal
library(grid)         # unit
tourney.teams.2014 = as.character(
    unique(subset(ncaa.bball[["team.season.summ"]],
        (season==2014)&(max.tourney.round>="Sweet Sixteen"))$team.name))
hist.perf.teams.2014 = subset(ncaa.bball[["team.season.summ"]], 
    team.name%in%tourney.teams.2014)
p1 = ggplot(hist.perf.teams.2014,
        aes(x=season, y=team.name, fill=win.pct)) +
    geom_tile() + 
    scale_fill_gradientn(
        "Proportion of regular season games won",
        colours = brewer.pal(9,"GnBu"),
        na.value="grey80", breaks=seq(0,1,0.25),
        guide = guide_colorbar(barwidth = 15, barheight = 1)) +
    scale_x_continuous("Season", expand = c(0, 0)) + 
    scale_y_discrete("Team", expand = c(0, 0)) +
    ggtitle("Historical regular season performance of\n2014 NCAA tournament 'Sweet 16' teams") +
    theme_classic(base_size=16) + 
    theme(legend.position = "bottom",
        axis.text.y = element_text(size = 11),
        plot.margin = unit(c(0,0.1,-0.4,0.1), "cm"))
p1

Creating a dendrogram with `ggdendro`

library(ggdendro)

team.season.df = subset(ncaa.bball[["team.season.summ"]],
    team.name%in%tourney.teams.2014)[,
        c("season", "team.name","win.pct")]
# use tidyr, but this time go long --> wide (spread)
team.season.mat = as.matrix(team.season.df %>% spread(team.name, win.pct))
rownames(team.season.mat) = team.season.mat[,"season"]
team.season.mat = team.season.mat[ ,-which(colnames(team.season.mat)=="season")]

teams.hc = hclust(dist(t(team.season.mat)), "ave")
ggdendrogram(teams.hc, rotate = TRUE)

Simplifying the dendrogram

teams.dendro = as.dendrogram(teams.hc)
teams.ddata = dendro_data(teams.dendro)

p2 = ggplot(segment(teams.ddata)) + 
    geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) + 
    coord_flip() + 
    theme_dendro() +
    # tweak these if the dendrogram doesn't line up:
    theme(plot.margin = unit(c(-7,0,-15,-20), "points"))
p2

Heatmap, reordered

# need to remove the extra factor levels
# and rorder according to the clustering
hist.perf.teams.2014$team.name = as.character(
    hist.perf.teams.2014$team.name)
# can do sort(unique(x)) here because they were originally 
# in alphabetical order. just be sure the order of the 
# dendrogram matches up with the new order of the heatmap.
hist.perf.teams.2014$team.name = factor(
    hist.perf.teams.2014$team.name,
    sort(unique(hist.perf.teams.2014$team.name))[
        order.dendrogram(teams.dendro)],
    ordered=TRUE)
p1 = ggplot(hist.perf.teams.2014,
        aes(x=season, y=team.name, fill=win.pct)) +
    geom_tile() + 
    scale_fill_gradientn(
        "Proportion of regular season games won",
        colours = brewer.pal(9,"GnBu")[3:9],
        na.value="grey80", breaks=seq(0,1,0.25),
        guide = guide_colorbar(barwidth = 15, barheight = 1)) +
    scale_x_continuous("Season", expand = c(0, 0)) + 
    scale_y_discrete("Team", expand = c(0, 0)) +
    ggtitle("Historical regular season performance of\n2014 NCAA tournament 'Sweet 16' teams") +
    theme_classic(base_size=16) + 
    theme(legend.position = "bottom",
        axis.text.y = element_text(size = 11),
        plot.margin = unit(c(0,0.1,-0.4,0.1), "cm"))
p1

Putting it all together

library(gtable)
g1 = gtable_add_cols(ggplotGrob(p1), unit(4,"cm"))
# may need to adjust "t" and "b" if you don't add a ggtitle:
g = gtable_add_grob(g1, ggplotGrob(p2),
    t=3, l=ncol(g1), b=4, r=ncol(g1))
grid.newpage()
grid.draw(g)

Packages that pair well with `ggplot2`

grid
gridExtra: additional functions to tweak/manipulate grid graphics
scales: additional functions to deal with the scale portions of the grammar of graphics
gtable: use to dismantle/hack underlying table of Grid Graphical Objects (grobs) that make up a ggplot
ggsubplot: embed smaller subplots within larger plots, all using ggplot2 graphics
dplyr (or plyr): manipulate data
tidyr (or reshape2): restructure data (esp. wide \(\leftrightarrow\) long)
lubridate: "makes working with dates fun instead of frustrating"
devtools: R package development tools (esp. ability to install packages from github rather than CRAN)

ggthemes, xkcd: for when you don't want your ggplot to look so much like a ggplot
extrafont, RColorBrewer, munsell, https://github.com/karthik/wesanderson : custom fonts and color schemes
ggdendro: dendrograms and trees with ggplot2
GGally: plot matrices (as in graphics::pairs), parallel coordinate plots, and network plots with ggplot2
ggtern: ternary plots (as in vcd::ternaryplot) and other additional *geom*s for ggplot2.
ggmap, maps, maptools, sp, rgdal, RgoogleMaps: maps with ggplot2
ggHorizon horizon graphs with ggplot2 (example 1, example 2)
\(\ldots\)

Where to go for help with `ggplot2`

Interactive data visualizations in R

Why interactive?

They're pretty, fun, and people love them (see Hans Rosling's TED talks).
They allow you to engage with, explore, and discover more about your data -- visually.
Static graphics are "dead" (according to The Economist).

SVG `ggplot` with `plotly`

# start with a static ggplot
p = ggplot(subset(ncaa.bball[["team.season.summ"]], 
        max.tourney.round>="Sweet Sixteen"),
    aes(x=season, y=win.pct, color=max.tourney.round)) +
    # ideally we would jitter here, but plotly has trouble with this
    # geom_point(position = position_jitter(w = 0.4, h = 0.002)) +
    geom_point(size=3, alpha=0.7) + # instead use transparency
    xlab("Year") + 
    ylab("Proportion of regular season games won") +
    scale_color_discrete("Highest tournament round achieved") +
    ggtitle("Regular season performance of 'Sweet Sixteen' teams 1985-2014, by season") +
    theme_classic(base_size = 16) + theme(legend.position = "bottom")
p

SVG `ggplot` with `plotly`

# library(devtools)
# install_github("ropensci/plotly")
library(plotly)

# get a plot.ly account and get your api key here:
# https://plot.ly/settings/api
# plug it in with your username in the statement below.
# set_credentials_file("username", "xxxxxxxxxx")

py <- plotly()
# recall p is the object returned by ggplot
# plot.info = py$ggplotly(p) # in an R session, opens in browser
plot.info = py$ggplotly(p, session="knitr") # embed in knitr document
# if you're embedding in a knitr document, 
# be sure to also set the code chunk
# plotly=TRUE

SVG graphic with `clickme`

# library(devtools)
# install_github("nachocab/clickme")
library(clickme)
cmplot = with(subset(ncaa.bball[["team.season.summ"]], 
        max.tourney.round>="Sweet Sixteen"),
    clickme("points",
        x = season, y = win.pct,
        names = team.name,
        color_groups = as.character(max.tourney.round),
        x_title = "Year", x_format = "",
        y_title = "Proportion of regular season games won",
        color_title = "Highest tournament round achieved",
        color_group_order = levels(max.tourney.round)[4:8],
        title = "Regular season performance of 'Sweet 16' teams",
        subtitle = "1985-2014, by season",
        file_path = paste(getwd(),"clickme0.html", 
            sep=.Platform$file.sep),
        height = 600, width = 700))
# cmplot # in an R session, open in browser
# embed in knitr document:
cmplot$iframe()$hide()

MotionChart with `googleVis` (Flash)

library(googleVis)
# data.frame with >=4 cols: x, y, id, time. color and size optional,
# but if you don't provide them, 
# it will choose them for you (if there are columns left to use)
mc = gvisMotionChart(
    subset(ncaa.bball[["team.season.summ"]], 
        !is.na(tourney.seed))[,c("team.name", "season", "win.pct", 
            "points.avg", "mov.avg", "tourney.seed")],
    idvar="team.name", timevar="season", 
    xvar="win.pct", yvar="tourney.seed",
    sizevar="mov.avg", colorvar="points.avg",
    options=list(width=750, height=650))
# plot(mc) # in an R session, opens in browser
print(mc, 'chart') # embed in knitr document

Many interactive options in R

plotly: ggplot2 graphics \(\rightarrow\) SVG via plot.ly's R API
googleVis: R interface to Google Charts API; SVG and Flash
rCharts: SVG graphics with popular JS libs, directly from R
ggvis: SVG and HTML5 Canvas graphics, rendered using vega, declared in a "grammar of graphics" style similar to ggplot2
gridSVG: ggplot2 and lattice graphics \(\rightarrow\) SVG image

clickme: interactive SVG graphics from R
rMaps: interactive maps with popular JS libs, directly from R
networkD3: d3.js network graphs from R (SVG)
rgl: interactive 3D visualizations using OpenGL and other frameworks/formats (example)
rggobi: R interface to GGobi, a "data visualization system" separate from R
SVGAnnotation: used for "post-processing SVG plots created in R"

Questions?

References

[1] L. Wilkinson. The Grammar of Graphics. Springer-Verlag New York, 2005. DOI: 10.1007/0-387-28695-0. .

[2] H. Wickham. ggplot2. Springer Science + Business Media, 2009. DOI: 10.1007/978-0-387-98141-3. .

[3] H. Wickham. "A Layered Grammar of Graphics". In: Journal of Computational and Graphical Statistics 19.1 (2010), pp. 3-28. DOI: 10.1198/jcgs.2009.07098. .

[4] H. Wickham. "Tidy Data". In: Journal of Statistical Software 59.10 (2014). .

Credits

Built using slidify with

knitr,
the deck.js framework,
the RefManageR bibliography manager, and
the prettify syntax highlighter.

Advanced Data Visualization in R

UC Berkeley Biostatistics seminar

Intro

Why ggplot?

Why not ggplot?

Tidy data [1]

The (Layered) Grammar of Graphics [2; 3; 4]

Components of the Grammar

The anatomy of a ggplot command

The data

A simple scatterplot

Improving the scatterplot

Improving the scatterplot

When geoms transform

Other ggplot transformations

Dates, tidyr, and summaries with ggplot2

A heatmap with ggplot2::geom_tile

Creating a dendrogram with ggdendro

Simplifying the dendrogram

Heatmap, reordered

Putting it all together

Packages that pair well with ggplot2

Where to go for help with ggplot2

Interactive data visualizations in R

Why interactive?

SVG ggplot with plotly

SVG ggplot with plotly

SVG graphic with clickme

MotionChart with googleVis (Flash)

Many interactive options in R

Questions?

References

Credits

Built using slidify with

Why `ggplot`?

Why not `ggplot`?

Tidy data ^[1]

The (Layered) Grammar of Graphics ^{[2; 3; 4]}

The anatomy of a `ggplot` command

Other `ggplot` transformations

Dates, `tidyr`, and summaries with `ggplot2`

A heatmap with `ggplot2::geom_tile`

Creating a dendrogram with `ggdendro`

Packages that pair well with `ggplot2`

Where to go for help with `ggplot2`

SVG `ggplot` with `plotly`

SVG `ggplot` with `plotly`

SVG graphic with `clickme`

MotionChart with `googleVis` (Flash)