By the end of this you will have had a whirlwind tour of the very tip of the data visualization best-practices iceberg. We will go over a broad range of topics generally applicable to data science usecases but not dive too deep into any single one. One thing to keep in mind the whole time is none of this is absolutely set in stone, most often in the real world you have to bend or break some of these rules to do what you want.
“The best camera is the one that’s with you.” –Chase Jarvis.
A very common question for people starting out with R and visualization is “which library should I use?” Like most things there is no right answer. Every situation is different. These are a few points to keep in mind in deciding which tool to use (hint: it really doesn’t matter).
Jeff Leek has a fantastic article on his blog about this issue.
Ultimately it comes down to what you know. You can do an absolutely amazing amount in most tools (even excel) so do what you like best.
For most people in R the choice is Ggplot vs Base. I mostly use Ggplot because it’s what I am the most familiar with (and it has nice defaults (more on this later)).
Whatever you choose will, in the not to distant future, be old and replaced by the new best thing, so understanding the concepts is a much better investment of your time. The next bit of this will be trying to reinforce good concepts.
A lot of data visualization is common sense, but some of it isn’t. These are a few of the examples of charts made that are not the best fit for the data that I frequently see.
Okay, let’s get the elephant out of the room first. The pie chart elicits a similar response in a data-viz person as a computer scientist’s prediction algorithm to a statistician. Initially claims of blaspheme but sometimes upon closer inspection grudging respect.
# a simple pie chart
data = data.frame(
val = c( 8 , 6 , 9 , 4 , 2 , 3.5),
labs = c("a", "b", "c", "d", "e", "f") )
pie(data$val, data$labs)
So why all the ire?
Humans have a very hard time interpreting angles, and that’s how a pie chart encodes the data. Looking at the code/chart above we know that d and f are 0.5 apart, or f is only 87.5% of the value of d, but upon initial inspection the average user would probably say they are the same.
We could use something called a tree map:
library(treemap)
treemap(data, c("labs"), "val")
This works similar in spirit to a pie chart, but encodes values in physical area rather than using an angle.
I would argue this is actually worse than the pie chart, but it is certianly a good option for some types of data. If you had a large number of values or hierarchically clustered data, treemaps can be excellent tools for looking at large amounts of data fast.
Even simpler you could do a stacked bar chart.
library(ggplot2)
ggplot(data, aes(1, val, fill=labs, width=0.2)) +
geom_col()
Same concept as the treemap in that value is encoded in area rather than angle. This would be good for a low number of comparisons with logical ordering or as a supplementary figure for a larger visualization.
Out of all of these options a plain bar chart is probably the most clear.
ggplot(data, aes(y = val, x = labs)) +
geom_col() +
labs("x" = "")
Using a bar chart we can clearly see that f is smaller than d.
There was a paper a few years ago by two super stars in data visualization Jeffrey Heer and Mike Bostock. In it they took a bunch of visual encodings of the same data (much like we are doing here) and showed them to people and asked them questions about what the data said. They then recorded these results and plotted them to show differences between encoding quality.
Pie charts are pretty far down there, but then again so are tree maps. If you did it for a different dataset I am betting you would get different results given which chart type the data fits best with. This raises the important question:
Is all the hate warranted for pie charts?
Penn postdoc Randal Olsen has a good blog post on pie charts.. It is a highly recommended read but to paraphrase his rules on pie charts:
People intuitively get pie charts so don’t rule out their use entirely, but make sure you are using them properly.
Bar charts are fantastic tools. It seems that more often than not they are the best visualization for the job, often out-competing more complicated flashy visualizations in terms of ease of reading/comprehension. There are some instances where they are not appropriate however.
As a general rule of thumb if the measure is a quantity of something then it makes sense to use a bar chart. This would include number of infections, a person’s weight etc.. A general heuristic I like to use when deciding to use a bar chart or not is ‘could I redraw the chart such that the bars are made up of individual instances of whatever the y-axis is encoding?’
Let’s look at a group of patients and their percentiles for vitamin d levels in their blood.
First we plot with a bar plot.
data <- data.frame(student = c("Tina", "Trish", "Kevin", "Rebecca", "Sarah"),
percentile = c(25, 95, 54, 70, 99) ) #percentile of d levels
p <- ggplot(data, aes(x = student, y = percentile))
p + geom_col()
The hierarchy of the data is clearly visible but the intuitive interpretation of the bar is slightly confusing. A percentile is not a sum of values but simply a place on the continuum of a scale. In addition, we have a tendency to assign good or bad to large or small levels of bar charts when in this case the middle would be best.
Let’s re-visualize the data as a dot-plot.
p +
geom_point(color = "steelblue", size = 4) +
theme_minimal() + # helps make the grid lines look more like guides
coord_flip()
This is more legible and intuitive. We see that the measure is simply a point where the student falls, not the accumulation of percentiles.
There are some exceptions to this rule. For instance: weight being looked at for a single person over time might be best shown on a line chart. Like almost everything in visualization, thinking carefully about what your data are before plotting them is important.
What if the thing we’re interested in is showing multiple observations of a given value? For instance, say we were looking at the expression of a protein of interest across different conditions. Like good scientists we took multiple measurements of each condition so we should represent that.
So I will forgive you if you want to use these plots just due to the fact they have the coolest name, however, don’t.
To illustrate why let’s look at the dynamite plot of our hypothetical expression experiment…
p <- ggplot(expression_summarized,aes(x = sample, y = average)) +
geom_errorbar(aes(ymin = average + sd, ymax = average + sd)) +
geom_linerange(aes(ymin = average, ymax = average + sd))
p + geom_col()
All these look pretty similar! Must be nothing really interesting going on. Maybe the first two conditions have higher peak? Or do the bars represent variablity? Also, does the top of the bar represent the bottom of an interval? Or the middle? I can never remember (because it changes from plot to plot).
Let’s actually look at the data that went into these plots. Usually with plots like these the number of datapoints going in is tiny so we may as well just show it.
p +
geom_col(alpha = 0.5) +
geom_jitter(data = expressions, aes(x = sample, y = expression), width = 0.1)
Oh, oh my, those aren’t the same at all. Let’s just clean this up a tiny bit.
ggplot(expression_summarized,aes(x = sample, y = average)) +
geom_pointrange(aes(ymin = average - sd, ymax = average + sd),
shape = 1, color = 'steelblue') +
geom_jitter(data = expressions, aes(x = sample, y = expression),
width = 0.1, alpha = 0.5)
If you want more evidence to push back against your PI with than the word of some random biostats grad student the blog post dynamite plots must die by Rafael Irizarry, chair of biostats at Harvard has much more indepth coverage of why dynamite plots are bad.
Box plots are, like the pie chart, one of the first visualization techniques we are taught. However, it is not necessarily a good one and many better new options have arisen.
The problem with box plots, much like dynamite plots, is they obscure trends at a resolution finer than the quantiles. Take for instance the following two box plots:
#Hiding the data input on purpose...
p <- ggplot(data, aes(dataset, val))
p +
geom_boxplot(fill = "steelblue", color = "grey") +
labs(title = "Box Plots")
Given the information that the standard box plot provides us we would say that these groups are identical.
What happens if we try another way of visualizing the distribution of data?
First let’s try a(nother form of the) dot plot:
p +
geom_dotplot(binaxis = "y", stackdir = "center",
fill = "steelblue", color = "steelblue", dotsize = 0.7) +
labs(title = "Dot Plots")
Now we can see that these data are very differently distributed.
Another method of visualizing the distribution of the groups is a violin plot. This is essentially a kernel density version of the dot plot. Useful for when the data are very large and a dot plot is not particularly useful due to the large number of dots drawn. However, if your data are small enough that you can actually visualize each point, do it.
p +
geom_violin(adjust = .5, fill = "steelblue", color = "steelblue") +
labs(title = "Violin Plots")
If you still want the familiarity of the box plot combined with the enhanced ability to see the underlying distribution you can combine the two plots as well.
p +
geom_dotplot(binaxis = "y", stackdir = "center",
fill = "steelblue", color = "steelblue") +
geom_boxplot(alpha = 0, size=1)