Data Visualization

In R data visualizations are constructed and refined with the ggplot package.

Data visualization: choose the best plot for your data

For visualizing proportions. While the most common way to visualize proportions is pie charts, they are not the most intuitive way to interpret proportions (as they are coded as angles of a circle). Here we explore some alternatives.

Waffle charts are useful when comparing proportions for a single variable. They are easier to interpret, as one square of the waffle can be interpreted as a single part of a whole.

Stacked bar charts are useful when comparing proportions for multiple variables. Note that a single bar for a single variable has the same interpretability issues as a pie chart.

For visualizing point data. While the most common way to visualize proportions is bars or columns. These options are only useful for data with some sort of cumulative properties, plus the location of the actual data points (often informative) is obscured by the bar shape. Here we explore some alternatives.

Point charts are useful as they allow you to show the location of the data.

For visualizing distributions. The most common way to visualize a single distribution is a histogram. While histograms are generally useful, they have some disadvantages like the fact that the bin sizing is completely arbitrary and can significantly change the shape of the data, and the fact that they are only useful with large quantities of data.

Here is a diagram to help you choose whether or not to use a histogram, and what bin size to use.

An alternative to using histograms for a single distribution is to use Kernel plots (KDE). The caveat to using kernel plots is that the width of the kernel (or SD) is up to you, thus the kernel can obscure the density of your data. To avoid this, you can add a marginal rug plot which will show the density of the data at different points of the kernel plot.

When the objective is to compare different distributions, the most commonly used plots are box plots. The downside of boxplots is that similarly to bar plots they can obscure your real data points. One way around this is to make the box plots transparent and to add a geom_jitter layer to the plot to show the individual data points.

There are some alternatives to box plots when comparing data distributions. Violin plots are similar to box plot, but they show the density of the data instead of using a bar. This provides extra information about the data. Beeswarm plots are like violin plots but instead of showing the shape of the data density, they actually show the data points.

Finally, when the goal is to compare data distributions across time. For example in a longitudinal data set. The best option is to choose a density ridge plot.

PreviousGetting started with R NextCoding in the tidyverse

Last updated 2 years ago