Preamble

The purpose of today’s section is to discuss one of the more powerful tools in our data analysis toolkit: visualization. You will learn:

# uncomment the next line to install the packages
# install.packages(c("cowplot", "tidyr", "ggthemes"))
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
theme_set(theme_classic())

Why visualize data?

Data can be complex! But when we show data using graphical properties, such as size, shape or color, we are engaging in a form of visual communication that can help us (and others) understand large amounts of information very efficiently. Moreover, a good visualization can reveal information about the data that would be difficult to perceive otherwise.

To illustrate why visualization should almost always be the first step in an effective data analysis, let’s take a look at a dataset created by Francis Anscombe in 1973 for an article he described as trying to attack the impression among statisticians that “numerical calculations are exact, but graphs are rough.”

A pathological example: Anscombe’s quartet

First, let’s take a look at the structure of the dataset using the str() function.

str(anscombe)
## 'data.frame':    11 obs. of  8 variables:
##  $ x1: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x2: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x3: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x4: num  8 8 8 8 8 8 8 19 8 8 ...
##  $ y1: num  8.04 6.95 7.58 8.81 8.33 ...
##  $ y2: num  9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
##  $ y3: num  7.46 6.77 12.74 7.11 7.81 ...
##  $ y4: num  6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...

The data consist of 8 sets of 11 numeric observations (in x,y pairs) named X1, X2, X3, X4 and Y1, Y2, Y3, Y4.

Next, we are going to do some minor data wrangling to make it easier to compute summary statistics using our dplyr skills.

# no need to worry about the details of what this code is doing
# but basically we are extracting the variable type (x vs. y) and the number (1-4)
# and separting them into two columns
ansc_df <- anscombe %>% 
  gather(key = name, value = value) %>% 
  mutate(number = str_extract(name, pattern = "[:digit:]"),
         variable = str_extract(name, pattern = "[:alpha:]")) 

Ok, now that we have cleaned up our data, let’s start with something reasonable and “summarize” the Anscombe dataset by calculating (a) the number of observations, (b) a measure of central tendency (the mean), and (c) a measure of variability (the standard deviation) for all 8 X and Y variables.

# use the summarise() function to compute the mean and standard deviation 
# for each set of observations
summaries <- ansc_df %>% 
  group_by(variable, number, name) %>%  
  summarise(n_obs = n(),     # n observations
            m = mean(value), # central tendency
            sd = sd(value))  # variability 
# print the summary table
summaries %>%
  ungroup() %>% 
  select(name:sd)
## # A tibble: 8 x 4
##    name n_obs        m       sd
##   <chr> <int>    <dbl>    <dbl>
## 1    x1    11 9.000000 3.316625
## 2    x2    11 9.000000 3.316625
## 3    x3    11 9.000000 3.316625
## 4    x4    11 9.000000 3.316625
## 5    y1    11 7.500909 2.031568
## 6    y2    11 7.500909 2.031657
## 7    y3    11 7.500000 2.030424
## 8    y4    11 7.500909 2.030579

Discussion: What do we see in this table? Do we think these sets of X/Y observations are different from each other?

But, before we go tell anybody about our finding, let’s visualize the data.

a <- anscombe %>% 
  ggplot(., aes(x = x1, y = y1)) +
  geom_smooth(se = F, method = "lm", fullrange = T, color = "darkgrey") +
  geom_point() +
  xlim(0, 20) +
  ylim(0,15)

b <- anscombe %>% 
  ggplot(., aes(x = x2, y = y2)) +
  geom_smooth(se = F, method = "lm", fullrange = T, color = "darkgrey") +
  geom_point() +
  xlim(0, 20) +
  ylim(0,15)

c <- anscombe %>% 
  ggplot(., aes(x = x3, y = y3)) +
  geom_smooth(se = F, method = "lm", fullrange = T, color = "darkgrey") +
  geom_point() +
  xlim(0, 20) +
  ylim(0,15) 

d <- anscombe %>% 
  ggplot(., aes(x = x4, y = y4)) +
  geom_smooth(se = F, method = "lm", fullrange = T, color = "grey") +
  geom_point() +
  xlim(0, 20) +
  ylim(0,15) 

cowplot::plot_grid(a, b, c, d)

Whoa! What’s going on here? All four datasets are identical when examined using simple summary statistics, but vary considerably when plotted!

[Note: the position along the x axis is different than the plots we’ve been looking at in class for SS, where the points were spread out along the x-axis to show some separation between points but this x position didn’t represent a value in the data]

[Note 2: all four sets of (x,y) pairs have the exact same correlation coefficient and formula for the linear regression line (which we will learn about next week!)]

The key takeaway from Anscombe’s quartet is that without visualization we would have arrived at the wrong conclusion about these data. So it is always a good idea to plot your data before diving into statistical analysis.

Ok, I hope you’re convinced that plotting is a good idea, but how do you know which plot to make? And how do we interpret other people’s plots? In Part 2, we will get some practice creating and interpreting some of the more common statistical graphics that you will encounter in the wild.

Choosing effective plots (with some tips on interpretation)

[High-level point: I think it can be easy to get caught up in the code, but the details of the ggplot2 package are less important than thinking critically about how your decisions about visualization can change the interpretation of plot.]

We are going to walk through several useful plots for you to have in your visualization toolkit. While we discuss plots, try to keep in mind the following general principles of effective data visualization:

Also, the two questions that I ask myself before I create any data visualization are:

Plotting distributions

# get the data from the ISIwithR package
singers_heights <- ISIwithR::SingerHeights

# look at the structure of the data
singers_heights %>% head() %>% knitr::kable()
part height
soprano 64
soprano 62
soprano 66
soprano 65
soprano 60
soprano 61

Discussion: What information is in this dataset? Or what type of variables are we working with? What are the relevant comparisons that we might want to make?

Histogram: Often the first plot you want to make; one dimensional; shows shape of distributions by binning a continuous distribution and mapping the count of an observation to the height of a colored bar.

Here’s how to plot the one-dimensional histogram of the “height” variable:

ggplot(singers_heights, aes(x = height)) +
  geom_histogram(bins = 30, color= "black", fill = "grey")

What do we like? What’s missing here?

# But, the above histogram doesn't show all of the information/structure in the data -- namely that there are # four different groups of singers: soprano, alto, tenor, bass. So let's replot the data in a "grouped histogram" using the `fill` aesthetic to encourage visual comparison of the different types of singers

ggplot(singers_heights, aes(x = height, fill = part)) +
  geom_histogram(bins = 15, position="dodge", color = "black")

One key decision for creating a histogram is the “bin width” controlled by the bins argument.

Let’s play with the bins argument in geom_histogram to see what happens to the plot.

ggplot(singers_heights, aes(x = height, fill = part)) +
  geom_histogram(bins = 15, position="dodge", color = "black")

What’s going on here? How would this deicision change your interpretation of the plot?

From the documentation of geom_histogram(), “binwidth: the default is to use bins bins that cover the range of the data. You should always override this value, exploring multiple widths to find the best to illustrate the stories in your data”

Another good way to visualize distributions (that we have seen more often in this course) is the dot plot. One advantage of the dotplot is that each dot maps onto a single observation in the dataset, making it easy to get a sense how many observations are in each group while still showing the central tendency, variability, and shape of the distributions.

ggplot(singers_heights, aes(x = height, fill = part)) +
  geom_dotplot(position = "dodge") +
  scale_y_continuous(name = "", breaks = NULL)
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

Summarising and comparing multiple groups

Showing distributions is often the best first step, but we also want to help the reader compare values across multiple groups. This is where visualizing the summary statistics that we have been learning in this course can be really useful.

First, let’s compute the mean and a confidence interval for each group.

mean_and_ci <- singers_heights %>% 
  group_by(part) %>%    # "do the following summary functions for each level of the part variable"
  summarise(n = n(),
            mean = mean(height),
            ci_lower = mean - qt(0.975, df = n-1) * sd(height)/sqrt(n),
            ci_upper = mean + qt(0.975, df = n-1) * sd(height)/sqrt(n)) 
mean_and_ci
## # A tibble: 4 x 5
##      part     n     mean ci_lower ci_upper
##     <chr> <int>    <dbl>    <dbl>    <dbl>
## 1    alto    35 64.88571 63.92572 65.84571
## 2    bass    39 70.71795 69.95247 71.48343
## 3 soprano    36 64.25000 63.61636 64.88364
## 4   tenor    20 69.15000 67.64471 70.65529

** YOUR TURN. ** Sketch a plot! With your partner or in a small group, try to make a plot that shows these statistics while encouraging comparisons across the four groups of singers.

Here’s how I would approach this problem in R. I would probably start with a bar plot where the height of the bar maps to the mean of each group and the length of the error bars map to the width of the confidence interval.

ggplot(data = mean_and_ci, aes(x = part, y = mean)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper)) 

Ok, let’s discuss! What do we like? What do we not like? And how would you change it?

Let’s try to improve our information to ink ratio by changing the bars to points and removing the horizontal bars from the confidence interval.

ggplot(data = mean_and_ci, aes(x = part, y = mean)) +
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper)) 

This looks a little cleaner, but how could we make it even more informative? What’s missing?

We can make this plot even more informative by: - expanding the range of the y-axis - adding a horizontal dashed line at the value of 5 feet, so the reader has a useful baseline (since we have removed zero) - adding clearer axis labels that include the units for the y-axis - mapping the number of observations in each group to the size of the point]

ggplot(data = mean_and_ci, aes(x = part, y = mean)) +
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper, size = n)) +
  scale_size_continuous(range = c(0.3, 0.9)) +
  ylim(55, 75) +
  geom_hline(yintercept = 60, linetype = "dashed") +
  geom_hline(yintercept = 72, linetype = "dashed") +
  annotate(geom = "text", x = 0.65, y = 58, label = "5 feet \n tall") +
  annotate(geom = "text", x = 0.65, y = 70, label = "6 feet \n tall") +
  labs(x = "Type of singer", y = "Avg. height (inches)") 

Now we can now easily compare the means and confidence intervals across the four different types of singers. But what information have we lost by only showing means and confidence intervals?

[Note: While I really like this visualization, it doesn’t communicate any information about the shapes of the distributions that we saw when we plotted the histograms.]

If we wanted to communicate information about the distribution, summary statistics, and confidence intervals, the boxplot is a pretty useful tool. Since this plot has such a good information to ink ratio, let’s spend a little time understanding how to interpret it.

[TA note: this is also a good opportunity to highlight some stats concepts that they have learned: means/medians, CIs, standard deviation, IQR.]

mean_and_ci <- mean_and_ci %>% mutate(height = mean)

ggplot(data = singers_heights, aes(x = reorder(part, height), y = height)) +
  geom_boxplot(width = 0.3) +
  ylim(55, 80) +
  geom_hline(yintercept = 60, linetype = "dashed") +
  geom_hline(yintercept = 72, linetype = "dashed") +
  annotate(geom = "text", x = 0.65, y = 58, label = "5 feet \n tall") +
  annotate(geom = "text", x = 0.65, y = 70, label = "6 feet \n tall") +
  labs(x = "Type of singer", y = "Avg. height (inches)") 

  • A boxplot splits the data set into four quartiles. The body of the boxplot consists of a “box” (hence, the name), which goes from the first quartile (Q1) to the third quartile (Q3).

  • Within the box, a horizontal line is drawn at the Q2, the median of the data set, and two vertical lines, called whiskers, extend from the top and bottom of the box. The bottom whisker goes from Q1 to the smallest non-outlier in the data set, and the top whisker goes from Q3 to the largest non-outlier.

  • You can also see whether there is skew in the data. If most of the observations are concentrated on the low end of the scale, the distribution is skewed right; and vice versa. If a distribution is symmetric, the observations will be evenly split at the median.

Your turn. Let’s practice interpreting the box plot above by answering the following questions:

  • What’s the median for each group?
  • The Range?
  • The Interquartile Range (IQR)
  • Is there evidence of skew in any of the groups?

Notice that we lost some information from the previous plot that would help us with visual inference (drawing conclusions about whether the population means are different from one another) – the confidence intervals.

How could we put that information back on the plot?

ggplot(data = singers_heights, aes(x = reorder(part, height), y = height)) +
  geom_boxplot(width = 0.3) +
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper), color = "darkgreen", 
                  data = mean_and_ci) +
  ylim(55, 80) +
  geom_hline(yintercept = 60, linetype = "dashed") +
  geom_hline(yintercept = 72, linetype = "dashed") +
  annotate(geom = "text", x = 0.65, y = 58, label = "5 feet \n tall") +
  annotate(geom = "text", x = 0.65, y = 70, label = "6 feet \n tall") +
  labs(x = "Type of singer", y = "Avg. height (inches)") 

I like it, but at this point, I am starting to feel like I’ve added too much ink to the plot, and maybe I would consider removing one of the dashed horizontal lines or making them lighter.

Line plot

We haven’t discussed continuous predictors yet, but you could include some short discussion of using a line plot if you have a continuous predictor variable plotted on the x-axis (time being one of the best examples.)

Misleading plots and graphical integrity

We have seen that effective and truthful data visualization can be an effective way to communicate lots of information. But it can also be especially misleading. Also, more and more information is being presented in graphical form. So to be a good consumer of information, let’s get some practice identifying the different ways that data visualizations can mislead the reader.

If you’re curious, the following examples are taken from these articles:

Messing with the y-axis

Truncating the y-axis: The conventional way of organizing the y-axis is to start at 0 and then go up to the highest data point in your set. By not setting the origin of the y-axis at zero, small differences become hyperbolic.

A real world example #1:

This takes advantage of the bar chart’s use of length to represent data. This means that if the bars don’t begin at zero, comparisons between bar lengths will distort the data — and a reader’s comprehension of the numbers. In the example above, the bars would appear much more equal in length if the scale were not cut off and we included a meaningful baseline of 0% increase.

The distortion can also work in the other direction! If we wish to minimize the appearance of a difference between two sets of observations, we can increase the range of the y-axis (see the Singers’ Heights bar plot that we made earlier in section).

But, not all y-axes should start at zero. See this line plot of the average annual temperature as a function of year. Real-world example #2:

A change of even one degree in the average global temperature is significant, but starting this chart at zero makes it look minuscule! This is a good example of why not all y-axes need to start at zero, rather the baseline has to be meaningful and should not distort the message of the data. Here’s the same plot with a more reasonable baseline:

But graphics don’t just minimize the effects of global warming, they can also exaggerate them. Consider this plot comparing the average global temperature in 2012 to the five warmest years on record.

Some thoughts on this plot:

  • Range on y-axis from 53 to 55.5 and small ranges can exaggerate small changes
  • Sorting of values on x-axis by y-value, not by year (x-value)
  • Use of orange/red emphasizes warming (good? misleading?)
  • Skewed axis makes the bars on the left look farther away (smaller)

Here’s a better version that adds in some helpful context of temperature from other years while still highlighting the five warmest years.

Dual y-axes

At first glance, it looks like both the number of uninsured Americans and Unemployment Rate are increasing at the same rate. But we can’t make this conclusion because there are two y-axes with very different units (the left is frequency and the right is a percentage). If you actually look at the numbers, you see that lack of insurance is increasing very slightly (from ~15 percent to ~16 percent) and Unemployment increases more rapidly (from ~4.5 percent to 7.5 percent).

Violating graphical conventions

One of the worst things people can do when constructing a misleading data visualization is to violate standard practices. For example, look at this plot showing changes in the number of gun deaths in Florida over time.

At first glance, it looks like gun deaths are on the decline in Florida. But a closer look shows that the y-axis is upside-down, with zero at the top and the maximum value at the bottom. As gun deaths increase, the line slopes downward, violating a well-established convention that y-values increase as we move up the page.

The key takeaway here is that choosing an effective plot allows you to quickly understand and communicate lots of information. But this comes at the potential cost of fooling yourself or others if you present misleading visualizations.

An example from my own research

Here is an example of a plot from a study me and my advisor recently finished. Let’s chat about the decisions/discussion that went into making this visualization and why we made them.

More resources on effective data visualization