32 summarise

Written by Mariam Walaa and last updated on 7 October 2021.

32.1 Introduction

In this lesson, you will learn how to:

Summarize a variable using summarise()
Summarize groups of observations within a variable using group_by()

Prerequisite skills include:

Using group_by()
Using summary functions like sum(), min(), max()

Highlights:

summarise() is often used with group_by()
There are many summary functions you can use within summarise()
You can also define your own functions to use within summarise()

Source: https://github.com/allisonhorst/stats-illustrations Credits: Allison Horst

32.2 Arguments

The summarise() function takes the following as arguments:

Argument	Parameter	Details
.data	data frame	a data frame containing variables we want to summarize
name-value pairs	name-value pairs	this takes the name of the column and the summary function

You can read more about the arguments in the summarise() function documentation here.

32.3 Overview

This section will demonstrate how to use the summarise() function to summarize variables and groups within a variable in a data set. We will be looking at a data set of Broadway shows with variables about the performances, attendance, and revenue for theaters that are part of The Broadway League. You can learn more about the data set provided by Alex Cookson in the data repository provided on GitHub, as well as this corresponding blog post.

glimpse(broadway)
#> Rows: 47,524
#> Columns: 8
#> $ week_ending      <date> 1985-06-09, 1985-06-09, 1985-06-…
#> $ show             <chr> "42nd Street", "A Chorus Line", "…
#> $ theatre          <chr> "St. James Theatre", "Sam S. Shub…
#> $ weekly_gross     <dbl> 282368, 222584, 249272, 95688, 61…
#> $ avg_ticket_price <dbl> 30.42, 27.25, 33.75, 20.87, 20.78…
#> $ top_ticket_price <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ performances     <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, …
#> $ previews         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

You will notice that there are 47,524 rows and 8 columns. Each row uniquely represents a show that occurred on a specific week. Each column contains information about a show that occurred in a specific week in a specific theater.

32.3.1 Question 1

How many performances occurred in total?

broadway %>%
  summarise(total_performances = sum(performances))
#> # A tibble: 1 × 1
#>   total_performances
#>                <dbl>
#> 1             343967

32.3.2 Question 2

How many performances occurred per week?

broadway %>%
  group_by(week_ending) %>%
  summarise(total_num_performances = sum(performances), 
            .groups = 'drop')
#> # A tibble: 1,812 × 2
#>    week_ending total_num_performances
#>    <date>                       <dbl>
#>  1 1985-06-09                     137
#>  2 1985-06-16                     139
#>  3 1985-06-23                     134
#>  4 1985-06-30                     137
#>  5 1985-07-07                     136
#>  6 1985-07-14                     137
#>  7 1985-07-21                     137
#>  8 1985-07-28                     121
#>  9 1985-08-04                     121
#> 10 1985-08-11                     121
#> # … with 1,802 more rows

The code for answering this second question is similar to the first question, except that we need to group by the week_ending variable which describes each distinct week. After grouping by the week, we use the summarise function to sum up all the performances for each week.

32.3.3 Question 3

How many performances and previews occurred per week?

broadway %>%
  group_by(week_ending) %>%
  summarise(total_num_performances = sum(performances),
            total_num_previews = sum(previews),
            .groups = 'drop')
#> # A tibble: 1,812 × 3
#>    week_ending total_num_performances total_num_previews
#>    <date>                       <dbl>              <dbl>
#>  1 1985-06-09                     137                 16
#>  2 1985-06-16                     139                  9
#>  3 1985-06-23                     134                  6
#>  4 1985-06-30                     137                  8
#>  5 1985-07-07                     136                  1
#>  6 1985-07-14                     137                  0
#>  7 1985-07-21                     137                  0
#>  8 1985-07-28                     121                  0
#>  9 1985-08-04                     121                  0
#> 10 1985-08-11                     121                  0
#> # … with 1,802 more rows

Here, we are taking two sums, the sum of performances and the sum of previews, for each distinct week.

32.3.4 Question 4

How many performances occurred per theatre within each week?

broadway %>%
  group_by(week_ending, theatre) %>%
  summarise(total_num_performances = sum(performances),
            .groups = 'drop')
#> # A tibble: 45,776 × 3
#>    week_ending theatre                      total_num_perfo…
#>                                             
#>  1 1985-06-09  46th Street Theatre                         8
#>  2 1985-06-09  Ambassador Theatre                          8
#>  3 1985-06-09  Booth Theatre                               8
#>  4 1985-06-09  Broadhurst Theatre                          0
#>  5 1985-06-09  Broadway Theatre                            8
#>  6 1985-06-09  Brooks Atkinson Theatre                     8
#>  7 1985-06-09  Circle in the Square Theatre                8
#>  8 1985-06-09  Edison Theatre                              9
#>  9 1985-06-09  Eugene O'Neill Theatre                      8
#> 10 1985-06-09  Gershwin Theatre                            0
#> # … with 45,766 more rows

This is similar to the second question, except we are grouping by two variables this time. This means we first group the distinct weeks, and then for each week, we group by the theatres and sum up the performances for each theatre by week. For example, for Week 1, there were X performances for Theatre A, Y performances for Theatre B, and Z performances for Theatre C.

Notice that we include the .groups argument within each summarise() function call (highlighted in red). We mostly do this to keep the output clean, but you can learn more about this argument by running?summarise in your console.

32.4 Exercises

This section will ask you to complete exercises based on what you’ve learned from the previous section.

32.4.1 Exercise 1

How many theaters do we have in this data set?

n_distinct()
#> [1] 0

# Try naming it something simple and clear, like n_theatres

32.4.2 Exercise 2

How many shows occurred per week?

n_distinct()
#> [1] 0

# Try naming it something brief, like n_shows

32.4.3 Exercise 3

What is the average number of performances across all theatres per week?

# Try naming it something descriptive, like avg_num_performances

32.4.4 Exercise 4

What is the minimum and maximum number of performances per week?

# Try across()

32.4.5 Exercise 5

What is the average top ticket price?

# Try na.rm = TRUE

32.4.6 Exercise 6

Which weeks did shows have no performances or previews?

# Try arrange()

32.4.7 Exercise 7

Select all the true statements about the summarise() function from dplyr.

32.5 Common Mistakes & Errors

Below are some common mistakes and errors you may come across:

You try to summarize a column that has NA values. Remember to include na.rm = TRUE.
You try to summarize a column that is not available in the data set (i.e., you misspelled the column name, or it’s simply not in the data set).

32.6 Next Steps

If you would like to read more about the summarise() function, here are some additional resources you may find helpful:

DoSS Toolkit

32 summarise

32.1 Introduction

32.2 Arguments

32.3 Overview

32.3.1 Question 1

32.3.2 Question 2

32.3.3 Question 3

32.3.4 Question 4

32.4 Exercises

32.4.1 Exercise 1

32.4.2 Exercise 2

32.4.3 Exercise 3

32.4.4 Exercise 4

32.4.5 Exercise 5

32.4.6 Exercise 6

32.4.7 Exercise 7

32.5 Common Mistakes & Errors

32.6 Next Steps

32.7 Exercises

32.7.1 Question 1

32.7.2 Question 2

32.7.3 Question 3

32.7.4 Question 4

32.7.5 Question 5

32.7.6 Question 6

32.7.7 Question 7

32.7.8 Question 8

32.7.9 Question 9

32.7.10 Question 10