32 summarise
Written by Mariam Walaa and last updated on 7 October 2021.
32.1 Introduction
In this lesson, you will learn how to:
- Summarize a variable using
summarise()
- Summarize groups of observations within a variable using
group_by()
Prerequisite skills include:
- Using
group_by()
- Using summary functions like
sum()
,min()
,max()
Highlights:
-
summarise()
is often used withgroup_by()
- There are many summary functions you can use within
summarise()
- You can also define your own functions to use within
summarise()
32.2 Arguments
The summarise()
function takes the following as arguments:
Argument | Parameter | Details |
---|---|---|
.data | data frame | a data frame containing variables we want to summarize |
name-value pairs | name-value pairs | this takes the name of the column and the summary function |
You can read more about the arguments in the summarise()
function documentation
here.
32.3 Overview
This section will demonstrate how to use the summarise()
function to summarize variables
and groups within a variable in a data set. We will be looking at a data set of Broadway
shows with variables about the performances, attendance, and revenue for theaters that are
part of The Broadway League. You can learn more about the data set provided by Alex
Cookson in the data repository provided on GitHub, as
well as this corresponding blog
post.
glimpse(broadway)
#> Rows: 47,524
#> Columns: 8
#> $ week_ending <date> 1985-06-09, 1985-06-09, 1985-06-…
#> $ show <chr> "42nd Street", "A Chorus Line", "…
#> $ theatre <chr> "St. James Theatre", "Sam S. Shub…
#> $ weekly_gross <dbl> 282368, 222584, 249272, 95688, 61…
#> $ avg_ticket_price <dbl> 30.42, 27.25, 33.75, 20.87, 20.78…
#> $ top_ticket_price <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ performances <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, …
#> $ previews <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
You will notice that there are 47,524 rows and 8 columns. Each row uniquely represents a show that occurred on a specific week. Each column contains information about a show that occurred in a specific week in a specific theater.
32.3.2 Question 2
How many performances occurred per week?
broadway %>%
group_by(week_ending) %>%
summarise(total_num_performances = sum(performances),
.groups = 'drop')
#> # A tibble: 1,812 × 2
#> week_ending total_num_performances
#> <date> <dbl>
#> 1 1985-06-09 137
#> 2 1985-06-16 139
#> 3 1985-06-23 134
#> 4 1985-06-30 137
#> 5 1985-07-07 136
#> 6 1985-07-14 137
#> 7 1985-07-21 137
#> 8 1985-07-28 121
#> 9 1985-08-04 121
#> 10 1985-08-11 121
#> # … with 1,802 more rows
The code for answering this second question is similar to the first question, except that we need to group by the week_ending variable which describes each distinct week. After grouping by the week, we use the summarise function to sum up all the performances for each week.
32.3.3 Question 3
How many performances and previews occurred per week?
broadway %>%
group_by(week_ending) %>%
summarise(total_num_performances = sum(performances),
total_num_previews = sum(previews),
.groups = 'drop')
#> # A tibble: 1,812 × 3
#> week_ending total_num_performances total_num_previews
#> <date> <dbl> <dbl>
#> 1 1985-06-09 137 16
#> 2 1985-06-16 139 9
#> 3 1985-06-23 134 6
#> 4 1985-06-30 137 8
#> 5 1985-07-07 136 1
#> 6 1985-07-14 137 0
#> 7 1985-07-21 137 0
#> 8 1985-07-28 121 0
#> 9 1985-08-04 121 0
#> 10 1985-08-11 121 0
#> # … with 1,802 more rows
Here, we are taking two sums, the sum of performances and the sum of previews, for each distinct week.
32.3.4 Question 4
How many performances occurred per theatre within each week?
broadway %>%
group_by(week_ending, theatre) %>%
summarise(total_num_performances = sum(performances),
.groups = 'drop')
#> # A tibble: 45,776 × 3
#> week_ending theatre total_num_perfo…
#>
#> 1 1985-06-09 46th Street Theatre 8
#> 2 1985-06-09 Ambassador Theatre 8
#> 3 1985-06-09 Booth Theatre 8
#> 4 1985-06-09 Broadhurst Theatre 0
#> 5 1985-06-09 Broadway Theatre 8
#> 6 1985-06-09 Brooks Atkinson Theatre 8
#> 7 1985-06-09 Circle in the Square Theatre 8
#> 8 1985-06-09 Edison Theatre 9
#> 9 1985-06-09 Eugene O'Neill Theatre 8
#> 10 1985-06-09 Gershwin Theatre 0
#> # … with 45,766 more rows
This is similar to the second question, except we are grouping by two variables this time. This means we first group the distinct weeks, and then for each week, we group by the theatres and sum up the performances for each theatre by week. For example, for Week 1, there were X performances for Theatre A, Y performances for Theatre B, and Z performances for Theatre C.
Notice that we include the .groups
argument within each summarise()
function call
(highlighted in red). We mostly do this to keep the output clean, but you can learn more
about this argument by running?summarise
in your console.
32.4 Exercises
This section will ask you to complete exercises based on what you’ve learned from the previous section.
32.4.1 Exercise 1
How many theaters do we have in this data set?
n_distinct()
#> [1] 0
# Try naming it something simple and clear, like n_theatres
32.4.2 Exercise 2
How many shows occurred per week?
n_distinct()
#> [1] 0
# Try naming it something brief, like n_shows
32.4.3 Exercise 3
What is the average number of performances across all theatres per week?
# Try naming it something descriptive, like avg_num_performances
32.4.7 Exercise 7
Select all the true statements about the summarise()
function from dplyr.
32.5 Common Mistakes & Errors
Below are some common mistakes and errors you may come across:
- You try to summarize a column that has NA values. Remember to include
na.rm = TRUE
. - You try to summarize a column that is not available in the data set (i.e., you misspelled the column name, or it’s simply not in the data set).
32.6 Next Steps
If you would like to read more about the summarise()
function, here are some additional
resources you may find helpful: