64 Conditional mutating and summarising
Written by Mariam Walaa and last updated on 7 October 2021.
64.1 Introduction
In this lesson, you will learn how to:
- Use
across()
withsummarise()
- Use
mutate_if()
- Use
if_else()
andna_if()
Prerequisite skills include:
- Familiarity with
summarize()
andmutate()
- Familiarity with conditional statements
if_else()
Highlights:
- Use
across()
to summarize across a defined selection of columns - Mutate column types based on conditions using
mutate_if()
- Mutate columns based on conditions using
if_else()
andna_if()
withinmutate()
64.2 Overview
This section will demonstrate how to use the summarise()
function with across()
to
summarize variables and groups within a variable in a data set. We will be looking at a
data set of Broadway shows with variables about the performances, attendance, and revenue
for theaters that are part of The Broadway League. You can learn more about the data set
provided by Alex Cookson in the dataset repository as
well as this corresponding blog
post.
64.3 Questions
Lets start with loading the tidyverse and looking at the data.
broadway
#> # A tibble: 47,524 × 8
#> week_ending show theatre weekly_gross avg_ticket_price
#> <date> <chr> <chr> <dbl> <dbl>
#> 1 1985-06-09 42nd S… St. Ja… 282368 30.4
#> 2 1985-06-09 A Chor… Sam S.… 222584 27.2
#> 3 1985-06-09 Aren't… Brooks… 249272 33.8
#> 4 1985-06-09 Arms a… Circle… 95688 20.9
#> 5 1985-06-09 As Is Lyceum… 61059 20.8
#> 6 1985-06-09 Big Ri… Eugene… 255386 32.0
#> 7 1985-06-09 Biloxi… Neil S… 306839 28.3
#> 8 1985-06-09 Bright… 46th S… 107392 18.9
#> 9 1985-06-09 Cats Winter… 461880 38.4
#> 10 1985-06-09 Doubles Ritz T… 47452 17.5
#> # … with 47,514 more rows, and 3 more variables:
#> # top_ticket_price <dbl>, performances <dbl>,
#> # previews <dbl>
64.3.1 Question 1
What is the minimum and maximum number of performances and previews per week? We can use
across()
to select specific columns to summarize them with multiple summary functions.
broadway %>%
group_by(week_ending) %>%
summarise(across(.cols = c("performances", "previews"),
.fns = list(min = min, max = max)),
.groups = 'drop')
#> # A tibble: 1,812 × 5
#> week_ending performances_min performances_max
#>
#> 1 1985-06-09 0 9
#> 2 1985-06-16 0 8
#> 3 1985-06-23 0 9
#> 4 1985-06-30 0 9
#> 5 1985-07-07 7 9
#> 6 1985-07-14 8 9
#> 7 1985-07-21 8 9
#> 8 1985-07-28 8 9
#> 9 1985-08-04 8 9
#> 10 1985-08-11 8 9
#> # … with 1,802 more rows, and 2 more variables:
#> # previews_min , previews_max
Here is what the above chunk of code does:
- Groups the data by week using
group_by()
- Selects columns to summarize by passing a vector to
.cols
inacross()
(Highlighted in pink) - Defines summary functions by passing a list to
.fns
inacross()
(Highlighted in orange)
You can also learn more about across()
by running ?across
in your console.
64.3.2 Question 2
How do we provide a numeric summary for every show happening in a particular week?
broadway %>%
group_by(week_ending, show) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE),
.groups = 'drop')
#> # A tibble: 47,524 × 7
#> week_ending show weekly_gross avg_ticket_price
#> <date> <chr> <dbl> <dbl>
#> 1 1985-06-09 42nd Street 282368 30.4
#> 2 1985-06-09 A Chorus Line 222584 27.2
#> 3 1985-06-09 Aren't We All? 249272 33.8
#> 4 1985-06-09 Arms and the M… 95688 20.9
#> 5 1985-06-09 As Is 61059 20.8
#> 6 1985-06-09 Big River 255386 32.0
#> 7 1985-06-09 Biloxi Blues 306839 28.3
#> 8 1985-06-09 Brighton Beach… 107392 18.9
#> 9 1985-06-09 Cats 461880 38.4
#> 10 1985-06-09 Doubles 47452 17.5
#> # … with 47,514 more rows, and 3 more variables:
#> # top_ticket_price <dbl>, performances <dbl>,
#> # previews <dbl>
This chunk of code summarizes every show happening in some particular week by every numeric variable that is available in the data set. This helps us compute things like the average ticket price and the number of performances each show had in a particular week.
64.3.3 Question 3
We can use mutate() and if_else() to change values within a column. For example, here is what we would write if we wanted to change all rows with theatre Studio 54 to be called Studio 54 Theatre.
broadway %>%
mutate(theatre = if_else(theatre == "Studio 54", "Studio 54 Theatre", theatre))
#> # A tibble: 47,524 × 8
#> week_ending show theatre weekly_gross avg_ticket_price
#> <date> <chr> <chr> <dbl> <dbl>
#> 1 1985-06-09 42nd S… St. Ja… 282368 30.4
#> 2 1985-06-09 A Chor… Sam S.… 222584 27.2
#> 3 1985-06-09 Aren't… Brooks… 249272 33.8
#> 4 1985-06-09 Arms a… Circle… 95688 20.9
#> 5 1985-06-09 As Is Lyceum… 61059 20.8
#> 6 1985-06-09 Big Ri… Eugene… 255386 32.0
#> 7 1985-06-09 Biloxi… Neil S… 306839 28.3
#> 8 1985-06-09 Bright… 46th S… 107392 18.9
#> 9 1985-06-09 Cats Winter… 461880 38.4
#> 10 1985-06-09 Doubles Ritz T… 47452 17.5
#> # … with 47,514 more rows, and 3 more variables:
#> # top_ticket_price <dbl>, performances <dbl>,
#> # previews <dbl>
We can confirm this by filtering for these theatre names.
broadway %>%
mutate(theatre = if_else(theatre == "Studio 54", "Studio 54 Theatre", theatre)) %>%
filter(theatre %in% c("Studio 54", "Studio 54 Theatre"))
#> # A tibble: 787 × 8
#> week_ending show theatre weekly_gross avg_ticket_price
#> <date> <chr> <chr> <dbl> <dbl>
#> 1 1998-02-15 Cabaret Studio… 48008 58.4
#> 2 1998-02-22 Cabaret Studio… 172373 51.5
#> 3 1998-03-01 Cabaret Studio… 174178 50.8
#> 4 1998-03-08 Cabaret Studio… 177623 49.3
#> 5 1998-03-15 Cabaret Studio… 147752 48.3
#> 6 1998-03-22 Cabaret Studio… 151753 49.1
#> 7 1998-03-29 Cabaret Studio… 183802 48.7
#> 8 1998-04-05 Cabaret Studio… 194515 50.8
#> 9 1998-04-12 Cabaret Studio… 219786 54.2
#> 10 1998-04-19 Cabaret Studio… 196320 48.1
#> # … with 777 more rows, and 3 more variables:
#> # top_ticket_price <dbl>, performances <dbl>,
#> # previews <dbl>
Notice that we did not save the initial modified data frame. We are also using a new operator %in% which checks for the value Studio 54 in a vector of values including Studio 54 and Studio 54 Theatre.
64.4 Arguments
64.4.1 across()
The across()
function takes the following as arguments:
Argument | Parameter | Details |
---|---|---|
.fns | can pass a list of functions or a single function | |
.cols | vector | vector with column names to apply functions to |
You can read more about the arguments in the across()
function documentation
here. Please note that the arguments
for across() will also depend on your use case.
64.4.2 mutate_if()
The mutate_if()
function takes the following as arguments:
Argument | Details |
---|---|
selection | selection of variables by type, i.e., is.character |
function | function to apply to the selection of variables |
You can read more about the arguments in the mutate_if()
function documentation
here.
64.4.3 na_if()
The na_if()
function takes the following as arguments:
Argument | Details |
---|---|
vector | vector to look for values we want to change |
value | value in the vector that we want to change to NA |
You can read more about the arguments in the na_if()
function documentation
here.
You can read more about if_else()
in the corresponding tutorial on conditional
statements.
64.5 Exercises
64.5.1 Exercise 1
Count the number of distinct shows and distinct theatres using summarise() and across(). As a tip, try to use the data type for shows and theatres columns.