31 group and ungroup
Written by Matthew Wankiewicz and last updated on 7 October 2021.
31.1 Introduction
In this lesson, you will learn how to:
- Use the
group_by()
function in R - Use the
group_by()
function with other functions in R.
Prerequisite skills include:
- Having R installed on your computer/Having RStudio Cloud.
- Having
tidyverse
installed on R.
Highlights:
- The
group_by
function allows you to group datasets by variables you choose. -
group_by
works best when paired with other dplyr functions, either counting the number of items in a group or making new variables from groups.
31.2 The content
A major part of data analysis is seeing how your data looks using particular groups and the group_by
function is very helpful with this. The group_by
function takes a data frame and allows you to use other functions to get an idea of what these groups look like.
The group_by
function is useful for conducting operations on your dataset when you want to break up the points by group. For example, if you have a data frame with the heights and weights of different animals, the group_by
function is useful for finding things like the mean weight of each type of animal in the data frame. The function ungroup
is used to remove the grouping done by the group_by
function.
Normally, the group_by
function is paired with other dplyr functions in order to conduct your analysis.
The ungroup
function takes one argument, a grouped data frame that you want to ungroup. This is useful for ungrouping your data after you have run your analysis and want to work with the whole data frame again.
Brief Overview of the group_by()
and ungroup()
functions:
31.3 Arguments
group_by()
: The two main arguments forgroup_by
are the data you plan to analyze and variables you want to group. When you enter in the data you plan to analyze, it will be the first argument. You can either write the name of the dataset as the first argument or pipe it into thegroup_by()
function. Once you have your dataset in the function, you then have to write in the variable names you plan to group. You can write as many variables as you want when usinggroup_by()
.ungroup()
: Theungroup()
function takes one argument, the grouped data that you want to ungroup. This is useful for ungrouping your data after you have run your analysis and want to work with the whole data frame again.
31.4 Other Optional Arguments
-
group_by()
: The two optional arguments are .add and .drop. .add determines whether or not the function makes new groups in the data. .drop drops groups that were formed before-hand which we may not see in the data.
31.5 Questions
penguins_grouped <- penguins %>%
group_by(species)
head(penguins_grouped)
#> # A tibble: 6 × 8
#> # Groups: species [1]
#> species island bill_length_mm bill_depth_mm
#> <fct> <fct> <dbl> <dbl>
#> 1 Adelie Torgersen 39.1 18.7
#> 2 Adelie Torgersen 39.5 17.4
#> 3 Adelie Torgersen 40.3 18
#> 4 Adelie Torgersen NA NA
#> 5 Adelie Torgersen 36.7 19.3
#> 6 Adelie Torgersen 39.3 20.6
#> # … with 4 more variables: flipper_length_mm <int>,
#> # body_mass_g <int>, sex <fct>, year <int>
As you can see above, using the group_by
function does not appear to do anything when it is done on its own. When you have a grouped data frame, you usually pair it with another function to get the data you want.
The chunk above takes the penguins data frame and counts how many observations are included. The output is only one number which represent all of the penguins. In the chunk below, we will use the group_by
function to see the number of penguins of each species present in the data frame.
penguins %>%
group_by(species) %>%
count()
#> # A tibble: 3 × 2
#> # Groups: species [3]
#> species n
#> <fct> <int>
#> 1 Adelie 152
#> 2 Chinstrap 68
#> 3 Gentoo 124
In the chunk above, I used the grouped data frame that was created in the chunk before and ran it with the count
function. This allows us to see how many penguins of each species are present in the data frame.
penguins %>%
group_by(species) %>%
# use na.rm to remove missing values
summarise(average_weight = mean(body_mass_g, na.rm = T))
#> # A tibble: 3 × 2
#> species average_weight
#> <fct> <dbl>
#> 1 Adelie 3701.
#> 2 Chinstrap 3733.
#> 3 Gentoo 5076.
This chunk shows how we can use the group_by
function with the summarise
function to get summary statistics for each species. To do this, you can either take the grouped data frame and pipe it into the summarise
function or you can use the group_by
function on the initial data frame and then pipe that into the summarise
function.
penguins %>%
group_by(species) %>%
summarise(average_weight = mean(body_mass_g, na.rm = T),
average_flipper_length = mean(flipper_length_mm, na.rm = T))
#> # A tibble: 3 × 3
#> species average_weight average_flipper_length
#> <fct> <dbl> <dbl>
#> 1 Adelie 3701. 190.
#> 2 Chinstrap 3733. 196.
#> 3 Gentoo 5076. 217.
This is an example of using the group_by
function and the summarise
function after piping in your initial data frame. As you can see, there is another column present in the output, compared to the chart above. This was done by adding in another argument to the summarise
function which now gives us the average flipper length of each species of penguin.
penguins %>%
group_by(species, sex) %>%
summarise(average_weight = mean(body_mass_g, na.rm = T))
#> `summarise()` has grouped output by 'species'. You can
#> override using the `.groups` argument.
#> # A tibble: 8 × 3
#> # Groups: species [3]
#> species sex average_weight
#> <fct> <fct> <dbl>
#> 1 Adelie female 3369.
#> 2 Adelie male 4043.
#> 3 Adelie <NA> 3540
#> 4 Chinstrap female 3527.
#> 5 Chinstrap male 3939.
#> 6 Gentoo female 4680.
#> 7 Gentoo male 5485.
#> 8 Gentoo <NA> 4588.
The group_by
function can also be used with multiple variables. This is an example of using two variables to group our data, this time we will use species and sex. Once again, we can see that visually, the data doesn’t look different but when we apply other functions to it, the data will appear differently.
penguins %>%
group_by(species, sex, year) %>%
summarise(average_weight = mean(body_mass_g, na.rm = T))
#> `summarise()` has grouped output by 'species', 'sex'. You
#> can override using the `.groups` argument.
#> # A tibble: 22 × 4
#> # Groups: species, sex [8]
#> species sex year average_weight
#> <fct> <fct> <int> <dbl>
#> 1 Adelie female 2007 3390.
#> 2 Adelie female 2008 3386
#> 3 Adelie female 2009 3335.
#> 4 Adelie male 2007 4039.
#> 5 Adelie male 2008 4098
#> 6 Adelie male 2009 3995.
#> 7 Adelie <NA> 2007 3540
#> 8 Chinstrap female 2007 3569.
#> 9 Chinstrap female 2008 3472.
#> 10 Chinstrap female 2009 3523.
#> # … with 12 more rows
This output shows us the average weights of the penguins, when grouped by species and sex. We can see that there are levels for each of the three species (Adelie, Chinstrap and Gentoo) and the three gender levels present (Male, Female and NA).
penguins %>%
group_by(species, sex) %>%
filter(body_mass_g == max(body_mass_g))
#> # A tibble: 7 × 8
#> # Groups: species, sex [6]
#> species island bill_length_mm bill_depth_mm
#> <fct> <fct> <dbl> <dbl>
#> 1 Adelie Biscoe 43.2 19
#> 2 Adelie Biscoe 39.6 20.7
#> 3 Gentoo Biscoe 49.2 15.2
#> 4 Gentoo Biscoe 46.5 14.8
#> 5 Gentoo Biscoe 45.2 14.8
#> 6 Chinstrap Dream 46 18.9
#> 7 Chinstrap Dream 52 20.7
#> # … with 4 more variables: flipper_length_mm <int>,
#> # body_mass_g <int>, sex <fct>, year <int>
The group_by
function also works with the filter
function. The chunk above gives us the penguins with the largest body mass for each of the groups we created.
penguins %>%
group_by(species) %>%
count()
#> # A tibble: 3 × 2
#> # Groups: species [3]
#> species n
#> <fct> <int>
#> 1 Adelie 152
#> 2 Chinstrap 68
#> 3 Gentoo 124
penguins %>%
group_by(species) %>%
ungroup() %>%
count()
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 344
This chunk demonstrates the ungroup
function. The output of the first code is the same as one of the previous examples, it gives us the number of penguins present for each species.
The second group of code shows us what ungroup
does. The ungroup
function was placed just before the count
function so instead of giving us the number of penguins in each species, we get the number of penguins in the whole data frame.
Brief Overview of the group_by()
and ungroup()
functions:
31.6 Exercises
1. Use the group_by function to count how many penguins were studied each year and also group them by their sex. Remember the data frame is called “penguins” and the year variable is called “year,” sex is called “sex.”
## FINAL SOLUTION ##
penguins %>%
group_by(year, sex) %>%
count()
#> # A tibble: 9 × 3
#> # Groups: year, sex [9]
#> year sex n
#> <int> <fct> <int>
#> 1 2007 female 51
#> 2 2007 male 52
#> 3 2007 <NA> 7
#> 4 2008 female 56
#> 5 2008 male 57
#> 6 2008 <NA> 1
#> 7 2009 female 58
#> 8 2009 male 59
#> 9 2009 <NA> 3
## OR ##
penguins %>%
group_by(year, sex) %>%
summarise(n = n())
#> `summarise()` has grouped output by 'year'. You can override
#> using the `.groups` argument.
#> # A tibble: 9 × 3
#> # Groups: year [3]
#> year sex n
#> <int> <fct> <int>
#> 1 2007 female 51
#> 2 2007 male 52
#> 3 2007 <NA> 7
#> 4 2008 female 56
#> 5 2008 male 57
#> 6 2008 <NA> 1
#> 7 2009 female 58
#> 8 2009 male 59
#> 9 2009 <NA> 3
2. Using the penguins data frame, group by both island
, sex
and species
and give the average bill length (bill_length_mm
), average bill depth (bill_depth_mm
) and the difference between average bill length and average bill depths.
penguins %>%
group_by(island, sex, species) %>%
summarise(avg_length = mean(bill_length_mm, na.rm = T),
avg_depth = mean(bill_depth_mm, na.rm = T),
diff_depth = mean(bill_length_mm) - mean(bill_depth_mm))
#> `summarise()` has grouped output by 'island', 'sex'. You can
#> override using the `.groups` argument.
#> # A tibble: 13 × 6
#> # Groups: island, sex [9]
#> island sex species avg_length avg_depth diff_depth
#> <fct> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 Biscoe female Adelie 37.4 17.7 19.7
#> 2 Biscoe female Gentoo 45.6 14.2 31.3
#> 3 Biscoe male Adelie 40.6 19.0 21.6
#> 4 Biscoe male Gentoo 49.5 15.7 33.8
#> 5 Biscoe <NA> Gentoo 45.6 14.6 NA
#> 6 Dream female Adelie 36.9 17.6 19.3
#> 7 Dream female Chinstr… 46.6 17.6 29.0
#> 8 Dream male Adelie 40.1 18.8 21.2
#> 9 Dream male Chinstr… 51.1 19.3 31.8
#> 10 Dream <NA> Adelie 37.5 18.9 18.6
#> 11 Torgersen female Adelie 37.6 17.6 20.0
#> 12 Torgersen male Adelie 40.6 19.4 21.2
#> 13 Torgersen <NA> Adelie 37.9 18.2 NA
# na.rm = T is optional, safer to use it if you're unsure if
# your data contains NA's
Solution to Exercise 1:
Solution to Exercise 2:
31.7 Common Mistakes & Errors
Sometimes, you will encounter some errors in the group_by
function. In this section, we’ll cover what you should do when some of the common errors occur.
- Error: Must group by variables found in
.data
When this occurs, you are probably trying to group your data frame by a variable that isn’t in the data frame. Often times, this happens because of a typo in the variable you want to select.
- Error in eval(lhs, parent, parent) : object ‘totally_real_data_frame’ not found
When this error occurs, R is telling us that the data frame we are trying to make groups from does not exist. Once again, this is usually because of a typo.
- Error in group_by(data) : could not find function “group_by”
When this occurs, it means that R can’t find the group_by
function. To fix this, you should try to load the tidyverse
library in.
Some mistakes that you may run into when using group_by
and ungroup
:
Calling for variables that are not in the data frame you plan to analyze (usually typos).
Not calling in the tidyverse/dplyr library in R.
Sometimes you can encounter difficulties with other functions, usually, typos will be the biggest issue.
31.8 Next Steps
Now that you have got some experience with the group_by
and ungroup
functions, these links are useful resources to expand your understanding.
R for Data Science contains the
group_by
function with other dplyr functions: https://r4ds.had.co.nz/transform.htmlSection 6.11 of OHI Data Science Training looks at the
group_by
function: https://ohi-science.org/data-science-training/dplyr.html#group_by-operates-on-groups)
31.9 Questions
- True or False, after only running the
group_by
function on a dataset, the data will look different?
- True
- False
- Which function will ungroup a grouped dataset?
group_by()
summarise()
ungroup()
reset()
- If we want to find an average value for groups in a dataset, which function should be used with
group_by
?
- If we want to find the total number of observations in each group, which functions can be used? (Select all that apply)
- What is the issue that causes this error: “Error: Must group by variables found in
.data
?”
- The dataset you are referencing does not exist
- There are too many NA’s in the column you are referencing
- There is only one group in the column you want to group by
- The column you want to group by is not in the dataset
- True or False,
group_by
can be used on multiple columns?
- True
- False
- If I run the
group_by
function followed by theungroup
function and then usesummarise
what will the output be?
- A table grouped by the column of interest
- An empty table
- A table that is not grouped by the column of interest but is still summarized
- An error will occur
- If you use
group_by
and thenmutate
, will the data appear different if you add theungroup
function afterwards?
- Yes
- No
- What is the cause of the “Error in eval(lhs, parent, parent) : object ‘data_frame’ not found?”
- The column you want to
group_by
does not exist - The dataset you want to investigate does not exist
- The
group_by
function has not been loaded - None of the above
- True or False, you can use the
filter
function before usinggroup_by
?
- True
- False