65 Tidying up datasets
Written by Mariam Walaa and last updated on 7 October 2021.
65.1 Introduction
In this lesson, you will learn how to:
- Use recode()
- Use coalesce()
- Use lag() and lead()
- Use replace_na(), drop_na()
- Use n_distinct(), distinct()
Prerequisite skills include:
- Familiarity with NA values
- Familiarity with data types
Highlights:
- Use replace_na() and drop_na() to work with NA values
- Use n_distinct() and distinct() to look at unique rows
- Use lag() and lead() to push a set of values forward or backward in a vector
- Use coalesce() to look at the first occurrence of a non-NA value across vectors
- Use recode() to change certain values to something else that is of the same data type
65.3 Arguments
65.3.1 recode()
The recode() function takes the following as arguments:
| Argument | Parameter | Details |
|---|---|---|
| .x | vector | the vector you want to modify |
| … | old value | old value = new value; assign a new value to the old value you want to modify |
You can read more about the arguments in the recode() functions documentation
here.
65.3.2 replace_na()
The replace_na() function takes the following as arguments:
| Argument | Parameter | Details |
|---|---|---|
| data | input data frame | data frame with columns we want to replace NAs for |
| replace | list | list of values for each column to replace their NAs with |
You can read more about the arguments in the replace_na() functions documentation
here.
65.3.3 coalesce()
The coalesce() function takes the following as arguments:
| Argument | Parameter | Details |
|---|---|---|
| … | set of vectors | set of vectors to extract series of first non-empty elements from |
You can read more about the arguments in the coalesce() functions documentation
here.
65.3.4 n_distinct()
The n_distinct() function takes the following as arguments:
| Argument | Parameter | Details |
|---|---|---|
| … | set of vectors | set of vectors to count number of distinct elements for |
You can read more about the arguments in the n_distinct() functions documentation
here.
65.3.5 distinct()
The distinct() function takes the following as arguments:
| Argument | Parameter | Details |
|---|---|---|
| .data | tibble | tibble to return distinct rows for |
You can read more about the arguments in the distinct() functions documentation
here.
65.3.6 drop_na()
The drop_na() function takes the following as arguments:
| Argument | Parameter | Details |
|---|---|---|
| data | input data frame | data frame with columns we want to drop rows with NAs for |
| … | vector | columns you want to drop observations for if they have NAs |
You can read more about the arguments in the drop_na() functions documentation
here.
65.3.7 lag(), lead()
The lag() and lead() functions take the following as arguments:
| Argument | Parameter | Details |
|---|---|---|
| x | vector | vector of values to work with |
| n | number | number of positions to lead or lag by |
| default | number | value to fill the empty spots with |
You can read more about the arguments in the function documentation here.
65.4 Exercise
Match each of the function names to their descriptions.
| Function | Description |
|---|---|
| A | This function pulls a vector backward by n positions and fills with NAs. |
| B | This function provides all the distinct values in a vector. |
| C | This function replaces NA values with a specified value. |
| D | This function counts the number of distinct values in a vector. |
| E | This function pushes a vector forward by n positions and fills with NAs. |
| F | This function returns the first non-NA value at each row of a set of data. |
| G | This function takes out all the rows that include NA values. |
| H | This function allows you to change values of certain categories into new values of the same data type. |
65.5 Next Steps
If you are looking for more information on some of these functions, please check out the following resources:
- dplyr: Compute lagged or leading values - dplyr: Recode values - tidyr: Replace NAs with specified values - dplyr: Find first non-missing element - n_distinct: Efficiently count the number of unique values in a set of vector - distinct: Select distinct/unique rows - drop_na: Drop rows containing missing values