65 Tidying up datasets
Written by Mariam Walaa and last updated on 7 October 2021.
65.1 Introduction
In this lesson, you will learn how to:
- Use recode()
- Use coalesce()
- Use lag() and lead()
- Use replace_na(), drop_na()
- Use n_distinct(), distinct()
Prerequisite skills include:
- Familiarity with NA values
- Familiarity with data types
Highlights:
- Use replace_na() and drop_na() to work with NA values
- Use n_distinct() and distinct() to look at unique rows
- Use lag() and lead() to push a set of values forward or backward in a vector
- Use coalesce() to look at the first occurrence of a non-NA value across vectors
- Use recode() to change certain values to something else that is of the same data type
65.3 Arguments
65.3.1 recode()
The recode()
function takes the following as arguments:
Argument | Parameter | Details |
---|---|---|
.x | vector | the vector you want to modify |
… | old value | old value = new value; assign a new value to the old value you want to modify |
You can read more about the arguments in the recode()
functions documentation
here.
65.3.2 replace_na()
The replace_na()
function takes the following as arguments:
Argument | Parameter | Details |
---|---|---|
data | input data frame | data frame with columns we want to replace NAs for |
replace | list | list of values for each column to replace their NAs with |
You can read more about the arguments in the replace_na()
functions documentation
here.
65.3.3 coalesce()
The coalesce()
function takes the following as arguments:
Argument | Parameter | Details |
---|---|---|
… | set of vectors | set of vectors to extract series of first non-empty elements from |
You can read more about the arguments in the coalesce()
functions documentation
here.
65.3.4 n_distinct()
The n_distinct()
function takes the following as arguments:
Argument | Parameter | Details |
---|---|---|
… | set of vectors | set of vectors to count number of distinct elements for |
You can read more about the arguments in the n_distinct()
functions documentation
here.
65.3.5 distinct()
The distinct()
function takes the following as arguments:
Argument | Parameter | Details |
---|---|---|
.data | tibble | tibble to return distinct rows for |
You can read more about the arguments in the distinct()
functions documentation
here.
65.3.6 drop_na()
The drop_na()
function takes the following as arguments:
Argument | Parameter | Details |
---|---|---|
data | input data frame | data frame with columns we want to drop rows with NAs for |
… | vector | columns you want to drop observations for if they have NAs |
You can read more about the arguments in the drop_na()
functions documentation
here.
65.3.7 lag(), lead()
The lag()
and lead()
functions take the following as arguments:
Argument | Parameter | Details |
---|---|---|
x | vector | vector of values to work with |
n | number | number of positions to lead or lag by |
default | number | value to fill the empty spots with |
You can read more about the arguments in the function documentation here.
65.4 Exercise
Match each of the function names to their descriptions.
Function | Description |
---|---|
A | This function pulls a vector backward by n positions and fills with NAs. |
B | This function provides all the distinct values in a vector. |
C | This function replaces NA values with a specified value. |
D | This function counts the number of distinct values in a vector. |
E | This function pushes a vector forward by n positions and fills with NAs. |
F | This function returns the first non-NA value at each row of a set of data. |
G | This function takes out all the rows that include NA values. |
H | This function allows you to change values of certain categories into new values of the same data type. |
65.5 Next Steps
If you are looking for more information on some of these functions, please check out the following resources:
- dplyr: Compute lagged or leading values - dplyr: Recode values - tidyr: Replace NAs with specified values - dplyr: Find first non-missing element - n_distinct: Efficiently count the number of unique values in a set of vector - distinct: Select distinct/unique rows - drop_na: Drop rows containing missing values