65 Tidying up datasets

Written by Mariam Walaa and last updated on 7 October 2021.

65.1 Introduction

In this lesson, you will learn how to:

Use recode()
Use coalesce()
Use lag() and lead()
Use replace_na(), drop_na()
Use n_distinct(), distinct()

Prerequisite skills include:

Familiarity with NA values
Familiarity with data types

Highlights:

Use replace_na() and drop_na() to work with NA values
Use n_distinct() and distinct() to look at unique rows
Use lag() and lead() to push a set of values forward or backward in a vector
Use coalesce() to look at the first occurrence of a non-NA value across vectors
Use recode() to change certain values to something else that is of the same data type

65.2 Video

65.3 Arguments

65.3.1 recode()

The recode() function takes the following as arguments:

Argument	Parameter	Details
.x	vector	the vector you want to modify
…	old value	old value = new value; assign a new value to the old value you want to modify

You can read more about the arguments in the recode() functions documentation here.

65.3.2 replace_na()

The replace_na() function takes the following as arguments:

Argument	Parameter	Details
data	input data frame	data frame with columns we want to replace NAs for
replace	list	list of values for each column to replace their NAs with

You can read more about the arguments in the replace_na() functions documentation here.

65.3.3 coalesce()

The coalesce() function takes the following as arguments:

Argument	Parameter	Details
…	set of vectors	set of vectors to extract series of first non-empty elements from

You can read more about the arguments in the coalesce() functions documentation here.

65.3.4 n_distinct()

The n_distinct() function takes the following as arguments:

Argument	Parameter	Details
…	set of vectors	set of vectors to count number of distinct elements for

You can read more about the arguments in the n_distinct() functions documentation here.

65.3.5 distinct()

The distinct() function takes the following as arguments:

Argument	Parameter	Details
.data	tibble	tibble to return distinct rows for

You can read more about the arguments in the distinct() functions documentation here.

65.3.6 drop_na()

The drop_na() function takes the following as arguments:

Argument	Parameter	Details
data	input data frame	data frame with columns we want to drop rows with NAs for
…	vector	columns you want to drop observations for if they have NAs

You can read more about the arguments in the drop_na() functions documentation here.

65.3.7 lag(), lead()

The lag() and lead() functions take the following as arguments:

Argument	Parameter	Details
x	vector	vector of values to work with
n	number	number of positions to lead or lag by
default	number	value to fill the empty spots with

You can read more about the arguments in the function documentation here.

65.4 Exercise

Match each of the function names to their descriptions.

Function	Description
A	This function pulls a vector backward by n positions and fills with NAs.
B	This function provides all the distinct values in a vector.
C	This function replaces NA values with a specified value.
D	This function counts the number of distinct values in a vector.
E	This function pushes a vector forward by n positions and fills with NAs.
F	This function returns the first non-NA value at each row of a set of data.
G	This function takes out all the rows that include NA values.
H	This function allows you to change values of certain categories into new values of the same data type.

65.5 Next Steps

If you are looking for more information on some of these functions, please check out the following resources:

dplyr: Compute lagged or leading values - dplyr: Recode values - tidyr: Replace NAs with specified values - dplyr: Find first non-missing element - n_distinct: Efficiently count the number of unique values in a set of vector - distinct: Select distinct/unique rows - drop_na: Drop rows containing missing values

DoSS Toolkit

65 Tidying up datasets

65.1 Introduction

65.2 Video

65.3 Arguments

65.3.1 recode()

65.3.2 replace_na()

65.3.3 coalesce()

65.3.4 n_distinct()

65.3.5 distinct()

65.3.6 drop_na()

65.3.7 lag(), lead()

65.4 Exercise

65.5 Next Steps

65.6 Exercises

65.6.1 Question 1

65.6.2 Question 2

65.6.3 Question 3

65.6.4 Question 4

65.6.5 Question 5

65.6.6 Question 6

65.6.7 Question 7

65.6.8 Question 8

65.6.9 Question 9

65.6.10 Question 10