71 janitor

Written by Mariam Walaa and last updated on 7 October 2021.

71.1 Introduction

In this lesson, you will learn how to:

Tidy up variable names that are in your dataset
Deal with duplicate or partially duplicate data

Prerequisite skills include:

Installing and loading packages
Understanding duplicate data
Working with column names, such as names() and rename()

Highlights:

You can make column names cleaner using janitors clean_names() function
You can handle duplicate or partially duplicate data using janitors get_dupes() function

71.2 Overview

“The janitor package is a R package that has simple functions for examining and cleaning dirty data. The main janitor functions: perfectly format data frame column names; isolate partially-duplicate records; and provide quick tabulations (i.e., frequency tables and crosstabs).”

As the description says, the janitor package will help you with cleaning your data. It is not part of the tidyverse so you will have to install it and then load it separately as follows.

library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test

In data analysis, you will frequently across across the issue of duplication, as well as the issue of having difficult column names to work with. One way to deal with difficult column names is to rename them using rename() but you can also use the janitor package to perform multiple cleaning steps on the column names.

71.3 Arguments

71.4 `clean_names()`

The clean_names() function takes the following as arguments:

Argument	Parameter	Details
dat*	input data frame
case	‘title’ for big caps	default is ‘snake’; see to_any_case for detail

You can read more about the arguments in the clean_names() function documentation here.

71.5 `get_dupes()`

The get_dupes() function takes the following as arguments:

Argument	Parameter	Details
dat*	input data frame
…	vector	vector containing column names we want to check

You can read more about the arguments in the get_dupes() function documentation here.

71.6 Exercises

Lets start by loading tidyverse since we will be using the pipe %>% operator and more.

library(tidyverse)

Consider this small dataset of grades.

grades
#> # A tibble: 5 × 4
#>   `Student Initials` `Grade Midterm 1` `Grade Midterm 2`
#>   <chr>                          <dbl>             <dbl>
#> 1 AH                               100                90
#> 2 AE                                86                83
#> 3 HS                                90                79
#> 4 ES                                64                64
#> 5 BT                               100                90
#> # … with 1 more variable: `Final Grade %` <dbl>

Using the clean_names() function from janitor, we get:

grades %>% clean_names()
#> # A tibble: 5 × 4
#>   student_initials grade_midterm_1 grade_midterm_2
#>   <chr>                      <dbl>           <dbl>
#> 1 AH                           100              90
#> 2 AE                            86              83
#> 3 HS                            90              79
#> 4 ES                            64              64
#> 5 BT                           100              90
#> # … with 1 more variable: final_grade_percent <dbl>

Notice how now everything is lowercased with _ as a separator, and any special characters like % are converted to words to retain their meaning. clean_names() would also handle column names that are duplicated, but that is not demonstrated here since we already had unique columns.

71.6.1 Exercise 1

If, for some reason, you wanted to preserve some existing columns from being cleaned, how would you use the clean_names() function on only the columns you want to clean? For example, supposed you wanted to keep the Final Grade % column as is. As a hint, you will need to use functions outside of the janitor package to help with this. Remember the dplyr functions.

grades %>%
  select(-`Final Grade %`) %>%
  clean_names()
#> # A tibble: 5 × 3
#>   student_initials grade_midterm_1 grade_midterm_2
#>   <chr>                      <dbl>           <dbl>
#> 1 AH                           100              90
#> 2 AE                            86              83
#> 3 HS                            90              79
#> 4 ES                            64              64
#> 5 BT                           100              90

71.6.2 Exercise 2

If you wanted to restore the upper casing for some columns, how would you do that? As a tip, take a look at the Arguments section and see what you can use. Make sure you store the cleaned data from above in an object called clean, and then apply the new cleaning step to it.

71.6.3 Exercise 3

Try using the get_dupes() function to get duplicate rows from the cleaned grades data clean.

What was the result? Did you get anything?

71.6.4 Exercise 4

Look at the data above and try to see why you did not get anything even though it looks like two rows are very similar. How can you modify the function call so that you get the partially duplicate data?

71.7 Next Steps

If you would like to learn more, please read about the janitor package in its documentation here.

DoSS Toolkit

71 janitor

71.1 Introduction

71.2 Overview

71.3 Arguments

71.4 `clean_names()`

71.5 `get_dupes()`

71.6 Exercises

71.6.1 Exercise 1

71.6.2 Exercise 2

71.6.3 Exercise 3

71.6.4 Exercise 4

71.7 Next Steps

71.8 Exercises

71.8.1 Question 1

71.8.2 Question 2

71.8.3 Question 3

71.8.4 Question 4

71.8.5 Question 5

71.8.6 Question 6

71.8.7 Question 7

71.8.8 Question 8

71.8.9 Question 9

71.8.10 Question 10

71 janitor

71.1 Introduction

71.2 Overview

71.3 Arguments

71.4 clean_names()

71.5 get_dupes()

71.6 Exercises

71.6.1 Exercise 1

71.6.2 Exercise 2

71.6.3 Exercise 3

71.6.4 Exercise 4

71.7 Next Steps

71.8 Exercises

71.8.1 Question 1

71.8.2 Question 2

71.8.3 Question 3

71.8.4 Question 4

71.8.5 Question 5

71.8.6 Question 6

71.8.7 Question 7

71.8.8 Question 8

71.8.9 Question 9

71.8.10 Question 10

71.4 `clean_names()`

71.5 `get_dupes()`