71 janitor

Written by Mariam Walaa and last updated on 7 October 2021.

71.1 Introduction

In this lesson, you will learn how to:

  • Tidy up variable names that are in your dataset
  • Deal with duplicate or partially duplicate data

Prerequisite skills include:

  • Installing and loading packages
  • Understanding duplicate data
  • Working with column names, such as names() and rename()

Highlights:

  • You can make column names cleaner using janitors clean_names() function
  • You can handle duplicate or partially duplicate data using janitors get_dupes() function

71.2 Overview

“The janitor package is a R package that has simple functions for examining and cleaning dirty data. The main janitor functions: perfectly format data frame column names; isolate partially-duplicate records; and provide quick tabulations (i.e., frequency tables and crosstabs).”

As the description says, the janitor package will help you with cleaning your data. It is not part of the tidyverse so you will have to install it and then load it separately as follows.

library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test

In data analysis, you will frequently across across the issue of duplication, as well as the issue of having difficult column names to work with. One way to deal with difficult column names is to rename them using rename() but you can also use the janitor package to perform multiple cleaning steps on the column names.

71.3 Arguments

71.4 clean_names()

The clean_names() function takes the following as arguments:

Argument Parameter Details
dat* input data frame
case ‘title’ for big caps default is ‘snake’; see to_any_case for detail

You can read more about the arguments in the clean_names() function documentation here.

71.5 get_dupes()

The get_dupes() function takes the following as arguments:

Argument Parameter Details
dat* input data frame
vector vector containing column names we want to check

You can read more about the arguments in the get_dupes() function documentation here.

71.6 Exercises

Lets start by loading tidyverse since we will be using the pipe %>% operator and more.

Consider this small dataset of grades.

grades
#> # A tibble: 5 × 4
#>   `Student Initials` `Grade Midterm 1` `Grade Midterm 2`
#>   <chr>                          <dbl>             <dbl>
#> 1 AH                               100                90
#> 2 AE                                86                83
#> 3 HS                                90                79
#> 4 ES                                64                64
#> 5 BT                               100                90
#> # … with 1 more variable: `Final Grade %` <dbl>

Using the clean_names() function from janitor, we get:

grades %>% clean_names()
#> # A tibble: 5 × 4
#>   student_initials grade_midterm_1 grade_midterm_2
#>   <chr>                      <dbl>           <dbl>
#> 1 AH                           100              90
#> 2 AE                            86              83
#> 3 HS                            90              79
#> 4 ES                            64              64
#> 5 BT                           100              90
#> # … with 1 more variable: final_grade_percent <dbl>

Notice how now everything is lowercased with _ as a separator, and any special characters like % are converted to words to retain their meaning. clean_names() would also handle column names that are duplicated, but that is not demonstrated here since we already had unique columns.

71.6.1 Exercise 1

If, for some reason, you wanted to preserve some existing columns from being cleaned, how would you use the clean_names() function on only the columns you want to clean? For example, supposed you wanted to keep the Final Grade % column as is. As a hint, you will need to use functions outside of the janitor package to help with this. Remember the dplyr functions.

grades %>%
  select(-`Final Grade %`) %>%
  clean_names()
#> # A tibble: 5 × 3
#>   student_initials grade_midterm_1 grade_midterm_2
#>   <chr>                      <dbl>           <dbl>
#> 1 AH                           100              90
#> 2 AE                            86              83
#> 3 HS                            90              79
#> 4 ES                            64              64
#> 5 BT                           100              90

71.6.2 Exercise 2

If you wanted to restore the upper casing for some columns, how would you do that? As a tip, take a look at the Arguments section and see what you can use. Make sure you store the cleaned data from above in an object called clean, and then apply the new cleaning step to it.

71.6.3 Exercise 3

Try using the get_dupes() function to get duplicate rows from the cleaned grades data clean.

What was the result? Did you get anything?

71.6.4 Exercise 4

Look at the data above and try to see why you did not get anything even though it looks like two rows are very similar. How can you modify the function call so that you get the partially duplicate data?

71.7 Next Steps

If you would like to learn more, please read about the janitor package in its documentation here.

71.8 Exercises

71.8.1 Question 1

71.8.2 Question 2

71.8.3 Question 3

71.8.4 Question 4

71.8.5 Question 5

71.8.6 Question 6

71.8.7 Question 7

71.8.8 Question 8

71.8.9 Question 9

71.8.10 Question 10