61 Looking for missing data

Written by Mariam Walaa and last updated on 7 October 2021.

61.1 Introduction

In this lesson, you will learn how to:

  • Find implicit missing data

Prerequisite skills include:

  • Using the pipe operator %>%

Highlights:

  • Use complete() and fill() to find implicit missing data

61.2 Overview

When we think of looking for missing data, we may think of looking for missing values, but there is also another type of missing data that is implicit which we can look for. For example, are there missing variables or observations in the data? We can answer this question by looking at combinations of values and seeing if all the possible combinations exist.

61.2.1 Example

Suppose we have a data set representing student grades for a collection of required first year courses for the statistics major: STA130, CSC108, MAT137, at the end of their first year. However, some students have not finished all three courses and may be taking some in the summer.

Lets start by loading the tidyverse.

Here is our hypothetical data of courses and grades.

first_year
#> # A tibble: 12 × 3
#>    student_id course grade
#>         <dbl> <chr>  <dbl>
#>  1          1 STA130    74
#>  2          1 CSC108    81
#>  3          1 MAT137    76
#>  4          2 STA130    74
#>  5          2 CSC108    74
#>  6          3 STA130    81
#>  7          4 STA130    85
#>  8          4 CSC108    79
#>  9          4 MAT137    78
#> 10          5 STA130    87
#> 11          5 MAT137    81
#> 12          6 MAT137    74

As you can see, our data is missing some rows that would correspond to courses that students have yet to complete. Suppose, for some reason, that you want to count the number of courses that are left for all students to take until they have completed all their requirements, or maybe you want to try predicting the grades a student will get on their remaining courses. Regardless, you will need to “manipulate” this data set to make it so that you can see which courses students have yet to complete. The complete() function is right tool to do this and we can do this as follows.

first_year %>%
  complete(student_id, course)
#> # A tibble: 18 × 3
#>    student_id course grade
#>           
#>  1          1 CSC108    81
#>  2          1 MAT137    76
#>  3          1 STA130    74
#>  4          2 CSC108    74
#>  5          2 MAT137    NA
#>  6          2 STA130    74
#>  7          3 CSC108    NA
#>  8          3 MAT137    NA
#>  9          3 STA130    81
#> 10          4 CSC108    79
#> 11          4 MAT137    78
#> 12          4 STA130    85
#> 13          5 CSC108    NA
#> 14          5 MAT137    81
#> 15          5 STA130    87
#> 16          6 CSC108    NA
#> 17          6 MAT137    74
#> 18          6 STA130    NA

This function gives us rows that represent courses students still haven’t completed, which we don’t have their grades for.

61.3 Video

61.4 Arguments

61.5 complete()

The complete() function takes the following as arguments:

Argument Parameter Details
data input data frame data whose columns we’ll use to find missing data
vector columns to find and complete all combinations for
fill named list values to fill the cells for newly added rows

You can read more about the arguments in the complete() function reference here or with ?complete.

61.6 fill()

The fill() function takes the following as arguments:

Argument Parameter Details
data input data frame dataframe whose columns we use to fill missing data
vector columns to find and complete all combinations for
.direction string ‘up,’ ‘down,’ ‘downup’ for direction to fill values

You can read more about the arguments in the fill() function reference here or with ?fill.

61.7 Exercises

There are many ways to fill the data we got above. If, for some reason, we wanted to fill it based on the past or the next value, we can use the fill() function. If, however, we wanted to fill all the empty values with a specific number, we could use the fill parameter within the complete() function.

61.7.1 Exercise 1

Referencing the Arguments section, try to fill it based on the past value using the fill() function.

61.7.2 Exercise 2

Referencing the Arguments section, try to fill all the empty values with a specific number 0 and using the fill parameter within the complete() function.

61.8 Next Steps

If you would like to learn more about the complete() and fill() functions, you will find these resources from tidyr very helpful:

61.9 Exercises

61.9.1 Question 1

61.9.2 Question 2

61.9.3 Question 3

61.9.4 Question 4

61.9.5 Question 5

61.9.6 Question 6

61.9.7 Question 7

61.9.8 Question 8

61.9.9 Question 9

61.9.10 Question 10