61 Looking for missing data
Written by Mariam Walaa and last updated on 7 October 2021.
61.1 Introduction
In this lesson, you will learn how to:
- Find implicit missing data
Prerequisite skills include:
- Using the pipe operator %>%
Highlights:
- Use complete() and fill() to find implicit missing data
61.2 Overview
When we think of looking for missing data, we may think of looking for missing values, but there is also another type of missing data that is implicit which we can look for. For example, are there missing variables or observations in the data? We can answer this question by looking at combinations of values and seeing if all the possible combinations exist.
61.2.1 Example
Suppose we have a data set representing student grades for a collection of required first year courses for the statistics major: STA130, CSC108, MAT137, at the end of their first year. However, some students have not finished all three courses and may be taking some in the summer.
Lets start by loading the tidyverse.
Here is our hypothetical data of courses and grades.
first_year
#> # A tibble: 12 × 3
#> student_id course grade
#> <dbl> <chr> <dbl>
#> 1 1 STA130 74
#> 2 1 CSC108 81
#> 3 1 MAT137 76
#> 4 2 STA130 74
#> 5 2 CSC108 74
#> 6 3 STA130 81
#> 7 4 STA130 85
#> 8 4 CSC108 79
#> 9 4 MAT137 78
#> 10 5 STA130 87
#> 11 5 MAT137 81
#> 12 6 MAT137 74
As you can see, our data is missing some rows that would correspond to courses that
students have yet to complete. Suppose, for some reason, that you want to count the number
of courses that are left for all students to take until they have completed all their
requirements, or maybe you want to try predicting the grades a student will get on their
remaining courses. Regardless, you will need to “manipulate” this data set to make it so
that you can see which courses students have yet to complete. The complete()
function is
right tool to do this and we can do this as follows.
first_year %>%
complete(student_id, course)
#> # A tibble: 18 × 3
#> student_id course grade
#>
#> 1 1 CSC108 81
#> 2 1 MAT137 76
#> 3 1 STA130 74
#> 4 2 CSC108 74
#> 5 2 MAT137 NA
#> 6 2 STA130 74
#> 7 3 CSC108 NA
#> 8 3 MAT137 NA
#> 9 3 STA130 81
#> 10 4 CSC108 79
#> 11 4 MAT137 78
#> 12 4 STA130 85
#> 13 5 CSC108 NA
#> 14 5 MAT137 81
#> 15 5 STA130 87
#> 16 6 CSC108 NA
#> 17 6 MAT137 74
#> 18 6 STA130 NA
This function gives us rows that represent courses students still haven’t completed, which we don’t have their grades for.
61.5 complete()
The complete()
function takes the following as arguments:
Argument | Parameter | Details |
---|---|---|
data | input data frame | data whose columns we’ll use to find missing data |
… | vector | columns to find and complete all combinations for |
fill | named list | values to fill the cells for newly added rows |
You can read more about the arguments in the complete()
function reference
here or with ?complete
.
61.6 fill()
The fill()
function takes the following as arguments:
Argument | Parameter | Details |
---|---|---|
data | input data frame | dataframe whose columns we use to fill missing data |
… | vector | columns to find and complete all combinations for |
.direction | string | ‘up,’ ‘down,’ ‘downup’ for direction to fill values |
You can read more about the arguments in the fill()
function reference
here or with ?fill
.
61.7 Exercises
There are many ways to fill the data we got above. If, for some reason, we wanted to fill it based on the past or the next value, we can use the fill() function. If, however, we wanted to fill all the empty values with a specific number, we could use the fill parameter within the complete() function.