29 select
Written by Yena Joo and last updated on 7 October 2021.
29.1 Introduction
29.1.1 What is select()
?
In this lesson, we will learn how to use the select
function.
When you’re working with data, you find there are way too many variables in it, and some would wonder, how do I select only the variables I want to use for the analysis? Well, there is a super easy way to see the variables you choose by referring to variables based on the name of the column, with just one simple function.
select()
is a function that keeps only the variables you specify.
The output of the function is a subset of the input data (columns), potentially with a different order. However, the function select()
does not mutate the original dataset/columns. So if you want to use the new columns you selected, you will have to assign the value to a new variable.
29.3 Basics
Here is a simple dataset that has average temperatures for each season. As you can see in the outcome, there are 3 observations(rows) and 4 variables(columns) in the dataset.
#> # A tibble: 3 × 4
#> spring summer fall winter
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3 23 19 2
#> 2 5 27 17 -1
#> 3 10 25 14 -8
Now, let’s say we want to only see temperatures in the Spring. To do so, select variables you would like to keep by putting the variable name in the parameter with the function select
.
Here, I would like to just see the spring
column by writing the code as below:
select(temperature_data, spring)
#> # A tibble: 3 × 1
#> spring
#> <dbl>
#> 1 3
#> 2 5
#> 3 10
However, it is important to know that the data in temperature_data
did not change as you can see. The function does not mutate the original dataset.
temperature_data
#> # A tibble: 3 × 4
#> spring summer fall winter
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3 23 19 2
#> 2 5 27 17 -1
#> 3 10 25 14 -8
If you want to use the new dataset with the variable spring
, you would have to assign the selected column(s) to a new variable:
new_data = select(temperature_data, spring)
new_data
#> # A tibble: 3 × 1
#> spring
#> <dbl>
#> 1 3
#> 2 5
#> 3 10
29.4 Operators
Now, we know how to use the function select
bare minimum. There are various ways to use the function with some operators.
1. You can select multiple variables using commas. Order of the input matters!
select(temperature_data, winter, summer)
#> # A tibble: 3 × 2
#> winter summer
#> <dbl> <dbl>
#> 1 2 23
#> 2 -1 27
#> 3 -8 25
2. Another way of selecting multiple variables is to use the operator c()
.c()
is a function that returns a vector(a one-dimensional array). Order of the input also matters.
temperature_data %>%
select(c(winter, fall))
#> # A tibble: 3 × 2
#> winter fall
#> <dbl> <dbl>
#> 1 2 19
#> 2 -1 17
#> 3 -8 14
3. The !
operator negates a selection, &
operator means “and”(intersection), whereas |
takes the union of the selections (or).
temperature_data %>%
select(!winter)
#> # A tibble: 3 × 3
#> spring summer fall
#> <dbl> <dbl> <dbl>
#> 1 3 23 19
#> 2 5 27 17
#> 3 10 25 14
select(temperature_data, !winter & !summer)
#> # A tibble: 3 × 2
#> spring fall
#> <dbl> <dbl>
#> 1 3 19
#> 2 5 17
#> 3 10 14
select(temperature_data, winter & summer)
would incur an error because there cannot exist a variable that is both winter and summer.
4. The :
operator selects a range of consecutive variables, starting from the variable you put on the left of the colon to the variable you put on the right of the colon.
temperature_data %>%
select(summer:winter)
#> # A tibble: 3 × 3
#> summer fall winter
#> <dbl> <dbl> <dbl>
#> 1 23 19 2
#> 2 27 17 -1
#> 3 25 14 -8
5. The -
operator excludes a column.
If you would like to choose most of the columns in the dataset, and exclude a few columns, there is an easier way. You can just put -
in front of the name of the column you would like to exclude. For example, I would like to exclude columns summer
and winter
, then I would just put a -
in front of the columns:
29.5 Advanced uses
Some would wonder how they could select the columns based on the data types, since most of the statistical analyses use quantitative data rather than qualitative data. In that case, you can use the following:
select(which(sapply(., is.numeric)))
Instead of is.numeric
, you can put is.character
and is.double
depending on what type of variables you would like to select in the dataset.
Let’s look at some examples. Here is a dataset that contains information about Sakura blooming in Japan, and it has various data types such as <int>
(integer), <chr>
(character), <dbl>
(double).
# japanese_blooming <- read.csv("https://raw.githubusercontent.com/tacookson/data/master/sakura-flowering/temperatures-modern.csv")
# head(japanese_blooming)
Using the which(sapply(., is.character))
, we can select the variables that have a data type character <chr>
.
# japanese_blooming %>% select(which(sapply(.,is.character)))
If you want to select quantitative/numeric data, you can put is.numeric
instead.
# japanese_blooming %>% select(which(sapply(.,is.numeric)))
Another way to perform, is the following using select_if
with whichever data type you would like to select, such as is.double
, is.integer
, is.double
and is.character
:
# japanese_blooming %>% select_if(is.double)
Selecting variables depending on the data type will come in handy when the dataset has hundreds of variables and you would like to select only quantitative variables for your data analysis/building a statistical model.
29.6 Exercises
Based on the material we have learned, now let’s do some exercises.
29.6.1 Question 1
Modify this code so that we can only see from second to fourth column.
temperature_data <- tibble(spring = c(3, 5, 10),
summer = c(23, 27, 25),
fall = c(19, 17, 14),
winter = c(2, -1, -8))
select(temperature_data, 4:4)
#> # A tibble: 3 × 1
#> winter
#> <dbl>
#> 1 2
#> 2 -1
#> 3 -8
select(temperature_data, 2:4)
#> # A tibble: 3 × 3
#> summer fall winter
#> <dbl> <dbl> <dbl>
#> 1 23 19 2
#> 2 27 17 -1
#> 3 25 14 -8
The outcome should be:
#> # A tibble: 3 × 3
#> summer fall winter
#> <dbl> <dbl> <dbl>
#> 1 23 19 2
#> 2 27 17 -1
#> 3 25 14 -8
29.6.2 Question 2
Modify this code so that we can only see the columns “winter” and “summer” respectively, using |
operator.
temperature_data <- tibble(spring = c(3, 5, 10),
summer = c(23, 27, 25),
fall = c(19, 17, 14),
winter = c(2, -1, -8))
select(temperature_data, spring)
#> # A tibble: 3 × 1
#> spring
#> <dbl>
#> 1 3
#> 2 5
#> 3 10
select(temperature_data, winter|summer)
#> # A tibble: 3 × 2
#> winter summer
#> <dbl> <dbl>
#> 1 2 23
#> 2 -1 27
#> 3 -8 25
The correct answer should have a result like:
#> # A tibble: 3 × 2
#> winter spring
#> <dbl> <dbl>
#> 1 2 3
#> 2 -1 5
#> 3 -8 10
29.7 Common Mistakes & Errors
- If you don’t have package “dplyr” or “tidyverse” installed and called in the library, the function would not work. Download either package using
install.packages("dplyr")
, and set it up in the library at the start of your code usinglibrary(dplyr)
.
- Make sure you typed in the correct variable/column name. Always check if your code contains any typo.
- If you would like to use the new data frame using the variables you have selected, make sure to assign the selected variables to a new data frame.
- Make sure you understand the differences between the operator
&
and|
. They can be confusing.
29.8 Next Steps & See also
In the tidyverse essentials, there are a ton of other functions you could mix & match with. After learning other functions such as filter()
, group_by()
, arrange()
, mutate()
, etc. you could easily modify the dataset according to your taste.
29.9 Summary
- package
tidyverse
ordplyr
is needed.
-
select()
keeps only the variables you mention.
- There are some operators that can come in handy.
-
|
: OR operator
-
&
: AND operator
-
c()
: a function that returns a vector, it is for choosing multiple columns. -
!
: to negate a statement or a column
-
-
: to exclude a column
-
:
: to select a range of consecutive variables
-
- If you want to select columns with a specific data type, use
which(sapply(.,is.DATATYPE))
or useselect_if(is.DATATYPE)
29.10 Exercises
29.10.1 Question 5
When do you use select()
function?
a. To retain all rows in a dataset that satisfy your conditions.
b. To choose a subset of variables or columns from a dataset.
c. To reorder a dataset.
d. To drop null values in a dataset.
29.10.2 Question 6
When you want to select multiple columns, which operator is the most appropriate to use?
a. &
b. c()
c. !
d. -
29.10.3 Question 7
What can you do with select() function?
a. Select column with column name.
b. Select column name with missing values.
c. Select column which ends with certain character.
d. All of the above.