24 Read CSVs
Written by Marija Pejcinovska and last updated on Feb 5, 2022.
24.1 Introduction
In this lesson, you will learn how to:
- read-in comma-delimited data files using the
read_csv()
function in thereadr
package.
Prerequisite skills include:
- Installing packages, calling libraries.
24.2 Delimited data files in R
Text and .csv files are common file formats for saving various types of data. R allows you to read in such delimited text files in a number of ways.
In this tutorial we will focus specifically on the read_csv()
function.
The function is part of the readr
package, which itself if part of the tidyverse
ecosystem of packages.
Note: you can get the readr
package by installing the whole tidyverse
(install.packages(“tidyverse”)
) or by installing readr
directly. recall that to load the package you’ll need to use library(readr)
; alternatively it will be loaded automatically when you load the tidyverse
.
As the name suggest the read_csv()
function is best suited for reading in .csv
type files.
“csv” stands for comma-separated values, which means that data entries are separated (or delimited) by commas (in the case where values are separated by semicolons instead, use the functionread_csv2()
). Each row in a csv file is initiated by a newline (or rather a newline character \n
).
To begin, let’s create our own small (csv looking) text data that we’ll read in with the read_csv()
function. We’ll call this object my_data
# Make sure that each row in the data starts on a new line
my_data <- c("studendID,test1,test2,grade
student1,90,85,A
student2,30,46,F
student3,70,80,B
student4, NA,68,C
student5,NA,NA,F")
We’ll read the data by putting my_data
as an argument inside our function as shown below.
If instead my_data
was a .csv
file somewhere on your computer you would need to provide the location (path) of your file which should look something like read_csv("my_folder/my_subfolder/my_data_file.csv")
.
We will start by getting a quick sense of what our data looks like once we’ve called the read_csv()
function.
# Let's read in my_data and save it as an object called my_first_csv_file
my_first_csv_file <- read_csv(my_data)
#> Rows: 5 Columns: 4
#> ── Column specification ────────────────────────────────────
#> Delimiter: ","
#> chr (2): studendID, grade
#> dbl (2): test1, test2
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
my_first_csv_file
#> # A tibble: 5 × 4
#> studendID test1 test2 grade
#> <chr> <dbl> <dbl> <chr>
#> 1 student1 90 85 A
#> 2 student2 30 46 F
#> 3 student3 70 80 B
#> 4 student4 NA 68 C
#> 5 student5 NA NA F
24.3 A closer look at read_csv()
To see all the arguments of read_csv()
you can simply call ?read_csv()
from your console. Among the many things in the help file you’ll notice the following usage description.
file,
col_names = TRUE,
col_types = NULL,
locale = default_locale(),
na = c("“,”NA"),
quoted_na = TRUE,
quote = """,
comment = "",
trim_ws = TRUE,
skip = 0,
n_max = Inf,
guess_max = min(1000, n_max),
progress = show_progress(),
skip_empty_rows = TRUE
)
You will note that the function has a number of arguments available to the user. In this tutorial, we will focus on a handful of arguments you are most likely to use:
-
file
col_names
skip
n_max
na
24.3.1 Arguments of read_csv()
The argument file
, as one would expect, indicates the file name you are reading in. In our previous example this was the object my_data
; we’ll see a different example in the exercises.
col_names
can either take a logical value, TRUE or FALSE, or a character vector.
Use TRUE/FALSE to indicate whether the file you are reading in contains column names or not.
Let’s see how this works.
By default read_csv() sets col_names = TRUE
, which works great for our particular example since in our data file my_data
we specified column names.
What happens if we set col_names = FALSE
for our file?
read_csv(my_data, col_names = FALSE)
#> Rows: 6 Columns: 4
#> ── Column specification ────────────────────────────────────
#> Delimiter: ","
#> chr (4): X1, X2, X3, X4
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 6 × 4
#> X1 X2 X3 X4
#> <chr> <chr> <chr> <chr>
#> 1 studendID test1 test2 grade
#> 2 student1 90 85 A
#> 3 student2 30 46 F
#> 4 student3 70 80 B
#> 5 student4 <NA> 68 C
#> 6 student5 <NA> <NA> F
You’ll notice that R assigns generic variable names. In this case, the columns (or variables) are named X1
through X4
.
But what else do you notice?
R also assumed that our actual column names (i.e. the very first row of our data) were just part of the data, consequently making all variables of type character (since we now have a mix of character and numeric values - visit the tutorial on object types if you need a refresher on this).
So, you should use col_names
as a logical value only to indicate to R that the data you are reading in does not have a line of column names.
If you data file does not contain variable names, but you don’t wish to have R assign generic values, you could specify those by supplying a vector with the desired variable names to the col_names
argument; for example, col_names = c("First Var Name, "Second Var Name")
. This, however, becomes cumbersome for data with too many variables and there might be better solution to manipulate the names.
The argument skip
can be used to indicate how many lines should be skipped before reading in data entries. For instance setting skip = 4
will skip the first 4 lines of the file.
The argument n_max
, on the other hand, allows you to control the maximum number of lines read. For instance, setting n_max=1
would indicate to R to only read a single line of the data file. Note, however, that if col_names = TRUE
the header of the data file (i.e. the column names) are not counted towards the n_max
total. For example, setting n_max=0
will initiate a tibble(data frame) with no entries read, but with the names of the columns preserved.
Let’s see skip
and n_max
in action.
Suppose we modified our my_data
object and added a few irrelevant entries at the beginning and the end of the object.
my_data_modified <- c("Hello, this text is, irrelevant
As is this line
studendID,test1,test2,grade
student1,90,85,A
student2,30,46,F
student3,70,80,B
student4, NA,68,C
student5,NA,NA,F
student6,LWD,LWD,LWD")
Suppose we wanted to read in the data by skipping the first 2 lines and the last (sixth) entry in my_data_modified
. We can accomplish this by running the following code
read_csv(my_data_modified, skip = 2, n_max = 5)
#> Rows: 5 Columns: 4
#> ── Column specification ────────────────────────────────────
#> Delimiter: ","
#> chr (2): studendID, grade
#> dbl (2): test1, test2
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 5 × 4
#> studendID test1 test2 grade
#> <chr> <dbl> <dbl> <chr>
#> 1 student1 90 85 A
#> 2 student2 30 46 F
#> 3 student3 70 80 B
#> 4 student4 NA 68 C
#> 5 student5 NA NA F
Finally, the argument na
allows you to indicate whether any special values have been used to encode NAs (i.e. missing values). By default, the function assumes that NAs are defined either by blank spaces or the string NA (see the help file for ?read_csv()
). However, if a data file you are reading in encodes missing values as -99
or blank spaces, you could just specify that by setting na=c("-99", "")
in the arguments of the function.
24.4 Common Mistakes & Errors
When importing your data using
read_csv()
make sure that the path to the file to be read is correctly specified and that, in addition, no spelling mistakes have been made.Make sure that the data file is really missing column names before you proceed to change the
col_names
argument.
24.5 Next Steps
So far you’ve learned how to import a basic delimited data file in R using read_csv()
. In the next tutorial you will see how to load other types of data and import more complicated data formats.
24.6 Exercises
24.6.1 Question 1
The power of read_csv()
is that it allows us to read any type of delimited data.
- True
- False
24.6.2 Question 2
read_csv()
will automatically recognize your missing values regardless of how they are encoded in your data.
- True
- False
24.6.3 Question 3
Consider the following data
my_new_data <- c("studendID,test1,test2,grade
(textID),(points out of 100),(points out of 100),(letter grade)
student1,90,85,A
student2,30,46,F
student3,70,80,B
student4, NA,68,C
student5,NA,NA,F")
Suppose you think the second line of data entries is not relevant for your analysis, and you would like to omit it when reading in the data. Which of the code lines below will achieve that?
-
read_csv(my_new_data, skip = 2, col_names = FALSE)
-
read_csv(my_new_data, skip = 2, col_names = TRUE)
-
read_csv(my_new_data, skip = 1, col_names = TRUE)
read_csv(my_new_data, skip = 2, col_names = c("studendID","test1","test2","grade"))
24.6.4 Question 4
Suppose you have the following data:
c("Sam,10,9,8
Steve,10,6,5
Jane,9,9,10
Marc,8,10,7")
To correctly read in these data using read_csv()
you would need to set:
-
col_names = TRUE
-
skip = 0
-
skip = 1
n_max = 5
col_names = FALSE
24.6.5 Question 5
Refer to the data from Question 3. Which of the following lines of code would extract the column names?
-
my_col_names <- names(read_csv(my_new_data, col_names = FALSE))
-
my_col_names <- names(my_new_data)
-
my_col_names <- names(read_csv(my_new_data, skip = 1))
my_col_names <- names(read_csv(my_new_data, n_max = 0))
24.6.6 Question 6
Referring again to the data from Question 3 and your answer from Question 5. Which of the following lines of code will read the data, including the correct column names, for the first two students only?
-
read_csv(my_new_data, skip = 2, col_names = FALSE, n_max = 4)
-
read_csv(my_new_data, skip = 2, col_names = FALSE, n_max = 2)
-
read_csv(my_new_data, skip = 0, col_names = my_col_names, n_max = 2)
read_csv(my_new_data, skip = 2, col_names = my_col_names, n_max = 2)
24.6.7 Question 7
Referring to the data from Question 3, which of the following commands would correctly read-in only the data for Student 3?
-
read_csv(my_new_data, skip = 2, col_names = my_col_names, n_max = 3)
-
read_csv(my_new_data, skip = 4, col_names = my_col_names, n_max = 3)
-
read_csv(my_new_data, skip = 4, col_names = my_col_names, n_max = 1)
read_csv(my_new_data, skip = 2, col_names = my_col_names, n_max = 1)
24.6.8 Question 8
Referring to the data from Question 3 once more, which line of code correctly reads in the data for Student 5 only?
-
read_csv(my_new_data, skip = 6, col_names = my_col_names, n_max = 1)
-
read_csv(my_new_data, skip = 5, col_names = my_col_names, n_max = 0)
-
read_csv(my_new_data, skip = 6, col_names = FALSE, n_max = 1)
read_csv(my_new_data, skip = 5, col_names = FALSE, n_max = 0)
24.6.9 Question 9
Select the correct answer for the following scenario: If, in a read_csv()
call, you set col_names = FALSE
and n_max = 0
the result will be that R will
- initiate an empty tibble with the correct column names associated with the data file you supplied.
- initiate an empty tibble with generic column names.
- will throw an error since
n_max
would need to be set to 1 in this case.
24.6.10 Question 10
In a read_csv()
call, setting the skip
argument to be skip = 3
will result in a tibble whose first entry is always the 4th line in the data file, regardless of whether the col_names
argument is set to TRUE
or FALSE
.
- True
- False