24 Read CSVs

Written by Marija Pejcinovska and last updated on Feb 5, 2022.

24.1 Introduction

In this lesson, you will learn how to:

  • read-in comma-delimited data files using the read_csv() function in the readr package.

Prerequisite skills include:

  • Installing packages, calling libraries.

24.2 Delimited data files in R

Text and .csv files are common file formats for saving various types of data. R allows you to read in such delimited text files in a number of ways.

In this tutorial we will focus specifically on the read_csv() function. The function is part of the readr package, which itself if part of the tidyverse ecosystem of packages.

Note: you can get the readr package by installing the whole tidyverse (install.packages(“tidyverse”) ) or by installing readr directly. recall that to load the package you’ll need to use library(readr); alternatively it will be loaded automatically when you load the tidyverse.

As the name suggest the read_csv() function is best suited for reading in .csv type files. “csv” stands for comma-separated values, which means that data entries are separated (or delimited) by commas (in the case where values are separated by semicolons instead, use the functionread_csv2()). Each row in a csv file is initiated by a newline (or rather a newline character \n).

To begin, let’s create our own small (csv looking) text data that we’ll read in with the read_csv() function. We’ll call this object my_data


# Make sure that each row in the data starts on a new line

my_data <- c("studendID,test1,test2,grade 
         student1,90,85,A
         student2,30,46,F
         student3,70,80,B
         student4, NA,68,C
         student5,NA,NA,F") 

We’ll read the data by putting my_dataas an argument inside our function as shown below. If instead my_data was a .csv file somewhere on your computer you would need to provide the location (path) of your file which should look something like read_csv("my_folder/my_subfolder/my_data_file.csv").

We will start by getting a quick sense of what our data looks like once we’ve called the read_csv() function.


# Let's read in my_data and save it as an object called my_first_csv_file
my_first_csv_file <- read_csv(my_data)
#> Rows: 5 Columns: 4
#> ── Column specification ────────────────────────────────────
#> Delimiter: ","
#> chr (2): studendID, grade
#> dbl (2): test1, test2
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
my_first_csv_file
#> # A tibble: 5 × 4
#>   studendID test1 test2 grade
#>   <chr>     <dbl> <dbl> <chr>
#> 1 student1     90    85 A    
#> 2 student2     30    46 F    
#> 3 student3     70    80 B    
#> 4 student4     NA    68 C    
#> 5 student5     NA    NA F

24.3 A closer look at read_csv()

To see all the arguments of read_csv() you can simply call ?read_csv() from your console. Among the many things in the help file you’ll notice the following usage description.

read_csv(
      file,
      col_names = TRUE,
      col_types = NULL,
      locale = default_locale(),
      na = c("“,”NA"),
      quoted_na = TRUE,
      quote = """,
      comment = "",
      trim_ws = TRUE,
      skip = 0,
      n_max = Inf,
      guess_max = min(1000, n_max),
      progress = show_progress(),
      skip_empty_rows = TRUE
)

You will note that the function has a number of arguments available to the user. In this tutorial, we will focus on a handful of arguments you are most likely to use:

  • file
  • col_names
  • skip
  • n_max
  • na

24.3.1 Arguments of read_csv()

The argument file, as one would expect, indicates the file name you are reading in. In our previous example this was the object my_data; we’ll see a different example in the exercises.

col_names can either take a logical value, TRUE or FALSE, or a character vector.
Use TRUE/FALSE to indicate whether the file you are reading in contains column names or not.

Let’s see how this works.

By default read_csv() sets col_names = TRUE, which works great for our particular example since in our data file my_data we specified column names.
What happens if we set col_names = FALSE for our file?

read_csv(my_data, col_names = FALSE)  
#> Rows: 6 Columns: 4
#> ── Column specification ────────────────────────────────────
#> Delimiter: ","
#> chr (4): X1, X2, X3, X4
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 6 × 4
#>   X1        X2    X3    X4   
#>   <chr>     <chr> <chr> <chr>
#> 1 studendID test1 test2 grade
#> 2 student1  90    85    A    
#> 3 student2  30    46    F    
#> 4 student3  70    80    B    
#> 5 student4  <NA>  68    C    
#> 6 student5  <NA>  <NA>  F

You’ll notice that R assigns generic variable names. In this case, the columns (or variables) are named X1 through X4.

But what else do you notice?

R also assumed that our actual column names (i.e. the very first row of our data) were just part of the data, consequently making all variables of type character (since we now have a mix of character and numeric values - visit the tutorial on object types if you need a refresher on this).

So, you should use col_names as a logical value only to indicate to R that the data you are reading in does not have a line of column names.
If you data file does not contain variable names, but you don’t wish to have R assign generic values, you could specify those by supplying a vector with the desired variable names to the col_names argument; for example, col_names = c("First Var Name, "Second Var Name"). This, however, becomes cumbersome for data with too many variables and there might be better solution to manipulate the names.

The argument skip can be used to indicate how many lines should be skipped before reading in data entries. For instance setting skip = 4 will skip the first 4 lines of the file.

The argument n_max, on the other hand, allows you to control the maximum number of lines read. For instance, setting n_max=1 would indicate to R to only read a single line of the data file. Note, however, that if col_names = TRUE the header of the data file (i.e. the column names) are not counted towards the n_max total. For example, setting n_max=0 will initiate a tibble(data frame) with no entries read, but with the names of the columns preserved.

Let’s see skip and n_max in action. Suppose we modified our my_data object and added a few irrelevant entries at the beginning and the end of the object.


my_data_modified <- c("Hello, this text is, irrelevant
As is this line
studendID,test1,test2,grade 
student1,90,85,A
student2,30,46,F
student3,70,80,B
student4, NA,68,C
student5,NA,NA,F
student6,LWD,LWD,LWD")

Suppose we wanted to read in the data by skipping the first 2 lines and the last (sixth) entry in my_data_modified. We can accomplish this by running the following code


read_csv(my_data_modified, skip = 2, n_max = 5)
#> Rows: 5 Columns: 4
#> ── Column specification ────────────────────────────────────
#> Delimiter: ","
#> chr (2): studendID, grade
#> dbl (2): test1, test2
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 5 × 4
#>   studendID test1 test2 grade
#>   <chr>     <dbl> <dbl> <chr>
#> 1 student1     90    85 A    
#> 2 student2     30    46 F    
#> 3 student3     70    80 B    
#> 4 student4     NA    68 C    
#> 5 student5     NA    NA F

Finally, the argument na allows you to indicate whether any special values have been used to encode NAs (i.e. missing values). By default, the function assumes that NAs are defined either by blank spaces or the string NA (see the help file for ?read_csv()). However, if a data file you are reading in encodes missing values as -99 or blank spaces, you could just specify that by setting na=c("-99", "") in the arguments of the function.

24.4 Common Mistakes & Errors

  • When importing your data using read_csv() make sure that the path to the file to be read is correctly specified and that, in addition, no spelling mistakes have been made.

  • Make sure that the data file is really missing column names before you proceed to change the col_names argument.

24.5 Next Steps

So far you’ve learned how to import a basic delimited data file in R using read_csv(). In the next tutorial you will see how to load other types of data and import more complicated data formats.

24.6 Exercises

24.6.1 Question 1

The power of read_csv() is that it allows us to read any type of delimited data.

  1. True
  2. False

24.6.2 Question 2

read_csv() will automatically recognize your missing values regardless of how they are encoded in your data.

  1. True
  2. False

24.6.3 Question 3

Consider the following data

my_new_data <- c("studendID,test1,test2,grade
(textID),(points out of 100),(points out of 100),(letter grade)
student1,90,85,A
student2,30,46,F
student3,70,80,B
student4, NA,68,C
student5,NA,NA,F")

Suppose you think the second line of data entries is not relevant for your analysis, and you would like to omit it when reading in the data. Which of the code lines below will achieve that?

  1. read_csv(my_new_data, skip = 2, col_names = FALSE)
  2. read_csv(my_new_data, skip = 2, col_names = TRUE)
  3. read_csv(my_new_data, skip = 1, col_names = TRUE)
  4. read_csv(my_new_data, skip = 2, col_names = c("studendID","test1","test2","grade"))

24.6.4 Question 4

Suppose you have the following data:

c("Sam,10,9,8
Steve,10,6,5
Jane,9,9,10
Marc,8,10,7")

To correctly read in these data using read_csv() you would need to set:

  1. col_names = TRUE
  2. skip = 0
  3. skip = 1
  4. n_max = 5
  5. col_names = FALSE

24.6.5 Question 5

Refer to the data from Question 3. Which of the following lines of code would extract the column names?

  1. my_col_names <- names(read_csv(my_new_data, col_names = FALSE))
  2. my_col_names <- names(my_new_data)
  3. my_col_names <- names(read_csv(my_new_data, skip = 1))
  4. my_col_names <- names(read_csv(my_new_data, n_max = 0))

24.6.6 Question 6

Referring again to the data from Question 3 and your answer from Question 5. Which of the following lines of code will read the data, including the correct column names, for the first two students only?

  1. read_csv(my_new_data, skip = 2, col_names = FALSE, n_max = 4)
  2. read_csv(my_new_data, skip = 2, col_names = FALSE, n_max = 2)
  3. read_csv(my_new_data, skip = 0, col_names = my_col_names, n_max = 2)
  4. read_csv(my_new_data, skip = 2, col_names = my_col_names, n_max = 2)

24.6.7 Question 7

Referring to the data from Question 3, which of the following commands would correctly read-in only the data for Student 3?

  1. read_csv(my_new_data, skip = 2, col_names = my_col_names, n_max = 3)
  2. read_csv(my_new_data, skip = 4, col_names = my_col_names, n_max = 3)
  3. read_csv(my_new_data, skip = 4, col_names = my_col_names, n_max = 1)
  4. read_csv(my_new_data, skip = 2, col_names = my_col_names, n_max = 1)

24.6.8 Question 8

Referring to the data from Question 3 once more, which line of code correctly reads in the data for Student 5 only?

  1. read_csv(my_new_data, skip = 6, col_names = my_col_names, n_max = 1)
  2. read_csv(my_new_data, skip = 5, col_names = my_col_names, n_max = 0)
  3. read_csv(my_new_data, skip = 6, col_names = FALSE, n_max = 1)
  4. read_csv(my_new_data, skip = 5, col_names = FALSE, n_max = 0)

24.6.9 Question 9

Select the correct answer for the following scenario: If, in a read_csv() call, you set col_names = FALSE and n_max = 0 the result will be that R will

  1. initiate an empty tibble with the correct column names associated with the data file you supplied.
  2. initiate an empty tibble with generic column names.
  3. will throw an error since n_max would need to be set to 1 in this case.

24.6.10 Question 10

In a read_csv() call, setting the skip argument to be skip = 3 will result in a tibble whose first entry is always the 4th line in the data file, regardless of whether the col_names argument is set to TRUE or FALSE.

  1. True
  2. False