25 Reading tables dta and other data types
Written by Isaac Ehrlich and last updated on 7 October 2021.
25.1 Introduction
Unfortunately, not all data that you may need to read into R will be stored in nice, easily-readable ‘.csv’ files. To handle alternate data file types, tidyverse has a series of read-in functions beyond read_csv(). This includes read_table(), read_table2(), read_excel(), read_fwf()), and read_delim() for tabular data, read_file() and read_lines() for non-tabular data, and functions in the tidyverse haven package for reading data formats from other statistical languages such as read_sas(), read_sav(), and read_dta().
In this lesson, you will learn how to:
- Read in a variety of alternate data file types, including tabular, non-tabular, and statistical package data
Prerequisite skills include:
- Installing packages and calling libraries
- Knowledge of
read_csv()and its arguments
25.2 Tabular Data
The following functions all take similar arguments to read_csv(), which are broken down in the previous module. The following sections will instead focus on the different types of files and formats each function is capable of reading.
read_table():
read_table() reads tabular text data where each column is separated by one (or more) spaces. read_table() is frequently used to read in space-delimited .txt files but can handle other text file types as well. read_table() requires that each line is the same length, and each column is in the same position.
read_table2():
read_table2() is similar to read_table(), but does not require each line to be the same length.
read_fwf():
read_fwf() reads in fixed width file types. Fixed width files are files where the data is not delimited in any way, but like the proper input to read_table(), these files have columns that are in the same place on every line; hence, they are “fixed width.” read_fwf() takes the additional argument col_positions, which specifies the position at which each column begins.
read_delim():
read_delim() is the general case of read_csv(), where the user can specify which single character the file is delimited by, rather than defaulting to the comma, as in read_csv(). read_delim() takes in the additional argument delim, which specifies by which single character columns in the raw file are separated.
read_excel():
read_excel() reads in .xls and .xlsx files (Microsoft Excel file types). read_excel() takes the additional argument sheet that specifies which sheet of an Excel file to read, either using the name of the sheet as a string or the index. If the argument is not specified, read_excel() will default to the first sheet.
25.3 Non-Tabular Data
read_file():
read_file() reads an entire file as a single string into a single vector.
read_lines():
read_lines() reads each line of a file as a separate string, and creates a list of strings.
25.4 Statistical Package Data (Using the Tidyverse haven Package)
read_sas():
read_sas() reads .sas7bdat and .sas7bcat files generated in SAS.
read_sav():
read_sav() reads .sav files generated in SPSS.
read_dta():
read_dta() reads .dta files generated in Stata. One particularly common way to use it is in combination with labelled::to_factor(). This then adds the labels into the dataset. Otherwise they are stored separately. For instance, a typical usage is something like:
my_dta_dataset <-
read_dta("my_dta_dataset.dta"))
# The Stata format separates labels so reunite those
my_dta_dataset <-
labelled::to_factor(my_dta_dataset)