57 head, tail, glimpse and summary
Written by Haoluan Chen and last updated on 7 October 2021.
57.1 Introduction
In this lesson, you will learn how to:
Prerequisite skills include:
- setup RStudio
- run R code in the console
- Install and load packages
Highlights:
After you load your dataset into R, you should start looking into the data to see what kinds of data you are working with.
Here are some useful functions that can help you to understand your dataset.
57.2 head()
The head()
function takes in two parameters. The first parameter is the data frame, and the second parameter is the first number of rows you want to look at. (The “head” of your dataset.)
head(mtcars, n = 3)
#> mpg cyl disp hp drat wt qsec vs am gear
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4
#> carb
#> Mazda RX4 4
#> Mazda RX4 Wag 4
#> Datsun 710 1
Here I have set ‘n’ to 3, so we are looking at the first three rows of the mtcars dataset.
57.3 tail()
The tail()
function also takes in two parameters. The first parameter is the data frame, and the second parameter is the last number of rows you want to look at. (The “tail” of your dataset.)
tail(mtcars, n = 3)
#> mpg cyl disp hp drat wt qsec vs am gear
#> Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5
#> Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5
#> Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4
#> carb
#> Ferrari Dino 6
#> Maserati Bora 8
#> Volvo 142E 2
Here I have set ‘n’ to 3, so we are looking at the last three row of the mtcars dataset.
57.4 glimpse()
The glimpse()
function takes in one parameter, which is the data frame. This function can tell you the number of rows and columns for your dataset. Additionally, you can get the name, data type, and first few observations of each variable.
glimpse(mtcars)
#> Rows: 32
#> Columns: 11
#> $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.…
#> $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, …
#> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360…
#> $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123…
#> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.6…
#> $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.5…
#> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.…
#> $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, …
#> $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, …
#> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, …
Here, we see that the table mtcars contains 32 rows and 11 columns of data. All of the variables in this table are double-precision floating-point number, because
57.5 summary()
Next, you may want to look at the summary statistics of your data set. The function summary can produce the following summary statistics for each of the variables.
- Min. : minimum value of the variable
- 1st.Qu. : the first quartile of the variable
- Median: median of the variable
- Mean: mean of the variable
- 3rd Qu. : the third quartile of the variable
- Max. maximum value of the variable
summary(mtcars)
#> mpg cyl disp
#> Min. :10.40 Min. :4.000 Min. : 71.1
#> 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8
#> Median :19.20 Median :6.000 Median :196.3
#> Mean :20.09 Mean :6.188 Mean :230.7
#> 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0
#> Max. :33.90 Max. :8.000 Max. :472.0
#> hp drat wt
#> Min. : 52.0 Min. :2.760 Min. :1.513
#> 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581
#> Median :123.0 Median :3.695 Median :3.325
#> Mean :146.7 Mean :3.597 Mean :3.217
#> 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610
#> Max. :335.0 Max. :4.930 Max. :5.424
#> qsec vs am
#> Min. :14.50 Min. :0.0000 Min. :0.0000
#> 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000
#> Median :17.71 Median :0.0000 Median :0.0000
#> Mean :17.85 Mean :0.4375 Mean :0.4062
#> 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000
#> Max. :22.90 Max. :1.0000 Max. :1.0000
#> gear carb
#> Min. :3.000 Min. :1.000
#> 1st Qu.:3.000 1st Qu.:2.000
#> Median :4.000 Median :2.000
#> Mean :3.688 Mean :2.812
#> 3rd Qu.:4.000 3rd Qu.:4.000
#> Max. :5.000 Max. :8.000
What happens if you have other data types in your dataset? Here is a dataset called scores. It contains three variables student_ID, gender, and test_score.
scores <- tibble(student_ID = c("1", "2", "3", "4", "5", "6"),
gender = as.factor(c("male", "male", "male","female","female","female")),
test_score = c(87, 76, 61, 80, 72, 69),
)
scores
#> # A tibble: 6 × 3
#> student_ID gender test_score
#> <chr> <fct> <dbl>
#> 1 1 male 87
#> 2 2 male 76
#> 3 3 male 61
#> 4 4 female 80
#> 5 5 female 72
#> 6 6 female 69
glimpse(scores)
#> Rows: 6
#> Columns: 3
#> $ student_ID <chr> "1", "2", "3", "4", "5", "6"
#> $ gender <fct> male, male, male, female, female, female
#> $ test_score <dbl> 87, 76, 61, 80, 72, 69
Using the glimpse function, we know that student_ID is a character data type, gender is a factor data type, and test_score is a double data type.
summary(scores)
#> student_ID gender test_score
#> Length:6 female:3 Min. :61.00
#> Class :character male :3 1st Qu.:69.75
#> Mode :character Median :74.00
#> Mean :74.17
#> 3rd Qu.:79.00
#> Max. :87.00
For character data type (student_ID), we see the length, class, and Mode of this variable. Length tells us the number of observations, class, and Mode tells us the data type.
For factor data type(gender), we have the count of each factor. In this dataset, there are three female students and three male students.
For double data type(test_score), we have the summary statistics as we have seen before.
57.6 Exercises
57.6.2 Exercise 3
Here, we have a book dataset from Alex Cookson. This dataset contains 9,000 children’s books that have been rated from 1-5 stars. Run the following code to your R and use the functions you learned in this tutorial to explore this dataset!
books <-
read_tsv("https://raw.githubusercontent.com/tacookson/data/master/childrens-book-ratings/childrens-books.txt")
57.7 Next Steps
Once you have fully understood the dataset you are working with you may start using plots to get a graphical representation of your dataset. You may like to read this chapter for more information: https://r4ds.had.co.nz/data-visualisation.html.
57.8 Exercises
57.8.1 Question 1
What is the first parameter for head()
?
a. number of row in your data set
b. A data frame
c. a number represents the First number of rows
d. a string number represents the First number of rows
57.8.2 Question 2
What does the following code produce?
head(data, n = 3)
a. The last three row of your data
b. The first three column of your data
c. The first three row fo your data
d. The last three column of your data
57.8.3 Question 3
What can we learn about our data from the glimpse()
output (Multiple answers)
a. Number of rows and columns
b. name of the variable
c. data type
d. Summary statistics of your data set
57.8.4 Question 4
glimpse()
give us few observations of each variable.
a. True
b. False
57.8.5 Question 5
summary()
does not produce mean value.
a. True
b. False
57.8.6 Question 6
If you want to look at the first 3 rows of the mtcars dataset, which code should you use?
a. head(mtcars,3)
b. tail(mtcars, 3)
c. glimpse(mtcars)
d. summary(mtcars)
57.8.7 Question 7
If you want to look at the last 3 rows of the mtcars dataset, which code should you use?
a. head(mtcars,3)
b. tail(mtcars, 3)
c. glimpse(mtcars)
d. summary(mtcars)
57.8.8 Question 8
What is the output of summary() function for factor data type? a. Data type b. Summary statistics such as min and max c. Number of factors in the variable d. Count of each factor
57.8.9 Question 9
What is the output of summary() function for double data type? a. Data type b. Summary statistics such as min and max c. Number of factors in the variable d. Count of each factor
57.8.10 Question 10
What is the default n for head()
a. 3
b. 5
c. 10
d. 6