57 head, tail, glimpse and summary

Written by Haoluan Chen and last updated on 7 October 2021.

57.1 Introduction

In this lesson, you will learn how to:

Prerequisite skills include:

  • setup RStudio
  • run R code in the console
  • Install and load packages

Highlights:

After you load your dataset into R, you should start looking into the data to see what kinds of data you are working with.

Here are some useful functions that can help you to understand your dataset.

57.3 tail()

The tail() function also takes in two parameters. The first parameter is the data frame, and the second parameter is the last number of rows you want to look at. (The “tail” of your dataset.)

tail(mtcars, n = 3)
#>                mpg cyl disp  hp drat   wt qsec vs am gear
#> Ferrari Dino  19.7   6  145 175 3.62 2.77 15.5  0  1    5
#> Maserati Bora 15.0   8  301 335 3.54 3.57 14.6  0  1    5
#> Volvo 142E    21.4   4  121 109 4.11 2.78 18.6  1  1    4
#>               carb
#> Ferrari Dino     6
#> Maserati Bora    8
#> Volvo 142E       2

Here I have set ‘n’ to 3, so we are looking at the last three row of the mtcars dataset.

57.4 glimpse()

The glimpse() function takes in one parameter, which is the data frame. This function can tell you the number of rows and columns for your dataset. Additionally, you can get the name, data type, and first few observations of each variable.

glimpse(mtcars)
#> Rows: 32
#> Columns: 11
#> $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.…
#> $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, …
#> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360…
#> $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123…
#> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.6…
#> $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.5…
#> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.…
#> $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, …
#> $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, …
#> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, …

Here, we see that the table mtcars contains 32 rows and 11 columns of data. All of the variables in this table are double-precision floating-point number, because represents the double data type.

57.5 summary()

Next, you may want to look at the summary statistics of your data set. The function summary can produce the following summary statistics for each of the variables.

  • Min. : minimum value of the variable
  • 1st.Qu. : the first quartile of the variable
  • Median: median of the variable
  • Mean: mean of the variable
  • 3rd Qu. : the third quartile of the variable
  • Max. maximum value of the variable
summary(mtcars)
#>       mpg             cyl             disp      
#>  Min.   :10.40   Min.   :4.000   Min.   : 71.1  
#>  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8  
#>  Median :19.20   Median :6.000   Median :196.3  
#>  Mean   :20.09   Mean   :6.188   Mean   :230.7  
#>  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0  
#>  Max.   :33.90   Max.   :8.000   Max.   :472.0  
#>        hp             drat             wt       
#>  Min.   : 52.0   Min.   :2.760   Min.   :1.513  
#>  1st Qu.: 96.5   1st Qu.:3.080   1st Qu.:2.581  
#>  Median :123.0   Median :3.695   Median :3.325  
#>  Mean   :146.7   Mean   :3.597   Mean   :3.217  
#>  3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.610  
#>  Max.   :335.0   Max.   :4.930   Max.   :5.424  
#>       qsec             vs               am        
#>  Min.   :14.50   Min.   :0.0000   Min.   :0.0000  
#>  1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000  
#>  Median :17.71   Median :0.0000   Median :0.0000  
#>  Mean   :17.85   Mean   :0.4375   Mean   :0.4062  
#>  3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000  
#>  Max.   :22.90   Max.   :1.0000   Max.   :1.0000  
#>       gear            carb      
#>  Min.   :3.000   Min.   :1.000  
#>  1st Qu.:3.000   1st Qu.:2.000  
#>  Median :4.000   Median :2.000  
#>  Mean   :3.688   Mean   :2.812  
#>  3rd Qu.:4.000   3rd Qu.:4.000  
#>  Max.   :5.000   Max.   :8.000

What happens if you have other data types in your dataset? Here is a dataset called scores. It contains three variables student_ID, gender, and test_score.

scores <- tibble(student_ID = c("1", "2", "3", "4", "5", "6"),
                 gender = as.factor(c("male", "male", "male","female","female","female")),
               test_score = c(87, 76, 61, 80, 72, 69),
               )
scores
#> # A tibble: 6 × 3
#>   student_ID gender test_score
#>   <chr>      <fct>       <dbl>
#> 1 1          male           87
#> 2 2          male           76
#> 3 3          male           61
#> 4 4          female         80
#> 5 5          female         72
#> 6 6          female         69
glimpse(scores)
#> Rows: 6
#> Columns: 3
#> $ student_ID <chr> "1", "2", "3", "4", "5", "6"
#> $ gender     <fct> male, male, male, female, female, female
#> $ test_score <dbl> 87, 76, 61, 80, 72, 69

Using the glimpse function, we know that student_ID is a character data type, gender is a factor data type, and test_score is a double data type.

summary(scores)
#>   student_ID           gender    test_score   
#>  Length:6           female:3   Min.   :61.00  
#>  Class :character   male  :3   1st Qu.:69.75  
#>  Mode  :character              Median :74.00  
#>                                Mean   :74.17  
#>                                3rd Qu.:79.00  
#>                                Max.   :87.00

For character data type (student_ID), we see the length, class, and Mode of this variable. Length tells us the number of observations, class, and Mode tells us the data type.

For factor data type(gender), we have the count of each factor. In this dataset, there are three female students and three male students.

For double data type(test_score), we have the summary statistics as we have seen before.

57.6 Exercises

57.6.1 Exercise 1

57.6.2 Exercise 3

Here, we have a book dataset from Alex Cookson. This dataset contains 9,000 children’s books that have been rated from 1-5 stars. Run the following code to your R and use the functions you learned in this tutorial to explore this dataset!

books <- 
  read_tsv("https://raw.githubusercontent.com/tacookson/data/master/childrens-book-ratings/childrens-books.txt")

57.7 Next Steps

Once you have fully understood the dataset you are working with you may start using plots to get a graphical representation of your dataset. You may like to read this chapter for more information: https://r4ds.had.co.nz/data-visualisation.html.

57.8 Exercises

57.8.1 Question 1

What is the first parameter for head()? a. number of row in your data set b. A data frame c. a number represents the First number of rows d. a string number represents the First number of rows

57.8.2 Question 2

What does the following code produce? head(data, n = 3) a. The last three row of your data b. The first three column of your data c. The first three row fo your data d. The last three column of your data

57.8.3 Question 3

What can we learn about our data from the glimpse() output (Multiple answers) a. Number of rows and columns b. name of the variable c. data type d. Summary statistics of your data set

57.8.4 Question 4

glimpse() give us few observations of each variable. a. True b. False

57.8.5 Question 5

summary() does not produce mean value. a. True b. False

57.8.6 Question 6

If you want to look at the first 3 rows of the mtcars dataset, which code should you use? a. head(mtcars,3) b. tail(mtcars, 3) c. glimpse(mtcars) d. summary(mtcars)

57.8.7 Question 7

If you want to look at the last 3 rows of the mtcars dataset, which code should you use? a. head(mtcars,3) b. tail(mtcars, 3) c. glimpse(mtcars) d. summary(mtcars)

57.8.8 Question 8

What is the output of summary() function for factor data type? a. Data type b. Summary statistics such as min and max c. Number of factors in the variable d. Count of each factor

57.8.9 Question 9

What is the output of summary() function for double data type? a. Data type b. Summary statistics such as min and max c. Number of factors in the variable d. Count of each factor

57.8.10 Question 10

What is the default n for head() a. 3 b. 5 c. 10 d. 6