14 Making reproducible examples

Written by Marija Pejcinovska and last updated on 5 Feb 2022.

14.1 Introduction

In this lesson you will:

learn about using reproducible examples to solve R issues
learn how to create reproducible examples
get introduced to packages that ease the pain of making reproducible examples

Prerequisite skills include:

Installing packages, troubleshooting, and asking for help

14.2 Seeking help the helpful way

At this point in the toolkit you have probably picked up a troubleshooting trick or two. In fact, you may have already googled how to code up something in R or have come across useful discussions on Stack Exchange or Stack Overflow.

Most of the time, you are likely to face issues many R users have grappled with before. In a way, this is nice. It means it should be much easier to find a solution to your problem online, usually with only a few google searches. Other times, however, you might come across a more challenging problem and the solution might not be as readily available online. In such a situation — assuming you’ve exhausted all other debugging and troubleshooting options — it’s possible that you would need to ask for some help online. Fortunately, as you’ve been making your way through this toolkit, you have probably already learned where to turn for help when needed.

At this point, knowing how to create a reproducible example (or reprex for short) would come in handy. The idea of a reprex is exactly what you might expect: it is a minimal, illustrative example that will reproduce the issue you are facing, so that others could help you solve it more easily.

Not all reprexes are created equal, however. In order for you to engage productively with the R community and get the help you need, you should help the reader understand your problem with as little effort on their part as possible. This means that a reprex needs to be minimal and self-contained, so that those helping you can simply copy and paste your R code, reproduce the error and understand your issue… and hopefully help you solve it, of course.

14.3 Elements of a reprex

So, what does a reprex look like?

All reproducible examples have these few elements in common:

a minimal data set on which the code can be run and the error reproduced.
a minimal, runnable piece of code that is well formatted and produces the same error when run on the supplied minimal data set.
call to any necessary R packages + any relevant pieces of information about your R environment when the issue was encountered.

14.4 Creating a data set for a reprex

You should always provide some data along with your reproducible example, so that you can easily illustrate your issue.

At times, it might not be possible, or advisable, to share the actual data you’ve been working on. In fact, it is probably more practical (and likely, much simpler) to create a “toy” data set which will produce the same error or issue you’ve been dealing with.

Here are a few ways of doing this.

You could create your own data by using functions such as those in the tibble package. As an example, consider the data frame below. Here we’ve used the tibble() function to create a very small data frame with 4 variables (x, y, z, and w) each with 4 observations (Note: tibble() in the tidyverse is similar to the data.frame() function in base R). An alternative way of creating a data frame is through the tribble() function, where tribble is simply short for transposed tibble, the syntax for which is visually easier but a little less common (call ?tribble at the R prompt to launch the help file)



first_toy_data <- tibble (
  x = 1:4,
  y = c("a", "b", "c", "d"),
  z = x + x^2,
  w = c("yes", "no", "yes", "yes")
)
first_toy_data
#> # A tibble: 4 × 4
#>       x y         z w    
#>   <int> <chr> <dbl> <chr>
#> 1     1 a         2 yes  
#> 2     2 b         6 no   
#> 3     3 c        12 yes  
#> 4     4 d        20 yes

Alternatively, you could consider using some of the built-in R data sets. To see the available data sets, try typing data() at the prompt in your R console. As you keep learning R you’ll notice many examples that feature, say, the mtcars or iris data sets. These data sets do get a little boring over time, but you shouldn’t really worry about the dullness of your data when making a reprex. In fact, feel free to bore your helpers with you choices. If you do end up using built-in data, you might want to consider using only a subset of the data set. You can do this by making use of the head() or sample() or slice() functions.
Finally, whatever data you are sharing you might find the dput() function helpful. The function will write a text representation of your data which others can then copy-and-paste into their own R scripts to get your data. dput essentially generates the R code necessary to recreate your data. The output from dput would look something like this:


dput(first_toy_data)
#> structure(list(x = 1:4, y = c("a", "b", "c", "d"), z = c(2, 6, 
#> 12, 20), w = c("yes", "no", "yes", "yes")), class = c("tbl_df", 
#> "tbl", "data.frame"), row.names = c(NA, -4L))

Running the output from dput() will create exactly the data object first_toy_data we created above. This is, indeed, the behavior we want. Run the code below to see for yourself.


structure(list(x = 1:4, y = c("a", "b", "c", "d"), z = c(2, 6, 
12, 20), w = c("yes", "no", "yes", "yes")), row.names = c(NA, 
-4L), class = c("tbl_df", "tbl", "data.frame"))

14.5 Adding code

The code in your reproducible example should be easy to understand and stripped down to the most bare version that would allow those helping you to replicate your errors.

Here are a few things you might want to keep in mind as you are readying code for a reprex:

Make sure you only include necessary code!! This simply means including only enough code to reproduce your problem; and not everything that is in your R script.
Try to format the code properly. This will make it easier for people to read and understand it. If you are unsure of the recommended formatting styles check out the tutorial on coding style in this toolkit (or consult Hadley Wickham’s tidyverse style guide).
Comment your code if necessary (recall that we use # to begin a comment in an R code chunk).
Don’t copy and paste code from the console!! Console output contains characters that would make it difficult for folks to re-run your code without doing additional work. This might be an even bigger problem if you post your copied console output to an online forum. The special characters in the console output might be interpreted as special formatting symbols which can render your post unreadable!
If the code that created your data uses any random generation of values (e.g. sample(), rnorm(), runif() etc.) you need to use the set.seed() function. This will fix the starting number used in generating a sequence of random values, making your data easy to replicate exactly.
Always test your code in a new, empty R session! This means that before you upload any code and ask for help on Stack Overflow or Slack or RStudio community forums you should make sure that the code runs outside of the R environment where it was created.

A useful tool for making reproducible example is the reprex package. We’ll see reprex in action shortly.

14.6 Packages and any other relevant information

Along with the data and code, when asking for help you should always remember to add the relevant packages at the top of your script. If you use certain functions or package-specific data sets you need to specify the required package (by adding library(your_package_name_here) at the top of your code snippet), otherwise your code will not be exactly reproducible.

In certain situations, it might be useful to add a bit more information. If you are reporting on an unusual error or believe to have come across a bug in some function or feature of a package, you might need to report the version of R you are using and possibly the operating system. In most cases, sharing the version of R or operating system would be sufficient, but sometimes you may need to also share the output of sessionInfo(). Note that while packages are absolutely essential, sharing your version of R or a specific package might not be always necessary.

14.7 A closer look at the reprex package

Making reproducible examples is not always easy. Fortunately, there is an R package which makes some of that work a little bit easier!

Below is a quick overview of how you can create a reproducible example with reprex.

Start by installing the reprex package

We can do this by installing it from CRAN:

install.packages("reprex")

or, by fetching the development version from GitHub

devtools::install_github("tidyverse/reprex")

In an R script write your code, including the data and all necessary calls to packages. For instance, suppose the code in your script is just as the one below.

library(tidyverse)
mpg %>% 
  ggplot(aes(x=displ, y=hwy)) %>% 
  geom_point(aes(color=class)) %>% 
  geom_smooth()

Highlight the code (including the library statement) and copy it to your clipboard
In your console type reprex() and press Enter (you’ll need to wait a second or two for R to render your reproducible example; remember also that you need to have the reprex package loaded before you attempt this!).
Once the reprex has been rendered, it is automatically stored on your clipboard and you could simply paste it online and share it with others.
To see what R actually generates once reprex() has been called in the console we’ll paste the content of the clipboard below.

library(tidyverse)
mpg %>% 
  ggplot(aes(x=displ, y=hwy)) %>% 
  geom_point(aes(color=class)) %>% 
  geom_smooth()
#> Error: `mapping` must be created by `aes()`
#> Did you use %>% instead of +?

^{Created on 2021-01-18 by the reprex package (v0.3.0)}

14.8 Exercises

14.9 Next steps

In this lesson you learned the basics of making a reproducible example. If you are interested in some additional resources, consider the following list of do’s and don’ts from the folks that made the reprex package: https://reprex.tidyverse.org/articles/reprex-dos-and-donts.html

For even more information on reprex, check out Jenny Bryan’s webinar on creating reproducible examples with reprex: https://reprex.tidyverse.org/articles/articles/learn-reprex.html

14.10 Exercises

14.10.1 Question 1

The reprex() function helps you identify the errors in your code, so that you can avoid asking for help on Stack Overflow or Stack Exchange.

True
False

14.10.2 Question 2

Pick the most appropriate answer from the list below: A good reproducible example should be

minimal.
self-contained.
able to reproduce the same error you have.
All of the above.

14.10.3 Question 3

Suppose you are interested in seeking help from the online RStudio community for an error you get based on the code below.

library(tidyverse)
tibble(
  group = c("trtmnt", "control", "control", "control", "trtmnt"),
  msrmnt = rnorm(5,5, 0.5), # rnorm(5,5,1.5) generates 5 random normal variables
                            #centered at 5 with sd of 0.5
                            # use ?rnorm to check out the function's arguments
  improvement = c("yes", "no", "no", "no", "no")
) %>% 
  mutate(new_var = msrmnt + improvement)

This is a well designed reproducible example.

True
False

14.10.4 Question 4

Referring to Question 3 above, what is needed to make the example truly reproducible?

Changing the variable type of the variable improvement.
Adding a call to the package reprex.
Adding a set.seed() command.
Assigning the tibble to an R object.

14.10.5 Question 5

As a way of verifying the functionality of your reproducible example, it is sufficient to test it out within the R environment you’ve been working in.

True
False

14.10.6 Question 6

Consider the following code using the starwars built-in data set.


starwars %>% 
  slice_head(n=30) %>% 
  group_by(homeworld) %>% 
  summarise(eye_color_counts = count(eye_color))

Regardless of what the error in this code is, this is an example of a good reprex.

True
False

14.10.7 Question 7

Referring to Question 6 above, which of the following would make this a good reproducible example.

adding set.seed(1234).
adding library(tidyverse).
adding dat <- starwars %>% ....
adding library(reprex).

14.10.8 Question 8

Consider the following code chunk. Suppose the data file my_reprex_data is a short, stripped down version of you actual data and you plan on using it in your reproducible example. Suppose also you’ve made sure that my_reprex_data is actually a good version of a data to use in a reproducible example.

setwd("Documents/my_projects/stats_projects")

my_reprex_data %>% 
  group_by(type) %>% 
  summarise(new_rounds = tally(rounds)) # error occurs after call to summarise

Is this a good example of a reproducible example?

14.10.9 Question 9

Referring to the code chunk in Question 8, why is this not a good example?

The setwd() call will only work on your computer.
The rounds variable does not exist, hence the error.
Calls to relevant packages are not listed.
my_reprex_data is not a reproducible data file.
Only a. and b. are correct
Only a. and c. are correct.
All of a. through d. are correct.
Only a., b., and c. are correct.
Only a., c., and d. are correct.

14.10.10 Question 10

Refer once more to the example code in Question 8. Which of the following steps would make it a good reprex?

Remove the setwd(...) command, define my_reprex_data, and add library(tidyr)
Add library(tidyverse), remove setwd(..), and generate new random data.
Add library(tidyvesre), use dput(my_reprex_data) to define your data, and remove setwd(...)

13 When your code doesn’t work

15 How to make the most of R’s cryptic error messages