14 Making reproducible examples
Written by Marija Pejcinovska and last updated on 5 Feb 2022.
14.1 Introduction
In this lesson you will:
- learn about using reproducible examples to solve R issues
- learn how to create reproducible examples
- get introduced to packages that ease the pain of making reproducible examples
Prerequisite skills include:
- Installing packages, troubleshooting, and asking for help
14.2 Seeking help the helpful way
At this point in the toolkit you have probably picked up a troubleshooting trick or two. In fact, you may have already googled how to code up something in R or have come across useful discussions on Stack Exchange or Stack Overflow.
Most of the time, you are likely to face issues many R users have grappled with before. In a way, this is nice. It means it should be much easier to find a solution to your problem online, usually with only a few google searches. Other times, however, you might come across a more challenging problem and the solution might not be as readily available online. In such a situation — assuming you’ve exhausted all other debugging and troubleshooting options — it’s possible that you would need to ask for some help online. Fortunately, as you’ve been making your way through this toolkit, you have probably already learned where to turn for help when needed.
At this point, knowing how to create a reproducible example (or reprex for short) would come in handy. The idea of a reprex is exactly what you might expect: it is a minimal, illustrative example that will reproduce the issue you are facing, so that others could help you solve it more easily.
Not all reprexes are created equal, however. In order for you to engage productively with the R community and get the help you need, you should help the reader understand your problem with as little effort on their part as possible. This means that a reprex needs to be minimal and self-contained, so that those helping you can simply copy and paste your R code, reproduce the error and understand your issue… and hopefully help you solve it, of course.
14.3 Elements of a reprex
So, what does a reprex look like?
All reproducible examples have these few elements in common:
a minimal data set on which the code can be run and the error reproduced.
a minimal, runnable piece of code that is well formatted and produces the same error when run on the supplied minimal data set.
call to any necessary R packages + any relevant pieces of information about your R environment when the issue was encountered.
14.4 Creating a data set for a reprex
You should always provide some data along with your reproducible example, so that you can easily illustrate your issue.
At times, it might not be possible, or advisable, to share the actual data you’ve been working on. In fact, it is probably more practical (and likely, much simpler) to create a “toy” data set which will produce the same error or issue you’ve been dealing with.
Here are a few ways of doing this.
- You could create your own data by using functions such as those in the
tibble
package. As an example, consider the data frame below. Here we’ve used thetibble()
function to create a very small data frame with 4 variables (x, y, z, and w) each with 4 observations (Note:tibble()
in the tidyverse is similar to thedata.frame()
function in base R). An alternative way of creating a data frame is through thetribble()
function, where tribble is simply short for transposed tibble, the syntax for which is visually easier but a little less common (call?tribble
at the R prompt to launch the help file)
first_toy_data <- tibble (
x = 1:4,
y = c("a", "b", "c", "d"),
z = x + x^2,
w = c("yes", "no", "yes", "yes")
)
first_toy_data
#> # A tibble: 4 × 4
#> x y z w
#> <int> <chr> <dbl> <chr>
#> 1 1 a 2 yes
#> 2 2 b 6 no
#> 3 3 c 12 yes
#> 4 4 d 20 yes
Alternatively, you could consider using some of the built-in R data sets. To see the available data sets, try typing
data()
at the prompt in your R console. As you keep learning R you’ll notice many examples that feature, say, the mtcars or iris data sets. These data sets do get a little boring over time, but you shouldn’t really worry about the dullness of your data when making a reprex. In fact, feel free to bore your helpers with you choices. If you do end up using built-in data, you might want to consider using only a subset of the data set. You can do this by making use of thehead()
orsample()
orslice()
functions.Finally, whatever data you are sharing you might find the
dput()
function helpful. The function will write a text representation of your data which others can then copy-and-paste into their own R scripts to get your data.dput
essentially generates the R code necessary to recreate your data. The output fromdput
would look something like this:
dput(first_toy_data)
#> structure(list(x = 1:4, y = c("a", "b", "c", "d"), z = c(2, 6,
#> 12, 20), w = c("yes", "no", "yes", "yes")), class = c("tbl_df",
#> "tbl", "data.frame"), row.names = c(NA, -4L))
Running the output from dput()
will create exactly the data object first_toy_data
we created above. This is, indeed, the behavior we want. Run the code below to see for yourself.
14.5 Adding code
The code in your reproducible example should be easy to understand and stripped down to the most bare version that would allow those helping you to replicate your errors.
Here are a few things you might want to keep in mind as you are readying code for a reprex:
Make sure you only include necessary code!! This simply means including only enough code to reproduce your problem; and not everything that is in your R script.
Try to format the code properly. This will make it easier for people to read and understand it. If you are unsure of the recommended formatting styles check out the tutorial on coding style in this toolkit (or consult Hadley Wickham’s tidyverse style guide).
Comment your code if necessary (recall that we use
#
to begin a comment in an R code chunk).Don’t copy and paste code from the console!! Console output contains characters that would make it difficult for folks to re-run your code without doing additional work. This might be an even bigger problem if you post your copied console output to an online forum. The special characters in the console output might be interpreted as special formatting symbols which can render your post unreadable!
If the code that created your data uses any random generation of values (e.g.
sample()
,rnorm()
,runif()
etc.) you need to use theset.seed()
function. This will fix the starting number used in generating a sequence of random values, making your data easy to replicate exactly.Always test your code in a new, empty R session! This means that before you upload any code and ask for help on Stack Overflow or Slack or RStudio community forums you should make sure that the code runs outside of the R environment where it was created.
A useful tool for making reproducible example is the reprex
package. We’ll see reprex in action shortly.
14.6 Packages and any other relevant information
Along with the data and code, when asking for help you should always remember to add the relevant packages at the top of your script. If you use certain functions or package-specific data sets you need to specify the required package (by adding library(your_package_name_here)
at the top of your code snippet), otherwise your code will not be exactly reproducible.
In certain situations, it might be useful to add a bit more information.
If you are reporting on an unusual error or believe to have come across a bug in some function or feature of a package, you might need to report the version of R you are using and possibly the operating system. In most cases, sharing the version of R or operating system would be sufficient, but sometimes you may need to also share the output of sessionInfo()
. Note that while packages are absolutely essential, sharing your version of R or a specific package might not be always necessary.
14.7 A closer look at the reprex package
Making reproducible examples is not always easy. Fortunately, there is an R package which makes some of that work a little bit easier!
Below is a quick overview of how you can create a reproducible example with reprex
.
- Start by installing the
reprex
package
- We can do this by installing it from CRAN:
install.packages("reprex")
- or, by fetching the development version from GitHub
devtools::install_github("tidyverse/reprex")
- In an R script write your code, including the data and all necessary calls to packages. For instance, suppose the code in your script is just as the one below.
library(tidyverse)
mpg %>%
ggplot(aes(x=displ, y=hwy)) %>%
geom_point(aes(color=class)) %>%
geom_smooth()
Highlight the code (including the library statement) and copy it to your clipboard
In your console type
reprex()
and press Enter (you’ll need to wait a second or two for R to render your reproducible example; remember also that you need to have thereprex
package loaded before you attempt this!).Once the reprex has been rendered, it is automatically stored on your clipboard and you could simply paste it online and share it with others.
To see what R actually generates once reprex() has been called in the console we’ll paste the content of the clipboard below.
library(tidyverse)
mpg %>%
ggplot(aes(x=displ, y=hwy)) %>%
geom_point(aes(color=class)) %>%
geom_smooth()
#> Error: `mapping` must be created by `aes()`
#> Did you use %>% instead of +?
Created on 2021-01-18 by the reprex package (v0.3.0)
14.9 Next steps
In this lesson you learned the basics of making a reproducible example. If you are interested in some additional resources, consider the following list of do’s and don’ts from the folks that made the reprex
package: https://reprex.tidyverse.org/articles/reprex-dos-and-donts.html
For even more information on reprex
, check out Jenny Bryan’s webinar on creating reproducible examples with reprex: https://reprex.tidyverse.org/articles/articles/learn-reprex.html
14.10 Exercises
14.10.1 Question 1
The reprex()
function helps you identify the errors in your code, so that you can avoid asking for help on Stack Overflow or Stack Exchange.
- True
- False
14.10.2 Question 2
Pick the most appropriate answer from the list below: A good reproducible example should be
- minimal.
- self-contained.
- able to reproduce the same error you have.
- All of the above.
14.10.3 Question 3
Suppose you are interested in seeking help from the online RStudio community for an error you get based on the code below.
library(tidyverse)
tibble(
group = c("trtmnt", "control", "control", "control", "trtmnt"),
msrmnt = rnorm(5,5, 0.5), # rnorm(5,5,1.5) generates 5 random normal variables
#centered at 5 with sd of 0.5
# use ?rnorm to check out the function's arguments
improvement = c("yes", "no", "no", "no", "no")
) %>%
mutate(new_var = msrmnt + improvement)
This is a well designed reproducible example.
- True
- False
14.10.4 Question 4
Referring to Question 3 above, what is needed to make the example truly reproducible?
- Changing the variable type of the variable
improvement
.
- Adding a call to the package
reprex
. - Adding a
set.seed()
command. - Assigning the tibble to an R object.
14.10.5 Question 5
As a way of verifying the functionality of your reproducible example, it is sufficient to test it out within the R environment you’ve been working in.
- True
- False
14.10.6 Question 6
Consider the following code using the starwars
built-in data set.
starwars %>%
slice_head(n=30) %>%
group_by(homeworld) %>%
summarise(eye_color_counts = count(eye_color))
Regardless of what the error in this code is, this is an example of a good reprex.
- True
- False
14.10.7 Question 7
Referring to Question 6 above, which of the following would make this a good reproducible example.
- adding
set.seed(1234)
. - adding
library(tidyverse)
.
- adding
dat <- starwars %>% ...
.
- adding
library(reprex)
.
14.10.8 Question 8
Consider the following code chunk. Suppose the data file my_reprex_data
is a short, stripped down version of you actual data and you plan on using it in your reproducible example. Suppose also you’ve made sure that my_reprex_data
is actually a good version of a data to use in a reproducible example.
setwd("Documents/my_projects/stats_projects")
my_reprex_data %>%
group_by(type) %>%
summarise(new_rounds = tally(rounds)) # error occurs after call to summarise
Is this a good example of a reproducible example?
- Yes
- No
14.10.9 Question 9
Referring to the code chunk in Question 8, why is this not a good example?
- The
setwd()
call will only work on your computer.
- The
rounds
variable does not exist, hence the error.
- Calls to relevant packages are not listed.
-
my_reprex_data
is not a reproducible data file. - Only a. and b. are correct
- Only a. and c. are correct.
- All of a. through d. are correct.
- Only a., b., and c. are correct.
- Only a., c., and d. are correct.
14.10.10 Question 10
Refer once more to the example code in Question 8. Which of the following steps would make it a good reprex?
- Remove the
setwd(...)
command, definemy_reprex_data
, and addlibrary(tidyr)
- Add
library(tidyverse)
, removesetwd(..)
, and generate new random data. - Add
library(tidyvesre)
, usedput(my_reprex_data)
to define your data, and removesetwd(...)