63 Simulating datasets for regression

Written by Mariam Walaa and last updated on 7 October 2021.

63.1 Introduction

In this lesson, you will learn how to:

  • Simulate data for regression

Prerequisite skills include:

Highlights

  • We can recover the linear regression model from simulated data

63.2 Overview

In the previous section, we learned about how to simulate data. We can build regression models from this simulated data. However, another thing we can do with these functions is build simulated data for regression models.

For example, suppose you were not given a data set but instead was told of the distributions of some variables and given their coefficients for a linear regression model. We can use this information to create the simulated data.

63.3 Idea

To simulate a data set, try writing it down on paper first, and then think about which parts are random, then translate it to code! For example, if you think some number \(y\) is related linearly to \(x\) with a slope of 0.3, with some random measurement error, you could write it down on paper like this:

\[ y = 0.3\cdot x + error \]

Translating it to code might look like:

x <- 2 # set some value of x
measurement_error <- rnorm(1) # normally distributed measurement error

y <- 0.3*x + rnorm(1) # calculate value of y

63.4 Example

Suppose we are given distributions of weights and heights for a population of 50 people, and a linear regression equation for this data, which is the slope and intercept. We can then use this information to simulate the data from the slope and intercept. Lets start with loading the tidyverse.

We can simulate the data as follows.

set.seed(2)

data                error = rnorm(n = 50, mean = 0, sd = 1),
               height = weight * 1.2 + error)

Lets see what this relationship looks like with the regression line and data points plotted.

data %>% 
  ggplot(aes(x = weight, y = height)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE, formula = 'y ~ x')

We may use this data set to build a regression models using the lm() function as follows.

lm(height ~ weight, data = data)
#> 
#> Call:
#> lm(formula = height ~ weight, data = data)
#> 
#> Coefficients:
#> (Intercept)       weight  
#>      0.9387       1.1915

As we expect, building this linear regression model successfully recovers the original slope of 1.2.

63.5 Exercises

There are many functions involved when it comes to trying to simulate a data set. These exercises will help you better learn which functions to use for which parts of the simulation.

63.5.1 Exercise 1

When simulating a data set, you will likely need to work with various types of variables – continuous variables, discrete variables, numeric variables, and non-numeric variables.

When it comes to discrete variables (whether numeric, non-numeric, or a mix of both), you will likely want to see a repetition of a finite set of values. There is a certain function out of the handful of functions we learned about for simulating datasets that will be helpful for this task.

For this exercise, youll try to simulate data points for a discrete variable representing the number of times you might pick out a red ball from a jar full of blue, red, and yellow balls. Suppose youll pick out a ball 10 times and put it back every time, and that each ball has equal probability of being picked.

Fill in the blanks for the code below to create data that represents the above scenario.

sample(  = c("Red", "Blue", "Yellow"),
            = 10,
               = TRUE,
            = c(0.33, 0.33, 0.33))

You’ll notice that there are 4 blanks in total that you’ll need to fill.

63.5.2 Exercise 2

Suppose we are given a regression model with a slope of 1.2, and we want to simulate a dataset from the regression model similar to the previous scenario. We have the below code to generate the data. What is wrong with this data-generating code below?

63.6 Common Mistakes & Errors

Here are some common mistakes and errors you may come across:

  • You may be confusing some of the arguments for different distribution functions. Make sure you are using runif() to sample points from the uniform distribution and rnorm() to sample points from the normal distribution.
  • You may be misusing some of the arguments for a function like sample(). Make sure you read the argument descriptions as well as the given examples in the documentation.
  • You may forget to set the seed before running a chunk of code, or you may be using a different value for a seed to obtain results you previously got with your code.

63.7 Next Steps

If you would like to learn more about these functions, read the documentation associated with each of the functions. But if you would like to learn more about simulating datasets from a regression model, please take a look at the following:

63.8 Exercises

63.8.1 Question 1

63.8.2 Question 2

63.8.3 Question 3

63.8.4 Question 4

63.8.5 Question 5

63.8.6 Question 6

63.8.7 Question 7

63.8.8 Question 8

63.8.9 Question 9

63.8.10 Question 10