Online Appendix D — Papers

One way to build understanding of material is by using it. The purpose of these papers is to give you a chance to implement what you have learnt in a real-world setting. Completing the papers is also important from the perspective of building a portfolio for job applications.

D.1 Donaldson Paper

D.1.1 Task

  • Working individually and in an entirely reproducible way, please find a dataset of interest on Open Data Toronto and write a short paper telling a story about the data.
    • Create a well-organized folder with appropriate sub-folders, and add it to GitHub. You are welcome to use this starter folder.
    • Find a dataset of interest on Open Data Toronto.
      • Put together an R script, “scripts/00-simulation.R”, that simulates the dataset of interest. Push to GitHub and include an informative commit message
      • Write an R script, “scripts/00-download_data.R” to download the actual data in a reproducible way using opendatatoronto (Gelfand 2022). Save the data: “inputs/data/unedited_data.csv” (or whatever file type the file is). Push to GitHub and include an informative commit message
    • Prepare a PDF using Quarto “outputs/paper/paper.qmd” with these sections: title, author, date, abstract, introduction, data, and references.
      • The title should be descriptive, informative, and specific.
      • The date should be in an unambiguous format. Add a link to the GitHub repo in the acknowledgments.
      • The abstract should be three or four sentences. The abstract must tell the reader the top-level finding. What is the one thing that we learn about the world because of this paper?
      • The introduction should be two or three paragraphs of content. And there should be an additional final paragraph that sets out the remainder of the paper.
      • The data section should thoroughly and precisely discuss the source of the data and the bias this brings (ethical, statistical, and otherwise). Comprehensively describe and summarize the data using text, graphs, and tables. Graphs must be made with ggplot2 (Wickham 2016) and tables must be made with knitr (Xie 2023) or gt (Iannone et al. 2022). Graphs must show the actual data, or as close to it as possible, not summary statistics. Graphs and tables should be cross-referenced in the text e.g. “Table 1 shows…”).
      • References should be added using BibTeX. Be sure to reference R and any R packages you use, as well as the dataset. Strong submissions will draw on related literature and reference those.
      • The paper should be well-written, draw on relevant literature, and explain all technical concepts. Pitch it at an educated, but non-specialist, audience.
      • Use appendices for supporting, but not critical, material.
      • Push to GitHub and include an informative commit message
  • Submit a PDF of your paper.
  • There should be no evidence that this is a class assignment.

D.1.2 Checks

  • There should be no R code or raw R output in the final PDF.
  • Code should be entirely reproducible, well-documented, commented, and readable.
  • The paper should knit directly to PDF i.e. use “Knit to PDF”.
    • Do not use “Knit to html” and then save as a PDF.
    • Do not use “Knit to Word” and then save as a PDF
  • Graphs, tables, and text should be clear, and of comparable quality to those of FiveThirtyEight.
  • The date should be up-to-date and unambiguous (e.g. 2/3/2022 is ambiguous, 2 March 2022 is not).
  • The entire workflow should be entirely reproducible.
  • There should not be any typos.
  • There should be no sign this is a school paper.
  • There must be a link to the paper’s GitHub repo using a footnote.
  • The GitHub repo should be well-organized, and contain an informative README.
  • The paper should be well-written and able to be understood by the average reader of, say, FiveThirtyEight. This means that you are allowed to use mathematical notation, but you must explain all of it in plain language. All statistical concepts and terminology must be explained. Your reader is someone with a university education, but not necessarily someone who understands what a p-value is.

D.1.3 FAQ

  • Can I use a dataset from Kaggle instead? No, because they have done the hard work for you.
  • I cannot use code to download my dataset, can I just manually download it? No, because your entire workflow needs to be reproducible. Please fix the download problem or pick a different dataset.
  • How much should I write? Most students submit something in the two-to-six-page range, but it is up to you. Be precise and thorough.
  • My data is about apartment blocks/NBA/League of Legends so there’s no ethical or bias aspect, what do I do? Please re-read the relevant chapter and readings to better understand bias and ethics. If you really cannot think of something, then it might be worth picking a different dataset.
  • Can I use Python? No. If you already know Python then it does not hurt to learn another language.
  • Why do I need to cite R, when I don’t need to cite Word? R is a free statistical programming language with academic origins, so it is appropriate to acknowledge the work of others. It is also important for reproducibility.
  • What reference style should I use? Any major reference style is fine (APA, Harvard, Chicago, etc); just pick one that you are used to.
  • The paper in the starter folder has a model section, so do I need to put together a model? No. The starter folder is designed to be applicable to all of the papers; just delete the aspects that you do not need.
  • The paper in the starter folder has a data sheets appendix, so do I need to put together a data sheet? No. The starter folder is designed to be applicable to all of the papers; just delete the aspects that you do not need.
  • What does “graph the actual data” mean? If you have, say 5,000 observations in the dataset and three variables, then for every variable there should be a graph that has 5,000 points in the case of dots, or adds up to 5,000 in the case of bar charts and histograms.

D.1.4 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used.
Measurement 0 - 'Poor or not done'; 1 - 'Exceptional' Some aspect of measurement, relating to the dataset, is mentioned in the data section.
Cross-references 0 - 'Poor or not done'; 2 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Prose 0 - 'Poor or not done'; 2 - 'Yes' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Reference list 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least two different commits, and they have meaningful commit messages.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled.
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

D.1.5 Previous examples

D.2 Mawson Paper

D.2.1 Task

  • Working as part of a team of one to three people, please pick a paper of interest to you, with code and data that are available, published anytime since 2019, in an American Economic Association journal. These journals are: “American Economic Review”, “AER: Insights”, “AEJ: Applied Economics”, “AEJ: Economic Policy”, “AEJ: Macroeconomics”, “AEJ: Microeconomics”, “Journal of Economic Literature”, “Journal of Economic Perspectives”, “AEA Papers & Proceedings”. Alternatively, you may choose any article from the Institute for Replication list available here that has a replicability status of “Looking for replicator”.
  • Following the Guide for Accelerating Computational Reproducibility in the Social Sciences, please complete a replication1 of at least three graphs, tables, or a combination, from that paper, using the Social Science Reproduction Platform. Note the DOI of your replication.
  • Working in an entirely reproducible way then conduct a reproduction based on two or three aspects of the paper, and write a short paper about that.
    • Create a well-organized folder with appropriate sub-folders, add it to GitHub, and then prepare a PDF using Quarto with these sections (you are welcome to use this starter folder): title, author, date, abstract, introduction, data, results, discussion, and references.
    • The aspects that you focus on in your paper could be the same aspects that you replicated, but they do not need to be. Follow the direction of the paper, but make it your own. That means you should ask a slightly different question, or answer the same question in a slightly different way, but still use the same dataset.
    • Include the DOI of your replication in your paper and a link to the GitHub repo that underpins your paper.
    • The results section should convey findings.
    • The discussion should include three or four sub-sections that each focus on an interesting point, and there should be another sub-section on the weaknesses of your paper, and another on potential next steps for it.
    • In the discussion section, and any other relevant section, please be sure to discuss ethics and bias, with reference to relevant literature.
    • The paper should be well-written, draw on relevant literature, and explain all technical concepts. Pitch it at an educated, but non-specialist, audience.
    • Use appendices for supporting, but not critical, material.
    • Code should be entirely reproducible, well-documented, and readable.
  • Submit a PDF of your paper.
  • There should be no evidence that this is a class assignment.

D.2.2 Checks

  • The paper should not just copy/paste the code from the original paper, but have instead used that as a foundation to work from.
  • Your paper should have a link to the associated GitHub repository and the DOI of the Social Science Reproduction Platform replication that you conducted.
  • Make sure you have referenced everything, including R. Strong submissions will draw on related literature in the discussion (and other sections) and would be sure to also reference those. The style of references does not matter, provided it is consistent.

D.2.3 FAQ

  • How much should I write? Most students submit something in the 10-to-15-page range, but it is up to you. Be precise and thorough.
  • Do I have to focus on a model result? No, it is likely best to stay away from that at this point, and instead focus on tables or graphs of summary or explanatory statistics.
  • What if the paper I choose is in a language other than R? Both your replication and reproduction code should be in R. So you will need to translate the code into R for the replication. And the reproduction should be your own work, so that also should be in R. One common language is Stata, and Huntington-Klein (2022) might be useful as a “Rosetta Stone” of sorts, for R, Python, and Stata, or just use a LLM to help.
  • Can I work by myself? Yes.
  • Do the graphs/tables have to look identical to the original? No, you are welcome to, and should, make them look better as part of the reproduction. And even as part of the replication, they do not have to be identical, just similar enough.
  • One of my graphs has four panels, do I have to do all of them for this to be counted as one element? No, for the purpose of this paper, every panel counts as a separate element, so all you would need to do is three panels and that would be enough.
  • How do I automatically download the data if they are behind a sign-in? If the data are behind a sign-in, just add commented documentation for how to download it into the download_data.R R file, rather than code.
  • Do we need to commit our original, unedited data data to Github if it is really big? No, you do not necessarily need to commit the original, unedited data data to GitHub if it is too large, just add a note explaining the situation in the README and how to obtain the data.
  • What should the abstract and introduction be about? The abstract and introduction should reflect your own work and findings, rather than those of the original paper (even though those will necessarily nonetheless have some role). You are (almost surely) not replicating their entire paper, and so your abstract should be different. See the examples for guidance.

D.2.4 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
Class paper 0 - 'No'; 1 - 'Yes' Check meta data such as project and folder names, as well as other aspect such as title etc. If there is any sign this is a class paper then no need to continue marking, paper gets 0 overall.
Replication 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' SSRP submission needs to be filled out completely for three elements.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done'; 1 - 'Exceptional' The estimand is clearly stated in the introduction.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used.
Measurement 0 - 'Poor or not done'; 1 - 'Exceptional' Some aspect of measurement, relating to the dataset, is mentioned in the data section.
Results 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars.
Discussion 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Cross-references 0 - 'Poor or not done'; 2 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Prose 0 - 'Poor or not done'; 2 - 'Yes' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Reference list 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least two different commits, and they have meaningful commit messages.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled.
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

D.2.5 Previous examples

D.3 Howrah Paper

D.3.1 Task

  • Working as part of a team of one to three people, and in an entirely reproducible way, please obtain data from the US General Social Survey2. (You are welcome to use a different government-run survey, but please obtain permission before starting.)
  • Obtain the data, focus on one aspect of the survey, and then use it to tell a story.
    • Create a well-organized folder with appropriate sub-folders, add it to GitHub, and then use Quarto to prepare a PDF with these sections (you are welcome to use this starter folder): title, author, date, abstract, introduction, data, results, discussion, an appendix that will, at least, contain a survey, and references.
    • In addition to conveying a sense of the dataset of interest, the data section should include, but not be limited to:
      • A discussion of the survey’s methodology, and its key features, strengths, and weaknesses. For instance: what is the population, frame, and sample; how is the sample recruited; what sampling approach is taken, and what are some of the trade-offs of this; how is non-response handled.
      • A discussion of the questionnaire: what is good and bad about it?
      • If this becomes too detailed, then use appendices for supporting but not essential aspects.
    • In an appendix, please put together a supplementary survey that could be used to augment the general social survey the paper focuses on. The purpose of the supplementary survey is to gain additional information on the topic that is the focus of the paper, beyond that gathered by the general social survey. The survey would be distributed in the same manner as the general social survey but needs to stand independently. The supplementary survey should be put together using a survey platform. A link to this should be included in the appendix. Additionally, a copy of the survey should be included in the appendix.
    • Please be sure to discuss ethics and bias, with reference to relevant literature.
    • Code should be entirely reproducible, well-documented, and readable.
  • Submit a PDF of the paper.
  • The paper should be well-written, draw on relevant literature, and explain all technical concepts. Pitch it at a university-educated, but non-specialist, audience. Use survey, sampling, and statistical terminology, but be sure to explain it. The paper should flow, and be easy to follow and understand.
  • There should be no evidence that this is a class paper.

D.3.2 Checks

  • An appendix should contain both a link to the supplementary survey and the details of it, including questions (in case the link fails, and to make the paper self-contained).

D.3.3 FAQ

  • What should I focus on? You may focus on any year, aspect, or geography that is reasonable given the focus and constraints of the general social survey that you are interested in. Please consider the year and topics that you are interested in together, as some surveys focus on particular topics in some years.
  • Do I need to include the raw GSS data in the repo? For most of the general social surveys you will not have permission to share the GSS data. If that is the case, then you should add clear details in the README explaining how the data could be obtained.
  • How many graphs do I need? In general, you need at least as many graphs as you have variables, because you need to show all the observations for all variables. But you may be able to combine a few; or, vice versa, you may be interested in looking at different aspects or relationships.

D.3.4 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
Class paper 0 - 'No'; 1 - 'Yes' Check meta data such as project and folder names, as well as other aspect such as title etc. If there is any sign this is a class paper then no need to continue marking, paper gets 0 overall.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done'; 1 - 'Exceptional' The estimand is clearly stated in the introduction.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used.
Measurement 0 - 'Poor or not done'; 1 - 'Exceptional' Some aspect of measurement, relating to the dataset, is mentioned in the data section.
Results 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars.
Discussion 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Cross-references 0 - 'Poor or not done'; 2 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Prose 0 - 'Poor or not done'; 2 - 'Yes' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Survey 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' The survey should have an introductory section and include the details of a contact person. The survey questions should be well constructed and appropriate to the task. The questions should have an appropriate ordering. A final section should thank the respondent.
Reference list 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least two different commits, and they have meaningful commit messages.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled.
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

D.3.5 Previous examples

D.4 Dysart Paper

D.4.1 Task

  • Working as part of a team of one to three people, and in an entirely reproducible way, please convert at least one full-page table from one DHS Program “Final Report”, from the 1980s or 1990s, as available here, into a usable dataset, then write a short paper telling a story with the data.
  • Create a well-organized folder with appropriate sub-folders, and add it to GitHub. You are welcome to use this starter folder.
  • Create and document a dataset:
    • Save the PDF to “inputs”.
    • Put together a simulation of your plan for the usable dataset and save the script to “scripts/00-simulation.R”.
    • Write R code, saved as “scripts/01-gather_data.R”, to either OCR or parse the PDF, as appropriate, and save the output to “outputs/data/first_parse.csv”.
    • Write R code, saved as “scripts/02-clean_and_prepare_data.R”, that draws on “first_parse.csv” to clean and prepare the dataset. Use pointblank to put together tests that the dataset passes (at a minimum, every variable should have a test for class and another for content). Save the dataset to “outputs/data/cleaned_data.parquet”.
    • Following Gebru et al. (2021), put together a data sheet for the dataset you put together (put this in the appendix of your paper). You are welcome to start from the template “inputs/data/datasheet_template.qmd” in the starter folder, although, again, you should add it to the appendix of your paper, rather than a stand-alone document.
  • Use the dataset to tell a story by using Quarto to prepare a PDF with these sections: title, author, date, abstract, introduction, data, results, discussion, an appendix that will, at least, contain a datasheet for the dataset, and references.
    • In addition to conveying a sense of the dataset of interest, the data section should include details of the methodology used by the DHS you used, and its key features, strengths, and weaknesses.
  • Submit a PDF of the paper.
  • There should be no evidence that this is a class paper.

D.4.2 Checks

  • Use GitHub in a well-developed way by making at least a few commits and using descriptive commit messages.

D.4.3 FAQ

D.4.4 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
Class paper 0 - 'No'; 1 - 'Yes' Check meta data such as project and folder names, as well as other aspect such as title etc. If there is any sign this is a class paper then no need to continue marking, paper gets 0 overall.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done'; 1 - 'Exceptional' The estimand is clearly stated in the introduction.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used.
Measurement 0 - 'Poor or not done'; 1 - 'Exceptional' Some aspect of measurement, relating to the dataset, is mentioned in the data section.
Results 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars.
Discussion 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Cross-references 0 - 'Poor or not done'; 2 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Prose 0 - 'Poor or not done'; 2 - 'Yes' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Reference list 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least two different commits, and they have meaningful commit messages.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Parquet 0 - 'Poor or not done'; 1 - 'Exceptional' The analysis dataset is saved as a parquet file (optionally also as a CSV).
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled.
Datasheet 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A thorough datasheet for the dataset that was constructed is included.
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

D.4.5 Previous examples

D.5 Murrumbidgee Paper

D.5.1 Task

  • Working as part of a team of one to three people, and in an entirely reproducible way, please revisit the dataset that you used in Section D.1. Build a linear model for one of the variables, and consider the results. Then write a short paper telling a story with the data.
  • Create a well-organized folder with appropriate sub-folders, and add it to GitHub. You are welcome to use this starter folder.
  • Use the model to tell a story by using Quarto to prepare a PDF with these sections: title, author, date, abstract, introduction, data, model, results, discussion, and references.
  • Submit a PDF of the paper.
  • There should be no evidence that this is a class paper.

D.5.2 Checks

  • Be careful to thoroughly explain the model. Also consider the assumptions of the model and the threats to its validity.

D.5.3 FAQ

  • Can we use aspects of the data and other sections that were submitted in Section D.1? Yes, it is fine to re-use aspects of Section D.1, but chances are you have developed since then and it would make sense to re-write much of that.

D.5.4 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
Class paper 0 - 'No'; 1 - 'Yes' Check meta data such as project and folder names, as well as other aspect such as title etc. If there is any sign this is a class paper then no need to continue marking, paper gets 0 overall.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done'; 1 - 'Exceptional' The estimand is clearly stated in the introduction.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used.
Measurement 0 - 'Poor or not done'; 1 - 'Exceptional' Some aspect of measurement, relating to the dataset, is mentioned in the data section.
Model 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' The model should be nicely written out, well-explained, justified, and appropriate.
Results 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars.
Discussion 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Cross-references 0 - 'Poor or not done'; 2 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Prose 0 - 'Poor or not done'; 2 - 'Yes' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Reference list 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least two different commits, and they have meaningful commit messages.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Parquet 0 - 'Poor or not done'; 1 - 'Exceptional' The analysis dataset is saved as a parquet file (optionally also as a CSV).
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled.
Datasheet 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A thorough datasheet for the dataset that was constructed is included.
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

D.5.5 Previous examples

D.6 Spadina Paper

D.6.1 Task

  • Working as part of a team of one to three people, and in an entirely reproducible way, please pick one of the examples in Chapter 13. Change the situation slightly, and then build a generalized linear model. Then write a short paper telling a story with the data.
  • Create a well-organized folder with appropriate sub-folders, and add it to GitHub. You are welcome to use this starter folder.
  • Use the model to tell a story by using Quarto to prepare a PDF with these sections: title, author, date, abstract, introduction, data, model, results, discussion, and references.
  • Submit a PDF of the paper.
  • There should be no evidence that this is a class paper.

D.6.2 Checks

  • Be careful to thoroughly explain the model. Also consider the assumptions of the model and the threats to its validity.

D.6.3 FAQ

  • What does “change the situation slightly” mean? You are welcome to use the same, or similar, data, but consider a different aspect. For instance:
    • In the logistic regression example of US political support, you may use the CES from a different year, and/or with slightly different explanatory variables.
    • In the Poisson regression example of the letters used in Jane Eyre, you may consider a different novel.
    • In the negative binomial regression of mortality in Alberta, you may consider a different geographic area.

D.6.4 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
Class paper 0 - 'No'; 1 - 'Yes' Check meta data such as project and folder names, as well as other aspect such as title etc. If there is any sign this is a class paper then no need to continue marking, paper gets 0 overall.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done'; 1 - 'Exceptional' The estimand is clearly stated in the introduction.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used.
Measurement 0 - 'Poor or not done'; 1 - 'Exceptional' Some aspect of measurement, relating to the dataset, is mentioned in the data section.
Model 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' The model should be nicely written out, well-explained, justified, and appropriate.
Results 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars.
Discussion 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Cross-references 0 - 'Poor or not done'; 2 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Prose 0 - 'Poor or not done'; 2 - 'Yes' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Reference list 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least two different commits, and they have meaningful commit messages.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Parquet 0 - 'Poor or not done'; 1 - 'Exceptional' The analysis dataset is saved as a parquet file (optionally also as a CSV).
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled.
Datasheet 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A thorough datasheet for the dataset that was constructed is included.
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

D.6.5 Previous examples

D.7 Spofforth Paper

D.7.1 Task

  • Working as part of a team of one to three people, please forecast the popular vote of the 2020 US election using multilevel regression with post-stratification and then write a short paper telling a story. This requires individual-level survey data, post-stratification data, and a model that brings them together. Given the expense of collecting these data, and the privilege of having access to them, please be sure to properly cite all datasets that you use.
  • Individual-level survey data:
    • Request access to the Democracy Fund + UCLA Nationscape “Full Data Set”. This could take a day or two. Please start early.
    • Simulate the survey dataset that you will use, and save the script to “scripts/00-simulation-survey.R”.
    • Once you have access then pick one survey of interest (they were conducted at different times).
    • This will be a large file and is not yours to share. Do not push it to GitHub. Use a .gitignore file to accomplish this. Instead document how to get the original, unedited data in the README.
    • Clean and prepare the dataset based on what you need.
  • Post-stratification data:
    • Create an account with IPUMS and then use this to access the American Community Surveys (ACS).
    • Simulate the post-stratification dataset that you will use, and save the script to “scripts/00-simulation-poststratification.R”.
    • Pick an appropriate 1-year ACS (there is one every year). Then select some variables. This will depend on what you want to model and the survey data, but some options include: REGION, STATEICP, AGE, SEX, MARST, RACE, HISPAN, BPL, CITIZEN, EDUC, LABFORCE, or INCTOT. Have a look around and see what you are interested in, remembering that you will need to establish a correspondence to the survey.
    • Download the relevant post-stratification data (it is probably easiest to change the data format to .dta).
    • Again, this will be a large file and is not yours to share. Do not push it to GitHub. Use a .gitignore file to accomplish this. Instead document how to get the original, unedited data in the README.
    • Clean and prepare the post-stratification dataset. Remember that you need cell counts for the sub-populations in the model.
  • Modelling:
    • You will want to explain vote intention based on a variety of explanatory variables. The decision is yours, but you should probably use logistic regression. In that case, construct the vote intention variable so that it is binary (either “supports Trump” or “supports Biden”). Then build a model.
    • Think about model fit, diagnostics, and other similar aspects that you need to convince someone that the model is appropriate.
    • You have flexibility of the model that you use, (and hence the cells that you will need to create). In general, the more cells the better, but you may want fewer cells for simplicity in the writing process and to ensure a decent sample in each cell. It would be best to start with a simple model and then complicate it, rather than vice versa.
    • Apply the trained model to the post-stratification dataset to forecast the election result. The specifics will depend on your modelling approach but will likely involve predict(), add_predicted_draws(), or similar. The primary aspect of interest is the forecast distribution of the popular vote, and how the explanatory variables affect this. Strong submissions would go beyond that.
  • Write-up:
    • Create a well-organized folder with appropriate sub-folders, add it to GitHub, and then prepare a PDF using Quarto with these sections (you are welcome to use this starter folder): title, author, date, abstract, introduction, data, model, results, discussion, and references. Use appendices for supporting, but not critical, material.
      • In the model section, you should carefully spell out the statistical model that you are using, being sure to define and explain each aspect and why it is important. The model should be appropriately complex; that is, not inappropriately simple, but not unnecessarily complicated. The model should have well-defined variables and these should correspond to what is discussed in the data section. You should explain how the aspects discussed in the data section assert themselves in the modelling decisions that you made. The model should be written out in appropriate mathematical notation but also in plain English. Every aspect of that notation should be defined. The model should make sense based on the substantive area, and the form of the model. If the model is Bayesian, then priors should be defined and sensible. There should be explanation of how features enter the model and why. For instance, why use age rather than age-groups, why does province have a levels effect, why is gender categorical, etc? In general, there should be a clear justification that this is the model for the situation. The assumptions underpinning the model should be clearly discussed. Alternative models, or variants, should be discussed, and strengths and weaknesses made clear. Why was this model chosen? You should mention the software that you used to run the model. There should be evidence of thought about the circumstances in which the model may not be appropriate. There should be evidence of model validation and checking, whether that is out-of-sample, RMSE, a test/training split, or appropriate sensitivity checks. You should be clear about model convergence, model checks, and diagnostic issues.
  • Submit a PDF of your paper.
  • There should be no evidence that this is a class assignment.

D.7.2 Checks

  • Use GitHub in a well-developed way by making at least a few commits and using descriptive commit messages.
  • Do not include p-values, stars, or similar, in tables. If you invoke statistical significance, then you should draw on and integrate Fisher (1926) and others.

D.7.3 FAQ

  • How much should I write? Most students submit something in the 10-to-15-page range, but it is up to you. Be precise and thorough.

D.7.4 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
Class paper 0 - 'No'; 1 - 'Yes' Check meta data such as project and folder names, as well as other aspect such as title etc. If there is any sign this is a class paper then no need to continue marking, paper gets 0 overall.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done'; 1 - 'Exceptional' The estimand is clearly stated in the introduction.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used.
Measurement 0 - 'Poor or not done'; 1 - 'Exceptional' Some aspect of measurement, relating to the dataset, is mentioned in the data section.
Model 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' The model should be nicely written out, well-explained, justified, and appropriate.
Results 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars.
Discussion 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Cross-references 0 - 'Poor or not done'; 2 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Prose 0 - 'Poor or not done'; 2 - 'Yes' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Reference list 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least two different commits, and they have meaningful commit messages.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Parquet 0 - 'Poor or not done'; 1 - 'Exceptional' The analysis dataset is saved as a parquet file (optionally also as a CSV).
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled.
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

D.7.5 Previous examples

D.8 Final paper

D.8.1 Task

  • Working individually and in an entirely reproducible way please write a paper that involves original work to tell a story with data.
  • Options include (pick one):
    • Develop a research question that is of interest to you based on your own interests, background, and expertise, then obtain or create a relevant dataset.
    • A reproduction, being sure to use the paper as a foundation rather than as an end-in-itself.
  • Create a well-organized folder with appropriate sub-folders, add it to GitHub, and then prepare a PDF using Quarto with these sections (you are welcome to use this starter folder):
    • Title, date, author, abstract, introduction, data, model, results, discussion, appendix (optional, for supporting, but not critical, material), and a reference list.
    • It must also include an enhancement, and this would either be contained, or linked to, in the appendix.

D.8.2 Peer review submission

  • This is an initial “submission” where you get comments and feedback on a draft.
  • Submit a PDF of your draft.
  • The paper does not have to be finished at this point, but the following sections must be filled out: title, author, date, abstract, and introduction.
  • All other sections must be present in the paper, but do not have to be filled out (e.g. you must have a “Data” heading, but you do not need to have content in that section).
  • To be clear, it is fine to later change any aspect of what you submit at this checkpoint.
  • You will be awarded one percentage point just for submitting a draft that meets this minimum.
  • There are no extensions possible for this submission because the following submission is dependent on this date.

D.8.3 Conduct peer-review

  • As an individual, you will randomly be assigned a handful of rough drafts to provide feedback. You have three days to provide feedback to your peers.
  • You should use GitHub Issues, or make a pull request, to provide the feedback.
  • If you provide feedback to one peer you will receive one percentage point, if you provide feedback to two peers you will receive two percentage points, etc.
  • Your feedback must include at least five comments (meaningful and useful bullet points). These must be well-written and thoughtful.
  • There are no extensions granted for this submission since the following submission is dependent on this date.
  • Please remember that you are providing feedback here to help your colleagues. All comments should be professional and kind. It is challenging to receive criticism. Please remember that your goal here is to help your peers advance their writing/analysis.
  • Submit the links to the GitHub Issues or pull requests that you created.

D.8.4 FAQ

  • Can I work as part of a team? No. You must have some work that is entirely your own. You really need your own work to show off for job applications etc.
  • How much should I write? Most students submit something that has 10-to-20-pages of main content, with additional pages devoted to appendices, but it is up to you. Be precise and thorough.
  • Do I have to submit an initial paper in order to do the peer-review? Yes.
  • Can I use the same paper for the reproduction as in the Howrah Paper? No.
  • Can I use any model? You are welcome to use any model, but you need to thoroughly explain it and this can be difficult for more complicated models. Start small. Pick one or two explanatory variables. Once you get that working, then complicate it. Remember that every explanatory variable, and the dependents for that matter, needs to be graphed.

D.8.5 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
Class paper 0 - 'No'; 1 - 'Yes' Check meta data such as project and folder names, as well as other aspect such as title etc. If there is any sign this is a class paper then no need to continue marking, paper gets 0 overall.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done'; 1 - 'Exceptional' The estimand is clearly stated in the introduction.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used.
Measurement 0 - 'Poor or not done'; 1 - 'Exceptional' Some aspect of measurement, relating to the dataset, is mentioned in the data section.
Model 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' The model should be nicely written out, well-explained, justified, and appropriate.
Results 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars.
Discussion 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Cross-references 0 - 'Poor or not done'; 2 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Prose 0 - 'Poor or not done'; 2 - 'Yes' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Reference list 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least two different commits, and they have meaningful commit messages.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Parquet 0 - 'Poor or not done'; 1 - 'Exceptional' The analysis dataset is saved as a parquet file (optionally also as a CSV).
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled.
Enhancements 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' You should pick at least one of the following and include it to enhance your submission: 1) A datasheet for the dataset; 2) A model card for the model; 3) A Shiny application; 4) An R package; or 5) API for the model.
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

D.8.6 Previous examples


  1. This terminology is used following Barba (2018), but it is the opposite of that used by BITSS.↩︎

  2. The US GSS is recommended here because individual-level data are publicly available, and the dataset is well-documented. But, often university students in particular countries have access to individual level data that are not available to the public, and if this is the case then you are welcome to use that instead. Students at Australian universities will likely have access to individual-level data from the Australian General Social Survey, and could use that. Students at Canadian universities will likely have access to individual-level data from the Canadian General Social and may like to use that.↩︎