Telling Stories with Data

With Applications in R

Author

Rohan Alexander

Published

July 27, 2023

You may purchase a print copy of this book from Chapman and Hall/CRC here.

Endorsements

This clean and fun book covers a wide range of topics on statistical communication, programming, and modeling in a way that should be a useful supplement to any statistics course or self-learning program. I absolutely love this book!

Andrew Gelman, Columbia University, and author of Regression and Other Stories

An excellent book. Communication and reproducibility are of increasing concern in statistics, and this book covers these topics and more in a practical, appealing, and truly unique way.

Daniela Witten, University of Washington, and author of An Introduction to Statistical Learning

Many data science texts tell you how to perform perfunctory calculations. Instead, Telling Stories with Data tells you how to engage in the mindset and process of analysis. By arming students with the computational, statistical and philosophical skills needed to use data in sense-making and story-telling, this book stands out from the pack as uniquely actionable and empowering.

Emily Riederer, Capital One, and author of R Markdown Cookbook

Telling Stories with Data is a thoughtful guide to using data to learn and affect positive change. The book includes each stage of the process and can serve as a long-lasting companion to many data scientists and future data story tellers.

Christopher Peters, Zapier

This is not another statistics book. It is much better than that. It is a book about doing quantitative research, about scientific justification, about quality control, about communication and epistemic humility. It’s a valuable supplement to any methods curriculum, and useful for self-learners as well.

Richard McElreath, Max Planck Institute for Evolutionary Anthropology, and author of Statistical Rethinking

A clever career choice is to pick a field where your skills are complementary with a growing resource. In the coming decades, those who are adept in analysing data will flourish. That means crunching statistics and telling compelling stories. Rohan Alexander’s book will help you do both.

Andrew Leigh, Member of the Australian Parliament, and author of Randomistas: How Radical Researchers Are Changing Our World

Every data analyst has to tell stories with data, and yet traditional textbooks focus on statistical methods alone. Telling Stories with Data teaches the entire data science workflow, including data acquisition, communication, and reproducibility. I highly recommend this unique book!

Kosuke Imai, Harvard University, and author of Quantitative Social Science: An Introduction

This is an extraordinary, wonderful, book, full of wise advice for anyone starting in data science. Intermixing concepts and code means the ideas are immediately made concrete, and the emphasis on reproducible workflows brings a welcome dose of rigor to a rapidly developing field.

Sir David Spiegelhalter, University of Cambridge, and author of The Art of Statistics

Telling (true) Stories with Data requires more than fancy statistical models and big data. With a series of fascinating case studies, Rohan Alexander teaches us how to ask good questions, acquire data, estimate models, and communicate our results. This holistic approach is explained with crisp and engaging prose. The pages are filled with detailed R examples, which emphasize the importance of transparency and reproducibility. I absolutely love this book and recommend it to all my students.

Vincent Arel-Bundock, Université de Montréal, and author of Analyse Causale et Méthodes Quantitatives

Preface

This book will help you tell stories with data. It establishes a foundation on which you can build and share knowledge about an aspect of the world that interests you based on data that you observe. Telling stories in small groups around a fire played a critical role in the development of humans and society (Wiessner 2014). Today our stories, based on data, can influence millions.

In this book we will explore, prod, push, manipulate, knead, and ultimately, try to understand the implications of, data. A variety of features drive the choices in this book.

The motto of the university from which I took my PhD is naturam primum cognoscere rerum or roughly “first to learn the nature of things”. But the original quote continues temporis aeterni quoniam, or roughly “for eternal time”. We will do both of these things. I focus on tools, approaches, and workflows that enable you to establish lasting and reproducible knowledge.

When I talk of data in this book, it will typically be related to humans. Humans will be at the center of most of our stories, and we will tell social, cultural, and economic stories. In particular, throughout this book I will draw attention to inequity both in social phenomena and in data. Most data analysis reflects the world as it is. Many of the least well-off face a double burden in this regard: not only are they disadvantaged, but the extent is more difficult to measure. Respecting those whose data are in our dataset is a primary concern, and so is thinking of those who are systematically not in our dataset.

While data are often specific to various contexts and disciplines, the approaches used to understand them tend to be similar. Data are also increasingly global, with resources and opportunities available from a variety of sources. Hence, I draw on examples from many disciplines and geographies.

To become knowledge, our findings must be communicated to, understood, and trusted by other people. Scientific and economic progress can only be made by building on the work of others. And this is only possible if we can understand what they did. Similarly, if we are to create knowledge about the world, then we must enable others to understand precisely what we did, what we found, and how we went about our tasks. As such, in this book I will be particularly prescriptive about communication and reproducibility.

Improving the quality of quantitative work is an enormous challenge, yet it is the challenge of our time. Data are all around us, but there is little enduring knowledge being created. This book hopes to contribute, in some small way, to changing that.

Audience and assumed background

The typical person reading this book has some familiarity with first-year undergraduate statistics, for instance they have run a regression. But it is not targeted at a particular level, instead providing aspects relevant to almost any quantitative course. I have taught from this book at undergraduate, graduate, and professional levels. Everyone has unique needs, but hopefully some aspect of this book speaks to you.

Enthusiasm and interest have taken people far. If you have those, then do not worry about too much else. Some of the most successful students have been those with no quantitative or coding background.

This book covers a lot of ground, but does not go into depth about any particular aspect. As such it especially complements more-detailed books such as: Data Science: A First Introduction (Timbers, Campbell, and Lee 2022), R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund [2016] 2023), An Introduction to Statistical Learning (James et al. [2013] 2021), and Statistical Rethinking (McElreath [2015] 2020). If you are interested in those books, then this might be a good one to start with.

Structure and content

This book is structured around six parts: I) Foundations, II) Communication, III) Acquisition, IV) Preparation, V) Modeling, and VI) Applications.

Part I—Foundations—begins with Chapter 1 which provides an overview of what I am trying to achieve with this book and why you should read it. Chapter 2 goes through three worked examples. The intention of these is that you can experience the full workflow recommended in this book without worrying too much about the specifics of what is happening. That workflow is: plan, simulate, acquire, model, and communicate. It is normal to not initially follow everything in this chapter, but you should go through it, typing out and executing the code yourself. If you only have time to read one chapter of this book, then I recommend that one. Chapter 3 introduces some key tools for reproducibility used in the workflow that I advocate. These are aspects like Quarto, R Projects, Git and GitHub, and using R in practice.

Part II—Communication—considers written and static communication. Chapter 4 details the features that quantitative writing should have and how to write a crisp, quantitative research paper. Static communication in Chapter 5 introduces features like graphs, tables, and maps.

Part III—Acquisition—focuses on turning our world into data. Chapter 6 begins with measurement, and then steps through essential concepts from sampling that govern our approach to data. It then considers datasets that are explicitly provided for us to use as data, for instance censuses and other government statistics. These are typically clean, well-documented, pre-packaged datasets. Chapter 7 covers aspects like using Application Programming Interfaces (APIs), scraping data, getting data from PDFs, and Optical Character Recognition (OCR). The idea is that data are available, but not necessarily designed to be datasets, and that we must go and get them. Finally, Chapter 8 covers aspects where more is expected of us. For instance, we may need to conduct an experiment, run an A/B test, or do some surveys.

Part IV—Preparation—covers how to respectfully transform the original, unedited data into something that can be explored and shared. Chapter 9 begins by detailing some principles to follow when approaching the task of cleaning and preparing data, and then goes through specific steps to take and checks to implement. Chapter 10 focuses on methods of storing and retrieving those datasets, including the use of R data packages and parquet. It then continues onto considerations and steps to take when wanting to disseminate datasets as broadly as possible, while at the same time respecting those whose data they are based on.

Part V—Modeling—begins with exploratory data analysis in Chapter 11. This is the critical process of coming to understand a dataset, but not something that typically finds itself into the final product. The process is an end in itself. In Chapter 12 the use of linear models to explore data is introduced. And Chapter 13 considers generalized linear models, including logistic, Poisson, and negative binomial regression. It also introduces multilevel modeling.

Part VI—Applications—provides three applications of modeling. Chapter 14 focuses on making causal claims from observational data and covers approaches such as difference-in-differences, regression discontinuity, and instrumental variables. Chapter 15 introduces multilevel regression with post-stratification, which is where we use a statistical model to adjust a sample for known biases. Chapter 16 is focused on text-as-data.

Chapter 17 offers some concluding remarks, details some outstanding issues, and suggests some next steps.

Online appendices offer critical aspects that are either a little too unwieldy for the size constraints of the page, or likely to need more frequent updating than is reasonable for a printed book. Online Appendix A goes through some essential tasks in R, which is the statistical programming language used in this book. It can be a reference chapter and some students find themselves returning to it as they go through the rest of the book. Online Appendix B provides a list of datasets that may be useful for assessment. The core of this book is centered around Quarto, however its predecessor, R Markdown, has not yet been sunsetted and there is a lot of material available for it. As such, Online Appendix C contains R Markdown equivalents of the Quarto-specific aspects in Chapter 3. A set of papers is included in Online Appendix D. If you write these, you will be conducting original research on a topic that is of interest to you. Although open-ended research may be new to you, the extent to which you are able to: develop your own questions, use quantitative methods to explore them, and communicate your findings, is the measure of the success of this book. Online Appendix E covers aspects such as websites, web applications, and maps that can be interacted with. Online Appendix F provides an example of a datasheet. Online Appendix G gives a brief overview of SQL essentials. Online Appendix H provides a discussion of prediction-focused modeling. Online Appendix I considers how to make model estimates and forecasts more widely available. Finally, Online Appendix J provides ideas for using this book in class.

Pedagogy and key features

You have to do the work. You should actively go through material and code yourself. King (2000) says “[a]mateurs sit and wait for inspiration, the rest of us just get up and go to work”. Do not passively read this book. My role is best described by Hamming ([1997] 2020, 2–3):

I am, as it were, only a coach. I cannot run the mile for you; at best I can discuss styles and criticize yours. You know you must run the mile if the athletics course is to be of benefit to you—hence you must think carefully about what you hear and read in this book if it is to be effective in changing you—which must obviously be the purpose\(\dots\)

This book is structured around a dense, introductory 12-week course. It provides enough material for advanced readers to be challenged, while establishing a core that all readers should master. Typical courses cover most of the material through to Chapter 13, and then pick another chapter that is of particular interest. But it depends on the backgrounds and interests of the students.

From as early as Chapter 2 you will have a workflow—plan, simulate, acquire, model, and communicate—allowing you to tell a convincing story with data. In each subsequent chapter you will add depth to this workflow. This will allow you to speak with increasing sophistication and credibility. This workflow encompasses skills that are typically sought in both academia and industry. These include: communication, ethics, reproducibility, research question development, data collection, data cleaning, data protection and dissemination, exploratory data analysis, statistical modeling, and scaling.

One of the defining aspects of this book is that ethics and inequity concerns are integrated throughout, rather than being clustered in one, easily ignorable, chapter. These are critical, but it can be difficult to immediately see their value, hence their tight integration.

This book is also designed to enable you to build a portfolio of work that you could show to a potential employer. If you want a job in industry, then this is arguably the most important thing that you should be doing. Robinson and Nolis (2020, 55) describe how a portfolio is a collection of projects that show what you can do and is something that can help be successful in a job search.

In the novel The Last Samurai (DeWitt 2000, 326), a character says:

[A] scholar should be able to look at any word in a passage and instantly think of another passage where it occurred; \(\dots\) [so a] text was like a pack of icebergs each word a snowy peak with a huge frozen mass of cross-references beneath the surface.

In an analogous way, this book not only provides text and instruction that is self-contained, but also helps develop the critical masses of knowledge on which expertise is built. No chapter positions itself as the last word, instead they are written in relation to other work.

Each chapter has the following features:

A list of required materials that you should go through before you read that chapter. To be clear, you should first read that material and then return to this book. Each chapter also contains extensive references. If you are particularly interested in the topic, then you should use these as a starting place for further exploration.
A summary of the key concepts and skills that are developed in that chapter. Technical chapters additionally contain a list of the software and packages that are used in the chapter. The combination of these features acts as a checklist for your learning, and you should return to them after completing the chapter.
“Scales” where I provide a small scenario and ask you to work through the workflow advocated in this book. This will probably take 15-30 minutes. Hilary Hahn, the American violinist, publicly documents herself practicing the violin, often scales or similar exercises, almost every day. I recommend you do something similar, and these are designed to enable that.
A series of short questions that you should complete after going through the required materials, but before going through the chapter, to test your knowledge. After completing the chapter, you should go back through the questions to make sure that you understand each aspect. An answer guide is available on request.
A tutorial question to further encourage you to actively engage with the material. You could consider forming small groups to discuss your answers to these questions.

Some chapters additionally feature:

A section called “Oh, you think we have good data on that!” which focuses on a particular setting where it is often assumed that there are unimpeachable and unambiguous data but the reality tends to be quite far from that.
A section called “Shoulders of giants”, which focuses on some of those who created the intellectual foundation on which we build.

Software information and conventions

The software that I primarily use in this book is R (R Core Team 2023). This language was chosen because it is open source, widely used, general enough to cover the entire workflow, yet specific enough to have plenty of well-developed features. I do not assume that you have used R before, and so another reason for selecting R for this book is the community of R users. The community is especially welcoming of newcomers and there is a lot of complementary beginner-friendly material available.

If you do not have a programming language, then R is a great one to start with. Please do go through Online Appendix A.

The ability to code is useful well beyond this book. If you have a preferred programming language already, then it would not hurt to also pick up R. That said, if you have a good reason to prefer another open-source programming language (for instance you use Python daily at work) then you may wish to stick with that. However, all examples in this book are in R.

Please download R and RStudio onto your own computer. You can download R for free here, and you can download RStudio Desktop for free here. Please also create an account on Posit Cloud here. This will allow you to run R in the cloud. And download Quarto here.

Packages are in typewriter text, for instance, tidyverse, while functions are also in typewriter text, but include brackets, for instance filter().

About the author

I am an assistant professor at the University of Toronto, jointly appointed in the Faculty of Information and the Department of Statistical Sciences. I am also the assistant director of the Canadian Statistical Sciences Institute (CANSSI) Ontario, a senior fellow at Massey College, a faculty affiliate at the Schwartz Reisman Institute for Technology and Society, and a co-lead of the Data Sciences Institute Thematic Program in Reproducibility. I hold a PhD in Economics from the Australian National University where I focused on economic history and was supervised by John Tang (chair), Martine Mariotti, Tim Hatton, and Zach Ward.

My research investigates how we can develop workflows that improve the trustworthiness of data science. I am particularly interested in the role of testing in data science.

I enjoy teaching and I aim to help students from a wide range of backgrounds learn how to use data to tell convincing stories. I try to develop students that are skilled not only in using statistical methods across various disciplines, but also appreciate their limitations, and think deeply about the broader contexts of their work. I teach in both the Faculty of Information and the Department of Statistical Sciences at both undergraduate and graduate levels. I am a RStudio Certified Tidyverse Trainer.

I am married to Monica Alexander and we have two children. I probably spend too much money on books, and certainly too much time at libraries. If you have any book recommendations of your own, then I would love to hear them.

Land acknowledgment

This book was primarily written on the traditional lands of the Mississaugas of the Credit, the Huron-Wendat, and the Seneca. Data have long been used to oppress and harm, and the acknowledgment of the history of this land serves as a reminder of the need to try to be better in our own use of data.

Acknowledgments

Many people generously gave code, data, examples, guidance, opportunities, thoughts, and time that helped develop this book.

Thank you to David Grubbs, Curtis Hill, Robin Lloyd-Starkes, and the team at Taylor & Francis for editing and publishing this book, and providing invaluable guidance and support. I am grateful to Erica Orloff who thoroughly edited this book. Thank you to Isabella Ghement, who thoroughly went through an early draft of this book and provided detailed feedback that improved it.

Thank you to Annie Collins, who went through every word in this book, improving many of them, and helped to sharpen my thinking on much of the content covered in it. One of the joys of teaching is the chance to work with talented people like Annie as they start their careers.

Thank you to Emily Riederer, who provided detailed comments on the initial plans for the book. She then returned to the manuscript after it was drafted and went through it in minute detail. Her thoughtful comments greatly improved the book. More broadly her work changed the way I thought about much of the content of this book.

I was fortunate to have many reviewers who read entire chapters, sometimes two, three, or even more. They very much went above and beyond and provided excellent suggestions for improving this book. For this I am indebted to Albert Rapp, Alex Hayes, Alex Luscombe (who also suggested the Police Violence “Oh, you think\(\dots\)” entry), Ariel Mundo, Benjamin Haibe-Kains, Dan Ryan, Erik Drysdale, Florence Vallée-Dubois, Jack Bailey, Jae Hattrick-Simpers, Jon Khan, Jonathan Keane (who also generously shared their parquet expertise), Lauren Kennedy (who also generously shared code, data, and expertise, to develop my thoughts about MRP), Liam Welsh, Liza Bolton (who also helped develop my ideas around how this book should be taught), Luis Correia, Matt Ratto, Matthias Berger, Michael Moon, Roberto Lentini, Ryan Briggs, and Taylor Wright.

Many people made specific suggestions that greatly improved things. All these people contribute to the spirit of generosity that characterizes the open source programming languages communities this book builds on. I am grateful to them all. A Mahfouz made me realize that that covering Poisson regression was critical. Aaron Miller suggested the FINER framework. Alison Presmanes Hill suggested Wordbank. Chris Warshaw suggested the Democracy Fund Voter Study Group survey data. Christina Wei pointed out many code errors. Claire Battershill directed me toward many books about writing. Ella Kaye suggested, and rightly insisted on, moving to Quarto. Faria Khandaker suggested what became the “R essentials” chapter. Hareem Naveed generously shared her industry experience. Heath Priston provided assistance with Toronto homelessness data. Jessica Gronsbell gave invaluable suggestions around statistical practice. Keli Chiu reinforced the importance of text-as-data. Leslie Root came up with the idea of “Oh, you think we have good data on that!”. Michael Chong shaped my approach to EDA. Michael Donnelly, Peter Hepburn, and Léo Raymond-Belzile provided pointers to classic papers that I was unaware of, in political science, sociology, and statistics, respectively. Nick Horton suggested the Hadley Wickham video in Chapter 11. Paul Hodgetts taught me how to make R packages and made the cover art for this book. Radu Craiu ensured sampling was afforded its appropriate place. Sharla Gelfand the approaches I advocate around how to use R. Thomas William Rosenthal made me realize the potential of Shiny. Tom Cardoso and Zane Schwartz were excellent sources of data put together by journalists. Yanbo Tang assisted with Nancy Reid’s “Shoulders of giants” entry. Finally, Chris Maddison and Maia Balint suggested the closing poem.

Thank you to my PhD supervisory panel John Tang, Martine Mariotti, Tim Hatton, and Zach Ward. They gave me the freedom to explore the intellectual space that was of interest to me, the support to follow through on those interests, and the guidance to ensure that it all resulted in something tangible. What I learned during those years served as the foundation for this book.

This book has greatly benefited from the notes and teaching materials of others that are freely available online, including: Chris Bail, Scott Cunningham, Andrew Heiss (who independently taught a course with the same name as this book, well before the book was available), Lisa Lendway, Grant McDermott, Nathan Matias, David Mimno, and Ed Rubin. Thank you to these people. The changed norm of academics making their materials freely available online is a great one and one that I hope the free online version of this book, available here, helps contribute to.

Thank you to Samantha-Jo Caetano who helped develop some of the assessment items. And also, to Lisa Romkey and Alan Chong, who allowed me to adapt some aspects of their rubric. The catalysts for aspects of the Chapter 4 tutorial were McPhee (2017, 186) and Chelsea Parlett-Pelleriti. The idea behind the “Interactive communication” tutorial was work by Mauricio Vargas Sepúlveda (“Pachá”) and Andrew Whitby.

I am grateful for the corrections of: Amy Farrow, Arsh Lakhanpal, Cesar Villarreal Guzman, Chloe Thierstein, Finn Korol-O’Dwyer, Flavia López, Gregory Power, Hong Shi, Jayden Jung, John Hayes, Joyce Xuan, Laura Cline, Lorena Almaraz De La Garza, Matthew Robertson, Michaela Drouillard, Mounica Thanam, Reem Alasadi, Rob Zimmerman, Tayedza Chikumbirike, Wijdan Tariq, Yang Wu, and Yewon Han.

Kelly Lyons provided support, guidance, mentorship, and friendship. Every day she demonstrates what an academic should be, and more broadly, what one should aspire to be as a person.

Greg Wilson provided a structure to think about teaching and suggested the “Scales” style exercises. He was the catalyst for this book, and provided helpful comments on drafts. Every day he demonstrates how to contribute to the intellectual community.

Thank you to Elle Côté for enabling this book to be written by looking after first one, and then two, children during a pandemic.

As at Christmas 2021 this book was a disparate collection of partially completed notes; thank you to Mum and Dad, who dropped everything and came over from the other side of the world for two months to give me the opportunity to rewrite it all and put together a cohesive draft.

Thank you to Marija Taflaga and the ANU Australian Politics Studies Centre at the School of Politics and International Relations for funding a two-week “writing retreat” in Canberra.

Finally, thank you to Monica Alexander. Without you I would not have written a book; I would not have even thought it possible. Many of the best ideas in this book are yours, and those that are not, you made better, by reading everything many times. Thank you for your inestimable help with writing this book, providing the base on which it builds (remember in the library showing me many times how to get certain rows in R!), giving me the time that I needed to write, encouragement when it turned out that writing a book just meant endlessly rewriting that which was perfect the day before, reading everything in this book many times, making coffee or cocktails as appropriate, looking after the children, and more.

You can contact me at: rohan.alexander@utoronto.ca.

Rohan Alexander
Toronto, Canada
May 2023