Primer on data science for health

These are notes/series of posts based on a seminar series I am currently running at the University of Canterbury’s School of Health Sciences. I have left these notes online for you to use, comment, and learn about doing data science in health sciences. We will cover topics ranging from simple descriptive statistics through to linear, logistic regressions, Cox Proportional Hazards modelling, time series analysis, structural equation modelling, and meta analysis. We will use Rstudio as our main tool for these exercises, but you can use Jupyter notebook as well. Indeed, many examples are developed in Jupyter notebooks maintained at a gitlab site.

The examples will be mainly drawn from health and health sciences but also from other sources. You are welcome to try using your own examples, and contribute to this series. You can ask questions directly using the comments section or you can ask individual questions and post comments.

Part I: Setting up your work

Installation of R and Rstudio

First, install Rstudio. In order to install Rstudio, you will first need to install R for statistical computing software. Visit https://cran.r-project.org/ to download and install R for your system (you should know your system configuration and if you do not have administration rights for your computer, then you should let the administrator to install R for your system. This is the same for Rstudio, next). After installing R, you should install Rstudio from https://www.rstudio.com/products/rstudio/download/

If you have an existing installation of Rstudio or R and when you start Rstudio it says that some components could not be made available, then either update Rstudio, or reinstall Rstudio, and this will help.

Set up Rstudio

Open Rstudio, and select a new project.

Then, select “New Directory” >> “Empty Project” >> Specify the “Directory Name” and choose this directory as a subdirectory from the New Project dialog box, so fill in the fields in the following dialog box:

This will take you to the Rstudio instance and here, you open a new “Rnotebook”. The R notebook has the following three sections where we will do our work in the next seminars. Note the Rnotebook sections:

When you open up Rnotebook like this, you see three sections. From top, the sections in the page are:

  1. A section that starts and ends with three dashes containing keywords “title”, “author”, “output”. These are the meta data for the document. Fill in these entries to set up the document. Here, for instance, the title sets the title for the document you will write, author indicates you, and you specify that your output will be an html document. We will learn how to set the different options later in the seminar series, for now leave the default values as they are and change the author and the title options.
  2. Following the meta data, you see that there is a grey box with a green triangle at the upper right hand corner, and the expression ```{r} written on left hand top and the grey box ends with three backticks in the bottom left hand corner. This is a code chunk. You will write R codes in the code chunk. If you click the green triangle in the upper right hand corner of the code chunk, it will run the code chunk and evaluate the expressions. You can insert code chunks in Rstudio using either the mouse pointer and selecting “insert R” from the upper right hand side dropdown box, or you can use keyboard symbols “Ctrl-Alt-I” (hold and press the three keys at once) in Windows/Linux or “Cmd-Opt-I” (Mac). You can evaluate either a single expression by holding “Ctrl-Enter” key in Windows/Linux or “Cmd-Enter” in Mac. You can evaluate a code chunk by itself by clicking on the green flag on the corner, or holding and pressing down “Ctrl/Cmd+Shift+Enter” in Windows/Linux/Mac.
  3. Other than the meta data and the code chunk, you will see that there are white spaces where you can write in plain text using markups. These markups are referred to as “markdown” codes for writing a document. In Rnotebook, we will use the pandoc-flavoured markdown codes to write our text. Using pandoc flavoured markdown, you can write tables, lists, paragraphs, different levels of headings, and include citations to the text, so that you can write really long or complex documents and then export the document using Rstudio to your desired format for sharing with others. You can also share an Rmarkdown document with your collaborators if they want to work directly on the Rmarkdown document. The extension for an Rmarkdown document is “.Rmd”.

Load the packages

We will work with the following packages in R. Packages in R consist of functions, and data that enable you to run specific sets of analyses. We will learn more about packages in due course in this seminar series. Please insert a code chunk and type the following:

Error messages

After typing these three lines, run the chunk. When you run the chunk, you may see an error message where Rstudio will complain that it cannot find the package nycflights13 (or tidyverse or knitr) and therefore could not complete the action. This tells you that Rstudio could not find the needed package. You will therefore need to install the package first before you can use it. In order to use the package, install it first by typing the following in a code chunk and then run the chunk:

Let me explain this code chunk:

  1. The first line starts with a hash mark (“#”). A hash mark in front of any line indicates that you are going to write a comment. A comment in a code is not run. Instead, a comment in a code reminds you why you wrote the line of code. This is important so that when you come back to this line of code say in a few weeks, you will not forget as to what this code is doing. You should add comments liberally in all code you write. You can put codes in nearly every line or every chunk of code you write. Each line of code is preceded by a hash mark. You can creatively use hash mark to indicate several lines of code and fence code blocks.
  2. The second line contains “install.packages(“nycflights13”)”. The expression install.packages() is a function. Note that the function install.packages is telling the computer to install a package (you can specify more than one package to be installed, we will show later how that is done). But note that this function name is telling the computer to do something. This is an ideal way to write a function name. When you write a name of a function, always make sure that your function is about doing something and you write that action or VERB when you write a function. Also note that the word or expression “nycflights13” is written between quote marks and this is referred to as a “parameter” of the function. We will discus how to write functions in a future module and we will discuss how to use parameters in writing functions. Functions and data are heart and soul of packages.

A glimpse into data

We will cover ideas around tidy data and grammar of graphics in our next seminar, but as a prelude to this work, open a code chunk and type:

What is this code doing? We use a function to see the first few (or six as defined) entries in a data set named “flights” from the nycflights13 package. The function head() suggests that you want to see what’s in the “head end” of the data, and the function tail() suggests you will see what’s in the tail end of the data set. If you run the code chunk, you will see an entry similar to:

Description of the output

The first line tells you that this is a “tibble”. What is a tibble? A tibble is a table that or a data frame that is “tidy data”. We will explain the details of tidy data and how to work with tidy data in the next seminar. It tells you that it is a tibble consisting of 6 x 19 ; Here, “6” indicates there are six rows, and 19 tells you that there are 19 columns (each column represents a variable). You will not see all 19 variables and data at once. The third line shows you eight variable names and beneath them, in the fourth line, you get to see words like <int> enclosed within angular brackets. We will get to these descriptors in our next seminar when we describe the concepts of tidyverse, tidy data, tibbles, but for now, note that these are the types of variables enclosed within the angular brackets.

Next …

We will continue to explore the tidy data and tidyverse in the next seminar and learn how to conduct data analysis and run models, and do graphics. In Part II, I will show you how you can use tidy data and the tidyverse package to analyse data sets, produce tables of data and conduct graphical exploration of data sets. In Part III, we will start with linear modelling using Rstudio; in Part IV, we will learn how you can use Rstudio for structural equation modelling using lavaan package and we will explore several scenarios. We will also briefly touch how you can work with matrices. In Part V, we will learn about conducting meta analysis using meta packages and we will see that using tidy approach you can conduct meta analyses. In Part VI, we will explore how you can use Rstudio to write longer texts. Finally, in Part VII, I will show you how you can interface Rstudio with Overleaf and Authorea to work with other databases.

Part II: The tidy approach to work with data, learn the grammar first

A great introduction to R for data science is the work by Hadley Wickham’s book R for data science . This series draws its inspiration from the text but we will also discuss issues and topics relevant to our setting where we will discuss health care data. Some of our data sets are large data sets, others are smaller data sets. Nevertheless, a starting point for all our data sets are data frames. We will use “tibbles” in place of data frames here and throughout this series for our work, so some introduction is in order.

What is a tibble? In the previous example, you have seen an example of a tibble. From R for data science,

Tibbles are data frames, but they tweak some older behaviours to make life a little easier

(http://r4ds.had.co.nz/tibbles.html)

But before that, let’s talk about tidy data sets. Hadley Wickham introduced the idea of “tidy” data sets¹.

Part III: Generalised linear modelling & time series analysis with RStudio

Part IV: Structural equation modelling with lavaan, and diagrammeR

Part V: Meta analysis in Rstudio

Part VI: Rstudio to write longer documents (journal articles, theses, books)

Part VII: Rstudio to interface with Authorea and Overleaf

¹Wickham, H. Tidy Data. Journal of Statistical Software. (2014): https://doi.org/10.18637/jss.v059.i10

Associate Professor of Epidemiology and Environmental Health at the University of Canterbury, New Zealand. Also in: https://refind.com/arinbasu

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store