R is an object-oriented programming language for statistical computing and graphics.

There are many benefits to learning and using R:

  1. Free and open source
  2. Thousands of packages for your analysis needs
  3. Active community (lots on online help)
  4. Has excellent publication-quality graphics
  5. Standard for statisticians and data analysts for statistical analysis
  6. Can interact with other programming languages

Learning objectives:

  • Using RStudio and interacting with R
    • Basic key bindings/shortcuts
    • Using R Projects
  • Knowing how to get help (vignettes, ?)
  • Difference between R Markdown vs R scripts
    • What is R Markdown and how to use it
  • Using basic and common R codes
    • Installing and loading packages
    • Importing and exporting datasets
    • Viewing your data
    • Data types
    • Assigning variables
    • Extracting values from the data
    • Using and making functions

Material and code

R and RStudio

Interacting with R (sending commands and getting results output) is done through the console. RStudio (cheatsheet here) has four panes, two of which we’ll cover first: Console pane and open files pane. The console is where code is sent and R gives back an output from that code. The open files include R scripts (.R files) and R markdown (.Rmd) files. You write your R code in the R/Rmd files and send the line of code to the console using Ctrl-Enter.

It’s generally good practice to make an R project for a specific research objective (e.g. your thesis). Create an .Rproj by using File -> New Project. Anytime you do work on that project, open the .Rproj file or via File -> Recent Projects. You can have multiple R projects.

Getting help

Getting help is easy in R. There is help for individual functions (i.e. commands) as well as for R packages. Help for specific commands uses ? in front of the function:

?head

Sometimes, specific packages have a more detailed description of the package, called a vignette. You can get information on the vignettes inside a package, followed by the specific vignette you want to read:

vignette(package = 'dplyr')
vignette('introduction', package = 'dplyr')

Not all packages have vignettes, but it’s always useful to check if there is one.

Types of R files

There are two types of R files: .R and .Rmd. R scripts (.R) contain only R code and nothing else; these are useful when all you want to do is run a set code sequence and produce a specific result, figure, or output. R Markdown (.Rmd) files can contain R code, text, citations, tables, figures, lists, and so on. An R Markdown file is like Word… except much more powerful as it can include R code within the document and can convert to a variety of file types including HTML and Word docx. Using this file type makes your research, manuscripts, and thesis more reproducible and saves you time and stress. For most of the workshop we’ll be using R Markdown.

You can recognize an .Rmd file as it generally has this at the very top of the file:

---
title: "Introducing R Markdown"
author: "Luke Johnston"
date: "July 23, 2015"
output: html_document
---

This snippet of code is called YAML. We’ll talk about that more in a later workshop. For now, just know that rmarkdown (the package) uses the YAML to know what to convert the document to (in this case html).

Installing and loading packages

Installing packages can be done using either RStudio in the Packages pane or using the command:

install.packages('packagename')

We’ll use the package readr soon, so let’s load it up:

library(readr)

We can now use the functions inside the readr package. To see what functions there are, you can go to the console and start typing readr::, then hit TAB. A list of functions will pop up, showing all the functions inside readr. This TAB-completion is very useful.

Importing and exporting

Let’s use the readr functions to load in a dataset from the codeasmanuscript site. It’s generally good to store your data as csv files, which are known as comma separated values. These are plain text, meaning there is nothing but text in the file (unlike Excel .xlsx files which are not plain text; they contain markup inside the file itself).

You can import files from your computer. It keep things easy, make sure the data file is in the same folder as your .Rmd/.R/.Rproj file(s). You can import the file into R via…

# comment: to import use:
ds <- read_csv('states_data.csv')
## Parsed with column specification:
## cols(
##   StateName = col_character(),
##   Population = col_integer(),
##   Income = col_integer(),
##   Illiteracy = col_double(),
##   LifeExp = col_double(),
##   Murder = col_double(),
##   HSGrad = col_double(),
##   Frost = col_integer(),
##   Area = col_integer(),
##   Region = col_character(),
##   Division = col_character(),
##   Longitude = col_double(),
##   Latitude = col_double()
## )
# or from a website
# ds <- read_csv('http://codeasmanuscript.org/states_data.csv') 
ds
## # A tibble: 50 x 13
##      StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
##          <chr>      <int>  <int>      <dbl>   <dbl>  <dbl>  <dbl> <int>
##  1     Alabama       3615   3624        2.1   69.05   15.1   41.3    20
##  2      Alaska        365   6315        1.5   69.31   11.3   66.7   152
##  3     Arizona       2212   4530        1.8   70.55    7.8   58.1    15
##  4    Arkansas       2110   3378        1.9   70.66   10.1   39.9    65
##  5  California      21198   5114        1.1   71.71   10.3   62.6    20
##  6    Colorado       2541   4884        0.7   72.06    6.8   63.9   166
##  7 Connecticut       3100   5348        1.1   72.48    3.1   56.0   139
##  8    Delaware        579   4809        0.9   70.06    6.2   54.6   103
##  9     Florida       8277   4815        1.3   70.66   10.7   52.6    11
## 10     Georgia       4931   4091        2.0   68.54   13.9   40.6    60
## # ... with 40 more rows, and 5 more variables: Area <int>, Region <chr>,
## #   Division <chr>, Longitude <dbl>, Latitude <dbl>

To export data inside R to a file, use…

write_csv('states.csv')

Viewing your data

There are several functions to look over your data.

# comment: quick summary
summary(ds)
##   StateName           Population        Income       Illiteracy   
##  Length:50          Min.   :  365   Min.   :3098   Min.   :0.500  
##  Class :character   1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625  
##  Mode  :character   Median : 2838   Median :4519   Median :0.950  
##                     Mean   : 4246   Mean   :4436   Mean   :1.170  
##                     3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575  
##                     Max.   :21198   Max.   :6315   Max.   :2.800  
##     LifeExp          Murder           HSGrad          Frost       
##  Min.   :67.96   Min.   : 1.400   Min.   :37.80   Min.   :  0.00  
##  1st Qu.:70.12   1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25  
##  Median :70.67   Median : 6.850   Median :53.25   Median :114.50  
##  Mean   :70.88   Mean   : 7.378   Mean   :53.11   Mean   :104.46  
##  3rd Qu.:71.89   3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75  
##  Max.   :73.60   Max.   :15.100   Max.   :67.30   Max.   :188.00  
##       Area           Region            Division           Longitude      
##  Min.   :  1049   Length:50          Length:50          Min.   :-127.25  
##  1st Qu.: 36985   Class :character   Class :character   1st Qu.:-104.16  
##  Median : 54277   Mode  :character   Mode  :character   Median : -89.90  
##  Mean   : 70736                                         Mean   : -92.46  
##  3rd Qu.: 81162                                         3rd Qu.: -78.98  
##  Max.   :566432                                         Max.   : -68.98  
##     Latitude    
##  Min.   :27.87  
##  1st Qu.:35.55  
##  Median :39.62  
##  Mean   :39.41  
##  3rd Qu.:43.14  
##  Max.   :49.25
# first 6 rows
head(ds)
## # A tibble: 6 x 13
##    StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
##        <chr>      <int>  <int>      <dbl>   <dbl>  <dbl>  <dbl> <int>
## 1    Alabama       3615   3624        2.1   69.05   15.1   41.3    20
## 2     Alaska        365   6315        1.5   69.31   11.3   66.7   152
## 3    Arizona       2212   4530        1.8   70.55    7.8   58.1    15
## 4   Arkansas       2110   3378        1.9   70.66   10.1   39.9    65
## 5 California      21198   5114        1.1   71.71   10.3   62.6    20
## 6   Colorado       2541   4884        0.7   72.06    6.8   63.9   166
## # ... with 5 more variables: Area <int>, Region <chr>, Division <chr>,
## #   Longitude <dbl>, Latitude <dbl>
# last 6 rows
tail(ds)
## # A tibble: 6 x 13
##       StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
##           <chr>      <int>  <int>      <dbl>   <dbl>  <dbl>  <dbl> <int>
## 1       Vermont        472   3907        0.6   71.64    5.5   57.1   168
## 2      Virginia       4981   4701        1.4   70.08    9.5   47.8    85
## 3    Washington       3559   4864        0.6   71.72    4.3   63.5    32
## 4 West Virginia       1799   3617        1.4   69.48    6.7   41.6   100
## 5     Wisconsin       4589   4468        0.7   72.48    3.0   54.5   149
## 6       Wyoming        376   4566        0.6   70.29    6.9   62.9   173
## # ... with 5 more variables: Area <int>, Region <chr>, Division <chr>,
## #   Longitude <dbl>, Latitude <dbl>
# column names of your dataset
colnames(ds)
##  [1] "StateName"  "Population" "Income"     "Illiteracy" "LifeExp"   
##  [6] "Murder"     "HSGrad"     "Frost"      "Area"       "Region"    
## [11] "Division"   "Longitude"  "Latitude"

A very detailed view of the exact contents of the data object, the str function is very useful:

str(ds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    50 obs. of  13 variables:
##  $ StateName : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Population: int  3615 365 2212 2110 21198 2541 3100 579 8277 4931 ...
##  $ Income    : int  3624 6315 4530 3378 5114 4884 5348 4809 4815 4091 ...
##  $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
##  $ LifeExp   : num  69 69.3 70.5 70.7 71.7 ...
##  $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
##  $ HSGrad    : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
##  $ Frost     : int  20 152 15 65 20 166 139 103 11 60 ...
##  $ Area      : int  50708 566432 113417 51945 156361 103766 4862 1982 54090 58073 ...
##  $ Region    : chr  "South" "West" "West" "South" ...
##  $ Division  : chr  "East South Central" "Pacific" "Mountain" "West South Central" ...
##  $ Longitude : num  -86.8 -127.2 -111.6 -92.3 -119.8 ...
##  $ Latitude  : num  32.6 49.2 34.2 34.7 36.5 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 13
##   .. ..$ StateName : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Population: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Income    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Illiteracy: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ LifeExp   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Murder    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ HSGrad    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Frost     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Area      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Region    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Division  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Longitude : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Latitude  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

To see the dimensions (number of rows and columns) on a dataset you can use

dim(ds)
## [1] 50 13

Data types

There are several types of data in R. In general, these can be simplified into two classes: continuous and discrete. These can then be classified into other groups.

  1. Continuous:
    • numeric
    • integer
  2. Discrete:
    • character
    • logical
    • factor

Continuous data types look like…

0.5:10.5
##  [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5
class(0.5:10.5)
## [1] "numeric"
1L:10L # comment: the `L` forces R to see it as an integer
##  [1]  1  2  3  4  5  6  7  8  9 10
class(1L:10L)
## [1] "integer"

… while discrete data types look like

c(TRUE, FALSE) # comment: the `c` means 'combine' or 'concatenate'
## [1]  TRUE FALSE
class(c(TRUE, FALSE))
## [1] "logical"
c('hi', 'there') # this is also called a vector
## [1] "hi"    "there"
as.numeric(c('hi', 'there'))
## Warning: NAs introduced by coercion
## [1] NA NA
class(c('hi', 'there', 'hi'))
## [1] "character"
factor(c('hi', 'there', 'hi'))
## [1] hi    there hi   
## Levels: hi there
class(factor(c('hi', 'there', 'hi')))
## [1] "factor"
# comment: factors are basically numbers underneath.
as.numeric(factor(c('hi', 'there', 'hi')))
## [1] 1 2 1
# comment: characters are not.
as.numeric(c('hi', 'there'))
## Warning: NAs introduced by coercion
## [1] NA NA

There are also more complex object types in R. Two common ones are lists and dataframes. A dataframe is what you just loaded above in the ds object. It can contain any data type except for other dataframes. So…

class(ds)
## [1] "tbl_df"     "tbl"        "data.frame"

A list can have any number of data types inside it, including dataframes and more lists.

ds_list <- list(data = ds, number = 1:10, char = c('hi', 'there'))
class(ds_list)
## [1] "list"
ds_list
## $data
## # A tibble: 50 x 13
##      StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
##          <chr>      <int>  <int>      <dbl>   <dbl>  <dbl>  <dbl> <int>
##  1     Alabama       3615   3624        2.1   69.05   15.1   41.3    20
##  2      Alaska        365   6315        1.5   69.31   11.3   66.7   152
##  3     Arizona       2212   4530        1.8   70.55    7.8   58.1    15
##  4    Arkansas       2110   3378        1.9   70.66   10.1   39.9    65
##  5  California      21198   5114        1.1   71.71   10.3   62.6    20
##  6    Colorado       2541   4884        0.7   72.06    6.8   63.9   166
##  7 Connecticut       3100   5348        1.1   72.48    3.1   56.0   139
##  8    Delaware        579   4809        0.9   70.06    6.2   54.6   103
##  9     Florida       8277   4815        1.3   70.66   10.7   52.6    11
## 10     Georgia       4931   4091        2.0   68.54   13.9   40.6    60
## # ... with 40 more rows, and 5 more variables: Area <int>, Region <chr>,
## #   Division <chr>, Longitude <dbl>, Latitude <dbl>
## 
## $number
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $char
## [1] "hi"    "there"
names(ds_list) # find out the names of the objects inside the list
## [1] "data"   "number" "char"

Variable assignment

Often you need to run a function to get an output and then use that output to do other things to it, like plot it or make it into a table. In order to help with that, there is variable assignment using the assignment operator <-. We used it earlier to assign the dataframe from read_csv to the ds object.

weight_kg <- 75
weight_kg
## [1] 75
weight_lb <- weight_kg * 2.2
weight_lb
## [1] 165

Extracting values from an object

You can use several methods to take one or more items from a vector, dataframe, or list using $, [], or [[]].

# vector can only use []
num <- 1:10
num[1] # first item
## [1] 1
num[9] # ninth item
## [1] 9
# dataframes and lists can use any of the methods
ds$Income # directly choose the Income column, converts to vector
##  [1] 3624 6315 4530 3378 5114 4884 5348 4809 4815 4091 4963 4119 5107 4458
## [15] 4628 4669 3712 3545 3694 5299 4755 4751 4675 3098 4254 4347 4508 5149
## [29] 4281 5237 3601 4903 3875 5087 4561 3983 4660 4449 4558 3635 4167 3821
## [43] 4188 4022 3907 4701 4864 3617 4468 4566
ds['Income'] # same as above, but keeps as column
## # A tibble: 50 x 1
##    Income
##     <int>
##  1   3624
##  2   6315
##  3   4530
##  4   3378
##  5   5114
##  6   4884
##  7   5348
##  8   4809
##  9   4815
## 10   4091
## # ... with 40 more rows
ds[c('Income', 'Population')] # same as above, but keeps as column
## # A tibble: 50 x 2
##    Income Population
##     <int>      <int>
##  1   3624       3615
##  2   6315        365
##  3   4530       2212
##  4   3378       2110
##  5   5114      21198
##  6   4884       2541
##  7   5348       3100
##  8   4809        579
##  9   4815       8277
## 10   4091       4931
## # ... with 40 more rows
ds[3] # using numbers, again keeping as column
## # A tibble: 50 x 1
##    Income
##     <int>
##  1   3624
##  2   6315
##  3   4530
##  4   3378
##  5   5114
##  6   4884
##  7   5348
##  8   4809
##  9   4815
## 10   4091
## # ... with 40 more rows
ds[[3]] # converts to number
##  [1] 3624 6315 4530 3378 5114 4884 5348 4809 4815 4091 4963 4119 5107 4458
## [15] 4628 4669 3712 3545 3694 5299 4755 4751 4675 3098 4254 4347 4508 5149
## [29] 4281 5237 3601 4903 3875 5087 4561 3983 4660 4449 4558 3635 4167 3821
## [43] 4188 4022 3907 4701 4864 3617 4468 4566
# combining [[]] and []
ds[[3]][4] # choose fourth item of third column
## [1] 3378
# ds[3][4] # this doesn't work 
ds[3,4] # but this does, which means [rownumber, columnnumber]
## # A tibble: 1 x 1
##   Illiteracy
##        <dbl>
## 1        1.8
ds[[3,4]] # converts to number
## [1] 1.8
# a range:
ds[1:5, 1:4]
## # A tibble: 5 x 4
##    StateName Population Income Illiteracy
##        <chr>      <int>  <int>      <dbl>
## 1    Alabama       3615   3624        2.1
## 2     Alaska        365   6315        1.5
## 3    Arizona       2212   4530        1.8
## 4   Arkansas       2110   3378        1.9
## 5 California      21198   5114        1.1
# as a list
ds_list$number # converts to vector
##  [1]  1  2  3  4  5  6  7  8  9 10
ds_list[[2]] # as a vector
##  [1]  1  2  3  4  5  6  7  8  9 10
ds_list[2] # as a named vector
## $number
##  [1]  1  2  3  4  5  6  7  8  9 10
# combining [[]] and []
ds_list[[2]][4] # chooses fourth item of the second list item
## [1] 4

Using and making functions

In R, nearly every command is a function. You can look at the contents of any function by simply typing in the function without the ():

factor
## function (x = character(), levels, labels = levels, exclude = NA, 
##     ordered = is.ordered(x), nmax = NA) 
## {
##     if (is.null(x)) 
##         x <- character()
##     nx <- names(x)
##     if (missing(levels)) {
##         y <- unique(x, nmax = nmax)
##         ind <- sort.list(y)
##         levels <- unique(as.character(y)[ind])
##     }
##     force(ordered)
##     if (!is.character(x)) 
##         x <- as.character(x)
##     levels <- levels[is.na(match(levels, exclude))]
##     f <- match(x, levels)
##     if (!is.null(nx)) 
##         names(f) <- nx
##     nl <- length(labels)
##     nL <- length(levels)
##     if (!any(nl == c(1L, nL))) 
##         stop(gettextf("invalid 'labels'; length %d should be 1 or %d", 
##             nl, nL), domain = NA)
##     levels(f) <- if (nl == nL) 
##         as.character(labels)
##     else paste0(labels, seq_along(levels))
##     class(f) <- c(if (ordered) "ordered", "factor")
##     f
## }
## <bytecode: 0x3c89188>
## <environment: namespace:base>

For a beginner, this doesn’t always come in handy. However, it is something that is very useful to know as you get more familiar with R.

All functions have arguments. You can see the arguments by hitting TAB when your cursor is inside the function (between the ()). A list of argument options will be shown as well as a quick help on what the argument is and needs. Passing a value to an argument (e.g. function_name(argument1 = value)), lets the function perform its action with that argument value. Let’s make a simple example by doing a sum of two values, plus an extra third that has the default value of 0:

adding <- function(value1, value2, value3 = 0) {
    value1 + value2
}

We have now created a new function called adding. The first two arguments require a value, while the third has a default so doesn’t need a value. Let’s add 2 with 2. There are two ways to do it, using positional arguments and named arguments.

# positional
adding(2, 2)
## [1] 4
# named
adding(value1 = 2, value2 = 2)
## [1] 4

Generally, the first two arguments are made to be positional, but further arguments are named. Like so:

adding(2, 2, value3 = 2)
## [1] 4

That’s the simple run down of making functions. We’ll make more complex ones as we learn more.

Assignment:

TBA