R is an object-oriented programming language for statistical computing and graphics.
There are many benefits to learning and using R:
?
)Interacting with R (sending commands and getting results output) is done through the console. RStudio (cheatsheet here) has four panes, two of which we’ll cover first: Console pane and open files pane. The console is where code is sent and R gives back an output from that code. The open files include R scripts (.R
files) and R markdown (.Rmd
) files. You write your R code in the R/Rmd files and send the line of code to the console using Ctrl-Enter
.
It’s generally good practice to make an R project for a specific research objective (e.g. your thesis). Create an .Rproj
by using File -> New Project
. Anytime you do work on that project, open the .Rproj
file or via File -> Recent Projects
. You can have multiple R projects.
Getting help is easy in R. There is help for individual functions (i.e. commands) as well as for R packages. Help for specific commands uses ?
in front of the function:
?head
Sometimes, specific packages have a more detailed description of the package, called a vignette. You can get information on the vignettes inside a package, followed by the specific vignette you want to read:
vignette(package = 'dplyr')
vignette('introduction', package = 'dplyr')
Not all packages have vignettes, but it’s always useful to check if there is one.
There are two types of R files: .R
and .Rmd
. R scripts (.R
) contain only R code and nothing else; these are useful when all you want to do is run a set code sequence and produce a specific result, figure, or output. R Markdown (.Rmd
) files can contain R code, text, citations, tables, figures, lists, and so on. An R Markdown file is like Word… except much more powerful as it can include R code within the document and can convert to a variety of file types including HTML and Word docx. Using this file type makes your research, manuscripts, and thesis more reproducible and saves you time and stress. For most of the workshop we’ll be using R Markdown.
You can recognize an .Rmd
file as it generally has this at the very top of the file:
---
title: "Introducing R Markdown"
author: "Luke Johnston"
date: "July 23, 2015"
output: html_document
---
This snippet of code is called YAML. We’ll talk about that more in a later workshop. For now, just know that rmarkdown
(the package) uses the YAML to know what to convert the document to (in this case html).
Installing packages can be done using either RStudio in the Packages pane or using the command:
install.packages('packagename')
We’ll use the package readr
soon, so let’s load it up:
library(readr)
We can now use the functions inside the readr
package. To see what functions there are, you can go to the console and start typing readr::
, then hit TAB. A list of functions will pop up, showing all the functions inside readr
. This TAB-completion is very useful.
Let’s use the readr
functions to load in a dataset from the codeasmanuscript site. It’s generally good to store your data as csv
files, which are known as comma separated values. These are plain text, meaning there is nothing but text in the file (unlike Excel .xlsx
files which are not plain text; they contain markup inside the file itself).
You can import files from your computer. It keep things easy, make sure the data file is in the same folder as your .Rmd
/.R
/.Rproj
file(s). You can import the file into R via…
# comment: to import use:
ds <- read_csv('states_data.csv')
## Parsed with column specification:
## cols(
## StateName = col_character(),
## Population = col_integer(),
## Income = col_integer(),
## Illiteracy = col_double(),
## LifeExp = col_double(),
## Murder = col_double(),
## HSGrad = col_double(),
## Frost = col_integer(),
## Area = col_integer(),
## Region = col_character(),
## Division = col_character(),
## Longitude = col_double(),
## Latitude = col_double()
## )
# or from a website
# ds <- read_csv('http://codeasmanuscript.org/states_data.csv')
ds
## # A tibble: 50 x 13
## StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## 2 Alaska 365 6315 1.5 69.31 11.3 66.7 152
## 3 Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## 4 Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## 5 California 21198 5114 1.1 71.71 10.3 62.6 20
## 6 Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## 7 Connecticut 3100 5348 1.1 72.48 3.1 56.0 139
## 8 Delaware 579 4809 0.9 70.06 6.2 54.6 103
## 9 Florida 8277 4815 1.3 70.66 10.7 52.6 11
## 10 Georgia 4931 4091 2.0 68.54 13.9 40.6 60
## # ... with 40 more rows, and 5 more variables: Area <int>, Region <chr>,
## # Division <chr>, Longitude <dbl>, Latitude <dbl>
To export data inside R to a file, use…
write_csv('states.csv')
There are several functions to look over your data.
# comment: quick summary
summary(ds)
## StateName Population Income Illiteracy
## Length:50 Min. : 365 Min. :3098 Min. :0.500
## Class :character 1st Qu.: 1080 1st Qu.:3993 1st Qu.:0.625
## Mode :character Median : 2838 Median :4519 Median :0.950
## Mean : 4246 Mean :4436 Mean :1.170
## 3rd Qu.: 4968 3rd Qu.:4814 3rd Qu.:1.575
## Max. :21198 Max. :6315 Max. :2.800
## LifeExp Murder HSGrad Frost
## Min. :67.96 Min. : 1.400 Min. :37.80 Min. : 0.00
## 1st Qu.:70.12 1st Qu.: 4.350 1st Qu.:48.05 1st Qu.: 66.25
## Median :70.67 Median : 6.850 Median :53.25 Median :114.50
## Mean :70.88 Mean : 7.378 Mean :53.11 Mean :104.46
## 3rd Qu.:71.89 3rd Qu.:10.675 3rd Qu.:59.15 3rd Qu.:139.75
## Max. :73.60 Max. :15.100 Max. :67.30 Max. :188.00
## Area Region Division Longitude
## Min. : 1049 Length:50 Length:50 Min. :-127.25
## 1st Qu.: 36985 Class :character Class :character 1st Qu.:-104.16
## Median : 54277 Mode :character Mode :character Median : -89.90
## Mean : 70736 Mean : -92.46
## 3rd Qu.: 81162 3rd Qu.: -78.98
## Max. :566432 Max. : -68.98
## Latitude
## Min. :27.87
## 1st Qu.:35.55
## Median :39.62
## Mean :39.41
## 3rd Qu.:43.14
## Max. :49.25
# first 6 rows
head(ds)
## # A tibble: 6 x 13
## StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## 2 Alaska 365 6315 1.5 69.31 11.3 66.7 152
## 3 Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## 4 Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## 5 California 21198 5114 1.1 71.71 10.3 62.6 20
## 6 Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## # ... with 5 more variables: Area <int>, Region <chr>, Division <chr>,
## # Longitude <dbl>, Latitude <dbl>
# last 6 rows
tail(ds)
## # A tibble: 6 x 13
## StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Vermont 472 3907 0.6 71.64 5.5 57.1 168
## 2 Virginia 4981 4701 1.4 70.08 9.5 47.8 85
## 3 Washington 3559 4864 0.6 71.72 4.3 63.5 32
## 4 West Virginia 1799 3617 1.4 69.48 6.7 41.6 100
## 5 Wisconsin 4589 4468 0.7 72.48 3.0 54.5 149
## 6 Wyoming 376 4566 0.6 70.29 6.9 62.9 173
## # ... with 5 more variables: Area <int>, Region <chr>, Division <chr>,
## # Longitude <dbl>, Latitude <dbl>
# column names of your dataset
colnames(ds)
## [1] "StateName" "Population" "Income" "Illiteracy" "LifeExp"
## [6] "Murder" "HSGrad" "Frost" "Area" "Region"
## [11] "Division" "Longitude" "Latitude"
A very detailed view of the exact contents of the data object, the str
function is very useful:
str(ds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 50 obs. of 13 variables:
## $ StateName : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Population: int 3615 365 2212 2110 21198 2541 3100 579 8277 4931 ...
## $ Income : int 3624 6315 4530 3378 5114 4884 5348 4809 4815 4091 ...
## $ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
## $ LifeExp : num 69 69.3 70.5 70.7 71.7 ...
## $ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
## $ HSGrad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
## $ Frost : int 20 152 15 65 20 166 139 103 11 60 ...
## $ Area : int 50708 566432 113417 51945 156361 103766 4862 1982 54090 58073 ...
## $ Region : chr "South" "West" "West" "South" ...
## $ Division : chr "East South Central" "Pacific" "Mountain" "West South Central" ...
## $ Longitude : num -86.8 -127.2 -111.6 -92.3 -119.8 ...
## $ Latitude : num 32.6 49.2 34.2 34.7 36.5 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 13
## .. ..$ StateName : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Population: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Income : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Illiteracy: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ LifeExp : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Murder : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ HSGrad : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Frost : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Area : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Region : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Division : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Longitude : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Latitude : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
To see the dimensions (number of rows and columns) on a dataset you can use
dim(ds)
## [1] 50 13
There are several types of data in R. In general, these can be simplified into two classes: continuous and discrete. These can then be classified into other groups.
Continuous data types look like…
0.5:10.5
## [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5
class(0.5:10.5)
## [1] "numeric"
1L:10L # comment: the `L` forces R to see it as an integer
## [1] 1 2 3 4 5 6 7 8 9 10
class(1L:10L)
## [1] "integer"
… while discrete data types look like
c(TRUE, FALSE) # comment: the `c` means 'combine' or 'concatenate'
## [1] TRUE FALSE
class(c(TRUE, FALSE))
## [1] "logical"
c('hi', 'there') # this is also called a vector
## [1] "hi" "there"
as.numeric(c('hi', 'there'))
## Warning: NAs introduced by coercion
## [1] NA NA
class(c('hi', 'there', 'hi'))
## [1] "character"
factor(c('hi', 'there', 'hi'))
## [1] hi there hi
## Levels: hi there
class(factor(c('hi', 'there', 'hi')))
## [1] "factor"
# comment: factors are basically numbers underneath.
as.numeric(factor(c('hi', 'there', 'hi')))
## [1] 1 2 1
# comment: characters are not.
as.numeric(c('hi', 'there'))
## Warning: NAs introduced by coercion
## [1] NA NA
There are also more complex object types in R. Two common ones are lists and dataframes. A dataframe is what you just loaded above in the ds
object. It can contain any data type except for other dataframes. So…
class(ds)
## [1] "tbl_df" "tbl" "data.frame"
A list can have any number of data types inside it, including dataframes and more lists.
ds_list <- list(data = ds, number = 1:10, char = c('hi', 'there'))
class(ds_list)
## [1] "list"
ds_list
## $data
## # A tibble: 50 x 13
## StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## 2 Alaska 365 6315 1.5 69.31 11.3 66.7 152
## 3 Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## 4 Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## 5 California 21198 5114 1.1 71.71 10.3 62.6 20
## 6 Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## 7 Connecticut 3100 5348 1.1 72.48 3.1 56.0 139
## 8 Delaware 579 4809 0.9 70.06 6.2 54.6 103
## 9 Florida 8277 4815 1.3 70.66 10.7 52.6 11
## 10 Georgia 4931 4091 2.0 68.54 13.9 40.6 60
## # ... with 40 more rows, and 5 more variables: Area <int>, Region <chr>,
## # Division <chr>, Longitude <dbl>, Latitude <dbl>
##
## $number
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $char
## [1] "hi" "there"
names(ds_list) # find out the names of the objects inside the list
## [1] "data" "number" "char"
Often you need to run a function to get an output and then use that output to do other things to it, like plot it or make it into a table. In order to help with that, there is variable assignment using the assignment operator <-
. We used it earlier to assign the dataframe from read_csv
to the ds
object.
weight_kg <- 75
weight_kg
## [1] 75
weight_lb <- weight_kg * 2.2
weight_lb
## [1] 165
You can use several methods to take one or more items from a vector, dataframe, or list using $
, []
, or [[]]
.
# vector can only use []
num <- 1:10
num[1] # first item
## [1] 1
num[9] # ninth item
## [1] 9
# dataframes and lists can use any of the methods
ds$Income # directly choose the Income column, converts to vector
## [1] 3624 6315 4530 3378 5114 4884 5348 4809 4815 4091 4963 4119 5107 4458
## [15] 4628 4669 3712 3545 3694 5299 4755 4751 4675 3098 4254 4347 4508 5149
## [29] 4281 5237 3601 4903 3875 5087 4561 3983 4660 4449 4558 3635 4167 3821
## [43] 4188 4022 3907 4701 4864 3617 4468 4566
ds['Income'] # same as above, but keeps as column
## # A tibble: 50 x 1
## Income
## <int>
## 1 3624
## 2 6315
## 3 4530
## 4 3378
## 5 5114
## 6 4884
## 7 5348
## 8 4809
## 9 4815
## 10 4091
## # ... with 40 more rows
ds[c('Income', 'Population')] # same as above, but keeps as column
## # A tibble: 50 x 2
## Income Population
## <int> <int>
## 1 3624 3615
## 2 6315 365
## 3 4530 2212
## 4 3378 2110
## 5 5114 21198
## 6 4884 2541
## 7 5348 3100
## 8 4809 579
## 9 4815 8277
## 10 4091 4931
## # ... with 40 more rows
ds[3] # using numbers, again keeping as column
## # A tibble: 50 x 1
## Income
## <int>
## 1 3624
## 2 6315
## 3 4530
## 4 3378
## 5 5114
## 6 4884
## 7 5348
## 8 4809
## 9 4815
## 10 4091
## # ... with 40 more rows
ds[[3]] # converts to number
## [1] 3624 6315 4530 3378 5114 4884 5348 4809 4815 4091 4963 4119 5107 4458
## [15] 4628 4669 3712 3545 3694 5299 4755 4751 4675 3098 4254 4347 4508 5149
## [29] 4281 5237 3601 4903 3875 5087 4561 3983 4660 4449 4558 3635 4167 3821
## [43] 4188 4022 3907 4701 4864 3617 4468 4566
# combining [[]] and []
ds[[3]][4] # choose fourth item of third column
## [1] 3378
# ds[3][4] # this doesn't work
ds[3,4] # but this does, which means [rownumber, columnnumber]
## # A tibble: 1 x 1
## Illiteracy
## <dbl>
## 1 1.8
ds[[3,4]] # converts to number
## [1] 1.8
# a range:
ds[1:5, 1:4]
## # A tibble: 5 x 4
## StateName Population Income Illiteracy
## <chr> <int> <int> <dbl>
## 1 Alabama 3615 3624 2.1
## 2 Alaska 365 6315 1.5
## 3 Arizona 2212 4530 1.8
## 4 Arkansas 2110 3378 1.9
## 5 California 21198 5114 1.1
# as a list
ds_list$number # converts to vector
## [1] 1 2 3 4 5 6 7 8 9 10
ds_list[[2]] # as a vector
## [1] 1 2 3 4 5 6 7 8 9 10
ds_list[2] # as a named vector
## $number
## [1] 1 2 3 4 5 6 7 8 9 10
# combining [[]] and []
ds_list[[2]][4] # chooses fourth item of the second list item
## [1] 4
In R, nearly every command is a function. You can look at the contents of any function by simply typing in the function without the ()
:
factor
## function (x = character(), levels, labels = levels, exclude = NA,
## ordered = is.ordered(x), nmax = NA)
## {
## if (is.null(x))
## x <- character()
## nx <- names(x)
## if (missing(levels)) {
## y <- unique(x, nmax = nmax)
## ind <- sort.list(y)
## levels <- unique(as.character(y)[ind])
## }
## force(ordered)
## if (!is.character(x))
## x <- as.character(x)
## levels <- levels[is.na(match(levels, exclude))]
## f <- match(x, levels)
## if (!is.null(nx))
## names(f) <- nx
## nl <- length(labels)
## nL <- length(levels)
## if (!any(nl == c(1L, nL)))
## stop(gettextf("invalid 'labels'; length %d should be 1 or %d",
## nl, nL), domain = NA)
## levels(f) <- if (nl == nL)
## as.character(labels)
## else paste0(labels, seq_along(levels))
## class(f) <- c(if (ordered) "ordered", "factor")
## f
## }
## <bytecode: 0x3c89188>
## <environment: namespace:base>
For a beginner, this doesn’t always come in handy. However, it is something that is very useful to know as you get more familiar with R.
All functions have arguments. You can see the arguments by hitting TAB when your cursor is inside the function (between the ()
). A list of argument options will be shown as well as a quick help on what the argument is and needs. Passing a value to an argument (e.g. function_name(argument1 = value)
), lets the function perform its action with that argument value. Let’s make a simple example by doing a sum of two values, plus an extra third that has the default value of 0:
adding <- function(value1, value2, value3 = 0) {
value1 + value2
}
We have now created a new function called adding
. The first two arguments require a value, while the third has a default so doesn’t need a value. Let’s add 2 with 2. There are two ways to do it, using positional arguments and named arguments.
# positional
adding(2, 2)
## [1] 4
# named
adding(value1 = 2, value2 = 2)
## [1] 4
Generally, the first two arguments are made to be positional, but further arguments are named. Like so:
adding(2, 2, value3 = 2)
## [1] 4
That’s the simple run down of making functions. We’ll make more complex ones as we learn more.
TBA