Data wrangling: loosely, the process of manually converting data from one “raw” form into another format that allows for more ease of eventual analysis and visualization (adapted from Wikipedia)

R was developed by statisticians to do statistical work. As such, embedded within R are capabilities to easily wrangle and manage data, to have data in a format that can be used for further analysis, and to work with datasets called dataframes. There are also excellent packages available to make data wrangling much easier in R. These packages are dplyr and tidyr. A minor assumption about using these packages: the data you are importing from the start is fairly clean (i.e. no large amount of missing values, no data entry errors or fixes needed, etc).

Learning objectives:

  • Appreciate that data wrangling and managing are important and time-consuming aspects to research
  • Learn the ‘grammar’ of shaping and moving data from one form to another
  • Understand and apply basic statistics to quickly summarize your dataset

Let’s get wrangling, the basics

Wrangling your data, dplyr style

Data wrangling can be a bit tedious in base R (R without packages), so we’ll be using two packages designed to make this easier. dplyr comes with a %>% pipe function (via the magrittr package), which works similar to how the Bash shell | pipe works (for those familiar with Bash, ie. those who use Mac or Linux). The command on the right-hand side takes the output from the command on the left-hand side, just like how a plumbing pipe works for water. tbl_df makes the object into a tbl class, making printing of the output nicer. The other nice thing about dplyr is that it can connect to SQL and other type of databases and is very fast at wrangling data, unlike base R. Check out the resources page for links to more about this.

library(dplyr)
library(tidyr)
library(readr)

# Import the dataset
ds <- read_csv("http://codeasmanuscript.org/states_data.csv")

## Compare
head(ds)
## # A tibble: 6 x 13
##    StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
##        <chr>      <dbl>  <dbl>      <dbl>   <dbl>  <dbl>  <dbl> <dbl>
## 1    Alabama       3615   3624        2.1   69.05   15.1   41.3    20
## 2     Alaska        365   6315        1.5   69.31   11.3   66.7   152
## 3    Arizona       2212   4530        1.8   70.55    7.8   58.1    15
## 4   Arkansas       2110   3378        1.9   70.66   10.1   39.9    65
## 5 California      21198   5114        1.1   71.71   10.3   62.6    20
## 6   Colorado       2541   4884        0.7   72.06    6.8   63.9   166
## # ... with 5 more variables: Area <dbl>, Region <chr>, Division <chr>,
## #   Longitude <dbl>, Latitude <dbl>
## With:
tbl_df(ds)
## # A tibble: 50 x 13
##      StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
##          <chr>      <dbl>  <dbl>      <dbl>   <dbl>  <dbl>  <dbl> <dbl>
##  1     Alabama       3615   3624        2.1   69.05   15.1   41.3    20
##  2      Alaska        365   6315        1.5   69.31   11.3   66.7   152
##  3     Arizona       2212   4530        1.8   70.55    7.8   58.1    15
##  4    Arkansas       2110   3378        1.9   70.66   10.1   39.9    65
##  5  California      21198   5114        1.1   71.71   10.3   62.6    20
##  6    Colorado       2541   4884        0.7   72.06    6.8   63.9   166
##  7 Connecticut       3100   5348        1.1   72.48    3.1   56.0   139
##  8    Delaware        579   4809        0.9   70.06    6.2   54.6   103
##  9     Florida       8277   4815        1.3   70.66   10.7   52.6    11
## 10     Georgia       4931   4091        2.0   68.54   13.9   40.6    60
## # ... with 40 more rows, and 5 more variables: Area <dbl>, Region <chr>,
## #   Division <chr>, Longitude <dbl>, Latitude <dbl>
## Now put the tbl dataset into a new object
ds2 <- tbl_df(ds)
ds2
## # A tibble: 50 x 13
##      StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
##          <chr>      <dbl>  <dbl>      <dbl>   <dbl>  <dbl>  <dbl> <dbl>
##  1     Alabama       3615   3624        2.1   69.05   15.1   41.3    20
##  2      Alaska        365   6315        1.5   69.31   11.3   66.7   152
##  3     Arizona       2212   4530        1.8   70.55    7.8   58.1    15
##  4    Arkansas       2110   3378        1.9   70.66   10.1   39.9    65
##  5  California      21198   5114        1.1   71.71   10.3   62.6    20
##  6    Colorado       2541   4884        0.7   72.06    6.8   63.9   166
##  7 Connecticut       3100   5348        1.1   72.48    3.1   56.0   139
##  8    Delaware        579   4809        0.9   70.06    6.2   54.6   103
##  9     Florida       8277   4815        1.3   70.66   10.7   52.6    11
## 10     Georgia       4931   4091        2.0   68.54   13.9   40.6    60
## # ... with 40 more rows, and 5 more variables: Area <dbl>, Region <chr>,
## #   Division <chr>, Longitude <dbl>, Latitude <dbl>

Select columns

Often times, you want to select only some of the columns or variables from a dataset. For that we use the select command, which does as it says. Note the use of the %>% operator. This allows you to chain commands together, letting you do more with only a few commands.

ds2 %>% 
    select(Population, Income, Area)
## # A tibble: 50 x 3
##    Population Income   Area
##         <dbl>  <dbl>  <dbl>
##  1       3615   3624  50708
##  2        365   6315 566432
##  3       2212   4530 113417
##  4       2110   3378  51945
##  5      21198   5114 156361
##  6       2541   4884 103766
##  7       3100   5348   4862
##  8        579   4809   1982
##  9       8277   4815  54090
## 10       4931   4091  58073
## # ... with 40 more rows

The real power with using the select() function comes when you combine it with regular expressions (regexp), or rather pattern searching. dplyr has several pattern searching functions, including starts_with(), contains(), and the most powerful matches(). The function matches() uses regexp, which are special commands that use certain, unique syntax for searching for patterns. For example, ^string means that string is the first character, string$ means that string is last character, string|strung searches for either string or strung, etc. regexp syntax are nearly a language to themselves, so use StackOverflow and Google as much as you can!

Ok, so lets say you want to search for variables that have certain patterns:

ds2 %>% 
    select(contains('Pop'), starts_with('Fr'))
## # A tibble: 50 x 2
##    Population Frost
##         <dbl> <dbl>
##  1       3615    20
##  2        365   152
##  3       2212    15
##  4       2110    65
##  5      21198    20
##  6       2541   166
##  7       3100   139
##  8        579   103
##  9       8277    11
## 10       4931    60
## # ... with 40 more rows
## Or more simplified
ds2 %>% 
    select(matches('Pop|Fr'))
## # A tibble: 50 x 2
##    Population Frost
##         <dbl> <dbl>
##  1       3615    20
##  2        365   152
##  3       2212    15
##  4       2110    65
##  5      21198    20
##  6       2541   166
##  7       3100   139
##  8        579   103
##  9       8277    11
## 10       4931    60
## # ... with 40 more rows

You can see that if you have many variables that have a common structure to their name, you can quickly select all those variables by using functions such as matches().

Rename columns

You can rename columns using the rename command (the new name is on the left hand side, so newname = oldname).

ds2 %>% 
    rename(HighSchoolGrad = HSGrad)
## # A tibble: 50 x 13
##      StateName Population Income Illiteracy LifeExp Murder HighSchoolGrad
##          <chr>      <dbl>  <dbl>      <dbl>   <dbl>  <dbl>          <dbl>
##  1     Alabama       3615   3624        2.1   69.05   15.1           41.3
##  2      Alaska        365   6315        1.5   69.31   11.3           66.7
##  3     Arizona       2212   4530        1.8   70.55    7.8           58.1
##  4    Arkansas       2110   3378        1.9   70.66   10.1           39.9
##  5  California      21198   5114        1.1   71.71   10.3           62.6
##  6    Colorado       2541   4884        0.7   72.06    6.8           63.9
##  7 Connecticut       3100   5348        1.1   72.48    3.1           56.0
##  8    Delaware        579   4809        0.9   70.06    6.2           54.6
##  9     Florida       8277   4815        1.3   70.66   10.7           52.6
## 10     Georgia       4931   4091        2.0   68.54   13.9           40.6
## # ... with 40 more rows, and 6 more variables: Frost <dbl>, Area <dbl>,
## #   Region <chr>, Division <chr>, Longitude <dbl>, Latitude <dbl>

Filter rows

Another common task in data wrangling is subsetting your dataset. You can subset the dataset using filter. Note the double equal sign == for testing if ‘Examination’ is equal to 15. A single = is used for something else (assigning things to objects or using them in functions/commands).

## For continuous/number data
ds2 %>% 
    select(Population, Illiteracy, Income, Region) %>% 
    filter(Illiteracy < 2, Population == 365)
## # A tibble: 1 x 4
##   Population Illiteracy Income Region
##        <dbl>      <dbl>  <dbl>  <chr>
## 1        365        1.5   6315   West
## Or for 'string' (words or letters) data
ds2 %>%
    select(Population, Illiteracy, Income, Region) %>% 
    filter(Region == 'Northeast')
## # A tibble: 9 x 4
##   Population Illiteracy Income    Region
##        <dbl>      <dbl>  <dbl>     <chr>
## 1       3100        1.1   5348 Northeast
## 2       1058        0.7   3694 Northeast
## 3       5814        1.1   4755 Northeast
## 4        812        0.7   4281 Northeast
## 5       7333        1.1   5237 Northeast
## 6      18076        1.4   4903 Northeast
## 7      11860        1.0   4449 Northeast
## 8        931        1.3   4558 Northeast
## 9        472        0.6   3907 Northeast

Create new columns or clean up existing ones

If you want to create a new column, you use the mutate command. The ifelse() command lets you use a condition to have different values depending on the condition.

ds2 %>%
    mutate(Testing = 'yes',
           Rich = ifelse(Income > 5000, 'Yes', 'No')) %>% 
    select(StateName, Population, Income, Rich, Testing)
## # A tibble: 50 x 5
##      StateName Population Income  Rich Testing
##          <chr>      <dbl>  <dbl> <chr>   <chr>
##  1     Alabama       3615   3624    No     yes
##  2      Alaska        365   6315   Yes     yes
##  3     Arizona       2212   4530    No     yes
##  4    Arkansas       2110   3378    No     yes
##  5  California      21198   5114   Yes     yes
##  6    Colorado       2541   4884    No     yes
##  7 Connecticut       3100   5348   Yes     yes
##  8    Delaware        579   4809    No     yes
##  9     Florida       8277   4815    No     yes
## 10     Georgia       4931   4091    No     yes
## # ... with 40 more rows

However, it’s fairly common that you need to do some data janitorial work by cleaning up an existing column. For example, in a dataset with a ‘Sex’ variable, some values had data entry errors in spelling, such as ‘fmale’ when it should be ‘female’. This needs to be fixed and can be done fairly easily in R. So let’s ‘pretend that all words starting with ’G’ in the X (county) column should ‘actually be ’J’ and that all words with an ‘e’ at the end should be removed. ‘For this, we will use the gsub() command within mutate(), which will’globablly substitute a pattern with the replacement.

ds2 %>% 
    select(Region) %>% 
    mutate(
        Region = gsub(pattern = '^S', replacement = 'J', Region),
        Region = gsub(pattern = 't$', replacement = '', Region)
        )
## # A tibble: 50 x 1
##      Region
##       <chr>
##  1    Jouth
##  2      Wes
##  3      Wes
##  4    Jouth
##  5      Wes
##  6      Wes
##  7 Northeas
##  8    Jouth
##  9    Jouth
## 10    Jouth
## # ... with 40 more rows

Notice the ^ and $ characters. Those are special syntax symbols used in regexp commands. We introduced them above, but we’ll quickly go over it again. These special symbols perform certain functions. In this case ^G means for all “G” that are at the start of the string/character, while e$ means for all “e” that are at the end of a string. Or let“s say that all”mont“,”mout“, and”mnt" should actually be “ment”. We can do some cleaning fairly easily here.

ds2 %>% 
    mutate(Region = gsub('Northeast', 'NE', Region, ignore.case = TRUE)) %>% 
    select(Region)
## # A tibble: 50 x 1
##    Region
##     <chr>
##  1  South
##  2   West
##  3   West
##  4  South
##  5   West
##  6   West
##  7     NE
##  8  South
##  9  South
## 10  South
## # ... with 40 more rows

Regular expressions are incredibly powerful, but also can be confusing. Make sure to check out our resources page for links to sites that explain regexp in more detail.

Chaining pipes

We can start chaining these commands together using the %>% pipe command. There is no limit to how long a chain can be. Chaining commands together using the pipe command makes your code easier to read, makes you type out your code faster, and makes it easier for you to go from thinking of an analysis to actually conducting it. arrange sorts/orders/re-arranges the column Education in ascending order.

ds2 %>%
  filter(Illiteracy > 2) %>%
  select(State = StateName, ## This renames the variable, just like the rename() command
         Population, Area, Frost) %>%
  arrange(Population) %>%
  mutate(Cold = ifelse(Frost > 100, 'Yes', 'No'))
## # A tibble: 6 x 5
##            State Population   Area Frost  Cold
##            <chr>      <dbl>  <dbl> <dbl> <chr>
## 1     New Mexico       1144 121412   120   Yes
## 2    Mississippi       2341  47296    50    No
## 3 South Carolina       2816  30225    65    No
## 4        Alabama       3615  50708    20    No
## 5      Louisiana       3806  44930    12    No
## 6          Texas      12237 262134    35    No

Re-organize your data (using tidyr)

To get the data into a nicer and more analyable format, you can use the tidyr package. See what gather does in the code below. Then see what spread does. Note that you can remove a column by having a minus - sign in front of a variable when you use select.

## Compare this:
ds2 %>%
  select(-Division, -Region, -matches('itude$'))
## # A tibble: 50 x 9
##      StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
##          <chr>      <dbl>  <dbl>      <dbl>   <dbl>  <dbl>  <dbl> <dbl>
##  1     Alabama       3615   3624        2.1   69.05   15.1   41.3    20
##  2      Alaska        365   6315        1.5   69.31   11.3   66.7   152
##  3     Arizona       2212   4530        1.8   70.55    7.8   58.1    15
##  4    Arkansas       2110   3378        1.9   70.66   10.1   39.9    65
##  5  California      21198   5114        1.1   71.71   10.3   62.6    20
##  6    Colorado       2541   4884        0.7   72.06    6.8   63.9   166
##  7 Connecticut       3100   5348        1.1   72.48    3.1   56.0   139
##  8    Delaware        579   4809        0.9   70.06    6.2   54.6   103
##  9     Florida       8277   4815        1.3   70.66   10.7   52.6    11
## 10     Georgia       4931   4091        2.0   68.54   13.9   40.6    60
## # ... with 40 more rows, and 1 more variables: Area <dbl>
## With this:
ds2 %>%
  select(-Division, -Region, -matches('itude$')) %>% 
  gather(Measure, Value, -StateName)
## # A tibble: 400 x 3
##      StateName    Measure Value
##          <chr>      <chr> <dbl>
##  1     Alabama Population  3615
##  2      Alaska Population   365
##  3     Arizona Population  2212
##  4    Arkansas Population  2110
##  5  California Population 21198
##  6    Colorado Population  2541
##  7 Connecticut Population  3100
##  8    Delaware Population   579
##  9     Florida Population  8277
## 10     Georgia Population  4931
## # ... with 390 more rows
## And back again:
ds2 %>%
  select(-Division, -Region, -matches('itude$')) %>% 
  gather(Measure, Value, -StateName) %>%
  spread(Measure, Value)
## # A tibble: 50 x 9
##      StateName   Area Frost HSGrad Illiteracy Income LifeExp Murder
##  *       <chr>  <dbl> <dbl>  <dbl>      <dbl>  <dbl>   <dbl>  <dbl>
##  1     Alabama  50708    20   41.3        2.1   3624   69.05   15.1
##  2      Alaska 566432   152   66.7        1.5   6315   69.31   11.3
##  3     Arizona 113417    15   58.1        1.8   4530   70.55    7.8
##  4    Arkansas  51945    65   39.9        1.9   3378   70.66   10.1
##  5  California 156361    20   62.6        1.1   5114   71.71   10.3
##  6    Colorado 103766   166   63.9        0.7   4884   72.06    6.8
##  7 Connecticut   4862   139   56.0        1.1   5348   72.48    3.1
##  8    Delaware   1982   103   54.6        0.9   4809   70.06    6.2
##  9     Florida  54090    11   52.6        1.3   4815   70.66   10.7
## 10     Georgia  58073    60   40.6        2.0   4091   68.54   13.9
## # ... with 40 more rows, and 1 more variables: Population <dbl>

Summarise variables

Combined with dplyr’s group_by and summarise you can quickly summarise data or do further, more complicated analyses. group_by makes it so further analyses or operations work on the groups. summarise transforms the data to only contain the new variable(s) created, in this case the mean, as well as the grouping variable.

ds2 %>%
  select(-Division, -Region, -StateName) %>% 
  gather(Measure, Value) %>%
  group_by(Measure) %>%
  summarise(Mean = mean(Value),
            SD = sd(Value),
            Median = median(Value),
            SampleSize = n())
## # A tibble: 10 x 5
##       Measure        Mean           SD      Median SampleSize
##         <chr>       <dbl>        <dbl>       <dbl>      <int>
##  1       Area 70735.88000 8.532730e+04 54277.00000         50
##  2      Frost   104.46000 5.198085e+01   114.50000         50
##  3     HSGrad    53.10800 8.076998e+00    53.25000         50
##  4 Illiteracy     1.17000 6.095331e-01     0.95000         50
##  5     Income  4435.80000 6.144699e+02  4519.00000         50
##  6   Latitude    39.41074 4.976349e+00    39.62075         50
##  7    LifeExp    70.87860 1.342394e+00    70.67500         50
##  8  Longitude   -92.46414 1.618079e+01   -89.90030         50
##  9     Murder     7.37800 3.691540e+00     6.85000         50
## 10 Population  4246.42000 4.464491e+03  2838.50000         50

Other useful and powerful examples

You can do some really powerful things with dplyr and tidyr functions. For example, you can run each combination of independent and dependent variables in a linear regression (lm()) using gather() and the dplyr do() command, rather than running each individually. To make the output from the lm() easier to combine and understand, we use the tidy() command from the broom package (:: tells R we want to use the tidy() function from the broom package). If you want more details on how to use this set up, check out my blog post about it.

library(broom)
# Using the swiss practice dataset.
swiss %>%
    gather(Indep, Xvalue, Fertility, Agriculture) %>%
    gather(Dep, Yvalue, Education, Catholic) %>% 
    group_by(Dep, Indep) %>% 
    do(tidy(lm(Yvalue ~ Xvalue + Infant.Mortality + Examination, data = .)))
## # A tibble: 16 x 7
## # Groups:   Dep, Indep [4]
##          Dep       Indep             term    estimate   std.error
##        <chr>       <chr>            <chr>       <dbl>       <dbl>
##  1  Catholic Agriculture      (Intercept) 48.62020731 51.22779001
##  2  Catholic Agriculture           Xvalue  0.08447534  0.31852283
##  3  Catholic Agriculture Infant.Mortality  1.69138478  1.81767192
##  4  Catholic Agriculture      Examination -2.75852964  0.91102265
##  5  Catholic   Fertility      (Intercept) 35.13701855 51.27597790
##  6  Catholic   Fertility           Xvalue  0.39943317  0.60419739
##  7  Catholic   Fertility Infant.Mortality  1.00336670  1.99122191
##  8  Catholic   Fertility      Examination -2.54831847  0.86570652
##  9 Education Agriculture      (Intercept) 12.75076261  9.86830392
## 10 Education Agriculture           Xvalue -0.13540736  0.06135889
## 11 Education Agriculture Infant.Mortality -0.21468852  0.35014860
## 12 Education Agriculture      Examination  0.56818921  0.17549554
## 13 Education   Fertility      (Intercept) 17.62086823  9.48591843
## 14 Education   Fertility           Xvalue -0.34172055  0.11177490
## 15 Education   Fertility Infant.Mortality  0.44332139  0.36837072
## 16 Education   Fertility      Examination  0.51463767  0.16015339
## # ... with 2 more variables: statistic <dbl>, p.value <dbl>