Data wrangling: loosely, the process of manually converting data from one “raw” form into another format that allows for more ease of eventual analysis and visualization (adapted from Wikipedia)
R was developed by statisticians to do statistical work. As such, embedded within R are capabilities to easily wrangle and manage data, to have data in a format that can be used for further analysis, and to work with datasets called dataframes. There are also excellent packages available to make data wrangling much easier in R. These packages are dplyr
and tidyr
. A minor assumption about using these packages: the data you are importing from the start is fairly clean (i.e. no large amount of missing values, no data entry errors or fixes needed, etc).
dplyr
styleData wrangling can be a bit tedious in base R (R without packages), so we’ll be using two packages designed to make this easier. dplyr
comes with a %>%
pipe function (via the magrittr
package), which works similar to how the Bash shell |
pipe works (for those familiar with Bash, ie. those who use Mac or Linux). The command on the right-hand side takes the output from the command on the left-hand side, just like how a plumbing pipe works for water. tbl_df
makes the object into a tbl
class, making printing of the output nicer. The other nice thing about dplyr
is that it can connect to SQL and other type of databases and is very fast at wrangling data, unlike base R. Check out the resources page for links to more about this.
library(dplyr)
library(tidyr)
library(readr)
# Import the dataset
ds <- read_csv("http://codeasmanuscript.org/states_data.csv")
## Compare
head(ds)
## # A tibble: 6 x 13
## StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## 2 Alaska 365 6315 1.5 69.31 11.3 66.7 152
## 3 Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## 4 Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## 5 California 21198 5114 1.1 71.71 10.3 62.6 20
## 6 Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## # ... with 5 more variables: Area <dbl>, Region <chr>, Division <chr>,
## # Longitude <dbl>, Latitude <dbl>
## With:
tbl_df(ds)
## # A tibble: 50 x 13
## StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## 2 Alaska 365 6315 1.5 69.31 11.3 66.7 152
## 3 Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## 4 Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## 5 California 21198 5114 1.1 71.71 10.3 62.6 20
## 6 Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## 7 Connecticut 3100 5348 1.1 72.48 3.1 56.0 139
## 8 Delaware 579 4809 0.9 70.06 6.2 54.6 103
## 9 Florida 8277 4815 1.3 70.66 10.7 52.6 11
## 10 Georgia 4931 4091 2.0 68.54 13.9 40.6 60
## # ... with 40 more rows, and 5 more variables: Area <dbl>, Region <chr>,
## # Division <chr>, Longitude <dbl>, Latitude <dbl>
## Now put the tbl dataset into a new object
ds2 <- tbl_df(ds)
ds2
## # A tibble: 50 x 13
## StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## 2 Alaska 365 6315 1.5 69.31 11.3 66.7 152
## 3 Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## 4 Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## 5 California 21198 5114 1.1 71.71 10.3 62.6 20
## 6 Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## 7 Connecticut 3100 5348 1.1 72.48 3.1 56.0 139
## 8 Delaware 579 4809 0.9 70.06 6.2 54.6 103
## 9 Florida 8277 4815 1.3 70.66 10.7 52.6 11
## 10 Georgia 4931 4091 2.0 68.54 13.9 40.6 60
## # ... with 40 more rows, and 5 more variables: Area <dbl>, Region <chr>,
## # Division <chr>, Longitude <dbl>, Latitude <dbl>
Often times, you want to select only some of the columns or variables from a dataset. For that we use the select
command, which does as it says. Note the use of the %>%
operator. This allows you to chain commands together, letting you do more with only a few commands.
ds2 %>%
select(Population, Income, Area)
## # A tibble: 50 x 3
## Population Income Area
## <dbl> <dbl> <dbl>
## 1 3615 3624 50708
## 2 365 6315 566432
## 3 2212 4530 113417
## 4 2110 3378 51945
## 5 21198 5114 156361
## 6 2541 4884 103766
## 7 3100 5348 4862
## 8 579 4809 1982
## 9 8277 4815 54090
## 10 4931 4091 58073
## # ... with 40 more rows
The real power with using the select()
function comes when you combine it with regular expressions (regexp), or rather pattern searching. dplyr
has several pattern searching functions, including starts_with()
, contains()
, and the most powerful matches()
. The function matches()
uses regexp, which are special commands that use certain, unique syntax for searching for patterns. For example, ^string
means that string
is the first character, string$
means that string
is last character, string|strung
searches for either string
or strung
, etc. regexp syntax are nearly a language to themselves, so use StackOverflow and Google as much as you can!
Ok, so lets say you want to search for variables that have certain patterns:
ds2 %>%
select(contains('Pop'), starts_with('Fr'))
## # A tibble: 50 x 2
## Population Frost
## <dbl> <dbl>
## 1 3615 20
## 2 365 152
## 3 2212 15
## 4 2110 65
## 5 21198 20
## 6 2541 166
## 7 3100 139
## 8 579 103
## 9 8277 11
## 10 4931 60
## # ... with 40 more rows
## Or more simplified
ds2 %>%
select(matches('Pop|Fr'))
## # A tibble: 50 x 2
## Population Frost
## <dbl> <dbl>
## 1 3615 20
## 2 365 152
## 3 2212 15
## 4 2110 65
## 5 21198 20
## 6 2541 166
## 7 3100 139
## 8 579 103
## 9 8277 11
## 10 4931 60
## # ... with 40 more rows
You can see that if you have many variables that have a common structure to their name, you can quickly select all those variables by using functions such as matches()
.
You can rename columns using the rename
command (the new name is on the left hand side, so newname = oldname
).
ds2 %>%
rename(HighSchoolGrad = HSGrad)
## # A tibble: 50 x 13
## StateName Population Income Illiteracy LifeExp Murder HighSchoolGrad
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 3615 3624 2.1 69.05 15.1 41.3
## 2 Alaska 365 6315 1.5 69.31 11.3 66.7
## 3 Arizona 2212 4530 1.8 70.55 7.8 58.1
## 4 Arkansas 2110 3378 1.9 70.66 10.1 39.9
## 5 California 21198 5114 1.1 71.71 10.3 62.6
## 6 Colorado 2541 4884 0.7 72.06 6.8 63.9
## 7 Connecticut 3100 5348 1.1 72.48 3.1 56.0
## 8 Delaware 579 4809 0.9 70.06 6.2 54.6
## 9 Florida 8277 4815 1.3 70.66 10.7 52.6
## 10 Georgia 4931 4091 2.0 68.54 13.9 40.6
## # ... with 40 more rows, and 6 more variables: Frost <dbl>, Area <dbl>,
## # Region <chr>, Division <chr>, Longitude <dbl>, Latitude <dbl>
Another common task in data wrangling is subsetting your dataset. You can subset the dataset using filter
. Note the double equal sign ==
for testing if ‘Examination’ is equal to 15. A single =
is used for something else (assigning things to objects or using them in functions/commands).
## For continuous/number data
ds2 %>%
select(Population, Illiteracy, Income, Region) %>%
filter(Illiteracy < 2, Population == 365)
## # A tibble: 1 x 4
## Population Illiteracy Income Region
## <dbl> <dbl> <dbl> <chr>
## 1 365 1.5 6315 West
## Or for 'string' (words or letters) data
ds2 %>%
select(Population, Illiteracy, Income, Region) %>%
filter(Region == 'Northeast')
## # A tibble: 9 x 4
## Population Illiteracy Income Region
## <dbl> <dbl> <dbl> <chr>
## 1 3100 1.1 5348 Northeast
## 2 1058 0.7 3694 Northeast
## 3 5814 1.1 4755 Northeast
## 4 812 0.7 4281 Northeast
## 5 7333 1.1 5237 Northeast
## 6 18076 1.4 4903 Northeast
## 7 11860 1.0 4449 Northeast
## 8 931 1.3 4558 Northeast
## 9 472 0.6 3907 Northeast
If you want to create a new column, you use the mutate
command. The ifelse()
command lets you use a condition to have different values depending on the condition.
ds2 %>%
mutate(Testing = 'yes',
Rich = ifelse(Income > 5000, 'Yes', 'No')) %>%
select(StateName, Population, Income, Rich, Testing)
## # A tibble: 50 x 5
## StateName Population Income Rich Testing
## <chr> <dbl> <dbl> <chr> <chr>
## 1 Alabama 3615 3624 No yes
## 2 Alaska 365 6315 Yes yes
## 3 Arizona 2212 4530 No yes
## 4 Arkansas 2110 3378 No yes
## 5 California 21198 5114 Yes yes
## 6 Colorado 2541 4884 No yes
## 7 Connecticut 3100 5348 Yes yes
## 8 Delaware 579 4809 No yes
## 9 Florida 8277 4815 No yes
## 10 Georgia 4931 4091 No yes
## # ... with 40 more rows
However, it’s fairly common that you need to do some data janitorial work by cleaning up an existing column. For example, in a dataset with a ‘Sex’ variable, some values had data entry errors in spelling, such as ‘fmale’ when it should be ‘female’. This needs to be fixed and can be done fairly easily in R. So let’s ‘pretend that all words starting with ’G’ in the X
(county) column should ‘actually be ’J’ and that all words with an ‘e’ at the end should be removed. ‘For this, we will use the gsub()
command within mutate()
, which will’globablly substitute a pattern with the replacement.
ds2 %>%
select(Region) %>%
mutate(
Region = gsub(pattern = '^S', replacement = 'J', Region),
Region = gsub(pattern = 't$', replacement = '', Region)
)
## # A tibble: 50 x 1
## Region
## <chr>
## 1 Jouth
## 2 Wes
## 3 Wes
## 4 Jouth
## 5 Wes
## 6 Wes
## 7 Northeas
## 8 Jouth
## 9 Jouth
## 10 Jouth
## # ... with 40 more rows
Notice the ^
and $
characters. Those are special syntax symbols used in regexp commands. We introduced them above, but we’ll quickly go over it again. These special symbols perform certain functions. In this case ^G
means for all “G” that are at the start of the string/character, while e$
means for all “e” that are at the end of a string. Or let“s say that all”mont“,”mout“, and”mnt" should actually be “ment”. We can do some cleaning fairly easily here.
ds2 %>%
mutate(Region = gsub('Northeast', 'NE', Region, ignore.case = TRUE)) %>%
select(Region)
## # A tibble: 50 x 1
## Region
## <chr>
## 1 South
## 2 West
## 3 West
## 4 South
## 5 West
## 6 West
## 7 NE
## 8 South
## 9 South
## 10 South
## # ... with 40 more rows
Regular expressions are incredibly powerful, but also can be confusing. Make sure to check out our resources page for links to sites that explain regexp in more detail.
We can start chaining these commands together using the %>%
pipe command. There is no limit to how long a chain can be. Chaining commands together using the pipe command makes your code easier to read, makes you type out your code faster, and makes it easier for you to go from thinking of an analysis to actually conducting it. arrange
sorts/orders/re-arranges the column Education in ascending order.
ds2 %>%
filter(Illiteracy > 2) %>%
select(State = StateName, ## This renames the variable, just like the rename() command
Population, Area, Frost) %>%
arrange(Population) %>%
mutate(Cold = ifelse(Frost > 100, 'Yes', 'No'))
## # A tibble: 6 x 5
## State Population Area Frost Cold
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 New Mexico 1144 121412 120 Yes
## 2 Mississippi 2341 47296 50 No
## 3 South Carolina 2816 30225 65 No
## 4 Alabama 3615 50708 20 No
## 5 Louisiana 3806 44930 12 No
## 6 Texas 12237 262134 35 No
tidyr
)To get the data into a nicer and more analyable format, you can use the tidyr
package. See what gather
does in the code below. Then see what spread
does. Note that you can remove a column by having a minus -
sign in front of a variable when you use select
.
## Compare this:
ds2 %>%
select(-Division, -Region, -matches('itude$'))
## # A tibble: 50 x 9
## StateName Population Income Illiteracy LifeExp Murder HSGrad Frost
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## 2 Alaska 365 6315 1.5 69.31 11.3 66.7 152
## 3 Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## 4 Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## 5 California 21198 5114 1.1 71.71 10.3 62.6 20
## 6 Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## 7 Connecticut 3100 5348 1.1 72.48 3.1 56.0 139
## 8 Delaware 579 4809 0.9 70.06 6.2 54.6 103
## 9 Florida 8277 4815 1.3 70.66 10.7 52.6 11
## 10 Georgia 4931 4091 2.0 68.54 13.9 40.6 60
## # ... with 40 more rows, and 1 more variables: Area <dbl>
## With this:
ds2 %>%
select(-Division, -Region, -matches('itude$')) %>%
gather(Measure, Value, -StateName)
## # A tibble: 400 x 3
## StateName Measure Value
## <chr> <chr> <dbl>
## 1 Alabama Population 3615
## 2 Alaska Population 365
## 3 Arizona Population 2212
## 4 Arkansas Population 2110
## 5 California Population 21198
## 6 Colorado Population 2541
## 7 Connecticut Population 3100
## 8 Delaware Population 579
## 9 Florida Population 8277
## 10 Georgia Population 4931
## # ... with 390 more rows
## And back again:
ds2 %>%
select(-Division, -Region, -matches('itude$')) %>%
gather(Measure, Value, -StateName) %>%
spread(Measure, Value)
## # A tibble: 50 x 9
## StateName Area Frost HSGrad Illiteracy Income LifeExp Murder
## * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 50708 20 41.3 2.1 3624 69.05 15.1
## 2 Alaska 566432 152 66.7 1.5 6315 69.31 11.3
## 3 Arizona 113417 15 58.1 1.8 4530 70.55 7.8
## 4 Arkansas 51945 65 39.9 1.9 3378 70.66 10.1
## 5 California 156361 20 62.6 1.1 5114 71.71 10.3
## 6 Colorado 103766 166 63.9 0.7 4884 72.06 6.8
## 7 Connecticut 4862 139 56.0 1.1 5348 72.48 3.1
## 8 Delaware 1982 103 54.6 0.9 4809 70.06 6.2
## 9 Florida 54090 11 52.6 1.3 4815 70.66 10.7
## 10 Georgia 58073 60 40.6 2.0 4091 68.54 13.9
## # ... with 40 more rows, and 1 more variables: Population <dbl>
Combined with dplyr
’s group_by
and summarise
you can quickly summarise data or do further, more complicated analyses. group_by
makes it so further analyses or operations work on the groups. summarise
transforms the data to only contain the new variable(s) created, in this case the mean, as well as the grouping variable.
ds2 %>%
select(-Division, -Region, -StateName) %>%
gather(Measure, Value) %>%
group_by(Measure) %>%
summarise(Mean = mean(Value),
SD = sd(Value),
Median = median(Value),
SampleSize = n())
## # A tibble: 10 x 5
## Measure Mean SD Median SampleSize
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Area 70735.88000 8.532730e+04 54277.00000 50
## 2 Frost 104.46000 5.198085e+01 114.50000 50
## 3 HSGrad 53.10800 8.076998e+00 53.25000 50
## 4 Illiteracy 1.17000 6.095331e-01 0.95000 50
## 5 Income 4435.80000 6.144699e+02 4519.00000 50
## 6 Latitude 39.41074 4.976349e+00 39.62075 50
## 7 LifeExp 70.87860 1.342394e+00 70.67500 50
## 8 Longitude -92.46414 1.618079e+01 -89.90030 50
## 9 Murder 7.37800 3.691540e+00 6.85000 50
## 10 Population 4246.42000 4.464491e+03 2838.50000 50
You can do some really powerful things with dplyr
and tidyr
functions. For example, you can run each combination of independent and dependent variables in a linear regression (lm()
) using gather()
and the dplyr
do()
command, rather than running each individually. To make the output from the lm()
easier to combine and understand, we use the tidy()
command from the broom
package (::
tells R we want to use the tidy()
function from the broom
package). If you want more details on how to use this set up, check out my blog post about it.
library(broom)
# Using the swiss practice dataset.
swiss %>%
gather(Indep, Xvalue, Fertility, Agriculture) %>%
gather(Dep, Yvalue, Education, Catholic) %>%
group_by(Dep, Indep) %>%
do(tidy(lm(Yvalue ~ Xvalue + Infant.Mortality + Examination, data = .)))
## # A tibble: 16 x 7
## # Groups: Dep, Indep [4]
## Dep Indep term estimate std.error
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Catholic Agriculture (Intercept) 48.62020731 51.22779001
## 2 Catholic Agriculture Xvalue 0.08447534 0.31852283
## 3 Catholic Agriculture Infant.Mortality 1.69138478 1.81767192
## 4 Catholic Agriculture Examination -2.75852964 0.91102265
## 5 Catholic Fertility (Intercept) 35.13701855 51.27597790
## 6 Catholic Fertility Xvalue 0.39943317 0.60419739
## 7 Catholic Fertility Infant.Mortality 1.00336670 1.99122191
## 8 Catholic Fertility Examination -2.54831847 0.86570652
## 9 Education Agriculture (Intercept) 12.75076261 9.86830392
## 10 Education Agriculture Xvalue -0.13540736 0.06135889
## 11 Education Agriculture Infant.Mortality -0.21468852 0.35014860
## 12 Education Agriculture Examination 0.56818921 0.17549554
## 13 Education Fertility (Intercept) 17.62086823 9.48591843
## 14 Education Fertility Xvalue -0.34172055 0.11177490
## 15 Education Fertility Infant.Mortality 0.44332139 0.36837072
## 16 Education Fertility Examination 0.51463767 0.16015339
## # ... with 2 more variables: statistic <dbl>, p.value <dbl>