Data Wrangling
with dplyr and tidyr
Cheat Sheet
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com
Syntax - Helpful conventions for wrangling
dplyr::tbl_df(iris)
Converts data to tbl class. tbl’s are easier to examine than
data frames. R displays only the data that fits onscreen:
dplyr::glimpse(iris)
Information dense summary of tbl data.
utils::View(iris)
View data set in spreadsheet-like display (note capital V).
Source: local data frame [150 x 5]
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
.. ... ... ...
Variables not shown: Petal.Width (dbl),
Species (fctr)
dplyr::%>%
Passes object on le hand side as first argument (or .
argument) of function on righthand side.
"Piping" with %>% makes code more readable, e.g.
iris %>%
group_by(Species) %>%
summarise(avg = mean(Sepal.Width)) %>%
arrange(avg)
x %>% f(y) is the same as f(x, y)
y %>% f(x, ., z) is the same as f(x, y, z )
Reshaping Data - Change the layout of a data set
Subset Observations (Rows)
Subset Variables (Columns)
Each variable is saved
in its own column
Each observation is
saved in its own row
In a tidy
data set:
&
Tidy Data - A foundation for wrangling in R
Tidy data complements R’s vectorized
operations. R will automatically preserve
observations as you manipulate variables.
No other format works as intuitively with R.
M * A
*
tidyr::gather(cases, "year", "n", 2:4)
Gather columns into rows.
tidyr::unite(data, col, ..., sep)
Unite several columns into one.
dplyr::data_frame(a = 1:3, b = 4:6)
Combine vectors into data frame
(optimized).
dplyr::arrange(mtcars, mpg)
Order rows by values of a column
(low to high).
dplyr::arrange(mtcars, desc(mpg))
Order rows by values of a column
(high to low).
dplyr::rename(tb, y = year)
Rename the columns of a data
frame.
tidyr::spread(pollution, size, amount)
Spread rows into columns.
tidyr::separate(storms, date, c("y", "m", "d"))
Separate one column into several.
dplyr::filter(iris, Sepal.Length > 7)
Extract rows that meet logical criteria.
dplyr::distinct(iris)
Remove duplicate rows.
dplyr::sample_frac(iris, 0.5, replace = TRUE)
Randomly select fraction of rows.
dplyr::sample_n(iris, 10, replace = TRUE)
Randomly select n rows.
dplyr::slice(iris, 10:15)
Select rows by position.
dplyr::top_n(storms, 2, date)
Select and order top n entries (by group if grouped data).
Logic in R - ?Comparison, ?base::Logic
dplyr::select(iris, Sepal.Width, Petal.Length, Species)
Select columns by name or helper function.
Helper functions for select - ?select
select(iris, contains("." ))
Select columns whose name contains a character string.
select(iris, ends_with("Length"))
Select columns whose name ends with a character string.
select(iris, everything())
Select every column.
select(iris, matches(". t ." ))
Select columns whose name matches a regular expression.
select(iris, num_range("x", 1:5))
Select columns named x1, x2, x3, x4, x5.
select(iris, one_of(c("Species", "Genus")))
Select columns whose names are in a group of names.
select(iris, starts_with("Sepal"))
Select columns whose name starts with a character string.
select(iris, Sepal.Length:Petal.Width)
Select all columns between Sepal.Length and Petal.Width (inclusive).
select(iris, -Species)
Select all columns except Species.
Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15
devtools::install_github("rstudio/EDAWR") for data sets