Package Tidyr': March 3, 2021
Package Tidyr': March 3, 2021
Package Tidyr': March 3, 2021
March 3, 2021
Title Tidy Messy Data
Version 1.1.3
Description Tools to help to create tidy data, where each
column is a variable, each row is an observation, and each cell
contains a single value. 'tidyr' contains tools for changing the
shape (pivoting) and hierarchy (nesting and 'unnesting') of a dataset,
turning deeply nested lists into rectangular data frames
('rectangling'), and extracting values out of string columns. It also
includes tools for working with missing values (both implicit and
explicit).
License MIT + file LICENSE
URL https://tidyr.tidyverse.org, https://github.com/tidyverse/tidyr
BugReports https://github.com/tidyverse/tidyr/issues
Depends R (>= 3.1)
Imports dplyr (>= 0.8.2), ellipsis (>= 0.1.0), glue, lifecycle,
magrittr, purrr, rlang, tibble (>= 2.1.1), tidyselect (>=
1.1.0), utils, vctrs (>= 0.3.6)
Suggests covr, data.table, jsonlite, knitr, readr, repurrrsive (>=
1.0.0), rmarkdown, testthat (>= 3.0.0)
LinkingTo cpp11 (>= 0.2.6)
VignetteBuilder knitr
Encoding UTF-8
LazyData true
RoxygenNote 7.1.1
SystemRequirements C++11
Config/testthat/edition 3
NeedsCompilation yes
Author Hadley Wickham [aut, cre],
RStudio [cph]
Maintainer Hadley Wickham <[email protected]>
1
2 billboard
Repository CRAN
Date/Publication 2021-03-03 09:20:06 UTC
R topics documented:
billboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
chop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
complete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
drop_na . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
expand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
expand_grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
fill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
fish_encounters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
full_seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
hoist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
nest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
nest_legacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
pack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
pivot_longer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
pivot_wider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
relig_income . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
replace_na . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
separate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
separate_rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
smiths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
table1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
uncount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
unite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
us_rent_income . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
who . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
world_bank_pop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Index 41
billboard Song rankings for Billboard top 100 in the year 2000
Description
Song rankings for Billboard top 100 in the year 2000
chop 3
Usage
billboard
Format
A dataset with variables:
artist Artist name
track Song name
date.enter Date the song entered the top 100
wk1 – wk76 Rank of the song in each week after it entered
Source
The "Whitburn" project, https://waxy.org/2008/05/the_whitburn_project/, (downloaded April
2008)
Description
[Maturing]
Chopping and unchopping preserve the width of a data frame, changing its length. chop() makes
df shorter by converting rows within each group into list-columns. unchop() makes df longer
by expanding list-columns so that each element of the list-column gets its own row in the out-
put. chop() and unchop() are building blocks for more complicated functions (like unnest(),
unnest_longer(), and unnest_wider()) and are generally more suitable for programming than
interactive data analysis.
Usage
chop(data, cols)
Arguments
data A data frame.
cols <tidy-select> Columns to chop or unchop (automatically quoted).
For unchop(), each column should be a list-column containing generalised vec-
tors (e.g. any mix of NULLs, atomic vector, S3 vectors, a lists, or data frames).
keep_empty By default, you get one row of output for each element of the list your unchop-
ping/unnesting. This means that if there’s a size-0 element (like NULL or an
empty data frame), that entire row will be dropped from the output. If you want
to preserve all rows, use keep_empty = TRUE to replace size-0 elements with a
single row of missing values.
4 complete
ptype Optionally, supply a data frame prototype for the output cols, overriding the
default that will be guessed from the combination of individual values.
Details
Generally, unchopping is more useful than chopping because it simplifies a complex data structure,
and nest()ing is usually more appropriate that chop()ing‘ since it better preserves the connections
between observations.
chop() creates list-columns of class vctrs::list_of() to ensure consistent behaviour when the
chopped data frame is emptied. For instance this helps getting back the original column types
after the roundtrip chop and unchop. Because <list_of> keeps tracks of the type of its elements,
unchop() is able to reconstitute the correct vector type even for empty list-columns.
Examples
# Chop ==============================================================
df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)
# Note that we get one row of output for each unique combination of
# non-chopped variables
df %>% chop(c(y, z))
# cf nest
df %>% nest(data = c(y, z))
# Unchop ============================================================
df <- tibble(x = 1:4, y = list(integer(), 1L, 1:2, 1:3))
df %>% unchop(y)
df %>% unchop(y, keep_empty = TRUE)
Description
Turns implicit missing values into explicit missing values. This is a wrapper around expand(),
dplyr::left_join() and replace_na() that’s useful for completing missing combinations of
data.
complete 5
Usage
Arguments
Details
If you supply fill, these values will also replace existing explicit missing values in the data set.
Examples
Description
Usage
construction
Format
Source
Description
Usage
drop_na(data, ...)
Arguments
Examples
library(dplyr)
df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df %>% drop_na()
df %>% drop_na(x)
Description
expand() generates all combination of variables found in a dataset. It is paired with nesting()
and crossing() helpers. crossing() is a wrapper around expand_grid() that de-duplicates and
sorts its inputs; nesting() is a helper that only finds combinations already present in the data.
expand() is often useful in conjunction with joins:
• use it with right_join() to convert implicit missing values to explicit missing values (e.g.,
fill in gaps in your data frame).
• use it with anti_join() to figure out which combinations are missing (e.g., identify gaps in
your data frame).
Usage
expand(data, ..., .name_repair = "check_unique")
Arguments
data A data frame.
... Specification of columns to expand. Columns can be atomic vectors or lists.
• To find all unique combinations of x, y and z, including those not present in
the data, supply each variable as a separate argument: expand(df,x,y,z).
• To find only the combinations that occur in the data, use nesting: expand(df,nesting(x,y,z)).
• You can combine the two forms. For example, expand(df,nesting(school_id,student_id),date
would produce a row for each present school-student combination for all
possible dates.
When used with factors, expand() uses the full set of levels, not just those that
appear in the data. If you want to use only the values seen in the data, use
forcats::fct_drop().
8 expand
When used with continuous variables, you may need to fill in values that do not
appear in the data: to do so use expressions like year = 2010:2020 or year =
full_seq(year,1).
.name_repair Treatment of problematic column names:
• "minimal": No name repair or checks, beyond basic existence,
• "unique": Make sure names are unique and not empty,
• "check_unique": (default value), no name repair, but check they are unique,
• "universal": Make the names unique and syntactic
• a function: apply custom name repair (e.g., .name_repair = make.names
for names in the style of base R).
• A purrr-style anonymous function, see rlang::as_function()
This argument is passed on as repair to vctrs::vec_as_names(). See there
for more details on these terms and the strategies used to enforce them.
See Also
complete() to expand list objects. expand_grid() to input vectors rather than a data frame.
Examples
fruits <- tibble(
type = c("apple", "orange", "apple", "orange", "orange", "orange"),
year = c(2010, 2010, 2012, 2010, 2010, 2012),
size = factor(
c("XS", "S", "M", "S", "S", "M"),
levels = c("XS", "S", "M", "L")
),
weights = rnorm(6, as.numeric(size) + 2)
)
Description
expand_grid() is heavily motivated by expand.grid(). Compared to expand.grid(), it:
• Produces sorted output (by varying the first column the slowest, rather than the fastest).
• Returns a tibble, not a data frame.
• Never converts strings to factors.
• Does not add any additional attributes.
• Can expand any generalised vector, including data frames.
Usage
expand_grid(..., .name_repair = "check_unique")
Arguments
... Name-value pairs. The name will become the column name in the output.
.name_repair Treatment of problematic column names:
• "minimal": No name repair or checks, beyond basic existence,
• "unique": Make sure names are unique and not empty,
• "check_unique": (default value), no name repair, but check they are unique,
• "universal": Make the names unique and syntactic
• a function: apply custom name repair (e.g., .name_repair = make.names
for names in the style of base R).
• A purrr-style anonymous function, see rlang::as_function()
This argument is passed on as repair to vctrs::vec_as_names(). See there
for more details on these terms and the strategies used to enforce them.
Value
A tibble with one column for each input in .... The output will have one row for each combination
of the inputs, i.e. the size be equal to the product of the sizes of the inputs. This implies that if any
input has length 0, the output will have zero rows.
10 extract
Examples
expand_grid(x = 1:3, y = 1:2)
expand_grid(l1 = letters, l2 = LETTERS)
extract Extract a character column into multiple columns using regular ex-
pression groups
Description
Given a regular expression with capturing groups, extract() turns each group into a new column.
If the groups don’t match, or the input is NA, the output will be NA.
Usage
extract(
data,
col,
into,
regex = "([[:alnum:]]+)",
remove = TRUE,
convert = FALSE,
...
)
Arguments
data A data frame.
col Column name or position. This is passed to tidyselect::vars_pull().
This argument is passed by expression and supports quasiquotation (you can
unquote column names or column positions).
into Names of new variables to create as character vector. Use NA to omit the variable
in the output.
regex a regular expression used to extract the desired values. There should be one
group (defined by ()) for each element of into.
remove If TRUE, remove input column from output data frame.
convert If TRUE, will run type.convert() with as.is = TRUE on new columns. This is
useful if the component columns are integer, numeric or logical.
NB: this will cause string "NA"s to be converted to NAs.
... Additional arguments passed on to methods.
fill 11
See Also
separate() to split up by a separator.
Examples
df <- data.frame(x = c(NA, "a-b", "a-d", "b-c", "d-e"))
df %>% extract(x, "A")
df %>% extract(x, c("A", "B"), "([[:alnum:]]+)-([[:alnum:]]+)")
# If no match, NA:
df %>% extract(x, c("A", "B"), "([a-d]+)-([a-d]+)")
Description
Fills missing values in selected columns using the next or previous entry. This is useful in the
common output format where values are not repeated, and are only recorded when they change.
Usage
fill(data, ..., .direction = c("down", "up", "downup", "updown"))
Arguments
data A data frame.
... <tidy-select> Columns to fill.
.direction Direction in which to fill missing values. Currently either "down" (the default),
"up", "downup" (i.e. first down and then up) or "updown" (first up and then
down).
Details
Missing values are replaced in atomic vectors; NULLs are replaced in lists.
Examples
# Value (year) is recorded only when it changes
sales <- tibble::tribble(
~quarter, ~year, ~sales,
"Q1", 2000, 66013,
"Q2", NA, 69182,
"Q3", NA, 53175,
"Q4", NA, 21001,
"Q1", 2001, 46036,
"Q2", NA, 58842,
"Q3", NA, 44568,
12 fill
# For values that are missing above you can use `.direction = "up"`
tidy_pets %>%
fill(pet_type, .direction = "up")
Description
Information about fish swimming down a river: each station represents an autonomous monitor that
records if a tagged fish was seen at that location. Fish travel in one direction (migrating down-
stream). Information about misses is just as important as hits, but is not directly recorded in this
form of the data.
Usage
fish_encounters
Format
A dataset with variables:
Source
Dataset provided by Myfanwy Johnston; more details at https://fishsciences.github.io/
post/visualizing-fish-encounter-histories/
Description
This is useful if you want to fill in missing values that should have been observed but weren’t. For
example, full_seq(c(1,2,4,6),1) will return 1:6.
Usage
full_seq(x, period, tol = 1e-06)
14 gather
Arguments
x A numeric vector.
period Gap between each observation. The existing data will be checked to ensure that
it is actually of this periodicity.
tol Numerical tolerance for checking periodicity.
Examples
full_seq(c(1, 2, 4, 5, 10), 1)
Description
[Superseded]
Development on gather() is complete, and for new code we recommend switching to pivot_longer(),
which is easier to use, more featureful, and still under active development. df %>% gather("key","value",x,y,z)
is equivalent to df %>% pivot_longer(c(x,y,z),names_to = "key",values_to = "value")
See more details in vignette("pivot").
Usage
gather(
data,
key = "key",
value = "value",
...,
na.rm = FALSE,
convert = FALSE,
factor_key = FALSE
)
Arguments
data A data frame.
key, value Names of new key and value columns, as strings or symbols.
This argument is passed by expression and supports quasiquotation (you can
unquote strings and symbols). The name is captured from the expression with
rlang::ensym() (note that this kind of interface where symbols do not repre-
sent actual objects is now discouraged in the tidyverse; we support it here for
backward compatibility).
... A selection of columns. If empty, all variables are selected. You can supply bare
variable names, select all variables between x and z with x:z, exclude y with
-y. For more options, see the dplyr::select() documentation. See also the
section on selection rules below.
gather 15
na.rm If TRUE, will remove rows from output where the value column is NA.
convert If TRUE will automatically run type.convert() on the key column. This is
useful if the column types are actually numeric, integer, or logical.
factor_key If FALSE, the default, the key values will be stored as a character vector. If TRUE,
will be stored as a factor, which preserves the original ordering of the columns.
• A data expression is either a bare name like x or an expression like x:y or c(x,y). In a data
expression, you can only refer to columns from the data frame.
• Everything else is a context expression in which you can only refer to objects that you have
defined with <-.
For instance, col1:col3 is a data expression that refers to data columns, while seq(start,end) is
a context expression that refers to objects from the contexts.
If you need to refer to contextual objects from a data expression, you can use all_of() or any_of().
These functions are used to select data-variables whose names are stored in a env-variable. For in-
stance, all_of(a) selects the variables listed in the character vector a. For more details, see the
tidyselect::select_helpers() documentation.
Examples
library(dplyr)
# From https://stackoverflow.com/questions/1181060
stocks <- tibble(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
group_by(Species) %>%
slice(1)
mini_iris %>% gather(key = "flower_att", value = "measurement", -Species)
Description
hoist(), unnest_longer(), and unnest_wider() provide tools for rectangling, collapsing deeply
nested lists into regular columns. hoist() allows you to selectively pull components of a list-
column out in to their own top-level columns, using the same syntax as purrr::pluck(). unnest_wider()
turns each element of a list-column into a column, and unnest_longer() turns each element of
a list-column into a row. unnest_auto() picks between unnest_wider() or unnest_longer()
based heuristics described below.
Learn more in vignette("rectangle").
Usage
hoist(
.data,
.col,
...,
.remove = TRUE,
.simplify = TRUE,
.ptype = list(),
.transform = list()
)
unnest_longer(
data,
col,
values_to = NULL,
indices_to = NULL,
indices_include = NULL,
names_repair = "check_unique",
simplify = TRUE,
ptype = list(),
transform = list()
)
unnest_wider(
data,
col,
names_sep = NULL,
simplify = TRUE,
names_repair = "check_unique",
hoist 17
ptype = list(),
transform = list()
)
unnest_auto(data, col)
Arguments
.data, data A data frame.
.col, col List-column to extract components from.
... Components of .col to turn into columns in the form col_name = "pluck_specification".
You can pluck by name with a character vector, by position with an integer vec-
tor, or with a combination of the two with a list. See purrr::pluck() for
details.
The column names must be unique in a call to hoist(), although existing
columns with the same name will be overwritten. When plucking with a single
string you can choose to omit the name, i.e. hoist(df,col,"x") is short-hand
for hoist(df,col,x = "x").
.remove If TRUE, the default, will remove extracted components from .col. This ensures
that each value lives only in one place.
.simplify, simplify
If TRUE, will attempt to simplify lists of length-1 vectors to an atomic vector
.ptype, ptype Optionally, a named list of prototypes declaring the desired output type of each
component. Use this argument if you want to check each element has the types
you expect when simplifying.
.transform, transform
Optionally, a named list of transformation functions applied to each component.
Use this function if you want transform or parse individual elements as they are
hoisted.
values_to Name of column to store vector values. Defaults to col.
indices_to A string giving the name of column which will contain the inner names or posi-
tion (if not named) of the values. Defaults to col with _id suffix
indices_include
Add an index column? Defaults to TRUE when col has inner names.
names_repair Used to check that output data frame has valid names. Must be one of the
following options:
• "minimal": no name repair or checks, beyond basic existence,
• "unique": make sure names are unique and not empty,
• "check_unique": (the default), no name repair, but check they are unique,
• "universal": make the names unique and syntactic
• a function: apply custom name repair.
• tidyr_legacy: use the name repair from tidyr 0.8.
• a formula: a purrr-style anonymous function (see rlang::as_function())
See vctrs::vec_as_names() for more details on these terms and the strategies
used to enforce them.
18 hoist
names_sep If NULL, the default, the names will be left as is. If a string, the inner and outer
names will be paste together using names_sep as a separator.
Unnest variants
The three unnest() functions differ in how they change the shape of the output data frame:
These principles guide their behaviour when they are called with a non-primary data type. For
example, if you unnest_wider() a list of data frames, the number of rows must be preserved, so
each column is turned into a list column of length one. Or if you unnest_longer() a list of data
frame, the number of columns must be preserved so it creates a packed column. I’m not sure how
if these behaviours are useful in practice, but they are theoretically pleasing.
unnest_auto() heuristics
unnest_auto() inspects the inner names of the list-col:
Examples
df <- tibble(
character = c("Toothless", "Dory"),
metadata = list(
list(
species = "dragon",
color = "black",
films = c(
"How to Train Your Dragon",
"How to Train Your Dragon 2",
"How to Train Your Dragon: The Hidden World"
)
),
list(
species = "blue tang",
color = "blue",
films = c("Finding Nemo", "Finding Dory")
)
)
)
df
df %>%
unnest_wider(metadata) %>%
unnest_longer(films)
Description
Nesting creates a list-column of data frames; unnesting flattens it back out into regular columns.
Nesting is implicitly a summarising operation: you get one row for each group defined by the non-
nested columns. This is useful in conjunction with other summaries that work with whole datasets,
most notably models.
Learn more in vignette("nest").
Usage
nest(.data, ..., .names_sep = NULL, .key = deprecated())
unnest(
20 nest
data,
cols,
...,
keep_empty = FALSE,
ptype = NULL,
names_sep = NULL,
names_repair = "check_unique",
.drop = deprecated(),
.id = deprecated(),
.sep = deprecated(),
.preserve = deprecated()
)
Arguments
.data A data frame.
... <tidy-select> Columns to nest, specified using name-variable pairs of the
form new_col = c(col1,col2,col3). The right hand side can be any valid tidy
select expression.
[Deprecated]: previously you could write df %>% nest(x,y,z) and df %>%
unnest(x,y,z). Convert to df %>% nest(data = c(x,y,z)). and df %>% unnest(c(x,y,z)).
If you previously created new variable in unnest() you’ll now need to do it
explicitly with mutate(). Convert df %>% unnest(y = fun(x,y,z)) to df %>%
mutate(y = fun(x,y,z)) %>% unnest(y).
.key [Deprecated]: No longer needed because of the new new_col = c(col1,col2,col3)
syntax.
data A data frame.
cols <tidy-select> Columns to unnest.
If you unnest() multiple columns, parallel entries must be of compatible sizes,
i.e. they’re either equal or length 1 (following the standard tidyverse recycling
rules).
keep_empty By default, you get one row of output for each element of the list your unchop-
ping/unnesting. This means that if there’s a size-0 element (like NULL or an
empty data frame), that entire row will be dropped from the output. If you want
to preserve all rows, use keep_empty = TRUE to replace size-0 elements with a
single row of missing values.
ptype Optionally, supply a data frame prototype for the output cols, overriding the
default that will be guessed from the combination of individual values.
names_sep, .names_sep
If NULL, the default, the names will be left as is. In nest(), inner names will
come from the former outer names; in unnest(), the new outer names will come
from the inner names.
If a string, the inner and outer names will be used together. In nest(), the
names of the new outer columns will be formed by pasting together the outer
and the inner column names, separated by names_sep. In unnest(), the new
inner names will have the outer names (+ names_sep) automatically stripped.
This makes names_sep roughly symmetric between nesting and unnesting.
nest 21
names_repair Used to check that output data frame has valid names. Must be one of the
following options:
• "minimal": no name repair or checks, beyond basic existence,
• "unique": make sure names are unique and not empty,
• "check_unique": (the default), no name repair, but check they are unique,
• "universal": make the names unique and syntactic
• a function: apply custom name repair.
• tidyr_legacy: use the name repair from tidyr 0.8.
• a formula: a purrr-style anonymous function (see rlang::as_function())
See vctrs::vec_as_names() for more details on these terms and the strategies
used to enforce them.
.drop, .preserve
[Deprecated]: all list-columns are now preserved; If there are any that you don’t
want in the output use select() to remove them prior to unnesting.
.id [Deprecated]: convert df %>% unnest(x,.id = "id") to df %>% mutate(id = names(x)) %>% unnest(x))
.sep [Deprecated]: use names_sep instead.
New syntax
tidyr 1.0.0 introduced a new syntax for nest() and unnest() that’s designed to be more similar
to other functions. Converting to the new syntax should be straightforward (guided by the message
you’ll recieve) but if you just need to run an old analysis, you can easily revert to the previous
behaviour using nest_legacy() and unnest_legacy() as follows:
library(tidyr)
nest <- nest_legacy
unnest <- unnest_legacy
Examples
df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)
# Note that we get one row of output for each unique combination of
# non-nested variables
df %>% nest(data = c(y, z))
# chop does something similar, but retains individual columns
df %>% chop(c(y, z))
# Nesting a grouped data frame nests all variables apart from the group vars
library(dplyr)
fish_encounters %>%
group_by(fish) %>%
nest()
Description
[Superseded]
tidyr 1.0.0 introduced a new syntax for nest() and unnest(). The majority of existing usage
should be automatically translated to the new syntax with a warning. However, if you need to
quickly roll back to the previous behaviour, these functions provide the previous interface. To make
old code work as is, add the following code to the top of your script:
library(tidyr)
nest <- nest_legacy
unnest <- unnest_legacy
Usage
nest_legacy(data, ..., .key = "data")
unnest_legacy(data, ..., .drop = NA, .id = NULL, .sep = NULL, .preserve = NULL)
Arguments
data A data frame.
... Specification of columns to unnest. Use bare variable names or functions of
variables. If omitted, defaults to all list-cols.
.key The name of the new column, as a string or symbol. This argument is passed
by expression and supports quasiquotation (you can unquote strings and sym-
bols). The name is captured from the expression with rlang::ensym() (note
that this kind of interface where symbols do not represent actual objects is now
discouraged in the tidyverse; we support it here for backward compatibility).
.drop Should additional list columns be dropped? By default, unnest() will drop
them if unnesting the specified columns requires the rows to be duplicated.
.id Data frame identifier - if supplied, will create a new column with name .id,
giving a unique identifier. This is most useful if the list column is named.
.sep If non-NULL, the names of unnested data frame columns will combine the name
of the original list-col with the names from the nested data frame, separated by
.sep.
.preserve Optionally, list-columns to preserve in the output. These will be duplicated
in the same way as atomic vectors. This has dplyr::select() semantics so
you can preserve multiple variables with .preserve = c(x,y) or .preserve =
starts_with("list").
Examples
# Nest and unnest are inverses
df <- data.frame(x = c(1, 1, 2), y = 3:1)
df %>% nest_legacy(y)
df %>% nest_legacy(y) %>% unnest_legacy()
# nesting -------------------------------------------------------------------
24 pack
# unnesting -----------------------------------------------------------------
df <- tibble(
x = 1:2,
y = list(
tibble(z = 1),
tibble(z = 3:4)
)
)
df %>% unnest_legacy(y)
Description
[Maturing]
Packing and unpacking preserve the length of a data frame, changing its width. pack() makes df
narrow by collapsing a set of columns into a single df-column. unpack() makes data wider by
expanding df-columns back out into individual columns.
Usage
pack(.data, ..., .names_sep = NULL)
Arguments
... <tidy-select> Columns to pack, specified using name-variable pairs of the
form new_col = c(col1,col2,col3). The right hand side can be any valid tidy
select expression.
data, .data A data frame.
cols <tidy-select> Column to unpack.
pack 25
names_sep, .names_sep
If NULL, the default, the names will be left as is. In pack(), inner names will
come from the former outer names; in unpack(), the new outer names will come
from the inner names.
If a string, the inner and outer names will be used together. In pack(), the
names of the new outer columns will be formed by pasting together the outer
and the inner column names, separated by names_sep. In unpack(), the new
inner names will have the outer names (+ names_sep) automatically stripped.
This makes names_sep roughly symmetric between packing and unpacking.
names_repair Used to check that output data frame has valid names. Must be one of the
following options:
• "minimal": no name repair or checks, beyond basic existence,
• "unique": make sure names are unique and not empty,
• "check_unique": (the default), no name repair, but check they are unique,
• "universal": make the names unique and syntactic
• a function: apply custom name repair.
• tidyr_legacy: use the name repair from tidyr 0.8.
• a formula: a purrr-style anonymous function (see rlang::as_function())
See vctrs::vec_as_names() for more details on these terms and the strategies
used to enforce them.
Details
Generally, unpacking is more useful than packing because it simplifies a complex data structure.
Currently, few functions work with df-cols, and they are mostly a curiosity, but seem worth explor-
ing further because they mimic the nested column headers that are so popular in Excel.
Examples
# Packing =============================================================
# It's not currently clear why you would ever want to pack columns
# since few functions work with this sort of data.
df <- tibble(x1 = 1:3, x2 = 4:6, x3 = 7:9, y = 1:3)
df
df %>% pack(x = starts_with("x"))
df %>% pack(x = c(x1, x2, x3), y = y)
# Unpacking ===========================================================
df <- tibble(
26 pivot_longer
x = 1:3,
y = tibble(a = 1:3, b = 3:1),
z = tibble(X = c("a", "b", "c"), Y = runif(3), Z = c(TRUE, FALSE, NA))
)
df
df %>% unpack(y)
df %>% unpack(c(y, z))
df %>% unpack(c(y, z), names_sep = "_")
Description
pivot_longer() "lengthens" data, increasing the number of rows and decreasing the number of
columns. The inverse transformation is pivot_wider()
Learn more in vignette("pivot").
Usage
pivot_longer(
data,
cols,
names_to = "name",
names_prefix = NULL,
names_sep = NULL,
names_pattern = NULL,
names_ptypes = list(),
names_transform = list(),
names_repair = "check_unique",
values_to = "value",
values_drop_na = FALSE,
values_ptypes = list(),
values_transform = list(),
...
)
Arguments
data A data frame to pivot.
cols <tidy-select> Columns to pivot into longer format.
names_to A string specifying the name of the column to create from the data stored in the
column names of data.
Can be a character vector, creating multiple columns, if names_sep or names_pattern
is provided. In this case, there are two special values you can take advantage of:
• NA will discard that component of the name.
pivot_longer 27
• .value indicates that component of the name defines the name of the col-
umn containing the cell values, overriding values_to.
names_prefix A regular expression used to remove matching text from the start of each vari-
able name.
names_sep, names_pattern
If names_to contains multiple values, these arguments control how the column
name is broken up.
names_sep takes the same specification as separate(), and can either be a
numeric vector (specifying positions to break on), or a single string (specifying
a regular expression to split on).
names_pattern takes the same specification as extract(), a regular expression
containing matching groups (()).
If these arguments do not give you enough control, use pivot_longer_spec()
to create a spec object and process manually as needed.
names_ptypes, values_ptypes
A list of column name-prototype pairs. A prototype (or ptype for short) is a zero-
length vector (like integer() or numeric()) that defines the type, class, and
attributes of a vector. Use these arguments if you want to confirm that the created
columns are the types that you expect. Note that if you want to change (instead
of confirm) the types of specific columns, you should use names_transform or
values_transform instead.
names_transform, values_transform
A list of column name-function pairs. Use these arguments if you need to change
the types of specific columns. For example, names_transform = list(week =
as.integer) would convert a character variable called week to an integer.
If not specified, the type of the columns generated from names_to will be char-
acter, and the type of the variables generated from values_to will be the com-
mon type of the input columns used to generate them.
names_repair What happens if the output has invalid column names? The default, "check_unique"
is to error if the columns are duplicated. Use "minimal" to allow duplicates
in the output, or "unique" to de-duplicated by adding numeric suffixes. See
vctrs::vec_as_names() for more options.
values_to A string specifying the name of the column to create from the data stored in cell
values. If names_to is a character containing the special .value sentinel, this
value will be ignored, and the name of the value column will be derived from
part of the existing column names.
values_drop_na If TRUE, will drop rows that contain only NAs in the value_to column. This ef-
fectively converts explicit missing values to implicit missing values, and should
generally be used only when missing values in data were created by its struc-
ture.
... Additional arguments passed on to methods.
Details
pivot_longer() is an updated approach to gather(), designed to be both simpler to use and to
handle more use cases. We recommend you use pivot_longer() for new code; gather() isn’t
going away but is no longer under active development.
28 pivot_wider
Examples
# See vignette("pivot") for examples and explanation
Description
pivot_wider() "widens" data, increasing the number of columns and decreasing the number of
rows. The inverse transformation is pivot_longer().
Learn more in vignette("pivot").
Usage
pivot_wider(
data,
pivot_wider 29
id_cols = NULL,
names_from = name,
names_prefix = "",
names_sep = "_",
names_glue = NULL,
names_sort = FALSE,
names_repair = "check_unique",
values_from = value,
values_fill = NULL,
values_fn = NULL,
...
)
Arguments
data A data frame to pivot.
id_cols <tidy-select> A set of columns that uniquely identifies each observation. De-
faults to all columns in data except for the columns specified in names_from and
values_from. Typically used when you have redundant variables, i.e. variables
whose values are perfectly correlated with existing variables.
names_from, values_from
<tidy-select> A pair of arguments describing which column (or columns)
to get the name of the output column (names_from), and which column (or
columns) to get the cell values from (values_from).
If values_from contains multiple values, the value will be added to the front of
the output column.
names_prefix String added to the start of every variable name. This is particularly useful
if names_from is a numeric vector and you want to create syntactic variable
names.
names_sep If names_from or values_from contains multiple variables, this will be used to
join their values together into a single string to use as a column name.
names_glue Instead of names_sep and names_prefix, you can supply a glue specification
that uses the names_from columns (and special .value) to create custom col-
umn names.
names_sort Should the column names be sorted? If FALSE, the default, column names are
ordered by first appearance.
names_repair What happens if the output has invalid column names? The default, "check_unique"
is to error if the columns are duplicated. Use "minimal" to allow duplicates
in the output, or "unique" to de-duplicated by adding numeric suffixes. See
vctrs::vec_as_names() for more options.
values_fill Optionally, a (scalar) value that specifies what each value should be filled in
with when missing.
This can be a named list if you want to apply different aggregations to different
value columns.
values_fn Optionally, a function applied to the value in each cell in the output. You will
typically use this when the combination of id_cols and value column does not
uniquely identify an observation.
30 pivot_wider
This can be a named list if you want to apply different aggregations to different
value columns.
... Additional arguments passed on to methods.
Details
pivot_wider() is an updated approach to spread(), designed to be both simpler to use and to
handle more use cases. We recommend you use pivot_wider() for new code; spread() isn’t
going away but is no longer under active development.
See Also
pivot_wider_spec() to pivot "by hand" with a data frame that defines a pivotting specification.
Examples
# See vignette("pivot") for examples and explanation
fish_encounters
fish_encounters %>%
pivot_wider(names_from = station, values_from = seen)
# Fill in missing values
fish_encounters %>%
pivot_wider(names_from = station, values_from = seen, values_fill = 0)
values_fn = mean
)
Description
Pew religion and income survey
Usage
relig_income
Format
A dataset with variables:
Source
Downloaded from https://www.pewforum.org/religious-landscape-study/ (downloaded Novem-
ber 2009)
Description
Replace NAs with specified values
Usage
replace_na(data, replace, ...)
Arguments
data A data frame or vector.
replace If data is a data frame, replace takes a list of values, with one value for each
column that has NA values to be replaced.
If data is a vector, replace takes a single value. This single value replaces all
of the NA values in the vector.
... Additional arguments for methods. Currently unused.
32 separate
Value
• If data is a data frame, replace_na() returns a data frame.
• If data is a vector, replace_na() returns a vector, with class given by the union of data and
replace.
See Also
dplyr::na_if() to replace specified values with NAs; dplyr::coalesce() to replaces NAs with
values from other vectors.
Examples
# Replace NAs in a data frame
df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df %>% replace_na(list(x = 0, y = "unknown"))
separate Separate a character column into multiple columns with a regular ex-
pression or numeric locations
Description
Given either a regular expression or a vector of character positions, separate() turns a single
character column into multiple columns.
Usage
separate(
data,
col,
into,
sep = "[^[:alnum:]]+",
remove = TRUE,
convert = FALSE,
extra = "warn",
fill = "warn",
...
)
separate 33
Arguments
data A data frame.
col Column name or position. This is passed to tidyselect::vars_pull().
This argument is passed by expression and supports quasiquotation (you can
unquote column names or column positions).
into Names of new variables to create as character vector. Use NA to omit the variable
in the output.
sep Separator between columns.
If character, sep is interpreted as a regular expression. The default value is a
regular expression that matches any sequence of non-alphanumeric values.
If numeric, sep is interpreted as character positions to split at. Positive values
start at 1 at the far-left of the string; negative value start at -1 at the far-right of
the string. The length of sep should be one less than into.
remove If TRUE, remove input column from output data frame.
convert If TRUE, will run type.convert() with as.is = TRUE on new columns. This is
useful if the component columns are integer, numeric or logical.
NB: this will cause string "NA"s to be converted to NAs.
extra If sep is a character vector, this controls what happens when there are too many
pieces. There are three valid options:
• "warn" (the default): emit a warning and drop extra values.
• "drop": drop any extra values without a warning.
• "merge": only splits at most length(into) times
fill If sep is a character vector, this controls what happens when there are not
enough pieces. There are three valid options:
• "warn" (the default): emit a warning and fill from the right
• "right": fill with missing values on the right
• "left": fill with missing values on the left
... Additional arguments passed on to methods.
See Also
unite(), the complement, extract() which uses regular expression capturing groups.
Examples
library(dplyr)
# If you want to split by any non-alphanumeric value (the default):
df <- data.frame(x = c(NA, "x.y", "x.z", "y.z"))
df %>% separate(x, c("A", "B"))
# If every row doesn't split into the same number of pieces, use
# the extra and fill arguments to control what happens:
34 separate_rows
Description
If a variable contains observations with multiple delimited values, this separates the values and
places each one in its own row.
Usage
separate_rows(data, ..., sep = "[^[:alnum:].]+", convert = FALSE)
Arguments
data A data frame.
... <tidy-select> Columns to separate across multiple rows
sep Separator delimiting collapsed values.
convert If TRUE will automatically run type.convert() on the key column. This is
useful if the column types are actually numeric, integer, or logical.
Examples
df <- tibble(
x = 1:3,
y = c("a", "d,e,f", "g,h"),
z = c("1", "2,3,4", "5,6")
)
separate_rows(df, y, z, convert = TRUE)
smiths 35
Description
A small demo dataset describing John and Mary Smith.
Usage
smiths
Format
A data frame with 2 rows and 5 columns.
Description
[Superseded]
Development on spread() is complete, and for new code we recommend switching to pivot_wider(),
which is easier to use, more featureful, and still under active development. df %>% spread(key,value)
is equivalent to df %>% pivot_wider(names_from = key,values_from = value)
See more details in vignette("pivot").
Usage
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
Arguments
data A data frame.
key, value Column names or positions. This is passed to tidyselect::vars_pull().
These arguments are passed by expression and support quasiquotation (you can
unquote column names or column positions).
fill If set, missing values will be replaced with this value. Note that there are two
types of missingness in the input: explicit missing values (i.e. NA), and implicit
missings, rows that simply aren’t present. Both types of missing value will be
replaced by fill.
convert If TRUE, type.convert() with asis = TRUE will be run on each of the new
columns. This is useful if the value column was a mix of variables that was
coerced to a string. If the class of the value column was factor or date, note
that will not be true of the new columns that are produced, which are coerced to
character before type conversion.
36 table1
drop If FALSE, will keep factor levels that don’t appear in the data, filling in missing
combinations with fill.
sep If NULL, the column names will be taken from the values of key variable. If non-
NULL, the column names will be given by "<key_name><sep><key_value>".
Examples
library(dplyr)
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
stocksm <- stocks %>% gather(stock, price, -time)
stocksm %>% spread(stock, price)
stocksm %>% spread(time, price)
Description
Data sets that demonstrate multiple ways to layout the same tabular data.
Usage
table1
table2
table3
table4a
table4b
table5
uncount 37
Details
table1, table2, table3, table4a, table4b, and table5 all display the number of TB cases doc-
umented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and
2000. The data contains values associated with four variables (country, year, cases, and popula-
tion), but each table organizes the values in a different layout.
The data is a subset of the data contained in the World Health Organization Global Tuberculosis
Report
Source
https://www.who.int/teams/global-tuberculosis-programme/data
Description
Performs the opposite operation to dplyr::count(), duplicating rows according to a weighting
variable (or expression).
Usage
uncount(data, weights, .remove = TRUE, .id = NULL)
Arguments
data A data frame, tibble, or grouped tibble.
weights A vector of weights. Evaluated in the context of data; supports quasiquotation.
.remove If TRUE, and weights is the name of a column in data, then this column is
removed.
.id Supply a string to create a new variable which gives a unique identifier for each
created row.
Examples
df <- tibble(x = c("a", "b"), n = c(1, 2))
uncount(df, n)
uncount(df, n, .id = "id")
# Or expressions
uncount(df, 2 / n)
38 unite
Description
Convenience function to paste together multiple columns into one.
Usage
unite(data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE)
Arguments
data A data frame.
col The name of the new column, as a string or symbol.
This argument is passed by expression and supports quasiquotation (you can
unquote strings and symbols). The name is captured from the expression with
rlang::ensym() (note that this kind of interface where symbols do not repre-
sent actual objects is now discouraged in the tidyverse; we support it here for
backward compatibility).
... <tidy-select> Columns to unite
sep Separator to use between values.
remove If TRUE, remove input columns from output data frame.
na.rm If TRUE, missing values will be remove prior to uniting each value.
See Also
separate(), the complement.
Examples
df <- expand_grid(x = c("a", NA), y = c("b", NA))
df
Description
Captured from the 2017 American Community Survey using the tidycensus package.
Usage
us_rent_income
Format
A dataset with variables:
Description
A subset of data from the World Health Organization Global Tuberculosis Report, and accompany-
ing global populations.
Usage
who
population
Format
who: a data frame with 7,240 rows and the columns:
Details
The data uses the original codes given by the World Health Organization. The column names for
columns five through 60 are made by combining new_ to a code for method of diagnosis (rel =
relapse, sn = negative pulmonary smear, sp = positive pulmonary smear, ep = extrapulmonary) to a
code for gender (f = female, m = male) to a code for age group (014 = 0-14 yrs of age, 1524 = 15-24
years of age, 2534 = 25 to 34 years of age, 3544 = 35 to 44 years of age, 4554 = 45 to 54 years of
age, 5564 = 55 to 64 years of age, 65 = 65 years of age or older).
Source
https://www.who.int/teams/global-tuberculosis-programme/data
Description
Data about population from the World Bank.
Usage
world_bank_pop
Format
A dataset with variables:
country Three letter country code
indicator Indicator name: SP.POP.GROW = population growth, SP.POP.TOTL = total population,
SP.URB.GROW = urban population growth, SP.URB.TOTL = total urban population
2000-2018 Value for each year
Source
Dataset from the World Bank data bank: https://data.worldbank.org
Index
∗ datasets hoist, 16
billboard, 2
construction, 6 nest, 19
fish_encounters, 13 nest(), 4, 23
relig_income, 31 nest_legacy, 22
smiths, 35 nest_legacy(), 21
table1, 36 nesting (expand), 7
us_rent_income, 39
who, 39 pack, 24
world_bank_pop, 40 pivot_longer, 26
pivot_longer(), 28
billboard, 2 pivot_wider, 28
pivot_wider(), 26
chop, 3 pivot_wider_spec(), 30
complete, 4 population (who), 39
complete(), 8 purrr::pluck(), 16, 17
construction, 6
crossing (expand), 7 quasiquotation, 10, 14, 23, 33, 35, 38
dplyr::coalesce(), 32 relig_income, 31
dplyr::count(), 37 replace_na, 31
dplyr::group_by(), 21 replace_na(), 4
dplyr::left_join(), 4 rlang::as_function(), 8, 9, 17, 21, 25
dplyr::na_if(), 32 rlang::ensym(), 14, 23, 38
dplyr::select(), 14, 23
drop_na, 6 separate, 32
separate(), 11, 27, 38
expand, 7 separate_rows, 34
expand(), 4 smiths, 35
expand.grid(), 9 spread, 35
expand_grid, 9 spread(), 30
expand_grid(), 7, 8
extract, 10 table1, 36
extract(), 27, 33 table2 (table1), 36
table3 (table1), 36
fill, 11 table4a (table1), 36
fish_encounters, 13 table4b (table1), 36
full_seq, 13 table5 (table1), 36
tidyr_legacy, 17, 21, 25
gather, 14 tidyselect::select_helpers(), 15
gather(), 27 tidyselect::vars_pull(), 10, 33, 35
41
42 INDEX
tidyselect::vars_select(), 15
type.convert(), 10, 15, 33–35
unchop (chop), 3
uncount, 37
unite, 38
unite(), 33
unnest (nest), 19
unnest(), 3, 23
unnest_auto (hoist), 16
unnest_legacy (nest_legacy), 22
unnest_legacy(), 21
unnest_longer (hoist), 16
unnest_longer(), 3
unnest_wider (hoist), 16
unnest_wider(), 3
unpack (pack), 24
us_rent_income, 39
vctrs::list_of(), 4
vctrs::vec_as_names(), 8, 9, 17, 21, 25, 27,
29
who, 39
world_bank_pop, 40