How to summarise a dataframe retaining all the columns

Question

Consider the following dataframe:

dummy_df <- tibble(
  A=c("ABC", "ABC", "BCD", "CDF", "CDF", "CDF"),
  B=c(0.25, 0.25, 1.23, 0.58, 0.58, 0.58),
  C=c("lorem", "ipsum", "dolor", "amet", "something", "else"),
  D=c("up", "up", "down", "down", "down", "down"),
  E=c(132, 132, 243, 512, 512, 512),
  F=c("m1", "m1", "m5", "m3", "m3", "m3"),
  G=c("val", "val", "mur", "mad", "mad", "mad"),
  H=c("grx", "grx", "bcn", "sal", "sal", "sal"),
  I=c(1.68, 1.68, 2.31, 3.12, 3.12, 3.12),
  J=c("p", "p", "f", "p", "p", "p"),
  K=c(100, 100, 200, 143, 143, 143)
)

# A tibble: 6 × 11
  A         B C         D         E F     G     H         I J         K
  <chr> <dbl> <chr>     <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 ABC    0.25 lorem     up      132 m1    val   grx    1.68 p       100
2 ABC    0.25 ipsum     up      132 m1    val   grx    1.68 p       100
3 BCD    1.23 dolor     down    243 m5    mur   bcn    2.31 f       200
4 CDF    0.58 amet      down    512 m3    mad   sal    3.12 p       143
5 CDF    0.58 something down    512 m3    mad   sal    3.12 p       143
6 CDF    0.58 else      down    512 m3    mad   sal    3.12 p       143

After reading this, I have managed to collapse column C so that its value is concatenated into a string for every unique row value of column A.

dummy_df %>% group_by(A) %>% summarise(hits = toString(C), nhits=n())

# A tibble: 3 × 3
  A     hits                  nhits
  <chr> <chr>                 <int>
1 ABC   lorem, ipsum              2
2 BCD   dolor                     1
3 CDF   amet, something, else     3

However, I'm loosing every other column information, which are essential to me. How can I retain information about all the columns while collapsing column C? Ideally it should be done without having to hard-code the columns names, since the number of columns can vary depending on the dataset.

I have read this, but the example shown doesn't create new variables so I haven't been able to make it work.

This is what I'm looking for:

# A tibble: 3 × 12
  A     hits                  nhits     B D         E F     G     H         I J         K
  <chr> <chr>                 <int> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 ABC   lorem, ipsum              2  0.25 up      132 m1    val   grx    1.68 p       100
2 BCD   dolor                     1  1.23 down    243 m5    mur   bcn    2.31 f       200
3 CDF   amet, something, else     3  0.58 down    512 m3    mad   sal    3.12 p       143

r2evans · Accepted Answer · 2023-05-05 12:49:31Z

If we assume that all fields other than C will contain repeating values for each A-group, then we can do this:

dummy_df %>%
  group_by(A) %>%
  summarize(
    nhits = n(),
    across(where(is.character), ~ toString(unique(.))),
    across(where(~ !is.character(.)), ~ unique(.))
  ) %>%
  ungroup()
# # A tibble: 3 × 12
#   A     nhits C                     D     F     G     H     J         B     E     I     K
#   <chr> <int> <chr>                 <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 ABC       2 lorem, ipsum          up    m1    val   grx   p      0.25   132  1.68   100
# 2 BCD       1 dolor                 down  m5    mur   bcn   f      1.23   243  2.31   200
# 3 CDF       3 amet, something, else down  m3    mad   sal   p      0.58   512  3.12   143

In this case, since all other variables are repeating, everything stays as-is. If, however, we have variability, then I've set up a catch, where strings are combined (with toString) and differing numbers cause repeat rows.

dummy_df$H[1] <- "GRX"
dummy_df$K[1] <- 99
 %>%
  group_by(A) %>%
  summarize(
    nhits = n(),
    across(where(is.character), ~ toString(unique(.))),
    across(where(~ !is.character(.)), ~ unique(.))
  ) %>%
  ungroup()
# A tibble: 4 × 12
  A     nhits C                     D     F     G     H        J         B     E     I     K
  <chr> <int> <chr>                 <chr> <chr> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>
1 ABC       2 lorem, ipsum          up    m1    val   GRX, grx p      0.25   132  1.68    99
2 ABC       2 lorem, ipsum          up    m1    val   GRX, grx p      0.25   132  1.68   100
3 BCD       1 dolor                 down  m5    mur   bcn      f      1.23   243  2.31   200
4 CDF       3 amet, something, else down  m3    mad   sal      p      0.58   512  3.12   143

The across function iterates over multiple columns. The where function allows us to subset columns by something about their values, where for character columns we will apply toString, and for others we'll go with unique. This means that string columns should never add rows, but non-strings may.

An alternative for the number-like columns would be to use first(.) instead of unique(.), which will silently discard all but the first value in each column by group. Using the same modified data, we would see this, where the key difference is that K has discarded the 100 values since the first value is 99.

dummy_df %>%
  group_by(A) %>%
  summarize(
    nhits = n(), 
    across(where(is.character), ~ toString(unique(.))),
    across(where(~ !is.character(.)), ~ first(.))
  ) %>%
  ungroup()
# # A tibble: 3 × 12
#   A     nhits C                     D     F     G     H        J         B     E     I     K
#   <chr> <int> <chr>                 <chr> <chr> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>
# 1 ABC       2 lorem, ipsum          up    m1    val   GRX, grx p      0.25   132  1.68    99
# 2 BCD       1 dolor                 down  m5    mur   bcn      f      1.23   243  2.31   200
# 3 CDF       3 amet, something, else down  m3    mad   sal      p      0.58   512  3.12   143

You can choose to use first with string columns as well, if you choose, which could simplify the logic to a single across:

dummy_df %>%
  group_by(A) %>%
  summarize(
    nhits = n(),
    across(everything(), ~ first(.))
  ) %>%
  ungroup()
# # A tibble: 3 × 12
#   A     nhits     B C     D         E F     G     H         I J         K
#   <chr> <int> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
# 1 ABC       2  0.25 lorem up      132 m1    val   GRX    1.68 p        99
# 2 BCD       1  1.23 dolor down    243 m5    mur   bcn    2.31 f       200
# 3 CDF       3  0.58 amet  down    512 m3    mad   sal    3.12 p       143

Note that we've lost H's "grx" (since "GRX" was first) as well as K's value of 100.

The data.table variant of this last code-block can be simply:

library(data.table)
as.data.table(dummy_df)[, c(.(nhits = .N), lapply(.SD, first)), by = A]

and a base R rendition? Perhaps not as "pretty" :-)

Reduce(
  function(a, b) merge(a, b, by = "A", all = TRUE), 
  list(
    setNames(aggregate(C ~ A, dummy_df, FUN = length), c("A", "nhits")), 
    aggregate(C ~ A, dummy_df, FUN = toString),
    aggregate(. ~ A, subset(dummy_df, select = -C), FUN = function(z) z[1])
  )
)
#     A nhits                     C    B    D   E  F   G   H    I J   K
# 1 ABC     2          lorem, ipsum 0.25   up 132 m1 val GRX 1.68 p  99
# 2 BCD     1                 dolor 1.23 down 243 m5 mur bcn 2.31 f 200
# 3 CDF     3 amet, something, else 0.58 down 512 m3 mad sal 3.12 p 143

The question has been updated with a better reprex, I didn't make myself clear, sorry for that. — jpm92, Commented May 5, 2023 at 8:26

TarJae · Accepted Answer · 2023-05-04 18:24:32Z

4

We could do it this way:

library(dplyr) #>= 1.1.0
dummy_df %>%
  summarise(hits = toString(C), nhits = n(), B = unique(B), .by = A)

OR

library(dplyr)

dummy_df %>%
  group_by(A) %>%
  summarise(hits = toString(C), nhits = n(), B = unique(B))

 A     hits                  nhits     B
  <chr> <chr>                 <int> <dbl>
1 ABC   lorem, ipsum              2  0.25
2 BCD   dolor                     1  1.23
3 CDF   amet, something, else     3  0.58
>

answered May 4, 2023 at 18:24

TarJae

78.6k6 gold badges28 silver badges87 bronze badges

Recognized by R Language Collective

Sorry, I mentioned it but maybe I wasn't clear. I need to retain every column in the original df, not only B. Is it still possible to accomplish this with this approach? (without having to write down manually 20 columns, of course :D)
– jpm92
Commented May 4, 2023 at 18:37
@jpm92 Just group_by(A, B) if B is always the same
– Azor Ahai -him-
Commented May 4, 2023 at 18:39
I am sorry but my output is exact the same as your desired output?
– TarJae
Commented May 4, 2023 at 18:42
I know it is, but as I mentioned, I have 20 more columns apart from B that I need to retain that I didn't include in the reprex to simplify it, how could I retain all of them without having to manually specify them in the code?
– jpm92
Commented May 4, 2023 at 19:31
Could you add an extended example?
– TarJae
Commented May 5, 2023 at 6:35

| Show 1 more comment

akrun · Accepted Answer · 2023-05-04 18:55:05Z

3

Using data.table

library(data.table)
setDT(dummy_df)[, .(B = first(B), hits = toString(C), nhits = .N), by = A]

answered May 4, 2023 at 18:55

akrun

886k38 gold badges579 silver badges691 bronze badges

Add a comment |

Collectives™ on Stack Overflow

How to summarise a dataframe retaining all the columns

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged
r
dplyr
group-by
aggregate
data-wrangling
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged rdplyrgroup-byaggregatedata-wrangling or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
r
dplyr
group-by
aggregate
data-wrangling
or ask your own question.