7

Consider the following dataframe:

dummy_df <- tibble(
  A=c("ABC", "ABC", "BCD", "CDF", "CDF", "CDF"),
  B=c(0.25, 0.25, 1.23, 0.58, 0.58, 0.58),
  C=c("lorem", "ipsum", "dolor", "amet", "something", "else"),
  D=c("up", "up", "down", "down", "down", "down"),
  E=c(132, 132, 243, 512, 512, 512),
  F=c("m1", "m1", "m5", "m3", "m3", "m3"),
  G=c("val", "val", "mur", "mad", "mad", "mad"),
  H=c("grx", "grx", "bcn", "sal", "sal", "sal"),
  I=c(1.68, 1.68, 2.31, 3.12, 3.12, 3.12),
  J=c("p", "p", "f", "p", "p", "p"),
  K=c(100, 100, 200, 143, 143, 143)
)

# A tibble: 6 × 11
  A         B C         D         E F     G     H         I J         K
  <chr> <dbl> <chr>     <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 ABC    0.25 lorem     up      132 m1    val   grx    1.68 p       100
2 ABC    0.25 ipsum     up      132 m1    val   grx    1.68 p       100
3 BCD    1.23 dolor     down    243 m5    mur   bcn    2.31 f       200
4 CDF    0.58 amet      down    512 m3    mad   sal    3.12 p       143
5 CDF    0.58 something down    512 m3    mad   sal    3.12 p       143
6 CDF    0.58 else      down    512 m3    mad   sal    3.12 p       143

After reading this, I have managed to collapse column C so that its value is concatenated into a string for every unique row value of column A.

dummy_df %>% group_by(A) %>% summarise(hits = toString(C), nhits=n())

# A tibble: 3 × 3
  A     hits                  nhits
  <chr> <chr>                 <int>
1 ABC   lorem, ipsum              2
2 BCD   dolor                     1
3 CDF   amet, something, else     3

However, I'm loosing every other column information, which are essential to me. How can I retain information about all the columns while collapsing column C? Ideally it should be done without having to hard-code the columns names, since the number of columns can vary depending on the dataset.

I have read this, but the example shown doesn't create new variables so I haven't been able to make it work.

This is what I'm looking for:

# A tibble: 3 × 12
  A     hits                  nhits     B D         E F     G     H         I J         K
  <chr> <chr>                 <int> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 ABC   lorem, ipsum              2  0.25 up      132 m1    val   grx    1.68 p       100
2 BCD   dolor                     1  1.23 down    243 m5    mur   bcn    2.31 f       200
3 CDF   amet, something, else     3  0.58 down    512 m3    mad   sal    3.12 p       143
0

3 Answers 3

5

If we assume that all fields other than C will contain repeating values for each A-group, then we can do this:

dummy_df %>%
  group_by(A) %>%
  summarize(
    nhits = n(),
    across(where(is.character), ~ toString(unique(.))),
    across(where(~ !is.character(.)), ~ unique(.))
  ) %>%
  ungroup()
# # A tibble: 3 × 12
#   A     nhits C                     D     F     G     H     J         B     E     I     K
#   <chr> <int> <chr>                 <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 ABC       2 lorem, ipsum          up    m1    val   grx   p      0.25   132  1.68   100
# 2 BCD       1 dolor                 down  m5    mur   bcn   f      1.23   243  2.31   200
# 3 CDF       3 amet, something, else down  m3    mad   sal   p      0.58   512  3.12   143

In this case, since all other variables are repeating, everything stays as-is. If, however, we have variability, then I've set up a catch, where strings are combined (with toString) and differing numbers cause repeat rows.

dummy_df$H[1] <- "GRX"
dummy_df$K[1] <- 99
 %>%
  group_by(A) %>%
  summarize(
    nhits = n(),
    across(where(is.character), ~ toString(unique(.))),
    across(where(~ !is.character(.)), ~ unique(.))
  ) %>%
  ungroup()
# A tibble: 4 × 12
  A     nhits C                     D     F     G     H        J         B     E     I     K
  <chr> <int> <chr>                 <chr> <chr> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>
1 ABC       2 lorem, ipsum          up    m1    val   GRX, grx p      0.25   132  1.68    99
2 ABC       2 lorem, ipsum          up    m1    val   GRX, grx p      0.25   132  1.68   100
3 BCD       1 dolor                 down  m5    mur   bcn      f      1.23   243  2.31   200
4 CDF       3 amet, something, else down  m3    mad   sal      p      0.58   512  3.12   143

The across function iterates over multiple columns. The where function allows us to subset columns by something about their values, where for character columns we will apply toString, and for others we'll go with unique. This means that string columns should never add rows, but non-strings may.

An alternative for the number-like columns would be to use first(.) instead of unique(.), which will silently discard all but the first value in each column by group. Using the same modified data, we would see this, where the key difference is that K has discarded the 100 values since the first value is 99.

dummy_df %>%
  group_by(A) %>%
  summarize(
    nhits = n(), 
    across(where(is.character), ~ toString(unique(.))),
    across(where(~ !is.character(.)), ~ first(.))
  ) %>%
  ungroup()
# # A tibble: 3 × 12
#   A     nhits C                     D     F     G     H        J         B     E     I     K
#   <chr> <int> <chr>                 <chr> <chr> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>
# 1 ABC       2 lorem, ipsum          up    m1    val   GRX, grx p      0.25   132  1.68    99
# 2 BCD       1 dolor                 down  m5    mur   bcn      f      1.23   243  2.31   200
# 3 CDF       3 amet, something, else down  m3    mad   sal      p      0.58   512  3.12   143

You can choose to use first with string columns as well, if you choose, which could simplify the logic to a single across:

dummy_df %>%
  group_by(A) %>%
  summarize(
    nhits = n(),
    across(everything(), ~ first(.))
  ) %>%
  ungroup()
# # A tibble: 3 × 12
#   A     nhits     B C     D         E F     G     H         I J         K
#   <chr> <int> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
# 1 ABC       2  0.25 lorem up      132 m1    val   GRX    1.68 p        99
# 2 BCD       1  1.23 dolor down    243 m5    mur   bcn    2.31 f       200
# 3 CDF       3  0.58 amet  down    512 m3    mad   sal    3.12 p       143

Note that we've lost H's "grx" (since "GRX" was first) as well as K's value of 100.

The data.table variant of this last code-block can be simply:

library(data.table)
as.data.table(dummy_df)[, c(.(nhits = .N), lapply(.SD, first)), by = A]

and a base R rendition? Perhaps not as "pretty" :-)

Reduce(
  function(a, b) merge(a, b, by = "A", all = TRUE), 
  list(
    setNames(aggregate(C ~ A, dummy_df, FUN = length), c("A", "nhits")), 
    aggregate(C ~ A, dummy_df, FUN = toString),
    aggregate(. ~ A, subset(dummy_df, select = -C), FUN = function(z) z[1])
  )
)
#     A nhits                     C    B    D   E  F   G   H    I J   K
# 1 ABC     2          lorem, ipsum 0.25   up 132 m1 val GRX 1.68 p  99
# 2 BCD     1                 dolor 1.23 down 243 m5 mur bcn 2.31 f 200
# 3 CDF     3 amet, something, else 0.58 down 512 m3 mad sal 3.12 p 143
1
  • The question has been updated with a better reprex, I didn't make myself clear, sorry for that.
    – jpm92
    Commented May 5, 2023 at 8:26
4

We could do it this way:

library(dplyr) #>= 1.1.0
dummy_df %>%
  summarise(hits = toString(C), nhits = n(), B = unique(B), .by = A)

OR

library(dplyr)

dummy_df %>%
  group_by(A) %>%
  summarise(hits = toString(C), nhits = n(), B = unique(B))
 A     hits                  nhits     B
  <chr> <chr>                 <int> <dbl>
1 ABC   lorem, ipsum              2  0.25
2 BCD   dolor                     1  1.23
3 CDF   amet, something, else     3  0.58
> 
6
  • Sorry, I mentioned it but maybe I wasn't clear. I need to retain every column in the original df, not only B. Is it still possible to accomplish this with this approach? (without having to write down manually 20 columns, of course :D)
    – jpm92
    Commented May 4, 2023 at 18:37
  • @jpm92 Just group_by(A, B) if B is always the same Commented May 4, 2023 at 18:39
  • I am sorry but my output is exact the same as your desired output?
    – TarJae
    Commented May 4, 2023 at 18:42
  • I know it is, but as I mentioned, I have 20 more columns apart from B that I need to retain that I didn't include in the reprex to simplify it, how could I retain all of them without having to manually specify them in the code?
    – jpm92
    Commented May 4, 2023 at 19:31
  • Could you add an extended example?
    – TarJae
    Commented May 5, 2023 at 6:35
3

Using data.table

library(data.table)
setDT(dummy_df)[, .(B = first(B), hits = toString(C), nhits = .N), by = A]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.