2

I have two dataframes like this:

set.seed(1)
df1 <- data.frame(id= 1:4, sex= c("m", "m", NA, NA), somevar= letters[1:4], whocares_var= rnorm(4))
df2 <- data.frame(id= 1:6, sex= c("m", NA, "m", NA, "m", NA), somevar= NA, morevars= LETTERS[1:6])

And I want to merge them. What I do is:

df_both <- merge(df1, df2, by= "id", all= TRUE)
df_both

  id sex.x somevar.x whocares_var sex.y somevar.y morevars
1  1     m         a    2.5721564     m        NA        A
2  2     m         b   -1.1182118  <NA>        NA        B
3  3  <NA>         c    0.6560304     m        NA        C
4  4  <NA>         d   -0.7959650  <NA>        NA        D
5  5  <NA>      <NA>           NA     m        NA        E
6  6  <NA>      <NA>           NA  <NA>        NA        F

I don't want the merged dataframe to have two columns sex.x and sex.y. Instead I want to have one sex column that contains the non-missing entry. So what I expect to get is:

set.seed(1)
df_wanted <- data.frame(id= 1:6, sex= c("m", "m", "m", NA, "m", NA),
                        somevar= c(letters[1:4], NA, NA),
                        whocares_var= c(rnorm(4), NA, NA),
                        morevars= LETTERS[1:6])
df_wanted
  id  sex somevar whocares_var morevars
1  1    m       a   -0.6264538        A
2  2    m       b    0.1836433        B
3  3    m       c   -0.8356286        C
4  4 <NA>       d    1.5952808        D
5  5    m    <NA>           NA        E
6  6 <NA>    <NA>           NA        F

So the function I am looking for only keeps the non-missing entries whenever both dataframe have the same column name. If a column is only present in one of the dataframes, it should also appear in the final data. How to achieve that?

Remark: I don't have the case of conflicting entries (i.e. different non-missing entries for same id)

1
  • Can you please adapt your toy data? Two similar columns per data frame would be helpful.
    – Friede
    Commented Oct 29 at 13:56

2 Answers 2

3

Probably you can try

d <- merge(df1, df2, by = "id", all = TRUE)
nms <- sub("\\.[xy]", "", names(d))
list2DF(
    lapply(
        split.default(d, nms)[unique(nms)],
        \(x) do.call(coalesce, x)
    )
)

which gives

  id  sex somevar whocares_var morevars
1  1    m       a   -0.6264538        A
2  2    m       b    0.1836433        B
3  3    m       c   -0.8356286        C
4  4 <NA>       d    1.5952808        D
5  5    m    <NA>           NA        E
6  6 <NA>    <NA>           NA        F

Note: coalesce is from dplyr package

0
2

With the expansion of the example in the question the answer has been revised. Merge the data frames, convert to a list, split the columns into groups having the same root, combine the columns in each such group using pmax removing NA's unless all NA and finally convert back to data frame. No packages are used other than tools which comes with R so it does not have to be installed.

library(tools)

df1 |>
  merge(df2, by = "id", all = TRUE) |>
  as.list() |>
  list(xx = _) |>
  with(split(xx, file_path_sans_ext(names(xx)))) |>
  lapply(\(cols) do.call("pmax", c(cols, na.rm = TRUE))) |>
  as.data.frame()

giving

  id morevars  sex somevar whocares_var
1  1        A    m       a   -0.6264538
2  2        B    m       b    0.1836433
3  3        C    m       c   -0.8356286
4  4        D <NA>       d    1.5952808
5  5        E    m    <NA>           NA
6  6        F <NA>    <NA>           NA
1
  • Have revised answer in light of the expanded example. Commented Oct 29 at 15:43

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.