I have a dataframe where missingness in indicated by "Z" (there may also be some "z" and NA
entries present in the data), and values are entered as characters ("0", "1", etc). I need to create scores ("updrs1", "updrs2", updrs3") that add up the non-missing values across columns selected by colname prefix ("NP1", "NP2", "NP3").
dummy data:
dummy_df <- data.frame(
subject_id = seq(1,6,1),
OTHERV1 = c(1,1,0,0,1,1),
NP1VAR1 = c("Z","0","Z","Z","Z","Z"),
NP1VAR2 = c("Z","0","Z","Z","Z","Z"),
NP1VAR3 = c("Z","3","Z","Z","Z","Z"),
NP2VAR1 = c("Z","2","Z","Z","Z","Z"),
NP2VAR2 = c("Z","0","Z","Z","Z","Z"),
NP2VAR3 = c("Z","0","Z","Z","Z","Z"),
NP3VAR1 = c("Z","4","Z","Z","Z","Z"),
NP3VAR2 = c("Z","0","Z","Z","z","Z"),
NP3VAR3 = c("Z","0","Z","Z","Z",NA),
OTHERV2 = c(NA,NA,NA,NA,NA,NA)
)
desired output:
subject_id | updrs1 | updrs2 | updrs3 | |
---|---|---|---|---|
1 | 1 | Z | Z | Z |
2 | 2 | 3 | 2 | 4 |
3 | 3 | Z | Z | Z |
4 | 4 | Z | Z | Z |
5 | 5 | Z | Z | Z |
6 | 6 | Z | Z | Z |
NOTE: all values in the output are characters
NOTE: treating NA/Z/z as 0 (i.e., transforming all values with as.numeric()
) is problematic.
I've tried variations on this answer, with no luck.
I have used a combination of tidyr::pivot_longer()
, dplyr::group_by()
, and summarize()
:
desired_output <- select(dummy_df, c(subject_id, starts_with("NP"))) %>%
mutate(across(all_of(everything()), ~ifelse(. %in% c("Z", "z"), NA, .))) %>%
pivot_longer(cols = starts_with("NP"), names_to = c(".value", "np_var"), names_sep = "VAR") %>%
group_by(subject_id) %>%
summarize(updrs1 = sum(as.numeric(NP1), na.rm = FALSE),
updrs2 = sum(as.numeric(NP2), na.rm = FALSE),
updrs3 = sum(as.numeric(NP3), na.rm = FALSE), .groups = "drop") %>%
mutate(across(all_of(everything()), as.character)) %>%
replace(is.na(.), "Z")
This works. L Tyrone's answer also works, so I've accepted it. Thanks all.
NA
?