2
dat <- as.data.frame(replicate(100,sample(c(0,1),100,replace=TRUE)))

I want to create a 100 by 100 matrix with the correlation coefficients between these binary variables as entries.

If the variables were continuous, then I would have used cor() to create the matrix. I am not sure if cor() with Pearson as the method is reasonable. If not, say I could find a function fn() to calculate the correlation between a pair of binary vectors. What is an efficient way to construct the 100 by 100 matrix?

1
  • What are the binary variables? ie could they represent some underlying normally distributed latent variable?
    – user20650
    Commented Jul 22, 2016 at 13:10

1 Answer 1

7

Not sure this is a stack overflow answer. What you are asking is for the correlation between binary vectors. This is called the Phi coefficient which was discovered by Pearson.

It approximates the Pearson correlation for small values. You might try

sqrt(chisq.test(table(dat[,1],dat[,2]), correct=FALSE)$statistic/length(dat[,1]))

and notice that it gives the same value 0.08006408 as

cor(dat[1], dat[2]) 

This is because the approximation is quite good for reasonably large values, say greater than 40.

So, I would advocate saving yourself some time and just using cor(dat) as the solution.

2
  • is phi just a case of pearson?
    – Maths12
    Commented Feb 4, 2021 at 18:40
  • Yes, just in the case of a 2x2 contingency table. Otherwise, they are not in the same range.
    – shayaa
    Commented Feb 4, 2021 at 19:31

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.