estimating correlations from unmatched data: gustavolacerda

gustavolacerda

estimating correlations from unmatched data

Jan 24, 2010 20:11

Can one estimate the correlation between two variables when no joint data is observed?

Imagine you have samples from a 2D joint distribution... except that you don't have joint observations, because the x and y observations are unmatched. Your data can be visualized using vertical and horizontal lines (red lines on figure below).

My surprising realization was that the marginals give some evidence for how the x- and y- readings are matched (especially strong evidence when the true correlation is close to 1 or -1, i.e. mutual information is high)

A highly-regarded collaborator of mine didn't buy this at all, and I couldn't convince him even after several emails. So I created some R code for the n=3 case, and named it "magic-trick.R". The program generates a large number of coin tosses (+1 or -1), and "magically" guesses the sequence of tosses, using just the empirical marginals.

Y <- toss*X + c*N(0,1) (signal + noise, if you will)
Y <- randomPermutation(Y)

When c is 0, it works 100%.
When c is 1, it only works ~53%.

If you are skeptical, perhaps my first step should be to convince you that on the figure above, a true correlation of 0.9 is much more likely than -0.9, e.g. by pointing out that the rectangles on the diagonal are much less eccentric under the correct matching. Looking at the figure below, we see that the true correlation could be +1, but definitely not -1:

Therefore, the unmatched data gives at least some information about the correlation.

As you get more data, the true matching becomes harder to get exactly right, but (hopefully) the correlation keeps getting easier (though perhaps at a very slow rate), ignoring computational issues.

Query:
* is there an identifiability / consistency result here? i.e. is there a consistent estimator for correlation, under reasonable assumptions about the joint pdf?
* more ambitiously, what families of joint densities can be estimated from unmatched data?

I suspect that the correlation is identifiable for 2D Gaussians.... which implies that multivariate normals can be estimated from unmatched data (since their parameters are just the correlations).

I suspect that non-identifiabilities arise when there exist x1 ≠ x2 for which the conditional distributions p(Y|x1) = p(Y|x2).... i.e. when the function λ(x).P(Y|x) is not injective.

However, as long the set of x's corresponding to any given p(Y|x) is a measure-zero set, some continuity assumptions might still give us full identifiability.

The following R code uses the protocol of forgetting the truth, as a guarantee that it's not cheating:

##################################################################
### magic-trick.R - inferring correlation from unmatched data ###
### by Gustavo Lacerda www.optimizelife.com ###
##################################################################

## PART 1 - generating data

nExpts <- 10000 ##number of experiments / number of coin tosses

randomPermutationMatrix <- function(n){
availableIndices <- 1:n
mat <- matrix(rep(0,n^2),nrow=n) ##n by n matrix, initialized with all zeros
for(i in 1:n){
r <- ceiling((n-i+1)*runif(1)) ##random number between 1 and n-i+1
j <- availableIndices[r]
availableIndices <- availableIndices[-r] ## now that we used this column, it's not available
mat[i,j] <- 1
}
mat
}

scramble <- function(v){
m <- randomPermutationMatrix(length(v))
as.vector(m %*% v)
}

n <- 3 ## with 3 points, we can get some information about the correlation; 2 isn't enough.
coeff <- c()
x <- list()
y <- list()

noiseCoeff <- 1 ## the 'c' from the formula

for (i in 1:nExpts){
coeff[i] <- sign(rnorm(1)) ## +1 or -1
x[[i]] <- rnorm(n)
y[[i]] <- coeff[i] * x[[i]] + noiseCoeff * rnorm(n)
##plot(x[[i]],y[[i]],main="'o' indicates the true matching")
y[[i]] <- scramble(y[[i]])
}

save(x,y, file="empirical-marginals.data")
save(coeff, file="coeff-truth.data")

## PART 2 - MAGIC TRICK
## reads the file 'empirical-marginals.data'; DOES NOT read 'coeff-truth.data'.
## This part runs a simple inference procedure, and gives you back 'coeff-predicted.data', containing a vector of nExpts entries (each +1 or -1)

rm(list=ls()) ##forget everything
load("empirical-marginals.data") ##remember one thing
nExpts <- length(x)

prediction <- c()
for (i in 1:nExpts){
xsorted <- sort(x[[i]],decreasing=FALSE)
ysorted <- sort(y[[i]],decreasing=FALSE)
xgap1 <- xsorted[2] - xsorted[1]
ygap1 <- ysorted[2] - ysorted[1]
##xgap2 <- xsorted[3] - xsorted[2]
ygap2 <- ysorted[3] - ysorted[2]

##if xgap1 is closer to ygap1 then guess +1; if it's closer to ygap2 then guess -1.
prediction[i] <- ifelse(abs(xgap1-ygap1) < abs(xgap1-ygap2),+1,-1)
}
save(prediction, file="coeff-predicted.data")

## PART 3 - evaluating: did the magic work?

rm(list=ls()) ##forget everything
load("coeff-truth.data")
load("coeff-predicted.data")
nExpts <- length(coeff)
correct <- 0
for (i in 1:nExpts){
if (prediction[i]==coeff[i])
correct <- correct+1
}

cat("The prediction was correct ", correct, " times out of ", nExpts, ".")

############# MAGIC TRICK IS OVER. THANKS! #############

#random, #plot, #if, #n, #forget, #remember, #number, stats, #xgap2