## Wednesday, June 20, 2012

### Avi at AofA: population recovery

Fans of "The Land Before Time" know that there are many kinds of dinosaurs: Apatosaurus, Tyrannosaurus, Triceratops, Saurolophus, Pteranodon, Stegosaurus, to mention a few. But is the movie an accurate representation of reality? Were they really in equal numbers, or were there, perhaps, more Triceratops dinosaurs? Enquiring minds want to know: what was the distribution of the various kinds of dinosaurs?

Here is a way to bring the problem into the more familiar realm of mathematics. Think of each dinosaur as a vector, and of each bone as a coordinate. Paleontologists unearth random dinosaur bones, identifying them as they go (somewhat unreliably: errors are not unheard of), and digs can be viewed as "queries". How can we mine the resulting data to recover the distribution of the dinosaur population? That's what Avi Wigderson talked about on Monday at the Analysis of Algorithms conference (AofA) here in Montreal, presenting joint ork with Amir Yehudayoff. (If you don't care for dinosaurs, think "Netflix" instead.)

The population recovery problem: Given k n-dimensional binary vectors, recover a hidden distribution (pv) that draws vector v with probability pv. The hidden distribution is accessed via lossy, noisy queries. Each query draws a vector v with probability pv, and reveals a lossy and noisy version of v: lossy, meaning that the algorithm only sees a random 1 percent of the coordinates; noisy, meaning that each coordinate seen by the algorithm is flipped with probability 49 percent.

The result: Avi gave an algorithm that estimates (p1,p2,…,pk) so that, (with high probability) for every v the estimate for pv is at most an additive error epsilon away from the correct value. The number of queries is polynomial in n and in 1/epsilon (but slightly super polynomial in k).

Here is the algorithm:
(0) Take a few queries.
(1) Find, for each vector v, a small set Sv of coordinates, such that the following condition holds:
(*) Sv uniquely distinguishes v from every vertex u such that |Su|≤|Sv|.
Let Ambiguous(v) be the set of vectors u that are indistinguishable from v by looking at just the coordinates in Sv.
(2) Label the vectors 1,2,..,k by order of non-increasing |Sv|
(3) For v = 1 to k
Let zv = the restriction of v to the coordinates in Sv.
Compute qv, the fraction of the queries whose Sv coordinates are (roughly) zv.
Let (**) estimate(pv)= qv- ∑ estimate(pu),
where the sum is over every u in Ambiguous(v), other than v itself, of course
Output (estimate(p1),estimate(p2),…,estimate(pk)).

The first question that comes to the mind of the eager coder is the risk that the code may generate some "undefined" error: isn't there a risk of circularity? Step (3) computes estimates using estimates, but what if the estimates used have not yet been computed?
It turns out that this is not an issue. Why? Because, in step (3), the sum is over u's that are in Ambiguous(v). But by condition (*) we must have |Su|>|Sv|, and by step (2) they are been placed before v in the ordering, so estimate(pu) has already been computed by the time we need it in the right hand side. Hence, no circularity.

The second question is that estimates have errors, inevitably, and when using the algebraic expression (**), those errors will accumulate. To make up for that, more queries will be needed. Now we are worried: in order to have reliable estimates, isn't this algorithm going to require lots and lots of queries? Well, no. The main idea is that the sets Sv constructed in step (1) are small: they all have cardinality at most log(k). Then the graph that has an arc (u,v) whenever estimate(pu) occurs in the right hand side of (**) is a partial order with depth at most log(k), and so errors don't accumulate.

The main remaining difficulty is how to achieve (*) while maintaining sets Sv of cardinality at most log(k). Here is how to do it.

Algorithm for Step (1):
Start with the empty set for every Sv
Repeat
While there exists an Sv and a coordinate i such that adding i to Sv decreases |Ambiguous(v)| by a factor of 2
Add i to Sv
If condition (*) is violated by some u and v,
then replace Su by Sv
Until condition (*) holds.

Why does that work? By some elementary magic.