Wednesday, October 19, 2011

Testing Sortedness

Sofya Raskhodnikova, in a talk, presented a problem that could be taught at the end of the semester of my undergraduate Algorithms course. It can be solved by elementary means using randomization. (It can also be tied to the problem of finding the longest increasing subsequence of an input sequence, a classic illustration of algorithmic techniques.) This builds on two papers, "Improved testing algorithms for monotonicity" by Dodis-Goldreich-Lehman-Raskhodnikov-Ron-Samorodnitsky, and "Transitive-Closure Spanners" by Bhattacharyya-Grigorescu-Jung-Raskhodnikova-Woodruff.

Given an array A of n items (numbers), you wish to check, quickly, whether the items are sorted or nearly sorted. By "check", I mean, with reasonable confidence (say, you're correct with probability 99%). By "quickly", I mean in logarithmic time. By "nearly sorted", I mean that only a few of the numbers are out of place: if we ignore 4% of the numbers (outliers), the rest is sorted.

Why would you wish to check that? Maybe, if you have some data that used to be sorted but that, after some lazy updates, is now slightly out of order, to test whether you should bother sorting your data again; or, if you're about to resort, to decide which sorting algorithm to use.

Here is the algorithm.

- Define a set S of n log n "edges" (pairs of items). Assuming A is in sorted order, those are the pairs that would have been compared if you had sorted using Quicksort and if your pivot was always the median of the subset of numbers being sorted in the current recursive call. Thus there are edges between A[n/2] and everybody else, edges between A[n/4] and everyone in the range [1...n/2], edges between A[3n/4] and everyone in the range (n/2..n], etc.
- Repeat log n times: pick an edge {i,j} from S uniformly at random, and check whether A[i] and A[j] are in the correct order relative to each other.
- Output "The array is nearly sorted" if and only if all the checks are successful.

Here is the analysis.

- If A is sorted then obviously all checks are successful and the output is correct
- If A is not sorted but is nearly sorted, with fewer than 4% of the items out of order, then the output could go either way.
- If A is far from being sorted, the central claim is that at least 0.01 n edges of S would fail the check. Then one of those edges would (probably) be found after a logarithmic number of tries.

Why does the claim hold?

Create a graph with vertex set A and with an edge between every pair {i,j} such that {A[i],A[j]} are in the wrong order. Note that the outliers are a minimum cardinality vertex cover of that graph. It is known (basic approximation algorithms fact) that the maximum matching M has size at least 1/2 times the minimum vertex cover, |M| >= |S|/2.

Mark the two endpoints of every edge of S that would fail the check. The number of edges of S that would fail the check is at least half of the number of marked vertices. For each {i,j} in the matching M defined above, by definition of Quicksort there is a k between i and j such that {i,k} and {k,j} are both in S. The central observation is that, since {A[i],A[j]} are in the wrong order, it must be that {A[i],A[k]} or {A[k],A[j]} are in the wrong order, so either i or j or both are marked. So the number of marked vertices is greater than or equal to |M|.

Putting those observations together, the number of edges of S that would fail the check is at least half of the number of marked vertices, so it is at least half the size of M, so it is at least a quarter of the number of outliers. So, if there are at least .04 n outliers, then there are at least .01 n edges of S that would fail their check. This proves the claim.

10 comments:

  1. Sorry, maybe I misunderstand what you wrote. I dont see why S is a minimum vertex cover of the graph you constructed. First S is a set of edges. If you interpret the end points of edges in S as a vertex cover, the size of this vertex set can be smaller than |S|. Also it should not be minimum in general. If A is sorted then the graph has no edges. But S has a fixed size of nlogn.

    ReplyDelete
  2. I had almost exactly the same confusion myself !

    ReplyDelete
  3. Oops. Notation error, sorry! I meant, the outliers.

    ReplyDelete
  4. See Problem 2-1 at:

    http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/assignments/ps2.pdf

    for a variant.

    ReplyDelete
  5. "Then one of those edges would (probably) be found after a logarithmic number of tries."

    The "probably" here is the first time that you mention the test is probabilistic. Before you write:

    "By "nearly sorted", I mean that only a few of the numbers are out of place: if we ignore 4% of the numbers (outliers), the rest is sorted."

    And "Output "The array is nearly sorted" if and only if all the checks are successful."

    While reading I was constantly wondering how you can assert this by looking only at a fraction of the input.

    ReplyDelete
  6. Thanks Anonymous. I inserted a sentence to remove the confusion: "By check, I mean, with reasonable confidence (say, you're correct with probability 99%)."

    Thanks to your help, by the time the semester comes I'll have lecture notes for this class ready already!

    ReplyDelete
  7. Sorry, but how would you prove that:

    "It is known (basic approximation algorithms fact) that the maximum matching M has size at least 1/2 times the minimum vertex cover, |M| >= |S|/2"

    It looks deceptively simple and I'm sure I'm missing something trivial but I just can't seem to be able to do it. Thanks.

    ReplyDelete
    Replies
    1. The answer is in Wikipedia's article about Vertex Cover, in the paragraph "Approximate Evaluation".

      Delete
    2. Thank you very much. =)

      Delete