## Friday, November 18, 2011

### How to solve a system of linear equations

What could be more boring than solving a system of linear equations, Ax=b? We've all learnt in school how to do it: Gaussian elimination; and we've practiced it until we got sick of it. End of story!

But there are people who are never happy with what they have. Gaussian elimination is too slow, they whine. Polynomial time is not good enough for them. They are afraid that it will fail on massive data, and want something that is fast "for real". Near-linear time is the goal these days.

They currently focus on a special case: "SDD solvers", meaning that the matrix A is symmetric and diagonally dominant, meaning that the value on the diagonal is larger than the sum of absolute values of all off-diagonal values on its row. In fact, let us simplify things further and assume that A is the "Laplacian matrix" of a graph - the matrix where rows and columns are indexed by V, where the diagonal entry is the vertex degree, and where the off-diagonal entry is -1 if and only if there is an edge between the two vertices.

So, for that special case, in their anxious search for faster algorithms, people have started playing a game that is the opposite of what we traditionally do in approximation algorithms: instead of giving highest priority to the accuracy of the output, and treating runtime as a second thought, they sacrifice some accuracy in the hope if getting a faster algorithm. Indeed, if we don't insist on finding the exact solution x* of Ax=b, then we can try some iterative method that constructs a sequence x1,x2,x3,., where each xi is an improvement over the previous one, and pray that it gets quickly to a value xT close to x*.

But scientists don't trust the power of prayer. Instead, we analyze the runtime, which is basically the cost of each iteration times the number of iterations. Unfortunately, if the matrix A is dense (many non-zero entries), each iteration is costly; and if it's not well-rounded, one needs a large number of iterations to get close to x*.

To speedup the algorithm, instead of solving the system Ax=b, let us solve the equivalent system
B^{-1}Ax=B^{-1}b

for a well-chosen matrix B. Namely, B should be sparse, so that each iteration is cheap, and B should be similar to A, so that few iterations suffice to converge. (What's the measure of similarity? It's k such that for any vector x, x^TBx is between x^TAx and k times that quantity.)

Fortunately, there is an algorithm to find a matrix B that is very sparse- only a linear number of non-zero entries - and very well conditioned - k is constant. Unfortunately, that algorithm is slow, which defeats the whole purpose. So, how can one go about quickly finding a good matrix B?

Here's an algorithm for that subproblem.

1. First, find a low-stretch spanning tree T of G, the graph corresponding to A: it's a spanning tree such that the sum of all interpoint distances in T is not much more than the sum of all interpoint distances in G. That's been studied before in a sequence of papers, so it's just a matter of recycling old work.

2. Second, for every edge uv of G, define P(uv) as the distance from u to v in T. Let t be the sum of P(e) over all edges e of G.

3. Third, build a multigraph H by repeating t log(n) times:
pick an edge e with probability P(e)/t, and put t/P(e) copies of e in H.

4. The desired matrix B is (1/t log(n)) times the Laplacian matrix of H.

And that is what I learned because of a seminar given at Brown by Richard Peng, who presented some more advanced stuff in the same area (some work he's done with Ioannis Koutis and Gary Miller).

Why should we care about this line of research? Getting an improved polynomial-time algorithm for a special case of linear system solvers, is that really such a big deal? Why is there so much interest for those papers in the Theory community? Pick your answer:

( a ) . we should care, because Dan Spielman (or insert the name of your favorite researcher in that area) cares, and he has good taste. [Trust of authorities]
( b ) . we should care, because the algorithmic perspective on those problems is pretty new. It's the development of a new technique. [Algorithmic design]
( c ) . we should care, because this has been implemented and tested and it is claimed to be competitive with existing solvers, so it leads, maybe, to actual real life programs that run faster. This could be a big success story of Theory! [Applications to Applications]
( d ) . we should care, because it has implications for Theory: it leads to faster algorithms for graph problems such as generating a random spanning tree or maximum flow. [Applications to Theory]
( e ) . we should care, because it's sunny out, I am in a good mood, and today I am inclined to see things as exciting in general. [Random]

( a' ) . we should ignore this, because it has too much heavy math, it's not fun, and it's not my taste. [Trust of own taste]
( b' ) . we should ignore this, because the core of it is just low-stretch algorithms, sampling, etc: it's just putting together basic algorithmic tools that have nothing new about them, and the claim of novelty are pure marketing [Disgruntled]
( c' ) . we should ignore this, because the claims of applications are merely what people always say in their introductions, and those claims never mean anything. [Skepticism]
( d' ) . we should ignore this, because it's not going to help understand the P versus NP problem, even in the long term. [Lack of applications to Theory]
( e' ) . we should ignore this, because I had too much to eat and now I have a stomachache. [Random]

1. To solve B^{-1}Ax=B^{-1}b, how does it help if B is sparse, and how does it help if B is similar to A? Is this what you meant to say?

2. That deserves another blog post, I think.

3. "To solve B^{-1}Ax=B^{-1}b, how does it help if B is sparse"

You can use iterative methods (like the conjugate gradient method) to approximately solve that linear system. Each iteration requires solving a linear system in B.

To solve linear systems in B, we can again use the conjugate gradient method. This is a good choice because CG works well for sparse linear systems. One provable bound is: if B has k non-zero entries, this takes O(n*k) time.

For example, we could choose B to be a spectral sparsifier of A, with O(n log n) non-zero entries. Then solving linear systems in B takes only O(n^2 log n) time.

To solve linear systems in B, you should recursively apply the same idea: compute an even sparser sparsifier C, then use conjugate gradient to solve the linear system in B using C as a preconditioner.

"how does it help if B is similar to A"

Because the solution to "Bx = b" will be similar to the solution to "Ax = b". The iterative methods can take an approximate solution to "Bx = b" and improve it to an approximate solution for "Ax = b".

At least, this seems true intuitively. But I believe we still don't have a solid proof of this when the iterative method is conjugate gradient. Following Vaidya's student Joshi, Spielman and Teng show that this intuition is provable for the Chebyshev iterative method.

4. "Why should we care about this line of research? Getting an improved polynomial-time algorithm for a special case of linear system solvers, is that really such a big deal? Why is there so much interest for those papers in the Theory community?"

Being a systems guy, I'd expect you and others in Theory to answer :-) Here is my understanding assuming that the main problem they were trying to solve was that of speeding up the computations:

I think we should care about this line of research for two reasons: (1) it is indeed true (as you, no doubt, already know) that "polynomial-time" is not the goal of designing algorithms - the goal is to design a general enough solution that can solve some problem - today or in future. Now, with the kind of data sets I deal, n^3 or even n^2 may be too slow - ever tried running Dijsktra's algorithm on 200 million node, 20 billion edge graph? It just doesn't want to finish. (2) Many applications may be able to handle a little bit of inaccuracy, lets say outliers, if processing speed were not a concern.

I find such results interesting, but may be I am just a systems guy - but hey, I read your blog! And just for the record, I'd choose (c) if I were looking at this result from a systems perspective and (b') if I were looking from theory perspective.

1. Very knowledgeable and descriptive blog.The linear equation can be solve by three methods.My favorite is graphical method.But there is some complex situations come when the solutions are imaginary.Can you please help me about it.