C# Parallel Processing Case Study - Linear Systems Library
In this brief article we add parallel processing to a C# linear systems library using the parallel loops found in C# .NET Framework 4.5.2. The linear systems library has three public classes and eleven public methods. The function of library is to solve potentially large systems of linear equations. The structure of the library is as shown in the following multilevel list where public classes are numbered and publicly accessible methods are lettered:
- ClassicalIterativeMethods
- Jacobi
- GaussSeidel
- SOR
- GradientMethod
- DirectMethods
- GaussianElimination
- LU Decomposition
- CholeskyDecomposition
- SimpleGaussianElimination
- ModernIterativeMethods
- ConjugateGradient
- ModifiedRichardson
So we have ten different ways of potentially solving a system of linear equations. The library testing application uses one of five different types of highly scalable matrices:
- Pascal
- Other
- Cauchy
- Lehmer
- Tri-Diag 1
The matrix creation C# snippet is displayed next:
if (matrix == 1)
{
// Pascal matrix
for (int i = 0; i < n; i++)
b[i] = Math.Pow(2, i + 1);
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
a[i, j] = Binomial(i + j, j);
}
else if (matrix == 2)
{
// other
b[0] = -0.5;
for (int i = 1; i < n - 1; i++)
b[i] = -1.5;
b[n - 1] = +0.5;
for (int i = 0; i < n; i++)
{
a[i, i] = -2.0;
if (i < n - 1)
a[i, i + 1] = 1.0;
if (i > 0)
a[i, i - 1] = 1.0;
}
}
else if (matrix == 3)
{
// Cauchy matrix
for (int i = 0; i < n; i++)
{
double xi = 2 * i;
for (int j = 0; j < n; j++)
{
double yj = 2 * j + 1;
a[i, j] = 1.0 / (xi + yj);
}
b[i] = i + 1;
}
}
else if (matrix == 4)
{
// Lehmer matrix
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
a[i, j] = Math.Min(i + 1, j + 1) + Math.Max(i + 1, j + 1);
a[i, i] = i + 1;
b[i] = i;
}
}
else if (matrix == 5)
{
// tri-diagnonal 1
a[0, 0] = +2.0;
a[0, 1] = -1.0;
b[0] = 1;
for (int i = 1; i < n - 1; i++)
{
a[i, i - 1] = -1.0;
a[i, i] = 2.0;
a[i, i + 1] = -1.0;
b[i] = 0;
}
a[n - 1, n - 2] = -1.0;
a[n - 1, n - 1] = +2.0;
b[n - 1] = 0;
}
We require that the maximum dimension of the system equations be such that the system is usually solved if possible in less about a minute or so via the sequential methods and the root-mean-square error be of the order 1.0e-10 or less. The following table summarizes the results of these experiments for the direct methods:
Method/Matrix Type |
Pascal |
Other |
Cauchy |
Lehmer |
Tri-Diag 1 |
Gaussian Elimination |
10 |
500 |
8 |
2000 |
2500 |
LU Decomposition |
10 |
1000 |
8 |
900 |
2500 |
Cholesky Decomposition |
10 |
Failed |
8 |
Failed |
2500 |
Simple Gaussian Elimination |
10 |
Failed |
2 |
2 |
Failed |
From the above table we see that Simple Gaussian Elimination can be eliminated from further consideration so to speak. We also can get rid of the Pascal and Cauchy matrix type for the direct methods. Next we give an analogous table for the classical iterative methods.
Method/Matrix Type |
Pascal |
Other |
Cauchy |
Lehmer |
Tri-Diag 1 |
Jacobi |
Failed |
Failed |
Failed |
Failed |
Failed |
Gauss-Seidel |
6 |
4 |
4 |
Failed |
5 |
SOR |
5 |
5 |
5 |
Failed |
5 |
Gradient Method |
5 |
5 |
3 |
Failed |
5 |
We now throw out the classical iterative methods due to their overwhelming dependence upon good initial guesses of the solutions. Next we move forward in time to the modern iterative methods:
Method/Matrix Type |
Pascal |
Other |
Cauchy |
Lehmer |
Tri-Diag 1 |
Modified Richardson |
4 |
5 |
3 |
Failed |
5 |
Conjugate Gradient |
10 |
Failed |
6 |
Failed |
2500 |
At this point we are left with two candidates for the "Other" matrix type and four candidates for the "Tri-Diag 1" matrix type. So we try to introduce parallel processing into the following sequential methods:
- Gaussian Elimination
- LU Decomposition
- Cholesky Decomposition
- Conjugate Gradient
Our Gaussian elimination algorithm uses scaled partial pivoting and was translated from the FORTRAN 77 code given in Elementary Numerical Analysis an Algorithmic Approach by S.D. Conte and Carl de Boor. There is only one for loop in whole method that could be benefit from a parallel loop construct and is found to be the working storage element swap loop shown in its parallel glory below:
// make k the pivot row by
// interchanging with i_star
flag *= -1;
i = pivot[i_star];
pivot[i_star] = pivot[k];
pivot[k] = i;
temp = d[i_star];
d[i_star] = d[k];
d[k] = temp;
Parallel.For(0, n, u =>
{
temp = w[i_star, u];
w[i_star, u] = w[k, u];
w[k, u] = temp;
});
A study of the LU Decomposition code shows that it is not a good method for parallel processing. The Cholesky Decomposition code was translated from the C code found in A Numerical Library in C for Scientists and Engineers by H.T. Lau, PhD. There are three little private methods that lend themselves to parallel loops and are illustrated next:
private double matvec(int l, int u, int i, double[,] a, double[] b)
{
double s = 0.0;
Parallel.For(l, u + 1, k => { s += a[i, k] * b[k]; });
return (s);
}
private double tammat(int l, int u, int i, int j, double[,] a, double[,] b)
{
double s = 0.0;
Parallel.For(l, u + 1, k => { s += a[k, i] * b[k, j]; });
return (s);
}
double tamvec(int l, int u, int i, double[,] a, double[] b)
{
double s = 0.0;
Parallel.For(l, u + 1, k => { s += a[k, i] * b[k]; });
return (s);
}
The Conjugate Gradient method is from equations given in Numerical Approximation of Partial Differential Equations by Alfio Quarteroni and Alberto Valli. The parallel loop version is given in the following code block:
private double InnerProduct(int n, double[] a, double[] b)
{
double sum = 0.0;
for (int i = 0; i < n; i++)
sum += a[i] * b[i];
return sum;
}
public double ConjugateGradient(
int n,
int maxIterations,
double tolerance,
double[,] a,
double[] b,
double[] x)
{
double t = 0.0;
double alpha = 0;
double rzold = 0.0;
double rznew = 0.0;
double[] p = new double[n];
double[] r = new double[n];
double[] z = new double[n];
double[] ap = new double[n];
double[,] pinv = new double[n, n];
int its = 0;
DirectMethods dm = new DirectMethods();
//for (int i = 0; i < n; i++)
// for (int j = 0; j < n; j++)
// pinv[i, j] = a[i, j];
Parallel.For(0, n, u =>
{
Parallel.For(0, n, v =>
{
pinv[u, v] = a[u, v];
});
});
dm.CholeskyInverse(pinv, n);
for (int i = 0; i < n; i++)
for (int j = i + 1; j < n; j++)
pinv[j, i] = pinv[i, j];
for (int i = 0; i < n; i++)
{
double sum = 0.0;
for (int j = 0; j < n; j++)
sum += pinv[i, j] * b[j];
x[i] = sum;
}
for (int i = 0; i < n; i++)
{
double sum = 0.0;
for (int j = 0; j < n; j++)
sum += a[i, j] * x[j];
r[i] = b[i] - sum;
sum = 0.0;
for (int j = 0; j < n; j++)
sum += pinv[i, j] * r[j];
z[i] = sum;
p[i] = z[i];
}
rzold = InnerProduct(n, r, z);
while (its < maxIterations)
{
its++;
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
ap[i] = a[i, j] * p[j];
double denom = InnerProduct(n, p, ap);
if (denom < tolerance)
break;
alpha = rzold / denom;
for (int i = 0; i < n; i++)
x[i] += alpha * p[i];
for (int i = 0; i < n; i++)
r[i] -= alpha * ap[i];
t = Math.Sqrt(InnerProduct(n, r, r));
if (t < tolerance)
break;
for (int i = 0; i < n; i++)
{
double sum = 0.0;
for (int j = 0; j < n; j++)
sum += pinv[i, j] * r[j];
z[i] = sum;
}
rznew = InnerProduct(n, r, z);
//for (int i = 0; i < n; i++)
// p[i] = z[i] + p[i] * rznew / rzold;
Parallel.For(0, n, u =>
{
p[u] = z[u] + p[u] * rznew / rzold;
});
rzold = rznew;
}
return t;
}
Now we are to present the results of our experiments. First we give the runtimes in seconds for the Direct Methods on the "Other" matrix type:
Method/n |
100 |
200 |
300 |
400 |
500 |
Gaussian Elimination |
0.003 |
0.019 |
0.061 |
0.143 |
0.283 |
Parallel Gaussian Elimination |
0.003 |
0.020 |
0.071 |
0.155 |
0.354 |
LU Decomposition |
0.003 |
0.015 |
0.056 |
0.132 |
0.257 |
Finally, the results for the "Tri-Diag 1" matrix type:
Method/n |
500 |
1000 |
1500 |
2000 |
2500 |
Gaussian Elimination |
0.280 |
2.216 |
7.461 |
17.629 |
34.445 |
Parallel Gaussian Elimination |
0.300 |
2.417 |
8.077 |
19.101 |
37.380 |
LU Decomposition |
0.257 |
2.128 |
8.017 |
20.725 |
49.082 |
Cholesky Decomposition |
0.141 |
1.203 |
4.527 |
13.634 |
31.199 |
Parallel Cholesky Decomposition |
4.429 |
21.718 |
-------- |
-------- |
-------- |
Conjugate Gradient |
0.356 |
2.944 |
11.313 |
32.099 |
69.329 |
Parallel Conjugate Gradient |
0.476 |
2.947 |
11.586 |
32.300 |
69.586 |
In conclusion for the methods studied, the parallel implementations created, and the test matrices the parallel versions are outperformed by the sequential variants in almost every case. Whether this outperformance is statistically significant is left as an exercise for an aspiring statistician (hint: f-tests first then Student's t-tests). The parallel Cholesky Decomposition code was ill advised and just bad code analysis and programming. We mention in passing that block diagonal systems of equations lend themselves to the multitasking paradigm.