wiki:GammaWithLapackAtlas

Version 12 (modified by dtodd, 10 years ago) (diff)

--

Gamma Speed Improvements

This section describes initial work done with ATLAS and LAPACK, and some suggestions on how to proceed.

Note: Since the initial writing of this section, Lapack has introduced a new library (11/15/10) that has native C calls that should simplify the work done below and possibly speed it up. That work was done in collaboration with Intel. For more information see these links:

Gamma with LAPACK and Atlas Enhancements

Initial work was done to study the benefits of Adding LAPACK and Atlas to the Gamma code. Basically how much speed enhancment would be gained by speeding up the linear algebra package.

Initial Steps

One of the first things was that to make a major change to the Linear Algebra code would be outside of the scope of the current grant so a more focused approach was used. Specific routines were targeted that make up substantial portions of the computations for various simulations. I changed some code in:

n_matrix.multiply(...)

routine by adding in code that uses blas (cblas / Atlas) to compute the product of two matrixes. The input matrix was copied into a suitable matrix to pass to this routine:

cblas_zgemm(...)

Testing indicated that the benefit of this approach on 2 independent computers, both with dual core "Pentium 6" chips, with 2-4 GBytes RAM and clock speeds in the 2.6 to 2.8 MHz range, was not obtained until the matrix sized reached 256 on a side.

Additional works was done on this routine:

h_matrix.diag(...)

matrix diagonalization of a hermitian matrix. This utilized CLAPACK (LAPACK with a C wrapper). This also used the strategy of copying the date to a suitable matrix (in this case row major ordering) and shipping off to the routine:

zheev_(...);

Testing of this routine also indicated that the minimum matrix size to gain benefit from this routine was 256x256. This was the main reason we abandoned this approach as our molecules were of spin 7 and 8 (or matrix size 128x128 and 256x256), or less.

Speed Comparisons

The following speed comparisons are on a single function basis, not based on a real simulation, but included all the copying of a gamma matrix to a blas matrix and back again (or from a gamma matrix to an lapack matrix and back again). The following list shows ratio's of times, or multiplicative speed enhancements. Optimized gamma implies using the -O2 compilation flags on Linux with gcc.

n_matrix->multiply():

n            optimized-gamma / blas-atlas

256                     1.12
512                     3.0
1024                    7.2

Similar behavior with observed with LAPACK and the h_matrix diagonalization routine, although the speed enhancement curve is not as steep.

h_matrix->diag():

n            optimized-gamma / blas-atlas

256                      1.1
516                      1.5
1024                     2.6
2024                     4.8

Swapping out the Linear Algebra code

It certainly seems plausable to completely swap out the code but keeping the matrix interface the same. All the source files that would need to be modified are in the gamma/trunk/src/Matrix. Alternatively, even the interface could be changed, but that would require a bigger code disruption.

Trying This Out

So your thinking of simulating a 12 spin system with lots of couplings and want it to run faster...

  • Download and Install Atlas
  • Download and Install CLAPACK (note: the instructions were not that helpful! I had to ask for a lot of support to get through this.)
  • Uncomment this line from n_matrix:
    //#define _USING_BLAS_
    
  • Uncomment this line from h_matrix:
    //#define _USING_LAPACK_
    
  • Search the Makefile for lapack and make the suggested changes:
    # Un-comment out these next two lines if using blas, atlas, or lapack,
    # and uncomment one or both of the following two lines
    #BLASINCL_FLAG   = -I/usr/include
    #LAPACKINCL_FLAG = -I/home/user/dev/CLAPACK-3.2.1/INCLUDE
    BLASINCL_FLAG   = 
    LAPACKINCL_FLAG = 
    
  • and
    ### un-comment the next four lines if using blas/atls/lapack, 
    # and comment out next line.
    #LAPACK_LIBS = /home/user/dev/CLAPACK-3.2.1/lapack_LINUX.a 
    #BLAS_LIBS  = /home/user/dev/CLAPACK-3.2.1/libcblaswr.a   /home/user/dev/CLAPACK-3.2.1/F2CLIBS/libf2c.a \
    #              /usr/lib/atlas/libblas.a   /usr/lib/libatlas.a
    #LAB_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS)
    LAB_LIBS = 
    
  • Make any required changes to the paths to CLAPACK and to libblas.a libatlas.a, etc.

What Next

A few thoughts on how to proceed.

  • Start with a detailed profiling of the code to see where there slow down is for real problems.
  • Convert more Gamma code to use LAPACK and ATLAS, using the above results.
  • Intel is working on a new version of CLAPACK that uses native C-style notation and matrix organization so that may speed things up a bit.
  • Consider using the MAGMA project to speed up linear algebra using GPU's and mutlicore PC's: http://icl.cs.utk.edu/magma/index.html
  • There is also the parallel processing linear algebra project: http://icl.cs.utk.edu/plasma/index.html
  • Or rework the whole Matrix library so using LAPACK and BLAS will help at smaller matrix sizes.