wiki:SwiggingHlsvd

Version 21 (modified by flip, 10 years ago) (diff)

--

Making the hlsvd Library Accessible to Python

Introduction

hlsvd is a library written in FORTRAN. We've run it through f2c, a FORTRAN to C convertor. Our version of the library is therefore written in autogenerated C. It exposes just one function called hlsvd().

This is an explanation of what I learned using SWIG and ctypes to wrap the hlsvd library in order to make its function accessible to Python.

Comments on the hlsvd Code

I'm impressed by f2c; it's a very cool tool. Nevertheless, there's some imperfections I'd like to address. They're not specific to SWIG and ctypes; they're evident even when calling hlsvd from straight C.

First, the f2c-generated code gives warnings when it compiles. They're mostly suggestions to parenthesize expressions so as not to rely on C operator precedence for statement correctness. I think they're only warning that the code is hard for humans to parse; I don't think the code is broken. However, I'm loath to ignore any compiler warnings; I want clean compiles. Hopefully we are on the same wavelength on this subject and I don't have to explain why I think clean compiles are extremely important, particularly in code that we'll share with others.

Cleaning up the code to eliminate the warnings is easy enough, but modifying auto-generated code comes with the caveat that if it is ever auto-generated again, bye-bye modifications. I think that's a low risk in this case since (a) the FORTRAN hlsvd code is by now probably quite stable and so probably won't need to be f2c-ed again, and (b) any modifications we'd apply are pretty lightweight and would be pretty easy to reapply if necessary.

A second problem with the code generated by f2c is the function interface. All of the parameters are pass-by-reference. This is technically correct because all FORTRAN params are pass-by-reference. C, however, is pass-by-value so f2c has to make every parameter to hlsvd() a pointer in order to mimic pass-by-reference semantics. It's unnecessary since many of the params to hlsvd are input-only, and it makes calling the function (from C, ctypes, or SWIG) more complicated than it needs to be.

I'd like to modify the f2c-generated hlsvd.c to simplify the call signature by making all of the input-only parameters regular C pass-by-value parameters. It might be easier to add a wrapper function that achieves the same result. A wrapper would live in its own file which would have the advantage of not getting overwritten if this code is ever regenerated from the FORTRAN source.

A third problem is that the generated hlsvd library exports about a zillion functions when it really only needs to export one. That's a little sloppy. It can be addressed by adding a "static" modifier to each function in the source code or by adding some linker commands. We can also just ignore it.

Last but not least, f2c creates several C typedefs to mimic the FORTRAN variable types integer, real and double precision. In the code we have, these are mapped to the C types long, float and double. (By supplying options to f2c when it is run, one can modify these typedefs.) There's nothing wrong with f2c's choices and it would be unfair to present these typedefs as a flaw in f2c. Nevertheless, they do present us with a problem.

The problem is that regardless of whether we use SWIG or ctypes or our own handwritten interface, sooner or later something will have to convert those C variables to Python objects by making calls to the Python API. Python's API provides functions for converting standard C types (int, long, float, char *, etc.) to Python types. (For instance, the Python C library function PyInt_FromLong() accepts a C long as a parameter and returns a new Python integer object.) However, it does not (cannot) know anything about custom typedefs. At some level, a piece of code is going to have to make an assumption about the true type (i.e. the standard C type) to which the typedef maps.

In our case, that's simple since there's only three typedefs we have to worry about and they're very simple. Nevertheless, if this code is f2c-ed again and for whatever reason the typedefs change (for instance, if f2c decides to start typedef-fing FORTRAN reals as C doubles instead of C floats), then our assumption will break and so will our code. Therefore, we have a dependency on how f2c generates code.

If we never run f2c to regenerate this code, then this is a non-issue. If we do run f2c again, we must ensure that the typedefs of integer, real and doublereal still map to C's long, float and double.

Calling hlsvd from Python

Before I get into specifics about SWIG and ctypes, I want to describe what an ideal Python interface to hlsvd() would look like. Remember that this function takes inputs of various types and returns a number of values also of various types. Returning multiple values from a function is one the of things Python makes easy, so we should take advantage of that.

The C interface looks something like this:

int hlsvd(float *,        //  1 - yr (input)
           float *,       //  2 - yi (input)
           long *,        //  3 - ndata (input)
           long *,        //  4 - ndp (input)
           float *,       //  5 - tstep (input)
           long *,        //  6 - m (input)
           long *,        //  7 - nssv (input)
           double *,      //  8 - rootr (output)
           double *,      //  9 - rooti (output) 
           float *,       // 10 - fre (output)
           float *,       // 11 - dam (output)
           float *,       // 12 - amp (output)
           float *,       // 13 - phase (output)
           double *,      // 14 - sv (output)
           double *,      // 15 - snoise (output)
           double *,      // 16 - rms (output)
           long *,        // 17 - ndiv (output)
           long *,        // 18 - nit (input)
           long *);       // 19 - nitf (output)

This comment about the parameters is stolen from the f2c-ed code:

/*   on entry: */
/*     yr    - data array (real part of time-domain signal) */
/*     yi    - data array (imaginary part of time-domain signal) */
/*     ndata - total number of data-points of experimental signal */
/*     ndp   - number of data-points being used in the hlsvd analysis */
/*     tstep - stepsize (milli-seconds) */
/*     m     - size parameter of hankel data matrix */
/*     nssv  - start value for number of signal-related singular values */
/*     nit   - number of iterations entered */

/*   on exit: */
/*     rootr - signal-root array (real part) */
/*     rooti - signal-root array (imaginary part) */
/*     fre   - frequency array (khz) */
/*     dam   - damping-factor array (milli-seconds) */
/*     amp   - amplitude array (arbitrary units) */
/*     phase - phase array (degrees) */
/*     sv    - singular value array */
/*     snois - estimation of signal noise level (standard deviation) */
/*     rms   - root mean square of hlsvd fit */
/*     nitf  - final number of iterations */
/*     ndiv  - number of singular values found */

The code to call an ideal Python version of hlsvd() might look something like the example below. (For clarity's sake I'm sticking with the same variable names as in the comments above; I generally prefer to avoid abbreviations.)

# - y is an iterable type (e.g. a list) of complex numbers. Complex
# numbers are a native type in Python.
# - ndata is no longer necessary as it can be inferred from len(y)
return_tuple = hlsvd(y, ndp, tstep, m, nssv, nit)

# Extract values from the return_tuple
# - root is a list of complex numbers
# - fre, dam, amp, phase and sv are lists of floats
# - snoise, rms, nitf and ndiv are floats or ints as appropriate
root, fre, dam, amp, phase, sv, snoise, rms, nitf, ndiv = return_tuple

This code has several advantages over the C version. For starters, the input and output parameters are clearly separated which makes the code easier to read. Second, since complex numbers are a native type in Python, there's no need to represent them as two correlated lists of floats. That makes the input parameter y and the output parameter root simpler to read and manipulate. It also eliminates the possibility of a mismatched list-size bug (i.e. if a different number of real and imaginary values are supplied). Yet another advantage is that we can eliminate one parameter (ndata) that can be inferred from len(y).

Another bug that can't occur here is a memory overwrite due to undersized output parameters. In C one must preallocate the output parameters (like fre, amp, etc.) using malloc() or by declaring them on the stack. If they're too small, hlsvd() might crash (if you're lucky) or silently alter the values of neighboring arrays (if you're not). That can't happen in a Python interface like the one above.

Last but not least, in the C interface some of the output parameters are floats and some are doubles. Python's float type covers both C float and C double, so we don't have to remember which is which.

Our Python call is just two lines of code, and could be one:

root, fre, dam, amp, phase, sv, snoise, rms, nitf, ndiv = \
    hlsvd(y, ndp, tstep, m, nssv, nit)

With this Pythonic interface in mind, let's look at how the interfaces generated by SWIG and ctypes stack up.

Wrapping hlsvd with SWIG

Wrapping anything with SWIG requires describing it to SWIG via an interface (.i) file. In an ideal world, SWIG can infer what it needs to know from a .h file and the interface file just refers SWIG to the .h file. In the case of hlsvd, there's no .h file so I have to describe the interface to SWIG explicitly.

Complications arise with hlsvd() because it takes arrays (rather than simple types) as input parameters. Further complications result from the fact that it expects to be passed pre-allocated arrays to which it will write output. This is standard stuff with C but takes some work to express in a pointer-less language like Python.

The resulting code is something only a mother could love. I should point out that I think SWIG is near miraculous in what it can do, and I don't intend any of my snide comments as criticism of SWIG or its developers. It's just that autotranslated code has all the charm of autotranslated writing. If you've ever used Babelfish, you know what I'm talking about. Meaning is (usually) preserved, albeit crudely, but all the idioms are lost and it's eminently clear that the text (or in this case, the interface) was not created by a native speaker.

So we get ugly code like this (where inputs is a Python list I populated from a file):

yr = hlsvd_wrapper.new_float_array(2048)
for i in range(0, 2048):
    hlsvd_wrapper.float_array_setitem(yr, i, inputs[i])

And this (which I think is only necessary because of hlsvd()'s pass-by-reference awkwardness I mentioned above):

tstep = hlsvd_wrapper.new_tstep()
hlsvd_wrapper.tstep_assign(tstep, .256)

And a series of calls like this one to create the output arrays:

rootr = hlsvd_wrapper.new_double_array(50)

In all, here's what one has to do just to create (not populate) the variables to be used in the call to hlsvd():

yr = hlsvd_wrapper.new_float_array(2048)
yi = hlsvd_wrapper.new_float_array(2048)
ndata = hlsvd_wrapper.new_ndata()
ndp = hlsvd_wrapper.new_ndp()
tstep = hlsvd_wrapper.new_tstep()
m = hlsvd_wrapper.new_m()
nssv = hlsvd_wrapper.new_nssv()
nit = hlsvd_wrapper.new_nit()
rootr = hlsvd_wrapper.new_double_array(50)
rooti = hlsvd_wrapper.new_double_array(50)
fre = hlsvd_wrapper.new_float_array(50)
dam = hlsvd_wrapper.new_float_array(50)
amp = hlsvd_wrapper.new_float_array(50)
phase = hlsvd_wrapper.new_float_array(50)
sv = hlsvd_wrapper.new_double_array(1024)
snoise = hlsvd_wrapper.new_snoise()
rms = hlsvd_wrapper.new_rms()
ndiv = hlsvd_wrapper.new_ndiv()
nitf = hlsvd_wrapper.new_nitf()

The call itself looks like this:

hlsvd_wrapper.hlsvd(yr, yi, ndata, ndp, tstep, m, nssv,
                     rootr, rooti, fre, dam, amp, phase, sv, snoise,
                     rms, ndiv, nit, nitf)

Note that every disadvantage of the C interface remains. There's no distinction between input and output parameters. The complex numbers are still represented as correlated lists of floats. The list/array sizes must be specified explicitly, and an opportunity for memory corruption or a crash exists if one gets it wrong. One must keep track of which variables are floats and which are doubles. And it's many more lines of code than the Python version. (Some of this is exacerbated by hlsvd()'s pass-by-reference interface.)

I should also note that generating the code to even make this interface possible requires the creation of a non-trivial interface file, a SWIG-ing step to auto-generate two cooperative wrappers (one in C and one in Python), and compilation and linkage of that C wrapper. Once the code is constructed, the call chain looks like this:

My code --> SWIG's .py wrapper --> SWIG's .c wrapper --> libhlsvd

It works, for which I bow deeply to the SWIG developers. However, it was a fair amount of work to generate the interface file. (Granted, I was climbing the SWIG learning curve.) In addition, the result is unPythonic and leaves us exposed to a lot of C's unpleasantness.

Wrapping hlsvd With Ctypes

Ctypes is part of the Python standard library. The Python documentation describes it like this:
ctypes is a foreign function library for Python. It provides C compatible data types, and allows to call functions in dlls/shared libraries. It can be used to wrap these libraries in pure Python.

Note the key phrase "pure Python".

One starts with the very unPythonic explicit load of the C library to be called:

lib = ctypes.CDLL("hlsvd.dylib")

This is one place where SWIG is more Pythonic, since I access SWIG's library with a normal Python import statement.

Ctypes already knows about, well, C types, so it's easy to create something that can be passed to a function expecting a C float, for instance:

tstep = ctypes.c_float(.256)

Arrays are pretty simple too:

# Create an array type; this is like a class
InputFloatArrayType = ctypes.c_float * 2048
# Create an instance of that "class"
yr = InputFloatArrayType()

The variable setup for the call to hlsvd() looks like the code below:

InputFloatArrayType = ctypes.c_float * 2048
OutputDoubleArrayType = ctypes.c_double * 50
OutputFloatArrayType = ctypes.c_float * 50

yr = InputFloatArrayType()
yi = InputFloatArrayType()

ndpmax = ctypes.c_long(2048)
ndp = ctypes.c_long(1024)
tstep = ctypes.c_float(.256)
m = ctypes.c_long(380)
nssv = ctypes.c_long(25)
nit = ctypes.c_long(500)

rootr = OutputDoubleArrayType()
rooti = OutputDoubleArrayType()
fre = OutputFloatArrayType()
dam = OutputFloatArrayType()
amp = OutputFloatArrayType()
phase = OutputFloatArrayType()
sv = OutputDoubleArrayType()
snoise = ctypes.c_double()
rms = ctypes.c_double()
ndiv = ctypes.c_long()
nitf = ctypes.c_long()


lib.hlsvd(yr, yi, ctypes.byref(ndpmax), ctypes.byref(ndp),
           ctypes.byref(tstep), ctypes.byref(m), ctypes.byref(nssv),
           ctypes.pointer(rootr), ctypes.pointer(rooti),
           ctypes.pointer(fre), ctypes.pointer(dam),
           ctypes.pointer(amp), ctypes.pointer(phase),
           ctypes.pointer(sv), ctypes.pointer(snoise),
           ctypes.pointer(rms), ctypes.pointer(ndiv),
           ctypes.byref(nit), ctypes.pointer(nitf)
          )

At first glance this might look no better than the SWIG code, but it's actually a bit nicer. For one thing, the array sizes are defined in one place (the first three lines) and not repeated over and over. In addition, it's pretty clear what's going on in a line like this:

ndiv = ctypes.c_long()

whereas SWIG's version is opaque:

ndiv = hlsvd_wrapper.new_ndiv()

The ctypes.byref() call is informative. It says that the C interface demands that I pass by reference, but I don't need access to the pointer here in Python. That tells anyone reading this code that I probably don't care what the called function does with the value, hinting that it is input-only. Again, I wouldn't need this at all if the f2c-generated code was tweaked to eliminate the unnecessary pass-by-reference for the input parameters.

Also note that the code above includes some value assignments, like this line which assigns the value .256 to tstep:

tstep = ctypes.c_float(.256)

In my ideal Python example above I ignored assigning values to the variables for brevity's sake. And in order to make a fair comparison of SWIG's code to the ideal, I didn't include assignment in that code either. But to fairly compare SWIG and ctypes I need to mention assignment here, because it's simpler with ctypes. The code to assign a value to tstep via the SWIG interface looks like this:

tstep = hlsvd_wrapper.new_tstep()
hlsvd_wrapper.tstep_assign(tstep, .256)

Arrays are also considerably uglier in SWIG. First consider the ctypes version (inputs is a list that I read from a disk file):

yi = InputFloatArrayType()
for i, f in enumerate(inputs):
    yi[i] = inputs[i]

And now look at the SWIG version:

yi = hlsvd_wrapper.new_float_array(2048)
for i in range(0, 2048):
    hlsvd_wrapper.float_array_setitem(yi, i, inputs[i])

Last but not least, note that the code above is 95% of what I needed to use ctypes to call hlsvd(). There's no interface file, no SWIG-ing step, and no compile & link steps. Just the Python standard library, that's it.

I have to say, though, that it's only because I saw SWIG first that I'm thrilled with ctypes. It's a nicer tool with which to build an equally ugly interface. Neither SWIG nor ctypes insulates us from the warts and wrinkles of the C interface.

Conclusions

For the purpose of wrapping hlsvd(), both SWIG and ctypes get the job done, although neither result will win a beauty contest. Setting aside criticisms of their ugly and unPythonic interfaces, ctypes is the hands down winner in the ease-of-use category.

Some adjustments to the f2c-generated code and another Python wrapper layer around the code demonstrated here could go a long way to making the interface Pythonic.

I'm less clear on what this experience portends for wrapping a larger C++ library such as GAMMA. The example here was perhaps unfair to SWIG, since SWIG can generate code based on header files but there were no header files with which to work in this case. Ctypes code, on the other hand, must always be handwritten, so in order to wrap 100 functions one must write 100 chunks of ctypes code. SWIG, however, might be able to generate at least some of that automatically.

I also don't know what this implies for use of C++ objects within Python. Ctypes is strictly for wrapping C code (or code that exports a C-style calling interface) and knows nothing about C++ objects. I suspect that even if SWIG can make it possible to cajole C++ objects into Python, the Babelfish analogy would apply again -- the result might be Python in syntax, but C++ in spirit.

References

FORTRAN defines only INTEGER, REAL and "DOUBLE PRECISION" numbers.
The options to f2c control how it deals with FORTRAN data types.