Opened 8 years ago

Closed 8 years ago

#36 closed defect (wontfix)

Double precison overflow causes hang (infinte loop) in hlsvd

Reported by: flip Owned by:
Priority: major Milestone:
Component: svd Version:
Keywords: Cc: bsoher

Description

Given the right combination of inputs, a call to HLSVD (through our Python interface) hangs. The attached sample code & input demonstrate a specific instance of the problem and a way to experiment with other inputs.

# of data points = 1500, # singular values = 20 produces a hang.
# of data points = 1500, # singular values = 10 does not hang.

I tested the f2c-ed version of HLSVD and it behaves the same way.

In the 1500/20 case, the hang occurs in zlascl.f which contains just the subroutine ZLASCL(). One of the params to ZLASCL() is the complex matrix A, and at least one of the values in A is infinity which sends ZLASCL() into an endless loop.

It's not fair to blame this on ZLASCL() since it's probably not meant to handle values like INF and NAN.

The INF originates (is created by) the call to vanmon() in hlsvd.f. vanmon() "calculates the Vandermonde matrix zeta (ndp x k)". It's pretty simple code that's just a big loop, and some print statements show the values march steadily towards the max value expressable by double precision (~1.7e+308), with INF appearing immediately after.

Here's the print statement I used right after the assignment of zeta(i,j) = temp --

PRINT *, REAL(zeta(i,j)), ', ', AIMAG(zeta(i,j))

Many of the subroutines in hlsvd are BLAS/LAPACK functions with updated implementations readily available on the Net. Unfortunately vanmon() is not one of them, so there's no alternate or updated implementation that we can quickly drop in.

Practical Considerations and Workarounds

This presents a nasty problem for us in Analysis.

At present, I can't predict which inputs will result in a hang. We only know a hang has occurred after the fact (i.e when it's too late). It's worth noting that on my Mac, a Python process that's hung up in the infinte loop does not respond to Ctrl+C. I have to kill it with the kill command at the console (or Force Quit in the GUI). For an Analysis user, this could mean lost work.

There are a couple of routes we could follow to avoid this problem.

One attractive option is to prevent the user from entering values that will hang. As I said above, as present I can't predict this. A study of the code by someone who understands the math might produce a formula to predict which values are OK and which are not. But I think the results of vanmon depend not only on # of data points and # of singular values, but also on the input signal (waveform) and there's an infinite number of them. So predicting which inputs will hang is perhaps not so easy.

A less attractive option is to execute HLSVD in a separate process and if the process takes too long (e.g. 10 seconds) then we assume that it is hung, kill it and let the user know that those values probably don't work. This isn't very satisfying, but it's probably better than hanging completely.

Attachments (2)

main.py (1.6 KB) - added by flip 8 years ago.
hlsvd_input.xml (32.4 KB) - added by flip 8 years ago.

Download all attachments as: .zip

Change History (5)

Changed 8 years ago by flip

Changed 8 years ago by flip

comment:1 Changed 8 years ago by flip

The algorithm hints and experimentation makes clear that this problem is more prone to happen for larger number of data points. For instance, I ran HLSVD for the following combinations, and all succeeded.

n_data_points = 512, n_singular_values = 20
n_data_points = 576, n_singular_values = 20
n_data_points = 640, n_singular_values = 20
n_data_points = 704, n_singular_values = 20
n_data_points = 768, n_singular_values = 20
n_data_points = 832, n_singular_values = 20
n_data_points = 896, n_singular_values = 20
n_data_points = 960, n_singular_values = 20
n_data_points = 1024, n_singular_values = 20
n_data_points = 512, n_singular_values = 30
n_data_points = 576, n_singular_values = 30
n_data_points = 640, n_singular_values = 30
n_data_points = 704, n_singular_values = 30
n_data_points = 768, n_singular_values = 30
n_data_points = 832, n_singular_values = 30
n_data_points = 896, n_singular_values = 30
n_data_points = 960, n_singular_values = 30
n_data_points = 1024, n_singular_values = 30
n_data_points = 512, n_singular_values = 40
n_data_points = 576, n_singular_values = 40
n_data_points = 640, n_singular_values = 40
n_data_points = 704, n_singular_values = 40
n_data_points = 768, n_singular_values = 40
n_data_points = 832, n_singular_values = 40
n_data_points = 896, n_singular_values = 40
n_data_points = 960, n_singular_values = 40
n_data_points = 1024, n_singular_values = 40
n_data_points = 512, n_singular_values = 50
n_data_points = 576, n_singular_values = 50
n_data_points = 640, n_singular_values = 50
n_data_points = 704, n_singular_values = 50
n_data_points = 768, n_singular_values = 50
n_data_points = 832, n_singular_values = 50
n_data_points = 896, n_singular_values = 50
n_data_points = 960, n_singular_values = 50
n_data_points = 1024, n_singular_values = 50

But when testing the range of data points > 1024, with # of singular values = 10 I can only run the algorithm up to n_data_points = 1472 where it hangs. With n_singular_values = 5 I get all the way up to 1600. I have to set n_singular_values = 1 to exercise n_data_points = 2048.

comment:2 Changed 8 years ago by flip

Brian suggested that limiting the # of data points to 1024 might be safe
enough. It looks like he's right, although there are significant limitations
in our test data.

Our test data consists of VASF files (press_cpX and press_cpX_wref), some
SI data (which we can't use), some Siemens DICOM files (all of which contain
only 1024 data points, so they can't exercise HLSVD above 1024), two Varian
files (one of which is SI), one VIFF raw file, and two Siemens RDA files,
one of which is SI and the other of which is just 1024 data points.

This boils down to the VASF presscpX files which are essentially the same
for testing the limits of HLSVD, VASF presscpX_wref files (ditto),
one Varian file, one Siemens RDA file (1024 points), one VIFF raw file
and the DICOM files (also 1024 points). That gives me six different sets of
data with which to exercise HLSVD, and four I can run at > 1024 points.

I am reluctant to draw conclusions about HLSCD's robustness from just
4 or 6 exercises. But it's what we've got, so with a big "YMMV" caveat
applied, here's what I observed.

I collected the data to pass to HLSVD by adding a little code to
funct_hlsvd.py right before it calls HSLVD. The code I added writes the
HLSVD params into an XML file. I used those XML files as input to a slightly
modified version of the main.py attached to this ticket.

For each set of test data, I ran HLSVD for 512 <= n_data_points <= 1024
with a step size of 64, and 20 <= n_singular_values <= 50 with a step size
of 10. In pseudocode --

for each set of test data in my set of six:
    for n_singular_values in (20, 30, 40, 50,):
        for n_data_points in range(512, 1025, 64):
            hlsvd()

None of these hung. Some (e.g. n_data_points == 512) didn't return results,
but this was because of the known limitation of the mixed-radix fft that
HLSVD uses.

I ran the test again for n_data_points in range(1024, 2049, 64). The DICOM
and Siemens RDA data was excluded from this test because it doesn't contain
enough data points.

HLSVD hung for each data set. With n_singular_values == 20, here's where
each input hung:

press_cp2                  - overflow at n_data_points = 1472
press_cp7_wref             - overflow at n_data_points = 1536
press20_single_voxel_noise - overflow at n_data_points = 1600
laser_2010102901           - overflow at n_data_points = 1728

So, for our data, all values < 1472 are safe. Even allowing for the greater
challenges that other data sets might present, it seems like restricting
n_data_points to <= 1024 would be "safe enough" or maybe even completely
safe.

comment:3 Changed 8 years ago by flip

  • Resolution set to wontfix
  • Status changed from new to closed

As of June 2012 we have switched to another HLSVD method that doesn't suffer from this problem so I'm closing this as WONTFIX.

Note: See TracTickets for help on using tickets.