how to minimize a function with discrete variable values in scipy

nagylzs picture nagylzs · Jul 25, 2013 · Viewed 8k times · Source

I'm trying to optimize a target function that has multiple input variables (between 24 and 30). These variables are samples of three different statistical variables, and target function values are t-test probability values. An error function represents the error (sum of squares of differences) between the desired and the actual t-test probabilities. I can only accept solutions where the error is less than 1e-8, for all of the three t-tests.

I was using scipy.optimize.fmin and it worked great. There are many solutions where the target function became zero.

The problem is that I need to find a solution where the variables are between 0 and 10.0, and are whole numbers or don't have more than one digit fractional part. Examples of valid values are 0 10 3 5.5 6.8. Examples of invalid values: -3 2.23 30or 0.16666667.

I happen to know that there is at least one solution, because the target values are coming from actual measured data. The original data was lost, and my task is to find them. But I don't know how. Using trial/error is not an option, because there are about 100 possible values for each variable, and given the number of variables, the number of possible cases would be 100**30 which is too much. Using fmin is great, however it does not work with discreet values.

Is there a way to solve this? It is not a problem if I need to run a program for many hours to find a solution. But I need to find solutions for about 10 target values within a few days, and I'm out of new ideas.

Here is an example MWE:

import math
import numpy
import scipy.optimize
import scipy.stats
import sys

def log(s):
    sys.stdout.write(str(s))
    sys.stdout.flush()

# List of target T values: TAB, TCA, TCB
TARGETS = numpy.array([
    [0.05456834,   0.01510358,    0.15223353   ],  # task 1 to solve
    [0.15891875,   0.0083665,     0.00040262   ],  # task 2 to solve
])
MAX_ERR = 1e-10 # Maximum error in T values
NMIN,NMAX = 8,10 # Number of samples for T probes. Inclusive.

def fsq(x, t, n):
    """Returns the differences between the target and the actual values."""
    a,b,c = x[0:n],x[n:2*n],x[2*n:3*n]
    results = numpy.array([
        scipy.stats.ttest_rel(a,b)[1], # ab
        scipy.stats.ttest_rel(c,a)[1], # ca
        scipy.stats.ttest_rel(c,b)[1]  # cb
    ])
    # Sum of squares of diffs
    return (results - t)

def f(x, t, n):
    """This is the target function that needs to be minimized."""
    return (fsq(x,t,n)**2).sum()

def main():
    for tidx,t in enumerate(TARGETS):
        print "============================================="
        print "Target %d/%d"%(tidx+1,len(TARGETS))
        for n in range(NMIN,NMAX+1):
            log(" => n=%s "%n)
            successful = False
            tries = 0
            factor = 0.1
            while not successful:
                x0 = numpy.random.random(3*n) * factor
                x = scipy.optimize.fmin(f,x0, [t,n], xtol=MAX_ERR, ftol=MAX_ERR )
                diffs = fsq(x,t,n)
                successful = (numpy.abs(diffs)<MAX_ERR).all()
                if successful:
                    log(" OK, error=[%s,%s,%s]\n"%(diffs[0],diffs[1],diffs[2]))
                    print " SOLUTION FOUND "
                    print x
                else:
                    tries += 1
                    log(" FAILED, tries=%d\n"%tries)
                    print diffs
                    factor += 0.1
                    if tries>5:
                        print "!!!!!!!!!!!! GIVING UP !!!!!!!!!!!"
                        break
if __name__ == "__main__":
    main()

Answer

Jeremy West picture Jeremy West · Sep 13, 2013

What you are attempting to do (if I understood your setup) is called integer programming and it is NP-hard; http://en.wikipedia.org/wiki/Integer_programming. I realize that you aren't looking for integer solutions, but if you multiply all your inputs by 10 and divide your target function by 100, you obtain an equivalent problem where the inputs are all integers. The point is, your inputs are discrete.

The target function you are working with is a convex, quadratic function, and there are good constrained optimization algorithms that will solve it quickly for real-valued inputs in the interval [0, 10]. From this you can try rounding or checking all acceptable points nearby, but there are 2^n of them, where n is the number of inputs. Even if you do this, the optimal solution is not guaranteed to be one of these points.

There are approximation algorithms for integer programming problems and you may find that sometimes one of them works well enough to get you to an optimal point. There are a list of things you might try in the Wikipedia article I cited, but I don't know that you will be happy trying to solve this problem.