ambitious Collatz genetic algorithm code/ graph results!

hi all, this is a writeup of a more ambitious attack on the collatz problem building and using a combined synthesis of many recent ideas and a GA approach.

have long wanted to do genetic algorithm (GA) type attacks on number theoretic problems. have tried out a few rough ideas over the years incl some on Collatz and have achieved some results but nothing spectacular.

this idea came up just yesterday and it seemed quite interesting/ somewhat promising and not requiring a huge amount of programming overhead. one of the problems with very sophisticated GAs is they can require a lot of programming and tuning. some GA approaches even create entire complex structures such as circuits or programs, eg expressed as lambda calculus expressions but those also tend to require a lot of tuning. (brilliant case study on both those two areas by Koza, in the pacman and patented! circuit generation problems.) this latter approach is sometimes distinguished from GA by calling it Genetic Programming (GP).

here is the idea background. now it is known that some basic metrics of collatz iterates help determine (predict) its future dynamics. the correlations may be low but there is some measurable effect.

the recent breakthrough with density analysis seems to open lots of new doors. one simple idea occurred immediately and is hinted in the prior writeup. the iterate value plus the density form a 2-pair coordinate which gives its own random 2d walk. it is natural to wonder about the randomness of this 2d random walk versus the 1d walk. it is also natural to create 1d random walks from a 2d random walk by just converting the 2d coordinate value to 1d distance using the euclidean formula (distance from origin, square root of squared coordinate values: d = \sqrt{x^2 + y^2}). and there are other “norm” formulas that can be used. and then look, does this new 1d walk combined from the 2d walk have better convergence properties?

“convergence properties” here can also be regarded as “nearly monotonic (decreasing)”. and looking for a monotonic decreasing function of a random walk reminds me very much of nonhalting analysis literature, aka “termination analysis”. a big theme from that literature is to find an “invariant function”. my understanding is that these “invariant functions” are also often monotonically decreasing.

a quick experiment (not written up) shows that the 1d Euclidean distance combination of the 2d iterate value and density does seem to have some better monotonic behavior. but then how does one quantify that? there are two immediate simple ideas that have been used in some prior experiments:

  • count the number of “reversals” ie if a function is monotonically decreasing, count the times successive iterates do not decrease. this can be expressed as a percent of the total sequence length. for “more monotonic functions” this is closer to 100%. call this x_a
  • there was an idea studied of “max length of nonmonotonic run”. look at local minima in the decreasing function, and look for long runs where the function is larger than the current local minima. there is a max long run. for “more monotonic functions” the max length nonmonotonic runs are shorter. call this x_b.

but all these ideas lead to some near-obvious generalizations. clearly we have a general way to generate 1d random walks from n-d random walks using Euclidean distance norm or other norms. but we can also use any metrics that might be relevant in attempting to come up with this new “better-behaved/converging”, “more monotonic” random walk. now it is documented via density experiments etc there are a lot of basic metrics of iterate values that have subtle influence on the (“future”) collatz random walk.

but then one immediately wonders if some particular “mix” of these different metrics would be optimal. but one has no idea ahead of time whether any are even applicable. aha! sounds like a job for a genetic algorithm!

a Collatz GA

a more complex genetic algorithm would be a GP that attempts to find an invariant function using complex programming combinations. that has been in the back of my mind for awhile but implementation and tuning challenges are high on that with not much apparent promise (a solution there would be akin to, in the english vernacular, “getting lucky”).

but just yesterday another idea came to me. there are fairly simple functions that can have a very complex behavior. this is the same reality staring one in the face with Collatz only from a different angle. one can come up with invariant functions based on some simple metrics without the full-blown GP complexity which themselves still have very complex behavior.

example: consider a bunch of not-hard-to-compute metrics, but which are highly nonlinear. the density measurement is a good example, and other recent metrics eg related to average run lengths, standard deviations, etc.

whats a simple combination of them? how about merely linear? that is also not hard to compute. and a GA would be a very natural algorithm to optimize the linear weights!

so heres the basic idea. setup a GA with some of these basic metrics and use it to find a linear combination of the metrics (ie filling in for complex full-blown GP complexity with some more limited GP-like “programming” aspects) that optimizes the “more monotonic” nature of the (limited) invariant function. this is hopefully a reasonable compromise on all the tradeoffs of search space, programming overhead, solution complexity, etc.

the result is the following. there are 14 basic “potential/plausible-looking” metrics chosen some based on prior experiments.

  • loge(xi), xi the iterate value
  • loge(xi)2
  • loge(xi)½
  • density
  • bit length
  • # of binary 1 bits
  • # of binary 0 bits
  • # 1 bits mod 2
  • # 1 bits mod 3
  • density distance (“from center/core”) |d - \frac{1}{2}|
  • avg 1-bit run lengths/ std dev
  • avg 0-bit run lengths/ std dev

then the genetic algorithm computes a linear combination of them, and that is the individual iterate value combined metric. then have the GA look for “more monotonic” random walks of the combined metric by adjusting the linear weights.

the GA is fairly simple and has 4 simple genetic operations labeled 0-3 or ‘a’-‘d’. the 1st 3 are mutation operators and the 3rd is a crossover operator.

  • set random coordinate to 0
  • set random coordinate to random number between 0-1½ and random sign (+1/-1).
  • multiply random current coordinate by random number between 0-1½ and random sign.
  • crossover. choose coordinates from two parents.

one could try all kinds of ways of improving convergence based on a mix of the above or have an adaptive algorithm that tries to figure out a good mix, and have written code like that in the past, but again that might be a greedy algorithm that optimizes for local minima anyway. the current “quick-and-teeny-dirty” algorithm simply chooses among the 4 operations randomly. one might also be concerned about apparently small magnitude of constants. but note the algorithm can choose to “quickly” repeatedly multiply a coordinate by 1½ which results in an exponential increase in that coordinate.

the algorithm starts by initalizing 40 seeds with random coordinates. half the seeds have entirely positive coordinates, and the other have have mixed signs.

so whats the result of all that? as one might have guessed from this already-longer-than-usual writeup, there is some positive result here.

this code is called with a parameter ‘a’ or ‘b’ which chooses the x_a or x_b evaluation metric. the x_b metric evaluates the function 3 separate times and uses the average because the “longest nonmonotonic run” metric is inherently more jagged/ noisy than the “percent sequentially decreasing” metric x_a.

the code has some other cool features. it can be called without a parameter to run a trend analysis on the current last best weights found, even while another GA instance is running concurrently. it has two files, one is the status log which gives best and worst fitness values and the new fitness value, and is updated whenever a new fitness value is better than the previously worst fitness value.

the weight log outputs the last best weights, iteration # of the solution, and a string which records the genetic history/ lineage/ “genealogy” of the best solution. there are only 200 genetic strings at a time. it seems sufficient and did not experiment to see how more strings lead to better solutions, suspect it would not alter results much or convergence a lot.

the code has to run as a maximizer for x_a and a minimizer for x_b and just uses a near-hack of applying a negative sign on the fitness function in the 2nd case.


now, looking at analysis. the algorithms “relatively rapidly” find dramatically improved solutions (from admittedly random, presumably “poor quality” starting conditions) in 10-15m but then find small improvements only very intermittently.

the algorithms succeed in improving each metric significantly shown in these plots of the metric change. this plots worst, last best, and new best fitness on the y axis and the iteration # on the x axis. the fitness function is evaluated for iterate seeds starting with 2^n bits where n \in \{40...60\} randomly selected. the analysis plots look over the longer range n \in \{1...500\}. the x_b run finds much less of a gradient to exploit as would be somewhat expected.

def stat(l)
        t = t2 = 0
        l.each \
            t += x
            t2 += x**2
        c = l.size
        a = t.to_f / c
        z = t2.to_f / c - a ** 2
        return a, Math.sqrt(z < 0 ? 0 : z)
def d(s)
        c = s.split('').select { |x| x == '1' }.size
        return c.to_f / s.length, s.length, c, s.size - c, c % 2, c % 3
def rank(n, l)
        s = n.to_s(2)
        m1 = Math.log(n)
        m2 = m1 ** 2
        m3 = m1 ** 0.5
        d1, w, c1, c0, c2, c3 = d(s)
        d2 = (0.5 - d1).abs
        a1, sd1 = stat(s.split('0').map { |x| x.length })
        a0, sd0 = stat(s.split('1').map { |x| x.length })
        l2 = [m1, m2, m3, d1, w, c1, c0, c2, c3, d2, a1, sd1, a0, sd0]
        t = 0.0
        l.each_with_index { |w, i| t += w * l2[i] }
        return t
def f(n, w)
        l = []
        t = 0
        while (n != 1)
                l << rank(n, w)
                n = (n % 2 == 0) ? n / 2 : (n * 3 + 1) / 2
                t += 1
        c = 0
        m = l[0]
        x = j = 0
        (1...l.size).each \
                c += 1 if (l[i] < l[i - 1])
                m, j = [l[i], i] if (l[i] < m)
                x = [x, i - j].max
        return c.to_f / (l.size - 1), x, t
def f3(n)
        return sprintf("%.3f", n).to_f
def eval1(p, z)
        n = 2 ** p + rand(2 ** (p - 1))
        x, t = f(n, z)
        return x if ($mode == 'a')
        return -t if ($mode == 'b')
def p()
        return 40 + rand(20)
def eval(z)
        return eval1(p(), z) if ($mode == 'a')
        return (eval1(p(), z) + eval1(p(), z) + eval1(p(), z)) / 3.0 if ($mode == 'b')
def trend()
        n = ARGV[0].nil? ? 1 : ARGV[0].to_i
        w =$fn).readlines[-n].split[0...$v].map { |x| x.to_f }
        (2..500).each \
                n = 2 ** p + rand(2 ** (p - 1))
                l = f(n, w)
def rsign()
        return rand(2) ? -1 : 1
$mode = ARGV[0]
$v = 14
$fn = 'var.txt'
if (!['a', 'b'].member?($mode))
l = []
50.times \
        w = (1..$v).map { rand() * 1.5 * ((i % 2 == 0) ? 1 : rsign()) }
        l <<= [eval(w), w, "[#{i}]"]
f ='out.txt', 'w')
f2 =$fn, 'w')
mhi = nil
mlo = nil
j = 0
loop \
        l = l.sort_by { |x| x[0] }.reverse
        a = l[rand(l.size)]
        b = l[rand(l.size)]
        x = a[1]
        y = b[1]
        r = rand($v)
        c = rand(4)
        h = nil
        case c
                when 0
                        z = x.dup
                        z[r] = 0.0
                when 1
                        z = x.dup
                        z[r] = rand() * 1.5 * rsign()
                when 2
                        z = x.dup
                        z[r] *= rand() * 1.5 * rsign()
                when 3
                        z = []
                        x.size.times { |i| z[i] = (rand(2) % 2 == 0) ? x[i] : y[i] }
                        h = "(#{a[2]}+#{b[2]})"
        op = 'a'
        c.times { op.succ! }
        h = "#{a[2]}#{op}#{r}" if (h.nil?)
        l.pop if (l.size > 200)
        t = eval(z)
        hi = l[0][0]
        lo = l[-1][0]
        fhi = flo = false
        mhi, fhi = [t, true] if (mhi.nil? || t > mhi)
        mlo, flo = [t, true] if (mlo.nil? || t > mlo)
        f.puts([f3(t).abs, f3(hi).abs, f3(lo).abs, h.length, j,].join("\t")) if (fhi || flo)
        f2.puts((z + [j, f3(t), h]).join("\t")) if (fhi)
        j += 1
        l << [t, z, h]
0.457   0.662   0.314   6       0       Thu Jan 01 10:45:25 MST 2015
0.514   0.662   0.314   11      1       Thu Jan 01 10:45:25 MST 2015
0.516   0.662   0.314   11      4       Thu Jan 01 10:45:26 MST 2015
0.525   0.662   0.314   11      5       Thu Jan 01 10:45:26 MST 2015
0.569   0.662   0.314   5       6       Thu Jan 01 10:45:26 MST 2015
0.681   0.662   0.314   7       27      Thu Jan 01 10:45:27 MST 2015
0.689   0.681   0.314   11      47      Thu Jan 01 10:45:28 MST 2015
0.713   0.689   0.488   30      298     Thu Jan 01 10:46:00 MST 2015
0.717   0.713   0.495   32      322     Thu Jan 01 10:46:04 MST 2015
0.735   0.717   0.526   13      530     Thu Jan 01 10:46:29 MST 2015
0.737   0.735   0.527   11      537     Thu Jan 01 10:46:30 MST 2015
0.753   0.737   0.536   21      575     Thu Jan 01 10:46:34 MST 2015
0.769   0.753   0.542   23      613     Thu Jan 01 10:46:38 MST 2015
0.773   0.769   0.633   38      867     Thu Jan 01 10:47:09 MST 2015
0.785   0.773   0.646   77      914     Thu Jan 01 10:47:14 MST 2015
0.788   0.785   0.725   266     1934    Thu Jan 01 10:49:13 MST 2015
0.798   0.788   0.725   229     1948    Thu Jan 01 10:49:15 MST 2015
0.806   0.798   0.729   219     2074    Thu Jan 01 10:49:29 MST 2015
0.808   0.806   0.741   74      2532    Thu Jan 01 10:50:25 MST 2015
0.818   0.808   0.746   167     2686    Thu Jan 01 10:50:43 MST 2015
0.848   0.818   0.748   482     2786    Thu Jan 01 10:50:54 MST 2015
0.896   0.848   0.771   1269    3915    Thu Jan 01 10:53:18 MST 2015
0.903   0.896   0.821   245     13775   Thu Jan 01 11:23:42 MST 2015
0.912   0.903   0.824   1282    14968   Thu Jan 01 11:29:12 MST 2015
0.932   0.912   0.844   6306    37835   Thu Jan 01 12:43:07 MST 2015
248.667 16.333  302.667 6       0       Thu Jan 01 10:45:48 MST 2015
160.0   16.333  302.667 6       1       Thu Jan 01 10:45:48 MST 2015
60.667  16.333  302.667 11      2       Thu Jan 01 10:45:49 MST 2015
52.667  16.333  302.667 6       6       Thu Jan 01 10:45:50 MST 2015
33.333  16.333  302.667 6       11      Thu Jan 01 10:45:52 MST 2015
31.0    16.333  302.667 6       15      Thu Jan 01 10:45:53 MST 2015
22.0    16.333  302.667 10      21      Thu Jan 01 10:45:55 MST 2015
21.0    16.333  302.667 10      32      Thu Jan 01 10:45:59 MST 2015
16.0    16.333  302.667 7       36      Thu Jan 01 10:46:00 MST 2015
15.333  16.0    237.0   21      200     Thu Jan 01 10:46:59 MST 2015
13.667  15.333  46.0    13      398     Thu Jan 01 10:48:06 MST 2015
12.333  13.667  35.333  55      673     Thu Jan 01 10:49:42 MST 2015
11.0    12.333  26.333  26      1940    Thu Jan 01 10:58:25 MST 2015
9.667   11.0    18.333  73      11190   Thu Jan 01 12:34:27 MST 2015
9.0     9.667   17.667  24      15623   Thu Jan 01 12:59:16 MST 2015



one would hope that if one optimized for x_a then it would also lead to good solutions for x_b but GAs often find decoupled solutions and it is critical which metric is optimized for, these results are in line with that, ie the two are not necessarily closely correlated.

the next graphs show random starting collatz trajectories and the resulting trajectory lengths (green) and the max nonmonotonic run length (red). 1st, while the trajectory lengths scale roughly as 5n where n is the # of bits in the starting seed, other than a few big spikes, the max nonmonotonic run length seems to be very gradually increasing in comparison in the 1st plot… but one can almost convince oneself in the 2nd plot after earlier points on the left it is later eventually bounded by a constant… ❗ and just to emphasize that the two lines are compared directly in the 3rd. in any case it shows the algorithm is working roughly as expected and the x_b optimization succeeds in optimizing and “pushing down” the same metric whereas the x_a applied indirectly does not as well.



an immediate problem with the above analysis however (other than, or maybe also in line with, the intermittent big spike problem) is the “needle in a haystack” problem. rerunning the analysis code leads to spikes at different locations each time the code is run and from other experiments, it is natural to guess that some very large black-swan-like spikes in the two metrics x_a and x_b may be lurking. it looks like a new approach may be to try to evaluate the GA fitness function over a large set of “hard seeds”. but then that involves a search using the current best GA fitness function!

this leads to a rather natural remarkable idea. years ago a CS (undergraduate) student at stanford told me a story of how his CS class was working on tetris, and their job was to create an “annoyer” algorithm that would send down ill-fitting blocks based on some system. this also reminds me of the 2-team system/ dichotomy described by feynman with the space shuttle teams, one is the coding team and the other is the QA “challenger” team that attempts to attack/ break the code. at NASA these two teams were both sizeable, in constrast to the typical industry pattern of a much smaller QA team with less resources and less technical capability.

here it looks like the code should be split into two parts, the hard-instance-finder for the top current solution, and the 2nd conceptual half attempts to optimize the parameters based on hard instances generated.

wild & crazy idea

now, after wading thru all that, heres a more interesting/ meaningful punchline, ie a hazy dreamlike long-shot but not-completely-inconceivable-or-implausible sketch of a computer-assisted proof. as hinted in some earlier posts, a neat outcome of earlier FSM transducer work/ experiments is to be able to compute any Turing complete functions on any input distributions that can be modelled with a FSM (finite state machine) and a constant number of compositions, f(...f(x)).

then it is natural to wonder if all the 14 metrics can analytically be computed over an FSM, and that theory seems rather tricky, but possibly feasible!

the end idea here is that if the combined metric has a constant-bounded max nonmonotonic run length over all iterates, one could evaluate the combined metric using a TM, and end up with a computable finite calculation over all iterates modelled as FSMs. and there could be some shot of proving x_{i+1} < x_i…?

there is some very interesting yet obscure FSM theory of computing statistics on accepted word lengths which uses generator matrices which looks superficially/ vaguely applicable, and even some more advanced theory on asymptotic analysis of their bounds….


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s