hi all, this is a writeup of a more ambitious attack on the collatz problem building and using a combined synthesis of many recent ideas and a GA approach.

have long wanted to do genetic algorithm (GA) type attacks on number theoretic problems. have tried out a few rough ideas over the years incl some on Collatz and have achieved some results but nothing spectacular.

this idea came up just yesterday and it seemed quite interesting/ somewhat promising and not requiring a huge amount of programming overhead. one of the problems with very sophisticated GAs is they can require a lot of programming and tuning. some GA approaches even create entire complex structures such as circuits or programs, eg expressed as lambda calculus expressions but those also tend to require a lot of tuning. (brilliant case study on both those two areas by Koza, in the pacman and *patented!* circuit generation problems.) this latter approach is sometimes distinguished from GA by calling it Genetic Programming (GP).

here is the idea background. now it is known that some basic metrics of collatz iterates help determine (predict) its future dynamics. the correlations may be low but there is some measurable effect.

the recent breakthrough with density analysis seems to open lots of new doors. one simple idea occurred immediately and is hinted in the prior writeup. the iterate value plus the density form a 2-pair coordinate which gives its own random 2d walk. it is natural to wonder about the randomness of this 2d random walk versus the 1d walk. it is also natural to create 1d random walks from a 2d random walk by just converting the 2d coordinate value to 1d distance using the euclidean formula (distance from origin, square root of squared coordinate values: ). and there are other “norm” formulas that can be used. and then look, does this new 1d walk combined from the 2d walk have better convergence properties?

“convergence properties” here can also be regarded as “nearly monotonic (decreasing)”. and looking for a monotonic decreasing function of a random walk reminds me very much of nonhalting analysis literature, aka “termination analysis”. a big theme from that literature is to find an “invariant function”. my understanding is that these “invariant functions” are also often monotonically decreasing.

a quick experiment (not written up) shows that the 1d Euclidean distance combination of the 2d iterate value and density does seem to have some better monotonic behavior. but then how does one quantify that? there are two immediate simple ideas that have been used in some prior experiments:

- count the number of “reversals” ie if a function is monotonically decreasing, count the times successive iterates do not decrease. this can be expressed as a percent of the total sequence length. for “more monotonic functions” this is closer to 100%. call this
- there was an idea studied of “max length of nonmonotonic run”. look at local minima in the decreasing function, and look for long runs where the function is larger than the current local minima. there is a max long run. for “more monotonic functions” the max length nonmonotonic runs are shorter. call this .

but all these ideas lead to some near-obvious generalizations. clearly we have a general way to generate 1d random walks from *n*-d random walks using Euclidean distance norm or other norms. but we can also use any metrics that might be relevant in attempting to come up with this new “better-behaved/converging”, “more monotonic” random walk. now it is documented via density experiments etc there are a lot of basic metrics of iterate values that have subtle influence on the (“future”) collatz random walk.

but then one immediately wonders if some particular “mix” of these different metrics would be optimal. but one has no idea ahead of time whether any are even applicable. aha! sounds like a job for a genetic algorithm!

### a Collatz GA

a more complex genetic algorithm would be a GP that attempts to find an invariant function using complex programming combinations. that has been in the back of my mind for awhile but implementation and tuning challenges are high on that with not much apparent promise (a solution there would be akin to, in the english vernacular, “getting lucky”).

but just yesterday another idea came to me. there are fairly simple functions that can have a very complex behavior. this is the same reality staring one in the face with Collatz only from a different angle. one can come up with invariant functions based on some simple metrics without the full-blown GP complexity which themselves still have very complex behavior.

example: consider a bunch of not-hard-to-compute metrics, but which are highly nonlinear. the density measurement is a good example, and other recent metrics eg related to average run lengths, standard deviations, etc.

whats a simple combination of them? how about merely *linear*? that is also not hard to compute. and a GA would be a very natural algorithm to optimize the linear weights!

so heres the basic idea. setup a GA with some of these basic metrics and use it to find a linear combination of the metrics (ie filling in for complex full-blown GP complexity with some more limited GP-like “programming” aspects) that optimizes the “more monotonic” nature of the (limited) invariant function. this is hopefully a reasonable compromise on all the tradeoffs of search space, programming overhead, solution complexity, etc.

the result is the following. there are 14 basic “potential/plausible-looking” metrics chosen some based on prior experiments.

- log
_{e}(x_{i}), x_{i}the iterate value - log
_{e}(x_{i})^{2} - log
_{e}(x_{i})^{½} - density
- bit length
- # of binary 1 bits
- # of binary 0 bits
- # 1 bits mod 2
- # 1 bits mod 3
- density distance (“from center/core”)
- avg 1-bit run lengths/ std dev
- avg 0-bit run lengths/ std dev

then the genetic algorithm computes a linear combination of them, and that is the individual iterate value combined metric. then have the GA look for “more monotonic” random walks of the combined metric by adjusting the linear weights.

the GA is fairly simple and has 4 simple genetic operations labeled 0-3 or ‘a’-‘d’. the 1st 3 are mutation operators and the 3rd is a crossover operator.

- set random coordinate to 0
- set random coordinate to random number between 0-1½ and random sign (+1/-1).
- multiply random current coordinate by random number between 0-1½ and random sign.
- crossover. choose coordinates from two parents.

one could try all kinds of ways of improving convergence based on a mix of the above or have an adaptive algorithm that tries to figure out a good mix, and have written code like that in the past, but again that might be a greedy algorithm that optimizes for local minima anyway. the current “quick-and-teeny-dirty” algorithm simply chooses among the 4 operations randomly. one might also be concerned about apparently small magnitude of constants. but note the algorithm can choose to “quickly” repeatedly multiply a coordinate by 1½ which results in an exponential increase in that coordinate.

the algorithm starts by initalizing 40 seeds with random coordinates. half the seeds have entirely positive coordinates, and the other have have mixed signs.

so whats the result of all that? as one might have guessed from this already-longer-than-usual writeup, there is some positive result here.

this code is called with a parameter ‘a’ or ‘b’ which chooses the or evaluation metric. the metric evaluates the function 3 separate times and uses the average because the “longest nonmonotonic run” metric is inherently more jagged/ noisy than the “percent sequentially decreasing” metric .

the code has some other cool features. it can be called without a parameter to run a trend analysis on the current last best weights found, even while another GA instance is running concurrently. it has two files, one is the status log which gives best and worst fitness values and the new fitness value, and is updated whenever a new fitness value is better than the previously worst fitness value.

the weight log outputs the last best weights, iteration # of the solution, and a string which records the genetic history/ lineage/ “genealogy” of the best solution. there are only 200 genetic strings at a time. it seems sufficient and did not experiment to see how more strings lead to better solutions, suspect it would not alter results much or convergence a lot.

the code has to run as a maximizer for and a minimizer for and just uses a near-hack of applying a negative sign on the fitness function in the 2^{nd} case.

### analysis

now, looking at analysis. the algorithms “relatively rapidly” find dramatically improved solutions (from admittedly random, presumably “poor quality” starting conditions) in 10-15m but then find small improvements only very intermittently.

the algorithms succeed in improving each metric significantly shown in these plots of the metric change. this plots worst, last best, and new best fitness on the y axis and the iteration # on the x axis. the fitness function is evaluated for iterate seeds starting with bits where randomly selected. the analysis plots look over the longer range . the run finds much less of a gradient to exploit as would be somewhat expected.

#!/usr/bin/ruby1.8 def stat(l) t = t2 = 0 l.each \ { |x| t += x t2 += x**2 } c = l.size a = t.to_f / c z = t2.to_f / c - a ** 2 return a, Math.sqrt(z < 0 ? 0 : z) end def d(s) c = s.split('').select { |x| x == '1' }.size return c.to_f / s.length, s.length, c, s.size - c, c % 2, c % 3 end def rank(n, l) s = n.to_s(2) m1 = Math.log(n) m2 = m1 ** 2 m3 = m1 ** 0.5 d1, w, c1, c0, c2, c3 = d(s) d2 = (0.5 - d1).abs a1, sd1 = stat(s.split('0').map { |x| x.length }) a0, sd0 = stat(s.split('1').map { |x| x.length }) l2 = [m1, m2, m3, d1, w, c1, c0, c2, c3, d2, a1, sd1, a0, sd0] t = 0.0 l.each_with_index { |w, i| t += w * l2[i] } return t end def f(n, w) l = [] t = 0 while (n != 1) l << rank(n, w) n = (n % 2 == 0) ? n / 2 : (n * 3 + 1) / 2 t += 1 end c = 0 m = l[0] x = j = 0 (1...l.size).each \ { |i| c += 1 if (l[i] < l[i - 1]) m, j = [l[i], i] if (l[i] < m) x = [x, i - j].max } return c.to_f / (l.size - 1), x, t end def f3(n) return sprintf("%.3f", n).to_f end def eval1(p, z) n = 2 ** p + rand(2 ** (p - 1)) x, t = f(n, z) return x if ($mode == 'a') return -t if ($mode == 'b') end def p() return 40 + rand(20) end def eval(z) return eval1(p(), z) if ($mode == 'a') return (eval1(p(), z) + eval1(p(), z) + eval1(p(), z)) / 3.0 if ($mode == 'b') end def trend() n = ARGV[0].nil? ? 1 : ARGV[0].to_i w = File.open($fn).readlines[-n].split[0...$v].map { |x| x.to_f } (2..500).each \ { |p| n = 2 ** p + rand(2 ** (p - 1)) l = f(n, w) puts(l.join("\t")) $stdout.flush } end def rsign() return rand(2) ? -1 : 1 end $mode = ARGV[0] $v = 14 $fn = 'var.txt' if (!['a', 'b'].member?($mode)) trend() exit end l = [] 50.times \ { |i| w = (1..$v).map { rand() * 1.5 * ((i % 2 == 0) ? 1 : rsign()) } l <<= [eval(w), w, "[#{i}]"] } f = File.open('out.txt', 'w') f2 = File.open($fn, 'w') mhi = nil mlo = nil j = 0 loop \ { l = l.sort_by { |x| x[0] }.reverse a = l[rand(l.size)] b = l[rand(l.size)] x = a[1] y = b[1] r = rand($v) c = rand(4) h = nil case c when 0 z = x.dup z[r] = 0.0 when 1 z = x.dup z[r] = rand() * 1.5 * rsign() when 2 z = x.dup z[r] *= rand() * 1.5 * rsign() when 3 z = [] x.size.times { |i| z[i] = (rand(2) % 2 == 0) ? x[i] : y[i] } h = "(#{a[2]}+#{b[2]})" end op = 'a' c.times { op.succ! } h = "#{a[2]}#{op}#{r}" if (h.nil?) l.pop if (l.size > 200) t = eval(z) hi = l[0][0] lo = l[-1][0] fhi = flo = false mhi, fhi = [t, true] if (mhi.nil? || t > mhi) mlo, flo = [t, true] if (mlo.nil? || t > mlo) f.puts([f3(t).abs, f3(hi).abs, f3(lo).abs, h.length, j, Time.now.to_s].join("\t")) if (fhi || flo) f.flush f2.puts((z + [j, f3(t), h]).join("\t")) if (fhi) f2.flush j += 1 l << [t, z, h] }

0.457 0.662 0.314 6 0 Thu Jan 01 10:45:25 MST 2015 0.514 0.662 0.314 11 1 Thu Jan 01 10:45:25 MST 2015 0.516 0.662 0.314 11 4 Thu Jan 01 10:45:26 MST 2015 0.525 0.662 0.314 11 5 Thu Jan 01 10:45:26 MST 2015 0.569 0.662 0.314 5 6 Thu Jan 01 10:45:26 MST 2015 0.681 0.662 0.314 7 27 Thu Jan 01 10:45:27 MST 2015 0.689 0.681 0.314 11 47 Thu Jan 01 10:45:28 MST 2015 0.713 0.689 0.488 30 298 Thu Jan 01 10:46:00 MST 2015 0.717 0.713 0.495 32 322 Thu Jan 01 10:46:04 MST 2015 0.735 0.717 0.526 13 530 Thu Jan 01 10:46:29 MST 2015 0.737 0.735 0.527 11 537 Thu Jan 01 10:46:30 MST 2015 0.753 0.737 0.536 21 575 Thu Jan 01 10:46:34 MST 2015 0.769 0.753 0.542 23 613 Thu Jan 01 10:46:38 MST 2015 0.773 0.769 0.633 38 867 Thu Jan 01 10:47:09 MST 2015 0.785 0.773 0.646 77 914 Thu Jan 01 10:47:14 MST 2015 0.788 0.785 0.725 266 1934 Thu Jan 01 10:49:13 MST 2015 0.798 0.788 0.725 229 1948 Thu Jan 01 10:49:15 MST 2015 0.806 0.798 0.729 219 2074 Thu Jan 01 10:49:29 MST 2015 0.808 0.806 0.741 74 2532 Thu Jan 01 10:50:25 MST 2015 0.818 0.808 0.746 167 2686 Thu Jan 01 10:50:43 MST 2015 0.848 0.818 0.748 482 2786 Thu Jan 01 10:50:54 MST 2015 0.896 0.848 0.771 1269 3915 Thu Jan 01 10:53:18 MST 2015 0.903 0.896 0.821 245 13775 Thu Jan 01 11:23:42 MST 2015 0.912 0.903 0.824 1282 14968 Thu Jan 01 11:29:12 MST 2015 0.932 0.912 0.844 6306 37835 Thu Jan 01 12:43:07 MST 2015

248.667 16.333 302.667 6 0 Thu Jan 01 10:45:48 MST 2015 160.0 16.333 302.667 6 1 Thu Jan 01 10:45:48 MST 2015 60.667 16.333 302.667 11 2 Thu Jan 01 10:45:49 MST 2015 52.667 16.333 302.667 6 6 Thu Jan 01 10:45:50 MST 2015 33.333 16.333 302.667 6 11 Thu Jan 01 10:45:52 MST 2015 31.0 16.333 302.667 6 15 Thu Jan 01 10:45:53 MST 2015 22.0 16.333 302.667 10 21 Thu Jan 01 10:45:55 MST 2015 21.0 16.333 302.667 10 32 Thu Jan 01 10:45:59 MST 2015 16.0 16.333 302.667 7 36 Thu Jan 01 10:46:00 MST 2015 15.333 16.0 237.0 21 200 Thu Jan 01 10:46:59 MST 2015 13.667 15.333 46.0 13 398 Thu Jan 01 10:48:06 MST 2015 12.333 13.667 35.333 55 673 Thu Jan 01 10:49:42 MST 2015 11.0 12.333 26.333 26 1940 Thu Jan 01 10:58:25 MST 2015 9.667 11.0 18.333 73 11190 Thu Jan 01 12:34:27 MST 2015 9.0 9.667 17.667 24 15623 Thu Jan 01 12:59:16 MST 2015

one would hope that if one optimized for then it would also lead to good solutions for but GAs often find decoupled solutions and it is critical which metric is optimized for, these results are in line with that, ie the two are not necessarily closely correlated.

the next graphs show random starting collatz trajectories and the resulting trajectory lengths (green) and the max nonmonotonic run length (red). 1^{st}, while the trajectory lengths scale roughly as where is the # of bits in the starting seed, other than a few big spikes, the max nonmonotonic run length seems to be *very gradually* increasing in comparison in the 1^{st} plot… but one can *almost convince oneself* in the 2^{nd} plot after earlier points on the left it is later eventually bounded by a constant… ❗ and just to emphasize that the two lines are compared directly in the 3^{rd}. in any case it shows the algorithm is working roughly as expected and the optimization succeeds in optimizing and “pushing down” the same metric whereas the applied indirectly does not as well.

an immediate problem with the above analysis however (other than, or maybe also in line with, the intermittent big spike problem) is the “needle in a haystack” problem. rerunning the analysis code leads to spikes at different locations each time the code is run and from other experiments, it is natural to guess that some very large black-swan-like spikes in the two metrics and may be lurking. it looks like a new approach may be to try to evaluate the GA fitness function over a large set of “hard seeds”. but then that involves a search using the current best GA fitness function!

this leads to a rather natural remarkable idea. years ago a CS (undergraduate) student at stanford told me a story of how his CS class was working on tetris, and their job was to create an “annoyer” algorithm that would send down ill-fitting blocks based on some system. this also reminds me of the 2-team system/ dichotomy described by feynman with the space shuttle teams, one is the coding team and the other is the QA “challenger” team that attempts to attack/ break the code. at NASA these two teams were both sizeable, in constrast to the typical industry pattern of a much smaller QA team with less resources and less technical capability.

here it looks like the code should be split into two parts, the hard-instance-finder for the top current solution, and the 2^{nd} conceptual half attempts to optimize the parameters based on hard instances generated.

### wild & crazy idea

now, after wading thru all that, heres a more interesting/ meaningful punchline, ie a hazy dreamlike long-shot but not-completely-inconceivable-or-implausible sketch of a computer-assisted proof. as hinted in some earlier posts, a neat outcome of earlier FSM transducer work/ experiments is to be able to compute *any* Turing complete functions on any input distributions that can be modelled with a FSM (finite state machine) and a constant number of compositions, .

then it is natural to wonder if all the 14 metrics can analytically be computed over an FSM, and that theory seems rather tricky, but possibly feasible!

the end idea here is that if the combined metric has a constant-bounded max nonmonotonic run length over all iterates, one could evaluate the combined metric using a TM, and end up with a computable finite calculation over all iterates modelled as FSMs. and there could be some shot of proving …?

there is some very interesting yet obscure FSM theory of computing statistics on accepted word lengths which uses generator matrices which looks superficially/ vaguely applicable, and even some more advanced theory on asymptotic analysis of their bounds….