hi all, have been holding out on you a bit while writing up other stuff on the blog. finally wrote up (coded) an idea that was nagging at my neurons for quite awhile. enjoyed building the genetic algorithm code to attack collatz from over a year ago, and was thinking of new approaches. since then, have a new bag of tricks such as 8 different hardness algorithms and it was always in the back of my mind to maybe return to a GA approach (now over a year old… time flies!).

wrote up a very complex/ ambitious GA approach within the last few weeks, and was running it for days at a time. it has many features and want to tout/ brag about them at some point, but that would take some more writeup.

was pondering however on why the GA was not really finding much improvement in solutions, and then zoomed into this particular issue that is very much worth sharing/ writing up/ detailing.

1st, wrote a new database generator, it has a slightly different output format and some other tweaks/ streamlining/ slight refactoring. it outputs 100 x 8 seeds, for the 8 different hardness algorithms.

then, next, here is the genetic code. it packs a lot of logic into under ~ยฝK lines of code and took quite awhile to isolate some very intermittent glitches that only showed up after very long run periods. ouch! may write it up in more detail later. this is my experience with GAs in general though and justifies some of the reservations (maybe even *“procrastination”*) in getting to it.

that code seems to be working as intended as far as the search logic, but not working as intended as far as succeeding in the search.

now, here is some key/ relevant code extracted from the genetic code. its the linear weighting evaluation function, and the smoothness (fitness) evaluation function for a bunch of different metrics. the genetic algorithm attempts to optimize all the metrics simultaneously using what seems a very ingenious approach that deserves a lot more description, but am passing on that for now.

anyway, noticed that the hardest metric to optimize seems to be “mx” which is basically just the worst case nonmonotone run length. and also for the “w” seed hardness method. so then was wondering, how much can this metric alone be optimized just looking at random assignments?

so then wrote up this following code that just evaluates a bunch (10) of random weights over the “w” method. the database approach really helped to show the performance. one does not have to recalculate hard seeds, the more lengthy calculation, and one can more quickly get a look at the overall trends for the weights and the fitness evaluation. this is a real payoff of the hard seed database idea.

what this graph is showing is basically confirming the suspicions, that even over a lot of random weight samples, the upward trend of the “mx” metric cannot be flattened much, ie decreasing the slope of the lower bounding line, although its somewhat possible. it appears that other metrics may be easier to find flattening weight combinations, but “mx” is apparently/ probably the hardest of the metrics to “flatten”. the genetic algorithm was coming up with supposedly improved solutions but each one had virtually no impact on the “mx” flattening. am hopeful at the moment that isolating this may help a lot in the future in figuring out smoothing algorithms that are successful or eliminating/ throwing out those that arent.

footnote, just noticed: the trendlines are sticking to certain “channels” even on increasing the # of samples, and the top and bottom trendlines are very distinctly traced out and not diverged from. what does this mean? not sure yet but maybe it has to do with the lower parts of the trajectories coinciding, ie some kind of pigeonhole-like effect. (does this also have something to do with random weights only in the range [-1…1]? need to make/ try out that simple change immediately.)

also, not obvious from the picture, but maybe all the trendlines are actually nearly the same through the center average region with just some deviation around it. need to do linear regression on each trend to attempt to isolate that. the lack of any apparent consistency/ bunching/ coalescing of line colors in any particular regions seems to indicate/ attest to that.

**(5/5)**ย ๐ณ oops! turns out on closer examination there were some glitches in that code but it doesnt affect the analysis too much. however, it does lead to some key debugging of the genetic algorithm.

1st, the `rand.rb`

code had the random weight assignment inside the loop over the trajectories instead of outside. its interesting to move it outside to see the same weight assignment over multiple trajectories instead of different for each trajectory. that leads to this graph of the “mx” metric which again somewhat unexpectedly/ surprisingly shows that it tightly sticks to an upper and lower bounding lines, both jagged. whats going on here? why is this?

2nd, a lame/ bozo defect was found in the metric calculation. the code to split the trajectory value binary string was failing! the delimiter was meant to be a regexp but was accidentally specified as a regexp contained in a string. oops! fixing that leads to the correct code, but essentially the same results. whats going on here? theres some big mystery!

it appears the answer is as follows as discovered in this following further analysis code. the metrics are all either of basically two types: a metric that closely follows the overall trend of the trajectory, ie is highly correlated, but maybe with widely varying scales. another metric eg those related to densities are quite noisy and are nearly not correlated whatsoever with the trend.

it appears that virtually all combinations of these two types, no matter what the weights, lead to either one of two basic trends: the overall trajectory trend, or an inverted (vertically flipped) trend ie a sort of “bipolar” effect. for each of these, a single “mx” metric dominates. these correspond to the upper and lower trend lines of the prior graphs.

the 1st graph is a series of 20 “w” hardness trajectories weighted differently each time starting in the middle hardness ie [35..64] plotted in different colors. it clearly shows this bipolar effect and the basic self-similarity of the trends. the trends increase in vertical/ horizontal scale but this is due to the unweighted trend, mostly unaffected by the weights.

the 2nd graph adds 5 new weights in attempt to have a greater effect on the trend. it was a simple idea, namely add a polynomial or function in terms of trajectory position. in this case, a linear, quadratic, square root, constant, and logarithmic (base 2) terms respectively. this is still a deterministic calculation, but the results are dramatic. in the 2nd graph shows that we now have randomly perturbed trends with different concavities and extrema ie minima, maxima, and inflection points but that are not overly noisy or jiggly. this is perfect for now, *ieย **exactly what was intended.* ๐

the 3rd graph now shows the “mx” calculation of the 100 “w” trajectories for random weights including the trajectory position function terms. *wow! thats more like it!* it succeeds in a much greater spread of flattening effect for some random weights! *so cool, even worthy of a few emoticons in celebration!* ๐ โ ๐ก โญ ๐ ๐

learned a lot from this exercise! when running a GA, at first try a lot of random weights (candidate solutions) to see what effect they have on the fitness function. it sounds obvious in retrospect/ 2020 hindsight, right? but it did not occur to me at 1st. the improved GA was already altered/ running last night with some of these features installed and some other cool ideas. already have some interesting/ substantial results that deserve to be shared! intend to write them up soon.

**(5/6)** heres the new genetic code. the code is quite sophisticated. a few features:

- calculation of curve fitting of multiple metrics
- calculation also of the deviation/ “tightness” of the curve fit(s) as one of the evaluation/ fitness metrics. this increases the rigor of the fit and is setup in a single line and can easily be turned on or off like a switch.
- minimization over many metrics simultaneously. instead of attempting to combine the metrics into a single metric and all its associated complications, it keeps the metrics separately (as “metric vectors”) and
*counts*how many different metrics are exceeded by new solutions compared to old ones. this is actually a partial ordering because some of the metric vectors will have the same number of individual elements different in the comparison, in which case they are not comparable. - this new version of the GA has 3 different crossover functions. the prior one only selected 1 element of the two weight vectors. the new one adds or multiplies the weight vectors.
- theres a geneology history tracking mechanism to understand how prior solution candidates influenced the final solution candidates. it tracks how many mutations or the 3 crossover functions that are found in the final solution, and all the prior candidate solutions in the family tree. it does not keep track of the exact tree because that was found to run out of memory due to what might be called “heavy inbreeding” (which should be noted despite its negative connotations in biology, in this case is still improving the solutions!). the tree histories would tend to double in an exponentially increasing size.
- some of the memory usage is due to that there is no “candidate garbage collector” to get rid of old prior evaluated candidates, but even with this the geneology history tracking still was increasing exponentially in memory usage with the exact tracking.
- it continues to add evaluation points to the new solutions, and the number of points in the solution is also one of the fitness comparison metrics. after some long runs the algorithm was found to “cheat” some by using multiple equal/ identical evaluation points which is a way to increase linear fit/ decrease deviation, so the code also had to make sure the points are
*unique.*this is already seen to need some adjustment as will be described shortly. - there is some logic to decide which of the hard seed distributions to sample from: either all of them of equal hardness, or only a few hardness methods with less than half hardness, and other variants are easily specified.

at this point have 2 runs with this code with two variations, first in a more relaxed mode with curve fit deviation optimization off and points sampled only from the hardness ranges [0..50, 400..450, 500..550, 600..650] which are the seeds with long glides, and the idea was to see if it would generalize to the other hardness ranges and methods. that led to the following results. the candidate data output is rather complex with all the intermediate data and will not currently go into describing this format, but note the code parses it in the post-run evaluation logic. weight indexes start at 0. (run times are on the order of ~1d and the 1st seemed to generate about double the intermediate incrementally improved candidate solutions.)

analysis: the images are graphing the “mx” metrix for the unadjusted hard seed trajectories in red and the adjusted/ weighted solutions in green. the runs are very notable because of the differential effect over different seeds. ofc would like a solution that works across different seeds, and these solutions succeed somewhat in that regard, but it also shows how the different hard seed generators really are creating different types that are hard for the weighting function to adapt to/ generalize across hardness methods.

the 1st solution succeeds in flattening a few of the long glide algorithms successfully, the 5th and the 6th, and fails quite noticeably on the 2nd, 4th, and 8th. the 2nd solution which attacked all seeds, not just the first 50 in each method, and also attempts to minimize the curve fitting deviation comes up with a *very impressive solution.* it still trends gradually upward esp on the problematic 8th method, but the flattening on the long glide methods #1, #4, #5, #6 is *extremely strong, ie exactly as desired on the apparently harder seed methods!*ย ๐ฎ ๐ ๐ ๐ก โ โญ

actually those results suggest that maybe the algorithm is focusing on “breaking up” long glides exactly as intended, but is failing to generalize or “breaking down” on regions outside the long glide. a plausible hypothesis but that will require some more analysis to isolate further.

note that both solutions are not evaluated on very many points (11 points for the 1st and only 3 for the 2nd). so the GA is finding impressive solutions without evaluating a lot of points, and this is great, but also it needs to be adjusted to make sure it is attacking all the different methods, otherwise it probably has little chance of additional success.

so the next step *(always easier said than done)* will be to adjust this point sampling logic. a basic idea: sample at least one point from each hardness method, and to increase points, increase by 1 in each hardness method. also need some idea so that points are not biased toward the easier/ shorter trajectories.

**(5/7)** just reran the code with the last options that worked well (ie sample from all 800 seeds, minimize curve fit deviation on) to look at the natural variability in this (inherently nondeterministic) algorithm. it came up with this solution which is similar which does not have the very strong flattening but is notable for much greater slope on the 1st hard seed method and the 7th has a clear concave downward/ asymptotic-looking trend (again, exactly as is sought after). ๐ ๐ ๐ก โ โญ

**(5/13)**ย ๐ก โ ๐ฎ ๐ณ after some more customizations that code ran for several days and found an “apparent” solution that was almost entirely flat with all “mx” measurements not more than 2! but alas, false alarm. the code is correctly optimizing a curve to be monotonic downward but is finding curves that arbitrarily drop below zero. they are very smoothly descending but these are not useful for lower bounding the function because (on further examination) they do not have any fixed lower bound, ie drop to lower negative values for larger seeds. it would be surprising if the algorithm found that without specifying that anyway.

then did a major push on the code that took a few days total.

- rename some functions
- nonmono function can return null if bad ie no points lower than starting point
- add subtract weights on 2-way combination
- take out multiplied weights on 2-way combination (underperforming)
- new
*1/j*term in weighting function - weight function computes absolute value to enforce lower bound
- adjust output info format
- dual add for negative weight vector
- garbage collector old solutions (solution/ comparison array)
- comparison array calc speedup (n
^{2}to linear) due to garbage collection - improved sampling logic. all 8 seeds, less half hardness
- avoid/ lower bound less than 8 seed samples per solution
- save baseline metrics as database
- optimize differential of metric minus baseline vs based instead of straight metrics
- redo analyze routines
- another alternate mode to just throw out trends with any negative values
- add some mode flags to control some of the prior listed variations

**(5/16)** have been running the code and it runs without exceptions but seemed to be failing to find good solutions even with different variations/ relaxations; “mx” slope was never making negative values or only barely so, and other metrics seemed not to be improving either. it wasnt clear if this was due to the tight constraints “or what”. then tried to analyze some solutions and got perplexing/ contradictory/ inconsistent results, the (re)analyzed solution was in a rejected state (due to being all negative, in a negative trends rejected mode) whereas the found solution was not, yet they should be identical. this turned out to be a useful basic “sanity check”.

quite a bit of debugging led to the code that was “trying” to create solution pairs by negating the weights. it was very tricky to realize staring at the code, but the 2nd negated solution metrics were not being recomputed and instead the original metrics for the nonnegated solution weights were being reused. *oops!* ๐ณ

its not easy to figure out the effect on the runs but basically it seemed to be equivalent to exactly half of the initial “solutions” added being corrupted (namely the 2nd of every pair) and their metrics not correctly reflecting their weights, but potentially still reused in new calculations leading to possible further spiralling corruption… yeeks, (as remembered from prior experience) this GA stuff can be very subtle/ tricky to debug and require full/ intense analytical prowess… and around here/ elsewhere, admittedly maybe some failed GA tests may be put down to incorrect implementation that was never discovered!

**(5/26)**

- another tricky-to-isolate bug in the dual-solution add logic! function calls got the two different weight vectors mixed up!
- also found that the runs were converging toward zero weights in some cases, and then realized that mutations cant get out of the zero weights, so put in an “add” mutation also that adds a random number.
- more adjustment on/ tweaking of analysis routines for better visibility
- despite initial seeds having hard samples from all 8 methods, it found a “solution” that ignored a lot of higher “z” seeds and included only lower ones. this was due to newer weights causing the initially generated hard seeds present in all solutions to be rejected due to negative trend of newer weights. so adjusted the test logic to reject solutions that do not carry/ include all 8 harder-range seeds. part of the challenge of this type of GA is forcing it to sample enough points and trying to steer it away from coming up with solutions that seem to “merely selectively sample”
- the code found a solution that negatively weighted the “
*j*” index indicating negative slope, and 0-weighted the*j*and^{2}*j*terms. this seems to be meaningful and then one^{3}*(“just”)*needs to look at how the other terms offset the monotonically decreasing linear trend. - using the mode of calculating differential with the baseline/ unweighted metrics, it seems a solution is converging/ matching the collatz trend dynamics with some other “non-collatz” function. in other words, a function with some “deterministic” terms (such as j index terms etc) mixed with the “nondeterministic” terms of the collatz function eg densities etc.; somehow this seems like an advance but am not sure what to make of it yet. can it be exploited somehow?

**(6/3)**ย am continuing to tinker/ run this code; its timeconsuming to write up the intermediate results on these algorithms but also dont want to miss any details on plausible yet discarded ideas so am struggling a little with what to share here, and another factor is (alas) that the audience is very low/ quiet. the results so far are not stellar because in particular the “mx” metric is not seeing a lot of difference as seen previously, it generally is nearly tracking the original “mx” value for the unsmoothed data/ curves.

one observation: was experimenting with GAs that “throw out” some candidate solutions eg those that dip below zero. however (and maybe have seen this before) think this is not so great for GA type optimizations for a few reasons. these still succeeded somewhat but maybe were slower. the GA gets most of its optimization info from the relative value of the fitness metric and a thrown out candidate is like an inconclusive result that the GA may not be able to “figure out” exactly why it was thrown out. so maybe this is ok if discarded candidates are not common but otherwise the GA may spend a lot of time just trying to find viable candidates. so as a rough guide, think its better to try to adjust the framework such that these solutions have bad fitness scores rather than having a scheme of throwing them out.

here are some results for a GA variation that takes out the mode flags and instead of working with absolute values of the trends or throwing out negative trends just tried to minimize the absolute value of the least point in the curve, ie trying to keep it close to zero as possible. the code is finding a weak “decreased”ย slope in the “mx” metric in the smoothed lines (first graph, red line is smoothed, green line is unsmoothed). 2nd graph shows the plot of “mx” for smoothed lines versus the seed starting size. unfortunately this is far from linear and theres clearly a kind of spread at the heart of the collatz mystery that the GA so far cant overcome/ “linearize”: some low seeds have high “mx” values and other high seeds chosen at random have moderate “mx” values. also not sure how this is happening but the “solution” seems to be throwing out some of the lower seeds, a partial “cheating” by the GA.

โ โญ ๐ก ๐ฎ ๐ so this is a failure to generalize in some sense. after a lot of really concerted effort, was getting a bit disappointed at the results so far and then came up with this simple variation. the hardest “mx” statistics (highest upslope) are for the first “w” hardness method (0-99 index seeds). so what about a GA that tries to optimize only over “w” seeds but not worry about the others? surprisingly the GA does *substantially better* and comes up with these very impressive/ nearly eyepopping results for “mx”. *the “mx” statistic is actually declining for “w” seeds, the first time a declining “mx” trend has ever been found.*ย and elsewhere for some of the other hardness methods, outside the training set ie the test set, the “mx” metric is *level* exactly as desired!

it gives me the immediate idea (actually one nagging at me for weeks, now newly prioritized) that maybe the GA needs to look at/ optimize individual slopes for each (8) separate seed hardness methods. conceptually its very similar and not much different code but it will take a little while to try/ debug this next idea out.

๐ก postscript, idea while driving! imagine that one could find separate formulas/ solutions for each of the hard problems one at a time. can they then be combined somehow for desired results? *absolutely!* a new *formula* that just simply takes the (“pointwise”) minimum of the 8 separate formulas, if they exist, would do the trick quite nicely! ๐