linear regression vs Collatz

this was quite a long time in the making but finally got everything all lined up. this is an experiment to see if any of the basic metrics explored so far may have a linear prediction capability on trajectory length. prior experiments “from another direction” suggest there is some weak link. can it be quantified? also, if there is a nonrandom prediction, how can this be turned into some kind of proof? thats a big gap/ leap of course.

in a blog from long ago it was conjectured that curve fitting approaches including general machine learning approaches would fail to generalize in the sense of having unboundedly increasing error for larger starting seeds. it would seem that any exception to this might hint at some kind of loop invariant and hence some kind of proof structure.

this code generates long trajectories via maximizing the glide length and then outputs 6 density measurements on each point in the glide, along with a “glide complete” metric that ranges linearly between 0 and 1. the goal of the regression is to predict the glide complete ratio from the density measurements. is there any predictive capability or is it random? prior experiments suggest there might be but its worth analyzing precisely.



the graph shows a “naive (running) error” vs the actual running error ie the difference. the naive error is just predicting 0.5 for every glide position. in this run there are 1822 points and 1200 are used for the training set and the remainder are the test set (this varies slightly based on the random seed generations). it shows some predictive capability versus the naive prediction (the error difference is less than zero) but a gradually increasing error, although an asymptotic bound less than zero is imaginable. rhetorical question: has anyone (“else”) used any machine learning approaches on Collatz, or for that matter any number theory problems in general? it appears to be very rare but think that could change in the not-so-far future.


this adds more metrics that measure “relative” 0 and 1 runs, average values, standard deviations, and maxes in the iterates, scaled by iterate bit width. the statistics for 0 and 1 were found to mess up the regression; it looks like theyre nearly linearly dependent but this is not detected by the regression code, leading to some kind of overflow and weird results, but eliminating either of the 3 metrics seems to give reasonable/ coherent results. the datapoint count is increased to 4690 total in these runs (again somewhat random between runs); still 1200 in the test set. a plateau in the error graph seems plausible.




(8/3) this is some complicated code but am gonna try to describe it briefly. the idea is to look at early density measurements of a trajectory and see if they can predict trajectory length. but trajectory length is unbounded, so what to predict? (any linear formula would fail on an unbounded data set.) as shown recently the trajectory horizontal scale ie iterations divided by number of bits in starting iterate is apparently stable/ bounded, so try to predict that.

this code works from the standard database and throws out trajectories less than 20 length and hardness less than 20 out of the 100 scale. remaining trajectories are put in test or training set depending on hardness, [20..50] in the training set and [51..99] in the training set. that ends up with 6 different hardness methods , 61 points in the training set and 110 in the test set. the first 3 density measurements are made on each trajectory (2 per iterate, 2×3 = 6 total) and another 3 for the stats on bit run lengths normalized, avg, stdev, max for 9 total predictor variables.

the code indeed finds some decent linear signal/ trend. to analyze, it helps to order/ graph by increasing starting iterate (bits) sizes as follows. there are two key graphs. the first is the plot of increasing bit widths of the filtered test trajectories. the 2nd graph is two superimposed plots, one for the error/ difference for the horizontal scale (predicted vs actual), red/ left scale [-100..100], and the other is scaled by the starting iterate bit width to predict the actual glide length, in green/ right scale [-1..1]. the key question is if the glide length prediction error increases noticeably going from left to right.

❗ the bottom line is that the glide error does not seem to increase substantially overall even for much larger starting seeds out of the training set! its significant that it works over 6 different hard seed types also. ie, at least over this “not so huge” dataset it looks not inconceivably something like a “local-global correlation”, aka loop invariant, in other words a predictor of trajectory length from “early”, finitely-computable information in the sequence (basically easily-computed bit statistics). but ofc std disclaimer applies, “appearances can be deceptive”.


stat_test5 stdout



(8/4) this all gave me the urge to look more into comparative density statistics for the long/ hard runs. this code looks at 4 statistics, the density, the distance from core, and running averages of both, over the main glide algorithms ‘w’, ‘cg’, ‘r’ for a total of 12 separate plots. a lot could be said but it looks like, along with the prior experiment, early density changes/ slopes are indicative of/ correlated with overall glide lengths. the trajectory lengths are colorcoded with cooler blue colors low lengths and hotter colors the longer lengths.














(8/6) some further thoughts/ musings/ retrospective analysis banging around in my head to record. above there is ref to the horizontal scale being “stable” in the sense of not increasing. (the technical statistics term is stationary.) actually just proving that alone would be equivalent to a proof. so why is there any need to estimate trajectory lengths (with greater precision)? yet somehow the latter seems surely relevant but not sure how these two aspects interrelate yet. this is also reminiscent of estimates of prime count growth function made by gauss and the error function estimate and its connections with the Riemann hypothesis. some mystery/ cognitive dissonance on all this right now.

another key issue always lurking: the density measurements are on specific hard seed generation approaches, but its not so clear whether these are crosscutting/ generalizable across all possible hard seed algorithms. they might be highly specific to the particular hard seed algorithms. (again the concept of a loop invariant.)

a currently fuzzy idea thats been occuring to me: suppose one created a GA/GP-like construction that attempts to simultaneously formulate some kind of loop invariant, and at the same time look for hard seeds that violate the loop invariant, and somehow succeed in the interplay between these two opposing angles/ constraints in constructing the loop invariant. that is almost an emergent design pattern of what is being hinted in this overall research.

another idea that came up yesterday. a long time ago there was some code to (recursively) generate all the base-2 n-length prefixes that are associated with all possible m-length or larger glides, where earlier-terminating glides are excluded. maybe had some musings at the time like the following that werent recorded and am now returning to them.

what if one looks at the density variations in those finite prefix sets? is there some pattern in the density for increasing prefix sets? in particular what about the minimum and maximum densities found in the prefix sets? could they be possibly be converging toward the core (½) density? strikingly, that is quite conceivably close to a proof construction. this code is not too complicated but didnt have a chance to pound it out yet. (where are my research assistant(s)?) :\

(8/12) 💡 ❗ ❓ 😮 this could be a big deal! was trying to think of something like an “unbiased” way to come up with larger seeds. many statistical properties are associated with the hard seeds but its hard to figure out if these are “biased” wrt the generation process. so its hard to figure out if the algorithm is (trying to come up with evocative/ colorful metaphors for this, and settled on this one) “mining veins” and yet missing out on gold nuggets so to speak, or if the veins it is following are unbiased etc. this code computes a “2-way” histogram on the density and glide length and selects advance candidates uniformly from it. this led to an apparently very significant discovery.

the plot shows glide length on x axis and average density and standard deviation for the found/ generated glides of that length in green line/ red bars. the blue line is the max of the density of the glides which is clearly decreasing (but is hidden in the average statistics which seem not to have much discernable trend). the magenta line (right axis scale) is the number of samples of each glide length. this is nearly exactly as was being speculated earlier (in the sense of almost too good to be true): do longer glides have a density gradient that can be exploited somehow for the proof, either min or max? (in this case the latter) …



an obvious next step is to try to collect the same # of samples for higher glide lengths, buts its kinda tricky and was musing on different approaches for doing that (presumably all the glide algorithms devised so far do not have that property and look something like above in the sample count trend). one idea is to have a frontier of next points and (maybe greedily) decide which in the frontier best (incrementally) level out the sample count curve. or possibly a greedy algorithm that tries to minimize the standard deviation of the average of the sample count curve? also had some ideas about trying to fill in the low samples first above some threshhold before filling in larger samples. not exactly sure what to go with to start.

(8/13) ⭐ ❗ 😮 this code took quite a few hours to write, it was very tricky conceptually but is not very many lines. had to think it thru very carefully and do a lot of interactive debugging to get all the subtleties figured out. it basically implements last idea. it loops 2000 times and selects frontier points that incrementally/ greedily level out the sample bins (sample count in blue/ right y axis labels). the results are as conjectured and very promising!

so, bottom line/ preliminary conclusion (subject to much more analysis!) it appears that an algorithm that tries to find larger seeds without bias is finding seeds with lower densities. its very interesting that the standard deviation decreases/ flattens also. (not exactly sure how to explain the 2 high outlier sample count points at the moment.) also note there is some wasted repeated calculation of trajectory lengths in here, it was just how the code was banged out on 1st try, may try to streamline it at some point, and anyway it can also be fixed with a simple memoization scheme. theres also a lot of repeated similar effort to calculate the histogram-like array ‘b’ from scratch when it could be calculated much more efficiently incrementally.

if this property can be proven to be monotonically decreasing somehow/ in general, its equivalent to a proof! ❗ 😮



(8/14) might as well exploit all the immediate effort and just run the code for much higher accuracy and time. didnt notice exactly but this took maybe ~10m to run, its 20K points with 50 bin size. it shows the density curve going through long but still falling plateaus, and the standard deviation flattening but also apparently converging to a lower bound/ constant. the high bin count outliers seem to have something to do with the algorithm hitting some kind of resistance and getting temporarily “stuck” in finding higher glide lengths…? also had a great idea to modify code slightly/ simply to greedily pick higher density glides with the targeted glide value instead of available glides with that length at random… wonder if it will change results at all?



(8/17) it was a one-line code change, ran for about ~14m and leads to remarkable results in at least two ways. the error bars (standard deviations of the glide densities) decrease to very low and there is a very far out nonmonotonic increase after a long slow decrease. (@#%&) argh! surprising/ disappointing! 😮 😳 😦 😥



again a minor change, this code adds a slight probabilistic component of choosing from the top 3 glide densities instead of the top one and gets essentially the same results although the late upslope occurs a little earlier and theres larger error bars early on. there are 2 more runs showing the variations. a natural question arises, what if the density falls below some constant bound? is that sufficient for a proof?





(8/19) this is that logic refactored, some nearly identical/ equivalent code logic results but a major restructuring/ streamlining that massively increases efficiency. the repeated trajectory calculations are thrown out and the “l2” array was found to be unnecessary. the “b” array is not (mostly redundantly) recalculated in the loop and is maintained incrementally instead. this code is ~10x faster! am including another graph as a marker. the efficiency improvements should allow substantially greater resolution runs. 😮



this code has 80K iterations and bin sizes 100 and dips to 40 top prior density samples in the frontier. the mostly monotonically declining slope returns (although some brief nonmonotonic region in the middle is barely discernable). so a big question is how repeatable this is and how representative it is of all possible seeds. are the prior runs with lower prior density samples and the much more nonmonotonic trends less representative? the 2nd run is 100K samples and some simple added timing logic/ tracking. (the code refactor is really paying off bigtime; 80K iterations run in ~7m, 100K in ~11m!) 🙂





(8/24) this is some code that attempts another angle on an unbiased analysis of density trend of larger seeds with longer glides. one can just randomly “advance” glides that are longer than their prior glides out of a large set. this was done for 100K iterations. then at the end there is a distribution of densities over seed sizes and glide lengths. the results are not so easy to interpret. 1st two graphs are two different runs, x axis is glide length (bin), the green line is average density and blue line is the bin/ sample count. in the 1st graph the trend looks mildly upward for part of the distribution but falling at the upper end of the distribution. in the 2nd it looks mildly downward. the 3rd graph is the max density in each glide bin and shows a seemingly more discernable downward trend.





(8/25) this is another quick riff off the last one that sorts the array by glide size and picks randomly out of the top 50. it finds large glide lengths up to ~2000 but slows down as it climbs. this is partly due to the resorting of the increasingly larger array but also it reflects that longer glides are rarer. the green line is average density and red line is error, and blue line is max density, all over the glide bins on x axis. the glide bins have very few samples, typically ~1-2. partly since samples are low, the max typically overplots the average. at around 2000 glide length the density is ~0.05 which is about ~½ what is found with the much more thorough prior algorithms with large bin sizes/ sample counts over each glide although this algorithm runs in roughly ~½ the time (~6m).



this is another pov on that code that looks at the density of a glide compared to its successor with longer glide. the graph is the ratio of densities subsequent/ prior minus 1. it shows the higher density advances become narrower in their increase, while also the decreases become narrower, but the increases are generally smaller than the decreases.



(8/26) this is another angle that attempts to push the frontier forward in a systematic way and exhaustively find all higher-glide prefixes by bit length. the green line is max density which is not declining for low bit lengths under 80. red points are average and standard deviation bars. it leads to an exponential increase in trajectories through the bit levels (blue, log scale).



the idea behind this next code is the observation that finding high densities at long glides is a sort of computational complexity problem in the sense that they are hard to find, and a longer search is more exhaustive and finds higher densities in an asymptotic way wrt time elapsed. “longer search” in this sense means “more samples”. so then this code keeps track of how much a higher density is found at a given glide length compared to number of samples taken at that glide length. glide length bins with higher success (ratio) in finding new maximums are at the earlier part of the asymptotic increase and those with lower success ratio are nearer to the end, aka a easy-vs-hard measure.

the logic simply tries to push the frontier whereever its easiest, basically a clever greedy strategy variation, and the resulting graph has a nearly uniform final hardness estimate at each bin based on the emergent result of the balancing (ie a sort of “load balancing” algorithm!). the red line is density average/ error, blue line is minimum, green line is maximum. 100K iterations, run time ~6m. overall at moment this seems like a really great balance of efficiency and accuracy/ scope, and is maybe fairly unbiased wrt finding max densities vs glide lengths. multiple runs (3 total shown here) do not seem to run into large variation esp at the far end. one run took ~9m.





(8/31) it turned out that code was doing something funky as far as putting seeds into the analysis array “l1” that are not unique because they are the same values with shifted bit indicator “p”. this code, not much different, alters the analysis calculation and output, with a 2nd piece of code to look at the largest trajectories. it is including some trajectories/ seeds that were generated (as part of the frontier) but not “selected” or “advanced” but with presumably negligible effect on results. also the generation algorithm was changed to “cm” the max glide value (index). the 2nd graph shows the last few trajectories scaled by starting value. the “end branching” structure noted a long time ago is present.

😳 another simple variant looks at sequential density of these trajectories and reveals something disappointing but maybe obvious in 2020 hindsight. the initial densities are biased low but after only a few iterations the density reaches the core region, yet the iterates are still large-sized and in the early part of long glides. in other words, there are many large iterates in the middle of large glides that dont have low density and have core region density instead. so these algorithms are indeed biased in that sense. maybe/ apparently they are optimal in the sense that they find or “squeeze out” the longest glides for seeds/ iterates of particular sizes but “nearby” glides are almost as long.






(9/1) more thoughts on that last diagram. didnt realize it was somewhat anomalous until sleeping on it & thinking further. the trajectory branch descents from the max are all in relatively close (linear) alignment. thats a remarkable/ surprising property of the generation algorithm and not at all “typical” of trajectories in general which are much more jagged. looking at base2 “suffixes” (msbs) shows that it is finding various suffixes of various lengths that somehow lead to nearly aligned descents. following is some code to isolate the suffixes and sample output. it initially outputs the # of matched bits in the prefix, in this case 2282.


review7c stdout



One thought on “linear regression vs Collatz

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s