collatz linear multiple regression attack contd

this took quite a bit of effort and is an idea that builds on `extend23.rb`, some of the effort is related to more need for analyzing results. it works with the long trajectory database. it uses 5 of the glide algorithms that typically lead to longer glides of at least 10 steps. then it does a linear regression on the trajectory iterates metrics to fit the estimated iterations left in the glide, but scaled by the initial seed bit width. this was called the “horizontal scale” in some earlier experiments and seems to be stationary in the statistical sense. then the optimization algorithm attempts to find trajectories that maximally “thwart” the estimate ie maximize error.

the 1st graph shows points associated with the 5 algorithms grouped, ie full trajectories stacked horizontally with some gap between the 5, in the order of increasing starting seed size/ trajectory “hardness” (glide length). red is the actual iterations remaining and the green is the regression estimated fit. the technique generally fits the middle trajectory metrics better for two of the algorithms, and for the other 3 it seems to fit better at the beginning of the glides. the 2nd graph shows increase in seed bit width (red) and glide length (green) during the optimization. the 3rd graph shows error which seemingly climbs almost exponentially. overall there seems to be some weak predictability. it can be seen in eg a scatterplot of predicted vs actual points. next need to do some analysis to try to explain why the error increase is so “bad,” admittedly was a bit surprised but in retrospect shouldnt have been!

`extend31b.rb`

(12/15) was banging around quite a bit with some different intermediate ideas & exploring results. here is a snapshot of the current direction. found that (not surprisingly, this was observed before) the prior trajectory generation code would generate a lot of trajectories with short glides, and only a few with long ones. needed something more balanced, otherwise the regression just tends to fit the majority of the glides of small length. so wrote some code to generate glides in a way that tries to maximize adding the most new points that are not part of existing glides. after that, there is a distribution of glides, in this case more instances of longer glides than prior algorithms. then pick a small constant glide count, here 4, and sample all the glides of at least that length. it tries to thwart the linear regression by finding the smallest and largest (bit size) seeds of each glide length. after all that, the prior linear regression/ metrics is fit on the samples.

there is lots to analyze here but this graph is something like the bottom line. it graphs the glide width versus the predicted glide width for increasing glides (4 at each value). it shows the regression is successfully finding some weak linear trend, and the spiky alternations are due to the small/large bit size alternation in the samples. however, the pattern is that the regression line is biased towards the average in the sense of overestimating the shorter glides and underestimating the longer glides. this shows that while there is some trend, its “overfitted” in the sense that it probably only works for a small finite window of data and will fail to generalize for arbitrarily larger starting seeds. so a real trick would be finding some approach where the error is not biased in this way and is more randomly distributed wrt seed size. something to shoot for. so back to the drawing board. 😐

😳 also, was perplexed for awhile and finally isolated a subtle glitch where the “readout” of the regression weights can be mixed up by more than 9 variables. the glitch is related to the string sorting of the variable names such that var “d10” sorts next to/ right after “d1”, and the hidden/ behind-the-scenes assignment order of ruby hashes and the regression library code (`statsample`). its fixed in the `fit` subroutine.

`fit.rb`

(12/16) a fairly natural idea thats been nagging at my mind for months at least is the idea of improving the accuracy of the linear regression estimate using some kind of trick as follows. the regression is fit on single iterates in the trajectory. every iterate is therefore associated with an estimate. one can experiment with averaging over multiple estimates. a 2nd estimate for the current glide distance is the next iterate estimate + 1, the 2nd next iterate estimate +2, etc. and one can increase the estimate over any # of points.

here its chosen to be 20 (as it starts in the graph with the lowest glide length) and it reveals a lot more about what the regression model is “fitting” but its surprising. the regression seems to be fitting a constant line for the upper sized seeds of each glide length and fitting a linear trend for the lower ones, in the alternating 2-point pattern. this is tricky and unexpected and am trying to figure out if theres some way to correct for this and am wondering if it could be related to some kind of bias in the generation algorithm.

an important property of linear regressions that are “off”, relevant here, is that if they are at least monotonically increasing (as roughly with the lower seeds), then one can easily correct it to the exact curve using a simple nonlinear “wrapper” function. this has surely been shown/ proved somewhere by somebody…

`fit2.rb`

😮 😳 oops, further analysis, something quite substantial! some sense of deja vu thinking about this. (ofc maybe one should look at the data before fitting it, admittedly not done in the prior flow/ haste!) it appears (1) the generation algorithm is quite biased in some senses of not generating a range of data, and this is shown with scatter plots, in that higher seeds associated with larger glides tend to aggregate around a ceiling (2) related, the regression is basically closely matching the starting iterate (seed) bit width, which is then a rough estimate of/ correlated with trajectory length! (but quite noisy because of the long-observed/ analyzed fractal nature of “adjacent/ nearby” trajectories.)

this is quite significant that iterate bit widths can be predicted from seemingly “blind” statistics about iterates and deserves further exploration in itself, although its possibly related to this generation algorithm (and bias). it might be a startling indication of more fractal self-similarity. otherwise the next step is coming up with a “less biased” generation algorithm, not at all a simple proposition at this point (unfortunately the oft-used greedy algorithm approach “hammer” can tend to bias samples in many ways).

modifying the code slightly to output the generation data leads to these plots. the selection code changed slightly to choose min, median, max bit width starting seeds at each glide length. the 1st plot is glide length versus bit width and unmistakably shows the prior wedge pattern. the 2nd plot is bit width versus glide length. the 2 are nearly a perfect 45 degree reflection, somewhat subtle exercise to reader to articulate why they are not exactly so.

`fitb.rb`

(12/20) the plot thickens. had strong sense of deja vu. the attempt to balance out the distribution wrt ‘ls’ parameter appeared first earlier in the `grow3.rb` code, and with many subsequent sequels/ variants. did not realize/ have that code immediately handy and rewrote some similar logic to generate the distribution. without customization the algorithms tend to generate a large number of smallest length trajectories and only a few longest length ones. also here look at fitting the 3 “middle sized” seeds for each ‘ls’ value instead of the extremes. modified the code to also have separate “training/ test” sets. the test set has the 3 middle seeds and the training set has 10 over the whole distribution.

the 1st graph below is the new distribution of seed bit sizes over ‘ls’ glide lengths and is a bit better for fitting purposes. the algorithm generates longer ‘ls’ glides however the seed bit widths instead of going up linearly still tend to increase asymptotically toward a maximum like the prior distributions. this asymptote is more an artifact of the greedy approach and varies in position according to how long the algorithm is run. another way to picture it, it has to do with that trajectories that “push” the asymptote exist but are harder to find. the 2nd graph is the fit over the test and the training set, sandwiched left to right.

the training set seems to fit fairly well and is nearly monotonic with not-so-much noise although the error is not evenly distributed, again its the regions of under/ overestimation pattern (ie clearly more a concave downward curve than a line). the test set includes 10 points distributed over the full seed sizes for each glide length and unfortunately gets similar results as previous. the predictions for/ associated with lower seed sizes for each glide length are fitting well/ monotonically but the upper ones hover around a ceiling. its somewhat surprising its a nearly constant ceiling! not exactly sure what to make of this but it seems to be a case of overfitting and the linear formula working only for a finite range of points after which it breaks down.

another way of looking at this is that seed sizes are a rough estimate of trajectory length and need to analyze how much the predictions are different than simply a seed size times a constant. it seems that maybe the regression formula is reducing to close to/ approximating something like that.

`fit3b.rb`

(12/21) 😳 😮 ❗ wow on much closer look/ 2nd thought, there was some major conceptual glitches in that logic! the general flow is correct but there are two major-but-subtle (dont you luv that concept, how its not really an oxymoron, esp with code?) logical problems associated with the same basic issue. 2020 hindsight! the code is trying to predict glides, which are not the same as trajectories. a (full) trajectory has the property that predicting the current length is recursively like predicting the length after the next iterate, plus one, but glides do not have that property! this shows up in two places in the code.

• at the beginning of the `samplefit` routine, the code is trying to get more points via “extra” glides other than those directly generated by working with intermediate values in the glides. but the intermediate values do not have the property that “interior” glide lengths are all sequentially related by 1 differences.
• in the `improve` subroutine, similar issue. glide length is denoted ‘ls’ and sequential iterates in a glide do not have the same ‘ls’ value as is assumed by the single parameter to the `data()` calculation over multiple iterates. this does not affect current estimation (the parameter is passed in but is not involved in the later estimate calculation) but am still wondering if maybe the estimation needs to adjust somehow.

was somewhat misled by results that do not seem overly implausible because there is some rough fitting; often major conceptual glitches tend to lead to very “wrong-looking” data.

another natural idea is to generate distributions based on the “horizontal scale” ratio instead of the ‘ls’ parameter.

(12/23) 💡 ⭐ ❗ 😎 😀 did a lot of work and somewhat major rewrite/ refactor/ streamline of the code making it more coherent/ organized and am pleased to finally get to the finish line. this throws off a lot of different analysis graphs (similar to the earlier GA work). would like to write up a lot more in detail but for now am just going to comment briefly on results/ bottom line. the code generates a distribution of ‘h2’ which is basically horizontal scaling and then fits it. this turned out to apparently make a massive difference.

basically the periodic-line (mis)estimation problem is fixed! and while this is apparently nearly/ roughly the same curve-fitting problem, am thinking its mainly due to the new input data distribution. graph 1 is the results and fit ordered by ‘ls’ values (an old trick of mine, this turns out to be a very nice/ effective way to visually analyze/ evaluate results, similar but in some cases, eg definitely this one, even more informative than plotting estimate vs actual). the algorithm is doing well in finding a linear trend.

the flat beginning line of ‘ls’ values is an artifact of the generation code. the 2nd graph is using the estimate over 5 iterates in a trajectory via the adjustment-via-increments and average method. there is a very clear/ evident self-similar effect! (am thinking as writing without trying it that would not have been evident in a predicted vs actual scatter plot.)

unfortunately the adjustment method seems to have the property that increasing the # of iterates increases the noise rather than decreases it! this is a surprising result! its also a bit disappointing. (and the endless variations of scale invariance of this “very fractal” problem again show up and throw a wrench in the works.) so it looks like maybe the next step is to experiment with the adjustment logic. did have one idea of maybe averaging parameters over the ‘h2’ “space” rather than the ‘ls’ “space” as it is currently. 😦 😐

but, nevertheless, think this is maybe something of a breakthrough because the basic linear estimation technique based on density-related metrics is proven/ vindicated, answering a long-open conjecture related to the general machine-learning attack direction: can decidable properties at each iterate give a halfway-decent statistical estimate of an “open-decidable” property such as trajectory or glide length?

`fit5.rb`

this code was not too tricky and does that idea of averaging over the ‘h2’ domain/ space instead of the ‘ls’ space. 25 iterates instead of 5. results seem about the same. one apparent observation: the error seems to decrease at both ends and larger in the middle.

so this is challenging, what combination of estimate information can increase the accuracy of the estimate? or maybe this “denoise-resistant” scaling phenomenon is inherent to the fractal nature of the problem?

`fit5b.rb`

(12/27) this code is a nearly easily adjusted (aka low effort) variant that calculates the estimate using every iterate available in each trajectory. it estimates the midpoint of the trajectory via averaging over ‘h2’ space. the idea is that, assuming iterate parameters are somehow linearly correlated with glide length, half of all iterates parameters are to the left of the midpoint and the other half are to the right and so the average presumably estimates the midpoint. it seems quite notable that the tight lower boundary line closely matches the actual value. this would tend to further support the idea of working with adjacent ‘h2’ coordinates to improve accuracy but also maybe taking min values over the neighborhood instead of the average might be more accurate.

`fit5c.rb`

this code averages over adjacent ‘h2’ space parameters, 20 pts. results are not exactly as expected, even somewhat unusual/ surprising. there does seem to be improved accuracy but theres a bifurcation effect. the lower ‘h2’ values overestimate with low noise and the higher ‘h2’ values underestimate with high noise. a 2nd plot of the sorted-by-‘ls’ bit widths reveals the culprit; the error is highly correlated with them. similarly with sorted-by-‘ls’ ‘h2’ values, 3rd plot. this seems to suggest the averaging should work over nearby bit width/ h2 values. now have to think some how to implement a 2-way or multi-way nearness averaging. a possibly natural direction to go in at this point is radial basis functions.

`fit6.rb`

😳 that code was sorting output order by averaged ‘ls’ value instead of actual ‘ls’ value. need to reanalyze that some, not sure if thats significant yet.

(12/31) further thought/ reflection: a basic problem, if it isnt already apparent, is that points that are nearby in ‘h2’ space are not necessarily nearby in ‘ls’ space, so that averaging nearby points over one space is not necessarily going to smooth out the results in the final transformed ‘ls’ space. so the current puzzle is, how to work/ smooth this out?