this tightens the screws some )( on the prior findings and show they generalize nicely. prior analysis seems very solid but there is always the shadow of question of bias in the generation algorithms which start from small seed sizes and build larger ones out of smaller ones. an entirely different strategy for generating sample seeds is used here based on a genetic algorithm. the idea is to start with a fixed bit width and alter the bits in it just like “genes”. fitness is based on glide length (or equivalently ‘h2’ for a constant seed width). it starts with 20 random parents of given length. there is 1 mutation operator and 2 crossover operators. 1 crossover operator selects adjacent/ contiguous bits from parents at a random cutoff/ crossover point (left to right) and the other just selects bits randomly from parents not wrt adjacent/ contiguous position.
fit24 is again used for the linear regression fit. these runs are for 50, 80, 100 bit sizes and ~200 points generated for each. 50K iterations.
because of declining # of points for higher widths, this is circumstantial evidence that as widths increase long glides (relative to width, ie ‘h2’) are (1) increasingly hard to find and/ or (2) rare. these two aspects interrelate but are not nec exactly equivalent. hardness of finding seeds with certain qualities ie computational expense does not nec mean theyre rare. an example might be RSA algorithm. finding twin primes is somewhat computationally expensive (although still in P) but theyre not so rare. technically/ theoretically rareness and computational expense are related through probabilistic search aka “randomized algorithms”.
⭐ 💡 ❗ 😮 😀 was experimenting with some different ideas with “null results” and then tried this out, predicting the bit width of the max iterate in the full trajectory, and was a bit shocked at the results! it gets a very good linear fit! the regression code is modified slightly to predict parameter ‘y’ and handle the parameters in a more sensible/ streamlined way.
looks like a veritable breakthru!
(also was just noticing that this GA code doesn’t control the size of the seeds exactly because its possible to get ‘0’s in the most significant bits of the candidates, hence max iterates with widths smaller than population starting size 50, for seeds that evolve to smaller (starting) widths, need to adjust that. also notable is an apparently levelling effect at top right.)
(3/3) this distribution code has various new features. (1) it tracks/ skips duplicate generated points which is not insignificant at about 6%. (2) instead of choosing 1st points in a histogram bin only, it randomly samples over all points encountered in that bin. (3) the fixed width limit is turned on or off with the ‘f’ flag. (4) min statistic is thrown out because it was constant but the ‘a’, ‘sd’, ‘mx’ statistics are added for both ‘0’ and ‘1’ runs instead of only 1 of them (5) dividing max iterate bit width by starting seed bit size was also investigated, the commented line in this code.
the bottom line after playing with the 4 modes (normalize by starting width on/ off, fixed starting seed bit width on/ off): it seems like dividing by starting bit width substantially weakened or added noise (variance) to the correlation although did not entirely remove it. it looks like ‘f’=false ie variable starting seed sizes produced some of the best correlations, as in the following graph. at this point it would make sense to iterate over these variations many times and look at averages. also it seems the top bin of the histogram (the rarer seeds) tends to vary quite a bit across distributions and have some effect on correlation. the variance is noticeably larger than last graph and not sure what change leads to that right now, its fairly repeatable. (am thinking it might be the random histogram bin selection.)
(@$%& misc/ random cyber travails, my chrome bookmark sync is failing right now, uninstalled chrome & cant reinstall it due to glitch, now using MS Edge, doesn’t seem to work quite right, a “heart” emoticon in wordpress seems to cause it to lose text/ remaining paragraph because of buggy escape logic for ‘less-than’ character, not sure if its wordpress or Edge specific right now!) 😦 😡 👿
(3/4) 💡 ⭐ ❗ 😮 😎 😀 good to be alive right about now! there is an obvious question of whether increasing the # of points in the distribution increases accuracy at all. that is easily answered by just looping over the distribution code a few times and adding points to the file, in this case 10x, leading to 2020 points total. it seems like the best results were with predicting the max iterate bit width divided by starting seed bit width. also, here is something of another breakthrough!
very observant readers (lol! full disclosure, thats just an expression here!) may recall a prior idea of recursively improving the predictor by looking at averages of predictions over a fixed # of initial iterates in the trajectory/ glide—and that the prior attempt was not successful! that logic can be a bit subtle because subsequent iterates while close are not necessarily leading to the same predictions. here is some new sophisticated logic to accumulate over 10 sequential iterates and the results are fantastic, the best ever seen so far! as recalled earlier, a glide does not (“seem to”) have the property that it can be recursively improved, because each new iterate is related to a new subglide, not the full glide. (there may be some ingenious way of averaging over predictors in that case but haven’t figured it out yet myself.) however in this case, the max iterate/ start seed width ratio is seen to be “recursively improvable” in the sense of improving fit/ decreasing error/ variance!
am thinking there may be some further improvement to this collection prediction logic also, have to think about it more (not sure about the logic surrounding the case where the prediction that current iterate is max), but for now am in a very good mood. there is a general idea lurking here that if one can run code more times and combine/ average somehow to get better predictions, it may be a hint of a decidable algorithm with some particular time complexity related to the iteration complexity. in the graph, the predictions are extraordinarily low variance, and overestimate but nearly by an exact constant (so just subtracting it improves error everywhere). for now the predictions start/ tend to degrade after about 10 iterates so this isn’t the case that more computation (after that point) leads to arbitrarily greater accuracy.
(3/8) this is some fancy footwork (or typingwork) that combines/ significantly refactors the last two programs
distf, fit24c in a sophisticated way. the generic algorithm is parameterized in such a way that its evaluation function is a parameter. it is called in 2 scenarios. the 1st is to generate the learning/ training distribution over 50k iterations which is postprocessed with histogram normalizing method, but only 1 time. that generates the data for the linear regression fit. the set of fit coefficients is determined. then those coefficients are used in an adversarial generation attack on them (ie the regression model), reusing the genetic algorithm optimization code. the model includes the 10 iterate smoothing method. optimizing method is then repeatedly called for 2k iterations starting at 55 width and increasing in multiples of 5. the genetic optimizer attempts to maximize difference between prediction and actual for the max trajectory bit width. the goal is to increase error as starting seed widths increase.
the two trends are plotted below with error in red (right tics) and starting seed width in green, left tics. the model seems to be extremely robust. even though the coefficients are generated for a seed width of 50 only, the error does not increase much for seed widths over 10x that! in fact error seems to be slowly declining for larger seeds! however this could just be a situation where higher error exists but is harder to find at higher seed bit widths.
next step(s): the real challenge is to try to find an invariant that predicts some “horizontal dimension” of the trajectory, which is the “semi-undecidable” aspect. ie need to translate these current errors into the horizontal dimension. somehow the max iterate bit width must relate to the horizontal dimension, but not sure how yet, need to think on this further. another note: rerunning this code seems to lead to quite a bit of variance in runs. one other run led to a nearly entirely negative error graph up to ~-25 but which did seem to asymptotically approach a limit/ constant, possibly again with narrowing, albeit gradual. it seems some or much of this instability may be due to variation in the fit coefficients across runs. the graph error is actually “y – y_p” ie actual minus predicted.
(3/9) this code is a 1 line change and simply aggregates over 3 distribution runs to find the coefficients and seems more stable across runs, and the run below more representative. the predictor generally overestimates but by an apparently bounded/ constant limit. (this time error tics on left, bit width on right.)
(3/14) this code is roughly similar to
data5 but has some neat new advances. the general idea here is that its now clear that maybe the max ‘h2’ parameter tends to decline/ compress for larger seeds. but rare large ‘h2’ values for larger seed bit widths are still existent. so its hard to figure out what the limit is. some of the distribution algorithms seem to do better than others in finding the outliers, apparently naturally because in the non-genetic algorithms the outliers are built recursively in the sense that their lsbs are also outliers. but its not clear if all the outliers can be constructed this way.
the genetic algorithms dont build the seeds recursively and (so?) seem to have trouble finding the outliers. so then its a question of how well the recursive distribution algorithms do in finding the outliers, and they all have slightly different performance. am not sure exactly how to proceed at this point, but it occurred to me to do a study/ comparison of how the recursive algorithms find different distributions. however, again its hard to compare. how to try to normalize the comparison somehow?
the idea here is to wait until the distribution average seed width (measured in the frontier points) hits some limit (in this case 40 bits), and then look at histograms of the seed parameters (of collected/ found seeds, not the frontier). here the glide length is labelled ‘g’, seed bit width ‘w’, and horizontal scale x50 ‘h’. then the optimization is over some combination of these parameters.
⭐ another neat innovation here is calling gnuplot programmatically. have done this before on linux but never in windows and it took quite a bit of time to figure out the syntax. this is a really great feature/ capability however & probably should have tried to figure it out a long time ago, could defn have saved time typing gnuplot cmds! this code tests 7 generation combinations and outputs 5 separate graphs for each, throwing off 35 total graphs. am including 3 highlights here. its a nice way to visualize how all the different generation algorithms have different “profiles,” but also kind of overwhelming. Big Data!™ 😮 🙄
so the idea of comparing algorithms is going to be a bit “)(” challenging. each algorithm represents a sort of “moving blob/ target”. its like trying to compare the performance of randomized algorithms or as one computer scientist once semifamously said (probably about something CS related), “nailing jelly to a tree”. theres also further separation of distributions in the found vs frontier sets, etc. (2020 hindsight should have adjusted the horizontal range narrower on these graphs.)
(3/15) this code is not easy to describe in words, but heres a stab. takes out two underperforming methods ‘w’, ‘h’ (in the sense the frontier avg seed bit width didnt reach the limit/ cutoff after 10K iterations) and plots according to x axis as the frontier seed bit width average. there are two different methods for ordering the graph, either the frontier or found points. the 1st is standard order as previously and frontier points are output as they are encountered (1st 5 graphs, showing ‘h’ avg, min, max, sdev, mode). the 2nd (2nd 5 graphs) resorts the found points according to the frontier seed bit width average when they were encountered, plots x as average seed bit width over those groups, and shows how the algorithms can “bounce around” some during execution, ie statistical pov more from the “found” sets instead of the frontier but ordered by the frontier.
the max ‘h’ values decline quite consistently but the average ‘h’ values seem to stay very level. max seed bit width/ cutoff 100. (the graphs sets are for 2 separate runs.) speaking of “moving blobs” another pov is that this shows the algorithms encounter various degrees of “resistance” or “acceleration” in their advance/ spread.
theres a lot to “process” here and nothing really jumps out except maybe that the frontier statistics are quite orderly whereas the “found”/ “harvested” set is a bit disorderly in comparison. am trying not to get lost in all the data/ nuance (lol, maybe collatz is not the best problem for that… aka “barking up the wrong tree”!). the general goal is to come up with some kind of new distribution algorithm that combines the best of prior ideas. the genetic approaches are very nice in that they target a distribution around a particular seed bit width and can be used to consider the inductive scaling of the general regression invariant (in the sense of evaluating its error for increasing seed bit widths). the recursive approach is very effective in finding/ building long glides and outlier (large) horizontal scales but it has a moving/ sliding window/ distribution/ spread of seed widths. so maybe there is some compromise/ intermediate blend between these two approaches that combines/ synthesizes the best of both and avoids the worst? am getting a few thoughts… ❓ 🙄 💡
(3/16) thought about it a bit and have a nice elegant approach! here is the basic idea. the recursive seed generation algorithms are basically using a sliding window across seed bit widths (ie the shifting frontier). however, the lower bit widths in the window are determined by worst-performing seeds by the current sort metric. the sliding window effect can be tightened/ amplified by “reaping” the shortest bit width seeds in the frontier instead of by the other optimization metric. then the window is forced more directly to “move rightward”. after the minimum point of the window has passed some point, it will not return, ie its monotonically increasing. so this is a natural way of determining when the algorithm is “finished” generating seeds of a given width.
this code tries 6 generation algorithms also mixed with a random frontier selection with each ie 2×6 combinations, and also generating graphs of both the ‘g’/’h’ increases by seed bit width (ie statistics over seeds sorted by bit width, similar to a histogram). a really neat way to look at results is in the windows file viewer, setting the display to medium size icons, and sorting files by date ascending. then one gets a thumbnail layout like the following (captured with a screenshot). there is not a whole lot of variation except the ‘gw’, ‘wh’ method tends to paint into corners aka local minima. while there are discernable nuances, other than the cornering, it doesnt seem any method has clearcut advantage/ is obviously superior than another. some cause ‘g’/’h’ to climb faster but the tradeoff is, then are presumably less thorough in the local search neighborhood (with maybe noisier statistics). this is a somewhat cherry picked batch with ‘gw’, ‘wh’ not cornering.
(3/17) this is not much code change but is a definite advance. the 1st idea is to just alternate thru all the methods, since they dont seem to have much different effect. then saw the cornering effect some more and tried to isolate it, and think have found the main or even nearly sole situation causing it. narrowing down the “top n” selection to a small number (eg 5) makes it more reproducible (for ‘gw’, ‘wh’). this is quite obvious in retrospect but it helps to state it, it seems not obvious before stating it. what is happening is that the algorithm is sometimes only choosing a few distinct initial trajectories (prefixes) over all the trajectories, ie initial prefixes are shared, and these prefixes lead to terminated glides. then no new improved glides can be built out of the terminated glides, even though there is variance in them after the initial “terminating” prefixes.
so bottom line, this shows its crucial to get the early population/ growth of the frontier smooth. a simple fix is to rank exclusively by glide length in the 1st 100 advances and then use the other methods after this initial population is secure with good diversity (ie long glides). this final code is very reliable in not cornering, ran it dozens of times without any. it also seems to do well as far as the balance/ tradeoff of generating larger bit size seeds without too many iterations. so this is now alternating optimization sequentially over all 8 ranking methods hopefully generating a diverse distribution (TBD).
finally, heres a “nearly” side by side comparison with the genetic code as was earlier motivated. this GA code was based mainly on
distd and tuned as much as possible. as anticipated the GA code is not quite as performant in the average statistics, the average does not clear 20 at the end 100 bit width seeds whereas the prior algorithm consistently exceeds it by a hair, although in the GA the ‘h2’ maximum is more steady and seems nearly the same as for the prior recursive method
data10. also prior algorithm makes ‘h2’ look like it is plateauing in the average statistics but in contrast the GA has a clear declining trend. this is tricky because it would be a key characteristic relating to an invariant. also the GA runs far longer, maybe by a factor of ~5x or so. however an interesting trick, in contrast to the last and many other algorithms, the graph can be generated right-to-left to figure out/ optimize its high-seed metrics 1st. note the GA used the fixed-width method for simplicity, not sure it makes any/ much difference. this was set to 10K iterations per bit width & it doesnt seem like increasing it affects the statistics all that much, but that relation does seem worthy of further study.
… some 2nd thoughts/ thinking alound on the graph. the distributions are different on the GA vs recursive approach in that the GA has a lower minimum. so just “shaving off” all the low points the GA returns would improve the average! but by how much? so again this is the challenge in comparing two blobs so to speak. unf there is some sense in which both graphs being compared have arbitrary baselines! so what then is a “baseline” for these metrics, anyway? zen question of the moment! another idea that is occurring to me lately is to take long glides and “perturb” them at various intermediate bit positions in the seed to obtain a variety of shorter glides. Under Construction/ Needs More Work!™
(3/23) 😮 ❓ 😳 some setback! decided to try to fit those results via regression and the standard metrics devised so far. the fit was bad/ null/ melted away! esp compared to earlier results! but then looked back and saw that there was only one semi-solid result along the lines at the beginning of the blog,
dist.rb, and subsequent code was significantly altered and jumped into studying the glide max iterate bit width with good results, but there wasnt anything else on ‘h2’. there were various adjustments on the generation algorithms since then so then had to reverse engineer those earlier results & figure out what got mixed up, its somewhat like a (code) regression! was a bit mystified/ stunned!
but, this is part of the idea of saving old code/ results (via archival/ blogging) very carefully. after quite a bit of effort to isolate/ replicate this (“successfully”) it appears some of the difficulty/ discrepancy is in the different “balancing” approaches that select points via the histogram calculation. surprisingly the 1st algorithm that has a 1st-only queue for the histogram balancing seems to outperform other approaches significantly. who would have thought? this code has 3 different balancing algorithms and resets some of the earlier generation logic.
the 1st two graphs are two “post” balancing algorithms and the 3rd is the 1st-only queue original approach. from visual inspection the 3rd while not a strong correlation (as earlier) tends to do better whereas others are essentially null results! so then thought about how to quantify this & then decided, need to look at correlation coefficients to compare the fits more quantitatively. also my early guess for the explanation is that the seeds in the 1st-only queue balancing have more correlation in their bits.
💡 ❗ 😮 however, but, aha! didnt realize this until looking at the graphs some more— the 1st two methods exclude lower ‘h2’ values less than 10 and the 3rd goes all the way down to ~2. could that be a key part of or entirely explain the discrepancy? and maybe the longer the algorithms run, the more they lose lower ‘h2’ values, driving down the regression fit quality? have to nail that down! the sharp (honestly, presumably apocryphal/ imaginary) reader may recall some earlier results showed how critical the lower points were in improving the linear fit, where removing them seemed to completely remove it. my confidence in the basic intuitive plausibility of this picture increases after further thinking. ❓
this is some new sophisticated code that implements the correlation coefficient calculation, and some rescue logic to handle the rare linear dependent edge cases. ‘r’ ranges between -1 to 1 to represent total negative or total positive correlation, 0 is no correlation detected. its a running average for starting seed width 50 after 80 iterations. there are 3 balancing algorithms ‘x’, ‘y’, ‘z’ and ‘z’ is the 1st only queue method. this code confirms the suspicion that the 1st-only queue balancing ‘z’ gives superior fit with r about ~0.5. somewhat surprisingly, despite the prior poor visual indication, other methods ‘x’, ‘y’ are measured at “not null/ not bad” at about 0.35, neither much different.
there were 3 modes on the distribution logic for 2^8 combinations, and differences in them were not entirely negligible; its tight for ‘x’, ‘y’ but interestingly the spread broadens for ‘z’. best performing method is ‘100z’. that corresponds to fixed width, top 10% selection, and no duplicate detection. am thinking maybe some of the reason the prior graphs were a little deceptive is that the two curves are not both scaled based on their standard deviation, which is maybe more in line with what the correlation coefficient is measuring. note also this has only half the 0/1 bit runs variables, the other 3 are commented. adding them in improves ‘r’ a few percent to get it to about ~0.6 for ‘100z’ method.
@#%& more learning the hard way! ran into a glitch with the automated gnuplot plotting. it looks like when called from command line via ruby hook, there is some limited/ fairly small buffer size on command lines. it was identified with a long plot command and led to blank images being created. there were 8 points plotted in a long plot line and taking off the 8th fixed the problem! this took quite awhile to isolate/ debug. an unexpected limitation in the previously bulletproof/ workhorse plotting code. separating lines with backslash characters didnt seem to fix it. instead this code outputs the plot command file as
plotn.cmd which can be executed with a gnuplot
load 'plotn.cmd' command. loading from the console bypasses the limited buffer glitch. calling the same load command via the command line interface still yields the bug! therefore it somehow seems to be restricted to the ruby command line call.