last month wrote out a remarkable outline. did anyone catch that? it seems potentially gamechanging and have been rolling it around in my brain, impatient to work on it, with a kind of background excitement + methodical intent bubbling/ mounting, a tricky combination to balance, maybe like walking a tightrope, and building the risky urge to rush. was very busy last week with a family visitor on a rare trip + other key family member mixed in too and couldnt get to it quickly. after thinking it over carefully the basic idea is a mix of 2 patterns that have already been explored: nearest neighbor plus the hybrid search algorithm. the NN algorithm handles creating/ matching a finite set of classes, and the hybrid search algorithm tries to find samples that “break” the classification, or improve it; those two sometimes go hand in hand.
the sample relationships are mapped out in a DAG like structure over the classes. the concept of breaking the classification (via optimization parameters) is still yet fuzzy and indistinct but some ideas have been laid out. my immediate idea was that classes with a high out-degree are more “ambiguous” and therefore less desirable. on further thought a key criteria is looking for loops in the DAG, which relate to indistinct feature classes creating an illusion of possible nonterminating sequences, although am not sure exactly/ fully yet how loops relates to class ambiguity (with outdegree a rough measure, or maybe others), it seems likely there is some relationship but its not immediately obvious.
❗ 😮 this code has not been very thoroughly tested but it seems to already create a remarkable result, “beginners luck”?! it was nontrivial to write even this rough version, not so much because its a lot of code, but that its not so much code but tricky. it is set to generate 4K samples of 50 width iterates and the prior/ familiar “bin” structure is utilized/ reused to build/ save the 500 classes. it seems to quickly find/ settle on a somewhat “stable traversal” of arbitrary iterates while preserving a 1 outdegree over the DAG. the features added over time are drawn in the diagrams with 3 details, early, middle, late. the early and late are somewhat random and the middle range is very/ extraordinarily orderly.
- the features (for now) are entropy and density, with additional for low/ high lsb/ msbs, and midpoint aka center of weight.
- the code measures outdegree as the variable ‘nnc’ aka “nearest neighbor count” and then attempts to minimize it. it is not fully clear yet if that is really the key criteria or if this optimization is handling it as intended. ‘nnc’ is both a local and global property of the iterate, it cannot be derived here until all points are classified. since this is expensive with the NN algorithm its reapplied only every 100 iterations along with the feature statistics (prior average + standard deviation calculation to optimize based on the “gaussian z-norm”).
- there is some other new code to streamline the hybrid solver such as simplifying/ shortening the parameters on the
initmethods and a new
remixmutator that shuffles bits in an iterate either contiguously or noncontiguously. the prior
mutbcode was discovered as not working as intended, it was only mutating a prefix, so that was fixed.
- the iterate block size is specified as the parameter ‘c’ on line 81. setting it to 10 led to runaway (ie increasing apparently without limit) outdegrees ‘nnc’ as samples increased. earlier it was thought maybe larger block sizes would be preferrable by some reasoning but here immediately they are not. the outcome of larger block sizes found here is that they tend to “scramble” the features in a nonlinear way and therefore disrupt the classifications by increasing outdegrees of the DAG.
- an immediate unexpected subtlety comes in the classes discrimination. early on the hybrid algorithm builds classes and finds some samples that are equidistant from 2 classes. so the code had to be adjusted to return “all matching classes,” a set of zero or more. this logic is in the
- another consideration of current code is that in the
adjsubroutine the logic was changed some to attribute a 0 ‘z’ score to newly generated candidate iterates because the actual score is only available every 100 new iterates with the heavy nearest neighbor calculation and is not recalculated for new iterates outside the interval. my expectation is that this will mean new unevaluated iterates will be given average score and think thats as desired but need to check that and understand how it affects the overall dynamics.
it would seem the outline is already bearing substantial fruit. the (probably) very hard shortly upcoming test is to uncomment the iterate size operators and see if stability can be achieved even with different size iterates.
😳 later, it is basic. the code is outputting the final 500 bin feature classification in time order that they are found. the graphs are showing that a lot of intermediate classes are scattered and then connected in the graph with lines, but the lines dont represent gradual changes in the features over time. however, the beginning/ ending clustering still needs to be analyzed/ explained further. the graph was dashed out and its clear that some new analysis ideas/ techniques/ logic are needed to visualize the dynamics and outcome, esp looking for stability properties, still not knowing exactly what they are/ look like. one immediate challenge is that every time a single new class is found, the classifications potentially change en masse, but recomputing them every iteration is too expensive. it is not easy to write an incremental NN algorithm but its looking like a near likelihood/ necessity sooner rather than later. an immediate idea is to retain the distances of nearest matches for the recalculation/ recomparison.
⭐ ⭐ ⭐
elsewhere, other striking news. met a young lad AA on the SE physics chat line who is interested in working on AP Research Paper class.. and was seriously considering… at my suggestion… fluid dynamics + particle physics. a remarkable new course offered by AP that has the student write a scientific paper with focus on novelty and contributing to a known gap. it seems safe to say there are very few college undergraduates who can even pull off such a feat, let alone gifted HS students. am getting a very strong urge, would like to write up this effort/ project/ outline more in detail along with all my accumulated physics research since last time (now over a staggering… 2 years!), but its a bit of a miniherculean task at the moment bordering on thankless also (ouch). it would seem that after decades of trying, nobody appreciates visionaries much… or at least this one… one can take it either personally or impersonally but the prognosis is not pleasant either way…
nevertheless wrt the advising scenario playing out briefly, have been looking for an opening like this for years, or nearly ~½ decade, and had no idea early on it would be so rare + utterly elusive over time. his effort/ engagement/ energy helped fuel my own but hes disappeared for the moment, with no signs of returning. sigh. quickly getting my hopes up then quickly dashed. as those with bipolar disorder might say with particular poignancy/ desperation, you win some, you lose some, or, its a rollercoaster.
⭐ ⭐ ⭐
(10/7) 🙄 closer look. the basic structure is in place but as usual the devil is in the details and a single line of logic off could change the picture/ results. the prior hybrid code had a wrinkle in that it doesnt resort the bin arrays after recalculating the parameter z-scores at line 449 in current code. but at line 450 theres a glitch where current code is sorting by the reverse order instead of intended order via the
rank subroutine. this affects code to throw out underperforming bin widths in the
add subroutine at line 191 but the sort is enforced prior to that at line 190. so maybe all that is affected is the code to choose the top ¼ bins at line 460 but which is not affected for runs with a single bin as in this setup.
(later) 😳 did anyone notice the multiple
weasel hedged words? bugs galore! closer look/ further painstaking analysis of prior code, there are not 1 but 3 subtle typos/ logic/ subscript mixups in the
states, near subroutines, basically mixing up the in/ out features, that end up with the code resulting in mapping each DAG node to itself, unnoticed until carefully analyzing the DAG structure with new code. beginners
luck haste/ waste, aka “too good to be true”…
(later) 😳 another so-called “fencepost” glitch found in the batch logic in
seq subroutine. a batch count of 1 means only 1 iteration with no subsequent iteration calculated, ie 0 iterations. in other words the “from” and “to” features are identical for the same iterate. at least these bugs are utterly subtle to the naked/ untrained eye… caveat, leave this work only to the professionals, wink
(later) ❗ ❓ even with all that fixed, the classes are frequently “self-mapping.” ie an iterates features change too slightly to “point it” outside its own class in the DAG. so it seems some other optimization criteria is needed, and “ambiguity” is related to self-mapping.
(10/9) 😳 😡 a whole day burned on a (apparently neighborhood-wide) internet outage and hours spent trying to isolate a mind-melting subtle defect. at least the illustrious cable provider who shall not remain nameless (comcast) “helpfully” provides info on the outage and expected fix time. “we are working hard to get your service up and running.” lol, sounds like my own IT job. except that the fix time has been updated about
5 7 times now to be 3 hours in the future at each fix time. @#%& … at line 380 the index should be ‘x’ instead of ‘i’ in the map code to find the (equidistant) nearest neighbors. this had the effect of adding multiple repeated iterates. its been something like a difficult/ further painstaking scavenger hunt for days. also added some more sophisticated analysis code.
the new code seems to be behaving as expected/ working as intended. however, the equidistant/ multiple matching classes issue seems to be widespread. need to figure out some kind of strategy/ fix on that.
added some other analysis, the code looks at in-degrees and out-degrees of the class DAG by keeping track of mapped iterates onto each DAG class. these set sizes were found to correlate with ‘nnc’. also tried adding some new features to improve discrimination ‘ea’, ‘da’ entropy and density distance. looked at a variety of optimization approaches.
💡 ❓ the self-mapping involves gradual feature changes and this leads me to think of an adaptive iteration count algorithm where maybe some classes utilize extra iterations to shift the features…?
(10/10) some graph reformatting and some results to display. these are not esp definitive/ edifying/ resounding but have done a lot of work without posting anything and there are some hints about future directions therein, and while timeconsuming this blog is largely written in the spirit of “train of thought/ stream of consciousness/ play-by-play commentary” (finding it has the natural benefit of organizing/ sharpening thoughts/ ideas/ approaches to write them out). aka prototype(ing)/ “proof of concept.”
my immediate idea is to go after low-hanging fruit and try to decrease self-loops (“self-mapping”) which are simple to detect/ analyze and as mentioned/ alluded immediately turn out as problematic and are apparently not at all easy to remove. but it is not obvious how to decrease them. my 1st idea is to just have a binary indicator 0/1 called ‘a’ if a node has a self-loop and then attempt to minimize it. gradient descent is not so great on discrete few-valued-range / even binary parameters but it this is no ordinary optimizer and it does succeed to some degree.
some of the parameter combinations did decrease self-loops to around ~100/500 class points ie about ⅕ total so its reassuring all this complex framework can have some effect at least. also several of the parameters tend to be correlated with the self-loops in the bin plots, but then attempting to push on them in the opposite direction did not manage to bring down the self-loops much, leaving them at ~½ which seems to be near about random expectation.
‘ni’ is class node indegree, ‘no’ class outdegree, ‘sio’ sum of both, ‘dio’ difference of both. the features are graphed in narrower dashed lines in background on the left 0-1 scale. the graphs are the 500 bins ordered by aggregated (sum over all optimization variable z-norms) ‘z’ optimization value. 1st graph/ run/ setup had 318 selfloops at end. 2nd graph does better on optimizing and ended with 149 selfloops, and 3rd graph shows the associated grid plot of points along with noticeable density and entropy clustering low/ high, the self-loops tend to be rightward with low density/ higher entropy. gnuplot does a remarkably good job on the very high count of 25 separate variables! in terms of total variables /datapoints one of the most complex graphs ever posted here. nobody ever said it will be simple/ easy.™
so, its what might be called an opening volley with a lot of new vistas opening up, but along with substantial newfound intricacy/ complexity/ dynamics. it seems inevitable that some analysis code will have to be built to find all loops, not just the simple ones. fortunately the field of graph algorithms is well explored terrain and there will be many standard approaches available for some of the basic tools needed.
( 😡 @#%& the wordpress editor code seems to have been modified so that the code format loses tabs, and also gets mangled on ipad. workaround to use syntax highlighter block instead, hope they dont mess that one up too, sigh, argh…)
(10/13) hmmm, ruminating; all that seems worthwhile/ even promising and not entirely inconceivably a viable approach, however, a few thoughts on all that, aka “cold light of day”…
- feel am jumping the gun too much on this right now with too many moving parts without strong coordination/ cohesion yet so to speak. need to understand/ build out more about the overall theory and eg what are the ambiguous classes and build code to target/ work with them. as already understood/ explored somewhat, near center density/ entropy is likely to be a problematic range.
- class ambiguity is just another manifestation of the long recognized “curse of undifferentiability.” it is not immediately obvious the class-based approach is inherently more powerful vs existing/ prior ideas and will overcome it.
- there has been a huge amount of analysis of/ focus on various metrics that are key yet not scale invariant as initially imagined. for example, many metrics long studied are tracking iteration counts eg ‘cm’, ‘cg’, ‘cmnw’, etc.; and its a basic realization now dawning that raw iteration counts are scale variant. in defense there has been some analysis of nearby/ closely related scale invariant iteration measures such as vertical scale, horizontal scale, etc, although those in particular havent been studied in a long time. and maybe all the scale variant measures have “natural”/ corresponding scale invariant versions.
- part of the idea of a scale invariant iteration measure seems to suggest that iteration batch sizes should scale with initial bit width.
so then how does one look more at scale invariant iteration counts? heres a quick approach on that. the 1-lsb-triangle glides have been helpful to study lately, and some of the basics were covered earlier this year eg 1/2020 (this is noticeably similar to
construct100e). for them here is a simple/ quickie scale invariant iteration measure that hasnt been measured/ noticed until now; remarkably it turns out
cg / nw ≈ 4! (‘cgnw’ magenta right side scale.)
- the idea of mapping the problem onto the feature space generally seems to be a way of converting the discrete differential equation into a continuous one. however, while not yet quantified in more detail, its somewhat clear a priori where the problem will be, as already alluded. the features are very discontinuous (aka noisy) between iterations. a possible general antidote/ solution pattern to this already familiar/ longstanding/ applied is averaging techniques, eg over “adjacent” sequence batches/ blocks.
- need to find other/ addl/ better ways to quantify/ visualize class ambiguity, am getting some ideas on that. much of it will be visualizing a/ the (node) graph of the interconnections.
- can already see where this is leading/ headed, as already outlined, most of the class ambiguity will probably be with the already long noted/ explored undifferentiated ½ density iterates, and ufos are already long understood as a kind of antipattern to “break” the general classification of them as in the drain. a key question will be how/ whether/ if adding additional (scale invariant) features eg entropy/ midpoint can help better differentiate/ “untangle” the dynamics/ outcomes.
- eg recalling the old MDE model, iterate size change is at the center of this. another key related question still open is how features + ufos interrelate to increase/ decline of the sequence; the now-longstanding conjecture that they dont exist in the postdetermined glide is still unresolved/ unfalsified.
- am now thinking of generating samples/ mapping classes starting without the genetic algorithm and using simpler approach/ prior techniques of density or Terras density ranges.
(10/18) finished some classification code (more on that shortly) + ruminated on some big ideas. went back thru old blogs to try to see/ pinpoint where they originate. am realizing that while Perron 1929 has been cited a few times on this blog, the big reveal hasnt been explained yet. pieces are seen in some earlier blogs, but dont think its been laid out.
Perron 1929 is a huge hint about a possible general proof structure, a kind of rosetta stone. its a fairly simple linear model, but the key concept is that it looks at stability of the model. the idea is that the model can be perturbed somewhat by noise, but retains certain convergence dynamics, aka stability. the key concept to keep in mind is that when the (collatz) problem is translated into the model, the model is an approximation and will have error associated with it compared to the “real thing” (in this case collatz sequences). this error can be measured/ estimated. then the stability of the model comes into play. the model will have some tolerance for error/ noise and still converge, ie still simulate a terminating sequence with some error introduced.
then, one has to try to show that the error is bounded such that any “error” associated with the model vs real problem is less than what interferes with/ “breaks” the model stability (ie convergence of the model to terminating sequences). in other words even as the model is periodically perturbed, it still “converges” aka here “terminates.” afaik, this crucial concept has not been explicitly stated/ highlighted so far, although it seems to be hinted/ alluded to in some of the earlier MDE work/ experiments. there has been a lot of imprecise musing (ie near speculation) about trying to deal with error in the proof structure for years, but this basic realization just now written out is utterly fundamental/ central. it has to do with how the error relates to/ (doesnt) disrupt(s) the “correct” overall/ bottom line model prediction which is (the binary decision) of terminating vs nonterminating.
the earlier MDE model was useful and captured some signal. it was probably always in the back of my mind to return to the problem with some “enhanced” model but never really imagined what it could be until now. also the MDE had some parameters for max run lengths, and iirc, they were not scaled. so there is some new realizations about how a model maybe can capture most of the problem dynamic with scaled features alone. so overall some not-exactly-earthshaking new povs, but they are now understandable as absolutely crucial details/ embellishments in attacking the problem.
so, taking a look at/ surveying some of the earlier recognition: Perron 1929 was 1st identified here on 10/2017 almost exactly 3 yr ago (“time flies when youre having fun™”).
perturbb from the same month have basic error calculations of the MDE vs real sequences. but the idea about the error limiting being key to the proof is not stated. looking back there is probably some dependencies on trajectory lengths in those graphs that was not further analyzed/ examined.
looking further ran across these musings on 2/2018 which come close/ near miss with the “leaf in the wind” analogy. the idea there was something like that theres a general trend in the model/ trajectory, the “wind”, and then local sharp fluctuations (depending on “leaf orientation”), somewhat corresponding to the noise term. however, looking back, while this is quite similar/ relevant/ evocative, the analogy doesnt quite carefully separate the concepts of model vs “raw trajectories.” the leaf/ wind metaphor is also quite related to the “(jostled) marble in a bowl” stability concept written out 9/2017.
something almost identical to the Perron stability ideas was sketched out on 4/2017 where it was explicitly named stability along with a 1st cut on computing the model vs raw error in
data22. the idea of convergence in the face of noise is sketched out, but it still appears to be missing thinking/ talking about it in terms of terminating vs nonterminating sequence predictions by the model.
data33 on 5/2017 has an actual perturbation experiment on a simpler low-dimensional density model/ fit, and looks at the transition point where the error shifts the model from converging to nonconverging ie a terminating sequence to a nonterminating one.
data35 looks at perturbation transition point wrt bit size scaling. actually these nearby lines written ~3½yrs ago nearly nail it:
this range of ‘k’ [transition point cutoff] is crucial and can be called something like the particular tolerance of the stability property. next steps are to determine whether the delta between real collatz trajectories comes anywhere near the tolerance range. … to summarize, after years of wandering in the wilderness so to speak, the question of a proof can be reduced somewhat/ nearly/ partly to a numerical problem involving a single numerical gap measurement. presumably this gap would be lower with higher accuracy models…
in other words the error between the model and real problem can be measured/ computed, and different models with higher sophistication/ fidelity (eg more features in a model, or a more sophisticated model on the same features, etc) will tend to decrease the inaccuracy of the model. this actual comparison was then made with
data34: “the “other” news is that the errors are not low enough to fit within the tolerance (k≈0.007 from previous experiment) by 1-2 orders of magnitude.” next, most of the experiments on 6/2017 can now be seen as looking at convergence vs divergence dynamics for the MDE model more carefully over individual feature perturbations etc.
looking that all over, after refamiliarizing some, in defense of earlier writing, the concepts were nearly written out fully and even largely quantified/ directly experimented on, even extensively in
data35, data36, data37, data38, data40. some of the new thinking/ paradigm seems to hinge on the difference between convergence and termination. they are not necessarily the same, whereas earlier the two were (probably, apparently) used interchangeably, it seems some subtlety here is possible. it seems possible a difference equation could “converge” to some values/ attractors but not also represent a terminating sequence.
(later) there was some work looking at class interrelationships graph associated with binary density range, but it came up as uninformative/ “mush”/ inconclusive/ noisy/ null. after multiple intermediate experiments of that, for now undocumented, this is nearest neighbor classification code that seems to function ok/ correctly/ as intended. it needs a lot more analysis but just this much took quite a bit of time to arrive at and want to post this intermediate result. it uses the latest density/ entropy/ midpoint features. it creates 250 classes via density samples and then works to optimize the total classes, culling/ decreasing “worst performing” classes while still improving fit. this is not a small feat to remove data and improve accuracy, but it succeeds at least on test error, and need to add a bit more code to look at train error.
“worst performing” is a vague term that is here quantified as having the highest average error associated with predictions involving that class, graphed as ‘er2max’ green. am getting some weird sense of deja vu with this code, suspect have written something like it years ago, but maybe not associated with this blog, cant seem to recall anything close to it. have not heard of ML algorithms that improve classification by throwing out classes like this, but its a not-too-hard idea to implement. there is an open question about whether creating “artificial” class vectors that are not generated, a more typical ML classification algorithm afaik, could improve the classification.
it was found with other nearest neighbor code that taking multiple nearest neighbors and doing (weighted) averages can improve optimization and maybe need to try that out some here. some other nearest neighbor algorithms tried out in the past eg 1/2017 were very noisy and ambiguous in improvement quantification and its heartening to see this very sturdy/ consistently reproducible result. the nearest neighbor prediction is expensive computationally but here model fidelity is the overarching criteria and computational expense is almost not a consideration/ constraint at all.
after kind of
rambling chatting about classifications/ classes for awhile now and some discarded code ideas later, it finally dawned on me (again, deja vu, this was utilized at length previously with MDE models) that the basic prediction/ classification variable probably ought to be sequence rise or fall. so this code uses that as the prediction parameter and attempts to minimize (class/classification) error. it uses a running 10 sample average on pre/ post values, this deals with/ mitigates feature gradation mentioned previously and high feature noise. the code doesnt actually delete classes, it marks them as removed. consistently around after ½ of the 250 classes are removed the test error starts to go up again (red line ‘ea’), ie overfitting. the blue line is the improvement in the test error (percent of unoptimized/ initial/ starting error), right side scale, decreasing to 70%, ie consistently about 30% improvement is found, a not-unimpressive result.
💡 also, if classes are determining/ predicting sequence rise/ fall, then the concept of self-loops is clarified/ simplified. a self-loop into a “decreasing” (prediction) class implies convergence/ termination, whereas a self-loop into a “increasing” class implies divergence/ nontermination.
much more needs to be done as now extensively detailed/ outlined, but the basic finding/ sanity check of predictions and improved predictions shows the (scale invariant!) feature set is apparently at least roughly modelling the problem. note: there is some commented Terras code in here to look at that generation scheme, it seemed to perform somewhat less effectively for prediction.
(10/20) this is some nice code on an idea that occurred to me hasnt been studied very carefully. was wondering what the longest glides are possible with exactly ½ density. there are similar experiments eg wrt random samples but not using a high-performance optimizer. part of the challenge is that the optimizer would seem to have to throw out many generated candidates that dont match ½ density. the new
remix operators on the hybrid algorithm help with this, they dont change density. this needs more analysis but the intermediate results are already notable.
😎 theres another neat trick in this code, dont recall if it has been used before in this blog, but have known about it/tried it for years, and it works really well here. the algorithm simply tracks whether candidates are thrown out wrt different data-altering operators. then it chooses randomly operators based on the weighted balance, in other words the more effective operators are given higher precedence, or in a sense theres a selection bias toward the most effective ones; the less effective ones are still chosen but with less frequency, in short “frequency proportional to effectiveness.” its a sort of ML trick inside the ML algorithm where the algorithm learns about the efficacy of operators using a simple/ straightfwd arithmetic/ statistical sampling method. this technique very highly improves the effectiveness ratio here, because in this case theres a very wide variance in the effectiveness over the operators, actually its essentially whether the operators preserve density or not, although different density-preserving operations still have long-range measurably different effectiveness.
❗ 😮 the ratio
cg / nw is renamed ‘cgnw2’ (think there was some prior code where ‘cgnw’ is difference, trying to minimize confusion). the algorithm optimizes over the various set of trajectory size measurements. there are 2 runs with 50k samples each, a width-constrained and a width-unconstrained, and 2 graphs each, one of the run dynamics and the other the grid plot of iterates, both sorted by decreasing to increasing ‘cgnw2’ left to right, and with significant shape/ structure/ pattern(s) showing up in the grids. it is finding
cgnw2≈6 in some cases. this is surprising! was expecting something smaller than 4 as found in prior experiments. this may even be enough to break some hypotheses. the idea up until now is that the undifferentiated region still has a sort of typical behavior to it, of being drainlike. this seems to find completely non-drainlike behavior ie long glides in the undifferentiated region. ofc immediately they need more examination/ analysis for any various aspects/ features. …
… but then also an immediate fineprint in 4th graph is that the algorithm seems to be biased toward lower iterates to find higher/ highest ratios. it would seem to suggest (or explicitly reveal?) as bit size increases, high ‘cgnw2’ ratio iterates become scarcer to the point of nonexistent. or, as always, this could be an artifact, and they still exist but are just increasingly hard to find/ are missed by the algorithm, and/or the algorithm is going after (easier) “low hanging fruit.” but it does seem notable the algorithm focuses on a bin widths less than the starting width and the highest
cgnw2≈5 is for
(later) adding onto that, plotting the 5th graph, its ‘cgnw2’ vs ‘nw’ for the 2nd variable size run data and it finds a distinctive wedge shape. however, some of this is likely to be attributed to the optimization algorithm dynamics/ search that is centered around the starting bit width 50. which leads one to wonder if there is some kind of analysis that can tend to highlight possible optimization bias/ limits/ artifacts; its long sought indirectly and would be very valuable…
(later) ❗ 😮 🙄 something has been long nagging at me (maybe a slogan/ motto/ theme/ tagline for this overall blog/ project!). last looked at on 8/2020, max scaled 0/1 runs for ½ density iterates have a characteristic curve that is similar (“nearly identical”) to the postdetermined glide section. this following remarkable factoid may have been previously noticed but its unreported until now. it was also imagined somewhere that maybe the scaled 0/1 runs formula (expected value) was not simple. but it is! they scale simply as log(log(n))! this is a graph of max 0/1 runs for divided by
log2(bit width) or log2(log2(n)), and the difference between the two, which is negligible (they nearly perfectly overplot). so it makes sense to (“further”) scale the feature (scaled max 0/1 runs) wrt this near constant to get the scale invariant version. max 0/1 runs have been studied for years, and it seems that the scale invariant version wasnt arrived at until this moment. as the saying goes today is the 1st day of the rest of your life™… suspecting this is probably derivable by some better mathematician from the binomial distribution…
(10/21) at the moment have a lot of prior work to ponder, new/ emerging ideas, and different directions to go in. some effort is spent on expanding optimization frameworks and other effort is spent on utilizing them. was starting to wonder how much different optimization parameters effect the outcome of the hybrid approach after not noticing significant differences on changing them. this compares 3 different arrangements. it is not easy to compare methods, but this just looks at evolution of average iterate features (values) during the optimization (over the saved bins/ candidates). these averages have been computed and output but not analyzed much so far. after watching a bunch of run sequences, the different parameters do seem to affect the ‘cgnw2’ red line optimization somewhat/ mildly but not especially strongly or consistently.
but, not really noticed until now, and yet obvious in hindsight, checking the parameter averages seems to be a key/ crucial way to determine the progress/ position/ completion/ “exhaustion” of the optimization runs and probably need to pay more attn to this in the future.
typically adding all the trajectory length parameters unexpectedly led to a slightly inferior ‘cgnw2’ result. in some cases/ optimization setus the trajectory length parameters added might improve another parameter optimization, in this case it seems to diminish it. partly due to this am suspecting that ‘cgnw2’ is relatively random/ independent wrt the other parameters, and maybe other optimization cases with less randomness/ more dependence would respond better.
or in other words, maybe any scheme that generates long trajectories or at least glides (optimizing by a single or multiple trajectory/ glide length parameters) will tend to turn up higher ‘cgnw2’ randomly among them. (ofc longer trajectories seems to imply long glides but dont recall testing this directly so far.) this apparent disconnection between ‘cgnw2’ and other trajectory parameters seems rather atypical so far and again makes me think of looking at eg the PCA of the parameter evolution as probably shedding some insight. again the possible interplay/ interrelationship of parameters is (sometimes or often) key.
the rather hacked-together “streaming” output code has been refactored to allow 2 independent/ concurrent output files. the other improvement in this code is adding (yet) another initialization method
initx that inserts 01 repeated (ie 0/1 alternating) sequences against a ½ density “background.” in the ufo studies these were found in the past to be crucial (actually, essentially a “feature”) in generating long 0/1 runs in subsequent iterates. but even with this algorithmic “hint/ suggestion/ bias” they dont seem to be involved much in the final high ‘cgnw2’ iterates found by/ arrived at by the algorithm. (should this be looked at more closely?) ❓
(10/22) ❗ 😮 😳 😡 holy @#%& just found an enormous gaping bug! the
hybrid code has 2 “best candidate” selection methods, or was intended to. one selects top ¼ performing bins to use in the crossover operators (measured by top candidate in each bin), and the other one selects top ¼ candidates in each bin. both are broken. they were both unintentionally unbiased, ie over all bins and all candidates in each bin. this was a big mixup, is utterly embarrassingly/ disappointingly seen in even the earliest/ 1st
hybrid version from ~2yr ago…
…and initially was thinking this would have a huge impact on optimization performance after fixing it, however results are more ambiguous. at least in current setup it does not seem to have much effect! hence prior variability/ repeatability seen in the last graph/ analysis was maybe due more to the seed initialization dynamics + operator balancing + this initial bin quality phenomenon (described next). the bin size limit seems to have analogous effects (on larger scale) as “top selection bias mechanism.” some of this was discovered by studying the optimization dynamics ie parameter averages which has been mostly skipped in the past and now thats looking like something of an oversight/ mistake. (too much utilizing vs building/ polishing.)
also, as seen distinctly in the following graph, another remarkable dynamic was discovered. the algorithm optimization trend seems to have a lot to do with the initial quality of candidates filling the bin(s), which happens early on but not after a fixed # of iterations due to the randomness of ½ density candidate generation and associated scores; here the 1k count bin fills up around 2k-3k additions. after the bins fill up, the optimization trend shifts toward much more smooth, but also highly related to initial bin candidate quality. to help somewhat the code was adjusted to recompute the statistics more frequently while the bin size is subfull, but theres still a rather wide variance in candidate quality right on bin full/ size limit. this suggests maybe an adaptive bin size algorithm is needed. another long-overdue improvement was using binary search instead of resorting the bin array improving from n log(n) to log(n) time.
from prior experiments was expecting to see maybe some trend in ‘cgnw2’ over different iterate sizes, but even after adjusting/ fixing the prior optimization wrinkles/ glitches, there is not a clear (correlational) trend. comparisons of certain metrics over bit size wrt the hybrid algorithm has generally been challenging and remains so. the following graph is color coded by initial bit size and the optimizations are over fixed bit size. its bit widths in increments of 20 to 100.
(10/23) more analysis/ thinking, looking at trends over different bit widths and more than 1 run, some data yet to be written up. there seems to be some not-simple tendencies in the data/ optimization here. 1st, was having a nagging sense of deja vu and forgot somewhat about relatively recent, very similar experiment from 12/2019
hybrid24 that used the hybrid optimizer to look for long ‘cg’ glides on large 500-fixed bit width iterates but with unconstrained density instead, and found they all started with large/ dominating 1-lsb triangles. it does appear in both cases (past/ present) there is more tendency of the GA to settle on longer 1-lsb runs for higher bit widths, but its apparently not especially consistent. it seems one explanation in current case, where typical iterates end up looking like
hybrid42 graph #2 distribution, where there is a lsb region of high density and corresponding multiple long 1-runs, is that these are “easier” for the optimizer to find, earlier in optimizations, but longer runs exist by extended work of breaking up the 1-lsb runs.
however, heres another way of looking at all this that relates to population dynamics/ evolution, and it needs more analysis but seems to be the explanation. the initial bin size creates a limited population of DNA (genetic) samples so to speak. these are both supportive and conflicting with each other as far as genetic health, or one might call the latter case “uncooperative/ uncollaborative.” in other words there is some commonality and difference of genes where here genes are similiarities in candidate strings (vectors), not necessarily adjacent, and individual bit sequences/ strings/ vectors can be understand to possibly have multiple genes. the population pressure of limited population size/ culling causes the conflicting or “incompatible”/ “weaker” genetic elements to be filtered/ “weeded out.”
the evolving population has some diversity but its also limited by evolutionary pressure. there is a tendency for there to be neither too little or too much genetic variation. apparently long 1-lsb or max runs tend to be a sign of population health, but the actual width tends to be variable and not highly correlated with bit width. another pov is that there are pockets/ “strange attractors” in the genetic code that populations tend to converge to. they seem to consistently converge to some kind of attractors but the attractors are not exactly consistently shaped/ identifiable although 1-lsb/ max runs are some rough element/ aspect/ sign. once near a population-wide attractor the population will tend to stay close to it (“orbit”) and not to diverge to another attractor.
for optimization, possibly the very best solutions, superior by very slight edges, lie in different attractors than found/ converged on, and the genetic algorithm is not able to jump between them in a single run, however different runs arrive at different ones. in a single run there is too much loss of fitness, aka a kind of uncrossable desert/ chasm, between the attractor islands/ oases so to speak. so “shadows” of the “optimum” can be found but right now the “optimum” seems not to be arrivable via this method even over multiple runs or a medium length single run. the only possible loophole considerable/ visible right now is that the optimizations are terminating too early and something else could emerge with very long runs. the other old optimization terminology for this is simply sticking in local minima…
all this conceptualization helps to understand/ explain everything better (as a working hypothesis) and then in response be less confused by the nonunique/ emergent phenomena and not expect consistent optimization results across runs, but alas for proof purposes, consistency is exactly what is needed. the seemingly endless shifting sands of the problem/ chasing mirage aspects/ (“curses”) strike again. a truly formidable, monumental adversary… but this utter sheerness is far from unfamiliar at this point…
(10/26) am again anxious to work on/ push forward the classification outline but theres seemingly always something incremental to do on improving an optimizer, and was a bit puzzling/ bothered over these explainable yet inconsistent results. in short was wondering if anything could be done to improve finding a global optimum instead of local ones? it took timeconsuming hours to make/ test some basic conceptual changes but results are somewhat satisfying.
- this code adds a stopping metric based on quality limit/ threshhold, here time (iterations count) since last best candidate found by ‘cgnw2’ measure, 50k iterations.
- the streaming output analysis has been improved. it generates multiruns/ intermediate (multi)graphs.
- and maybe most notably a new
initymethod was added that simply mimics final solutions/ features found previously, a notable strategy/ technique here turning out to be an effective idea. it generates iterates with a lsb 1-run up to ½ the iterate width, then fills up the remaining “empty” (0-run) region with 1 bits to give an overall ½ density of the iterate.
- after some looking, glide parity density was calculated as having some kind of signal, maybe hinting at some kind of feature(s)
was expecting the algorithm to find solutions that break up the solid region more from the remixing operators after noticing some in prior algorithm runs, but they dont end up in final results, suggesting they are either suboptimal and/ or the algorithm is not able to break them up in a few short steps. so then on further thinking, have ideas for another similar operator to try, but these results are already notable/ noteworthy.
the results are that the new initialization method has a dramatic effect and the algorithm more consistently converges in the final iterate shapes/ structure to that seen below. graphs are sorted by increasing ‘cgnw2’ left to right and by color in #3. for the runs optimal
also a striking emergent phenomenon was found in the glide parity density in graph 2, it converges to a near constant and has a staggered pattern when sorted. its strangely close to ~0.64 magic parity density constant associated with “near horizontal glides” derived previously, but in that case it was for uniform distribution, and here the distribution is quite lopsided due to the 1-lsb run, but it does hint at some kind of (remarkable?) “tradeoff phenomenon” between the 1-lsb run vs remaining glide density, something to look into further. some kind of hidden order. caveat, once looking into something like this, have seen sometimes sorting identical values and some internal quirk can lead to this. ❓
in graph 3 it appears the algorithm is maximizing the spread/ dispersion of the (postdetermined) glide following the linear up arc corresponding to the 1-lsb run but otherwise apparently without much correlation with
cgnw2. the more gradual postdetermined glide slope is also noticeably different than the remaining glide. something to look into, while different, is this anomalous wrt “typical” drain behavior, which has been examined from many angles to converge to ½ density iterates?
but it occurs to me that, always trying to keep the bigger picture in mind, this is more tactics, less strategy. the overarching strategy needs more to be “let the ML algorithm do the heavy lifting.” the “heavy lifting” is converting feature dynamics to a map of the overall problem, and the “less heavy lifting” (but still hard, taking years now) is a rather newly defined concept/ science in ML called “feature engineering.” these explorations via the optimizers are creating some mapping, but theres a hall-of-mirrors/ wandering-in-the-wilderness aspect to it due to the complexity of fractals. in defense it can be put in the feature engineering column.
(10/27) this is some new nontrivial initialization code that is not simple to describe with 2 new subroutines
initmf that have the intention of creating 1-lsb runs of arbitrary length broken up randomly with intermediate noise/ 0-bits, exactly as seen in various optimization results across multiple experiments so far. the results are that the new initialization routines are highly effective/ thereby preferred by the algorithm and seem almost the same except maybe for slightly more/ longer broken up 1-lsb runs as intended.
reminds me of paint dripping, the matrix movie signology, or the yin yang symbol, yin in yang, and yang in yin… ☯ 😎
(later) ⭐ as stated have the strong urge for over a month now to work on the outline strategy/ attack/ code. this is some nontrivial, careful refactoring of the existing code to add the train set error calculation with 250 points, same as the test set size. wrt additional train calculation the code was tightly coupled and needed to be decoupled some. over several runs the test error improvement (magenta dashed line, right side scale) is about 25-30% and the train error (cyan dashed line) is about around 5-15%. but note train error sometimes starts out a little higher than test error. note for 1st ~½ the run the test error goes down dramatically while the train error barely moves, alas this is obviously not great. other close examination shows some correlated bumps in test vs train error.
this all is some evidence/ basic sanity check that the algorithm is working correctly/ nearly exactly as expected, ie learning classes. although overall indication of low train error decrease seems to indicate a “lot of signal” held in the data/ features/ classes, ie a lack of redundancy/ compressibility, somewhat to be expected with inherently fractal/ noisy phenomena/ associated features. note that the model crucially needs to predict up vs down in iterate evolution which is different than the error calculated here, ie the absolute difference of expected vs actual post/ pre iterate ratios, but generally strong correlation between the two types of errors can be expected.
(10/28) probably the more ideas the more optimistic, and the more optimistic the more verbose. heres a nice quick demonstration of the overall concepts but have already moved on in my thinking. this generates features over varying density and generates/ graphs the nearest neighbor map/ state transition graph using
dotty from the graphviz suite, possibly not used around here in years, looking at timestamps maybe even ~½ decade! have had trouble with
dotty in the past not drawing too well, eg with basic problems like overplotted nodes, but maybe it was because was using undirected graphs which maybe it isnt designed for! it seems to do relatively well on this 50-node graph. the nodes are labelled by density and 10-consecutive-sample collatz mapping running average increase/decrease factor (negative indicates decrease).
no model compression is applied, in retrospect its not clear that its always necessary. theres an optimization bias around here but maybe its sometimes overdone! nearest neighbors is often combined with optimization but in a way at heart its a classification algorithm.
did not anticipate this but its obvious in hindsight. the other features besides density are probably ending up as relatively random/ near undifferentiated, therefore the nearest neighbor algorithm is basically finding nearest densities. this means even with all the different features (added “moving parts”) the (simple) model is not a lot different than the recently referenced MDE or density-only models constructed a few years earlier. there is a large connected graph in the center and some disconnected graphs, “islands”.
the high densities decrease to ½ density eg 43 → 26 → 14 → 32 → 44 → 6 → 33 → 25 → 24 node sequence as the iterates generally increase, and low densities increase to ½ density as the iterates generally decrease eg 11 → 3 → 30 → 2 → 42 → 13 → 34 → 36 → 48 → 24 sequence. there is a loop 24 → 36 → 48 with decreases only. the loop 5 → 23 with equal magnitude increase/ decrease +/-0.001 might seem to “cancel,” aka be inconclusive, but actually theres a technicality here, the increase/ decrease is actually a multiplicative factor, so the sequence is either x · 1.001 · 0.999 · 1.001…. which actually is increasing or decreasing depending on the initial factor applied (node visited)! another “ending/ sink” loop 15 → 39 decreases. this case did not have them, but rerunning, other graphs show increasing loops, eg 2nd graph has an island/ loop 1 → 26 → 1 thats increasing.
as can be seen even with only 3 instances, the graph structures vary quite a bit between runs. it appears graph heights are generally proportional to longest sequence found, possibly a key insight into the graph algorithm logic, and this 3-series is coincidentally decreasing in that. its rough, even primitive, but aspects of the critical scale invariance property/ internal dynamics map of the model (here based on density, and increase/ decline trends, sinks/ loops at the bottom, etc) are already discernable. already its also clear that islands are one of the wrinkles that has to be addressed somehow. on immediate glance it will be great/ revealing/ aesthetic to use color coding on nodes (eg grayscale representing density!), may figure that out.
💡 this is the immediate new idea/ twist jumping off this analysis. this was already sketched out some (eg last months
hybrid40 idea, maybe already partly/ mostly anticipated), but its clear that the problematic case is a long iterate sequence where all the features end up nearly undifferentiated. so it seems the goal is to find features such that for all long sequences, at least one of the features has significant signal/ is “differentiated.” this is the kind of quantitative criteria that an optimizer can search on as already done in one case. however it seems likely that for larger iterates, its more possible for features to be undifferentiated over long ranges. this also relates to the idea mentioned that it seems that some features require n iterations to measure where n is proportional to iterate size.
💡 that last thought reminds me, just realized the recent ‘cgnw2’ measurements highly relate to some earlier ideas about the “lookahead function” being “nearly” monotonically decreasing (eg within constant offset say 200), the idea was to compute n iterations of the collatz mapping where n is bit width and compare it with the current sequence value, studied on 7/2019 and 6/2019. even at the time the concept of Terras (density) glides was understood, and the lookahead function being (monotonically?) decreasing relates to how long glides are relative to initial iterates, and it relates to the slope of the drain associated with the postdetermined glide.
running the (revised/ tuned-up) hybrid optimizer recently with unconstrained density finds up to
cgnw2≈6 for bit widths 200 and the recent similar (½ density constrained) experiments seem to show the postdetermined glide slope can be significantly altered, ie steepness decreased, from “typical.” this means the earlier experiments seemingly showing the lookahead function is “nearly monotonically decreasing” are possibly wrong and the optimizers were having trouble finding the long exceptions/ outliers. ie the earlier hypotheses seem to have been falsified. but need to actually do the calculation (eg over the “6x” glides) to understand/ visualize this better.
“typical” postdetermined glide slope may be compared with the old, pivotal “rosetta diagram,” and that diagram is now regardable as possibly misleading compared to new outlier trend found in
hybrid45 graph #3.
(10/30) 💡 ❓
- the features seem to be something like optical resolving power of a microscope, which amplifies at the same time missing some detail. and microscopes invoke the heisenberg uncertainty principle (sometimes invoked around here in other/ different contexts), which relates to the inherent noise around features.
- it makes sense to run the GA optimization to find samples with different/ diverse/ even “extreme” features so to speak (the z-norm logic in the optimizer already supports it!), and then run the classifier on those.
- however, the GA or an adversarial algorithm also has to try to find samples with features that “slip through the cracks” of the existing classification ie no differentiation in many or all of the features, or that lead to different classifications (class/ state transitions) than those already found.
- have been trying to analyze the just noted “elevated” postdetermined glides looking for any statistical anomalies eg density/ entropy etc but am turning up essentially nothing after carefully controlling the analysis. it seems all that is different is the lsb/ “postdetermined glide drain” slope. various density/ entropy features (mostly typically greater spread) in “smaller” iterates tend to complicate the analysis. on the other hand its a mild difference in slope, ie parity density ~0.5 vs ~0.6 in postdetermined glide.