💡 this following code/ signal came up somewhat indirectly, incidentally, almost by accident. had an experiment to generate long ‘cgnw2’ glides with nearly ½ entropy to force them into the undifferentiated region; as explored last month the ½ density constraint is not sufficient for that, leading to long 1-lsb runs nearly ½ the iterate width. the ~½ entropy constraint works fairly well but then found small initial 1-lsb runs even in those. so wrote some code to cut off the leading 1-lsb runs of the glide, and analyze remaining glide carefully.
was looking for any kind of signals at all, lots out of the “bag of tricks,” also looking at 3-power sequence, and again it seemed to come up undifferentiated mush with lots of sophisticated analysis. there is a lot of signal found in prior 3-power sequence analysis but in general, a lot of it was related to Terras density glides. the hybrid approach tends to produce more “in the wild” glides. there is still some hidden order, maybe significant, in Terras glides that various experiments have isolated. but some, much, most, or nearly all this seems to melt away on “in the wild” glides.
notably, this hasnt been pointed out, but the (local) density metrics associated with the 3-power sequence naturally tighten even for random iterates. so its important to try to separate this “typical” tightening from “atypical” tightening, and thats not a trivial exercise.
but then started working on some “baseline” comparison ie control data, and then chasing down some stuff, lost my focus on the glides entirely. the simple way to generate this is via random ½ density iterates and then look at the drains. then started looking at features over these drains. some of this work has already been done, but it seems that some key signals have been missed. this straightfwd/ finetuned/ yet conceptually near basic code is on ½ density drains only.
averages are ones friend to deal with noise! this is 2 average types done in 3 different ways, dotted red line is iterate bit width. the 1st left panel is a 1000 length moving average. the 2nd middle is a running average from drain, and the final 3rd right panel is a running average in reverse direction of the collatz sequence, truncated from 100-bit width iterates and smaller. it is long known that density/ entropy spread goes up at tail end of drains/ trajectories, but this analysis shows the situation is much more nuanced. the 1st panel has a sort of “bubble” ie spread expand/ tighten alternation in the middle and even 1000 sample averages do not smooth out.
these “bubbles” (sometimes multiple ones in single trajectories) tend to appear at different places over different drains but they are unmistakable. the “bubble” phenomenon was seen a long time ago wrt MDE analysis but in some recently cited pages, need to go dig that up. in the 2nd panel the “natural” emergent ordering of the 5 different features is apparent right side; even though slight, even razor thin at 0.005-0.010 overall density range, this is highly repeatable over different drains! (so some very legitimate/ understandable excuse for not isolating this until now!) yellow, cyan, blue, green, magenta from bottom up. notice the ordering is retained in the bubble region in 1st panel! in 3rd panel the initial trend (final trajectory) spread is apparent and its sizeable over the 100-bit-width iterates and larger, note the truncation is applied before applying/ calculating the cumulative average.
the bubble phenomenon seems to be hinting at fat tails in the distributions and as the theory outlines, in (some) fat tailed distributions, averages dont really exist. and so something more/ some alternative is needed/ called for here. the bubble(s) seem to be indicative of some kind of cyclic/ oscillating/ resonating feature(s) in the trajectory/ drain! aka some kind of “ringing!” more analysis needs to be done to draw it out. the natural ordering of the features is also helpful and then one can better look for deviations in non-drain parts of glides. (is it a few nearby outliers skewing the distribution as the average runs over them?) ❓
in a sense, even razor-thin edges such as this are overcome with semi sophisticated averaging, and maybe the whole problem is going to hinge on these hair-thin, but demonstrably/ measurably real features. all this fits under careful control(s)/ controlling the data, with the undifferentiable region under highest scrutiny, and finding any possible features in it; the theory is not esp advanced here and so it could have been done long ago, but todays the 1st day of the rest of this blog 😀
ok, all thats the great part inspiring some enthusiasm but the not-so-ideal part is that these are very large 1000 bit iterates and very long running average, 1000 iterates, and the statistical edge is very thin. in other words to find/ squeeze out differentiation from the undifferentiated region maybe theres a lot of “big data” involved and trends/ signals in smaller iterates will be insufficiently strong ie weak/ faint/ unmeasurable.
note: the few-line
sample routine/ logic is all the way back from the
thwart code 2017. the other simple sample routine has been used across many experiments but it throws out tail end elements of the sample. iirc (re)wrote some similar code for an optimizer/ graph output within maybe last ½ year but having trouble finding it right now! (“exercise for reader…,” lol!) it is hard to find methods out of the now-massive pile of code sometimes, one has to remember their names. this was found via a text search and the different named method parameters, and it was sandwiched among dozens of cases of the other sample routine. probably spent over 1hr looking/ digging for it, could have coded it in ~¼ of that, but sometimes it seems a “waste” not to reuse code, lol!
(later) 😳 🙄 tried to do too much in my head again. last month was musing about ‘cgnw2’ vs the bit-width lookahead function monotonicity studied earlier thinking the recent ‘cgnw2’ hybrid optimizing experiments might disprove conjectures related to it, but not actually doing the calculations. the bit-width lookahead function monotonicity is actually all about ‘cm’ vs the bit width and whether the drain is “well behaved” afterwards ie mostly monotonic after ‘cm’. in this sense a long gradual slope, as found last month, is just as monotonic as a steeper one, so on closer analysis the earlier conjectures still seem to stand.
(11/5) found it! there was some extended reanalysis/ reappraisal/ overview/ survey of the MDE model wrt new outline ideas last month 10/2020 and it also referenced 10/2017. at the end of that blog there are some experiments looking at the sequential error in the MDE vs real trajectories,
perturbb which have “bubble-like” features in the plots, more in the initial/ final regions but also in the middle.
following is an analysis that maybe something like it has been done before, maybe both published/ unpublished, need to do a big survey to find it. started looking at ‘dl’ the density of lower (lsb) half of iterates and found a signal, then zoomed in on it, and then maybe even got a little carried away extracting it where it it led.
- 1st, needed some iterates/ trajectories/ glides. altered the hybrid analyzer to look for ½ density iterates with no more than “natural” max 1-runs in the starting iterate, this was found last month to be
log2(log2(n))a constraint combined with the density restriction causes very many trajectories to be thrown out, about 95%, but the algorithm is still usable. also, maximize the glide size vs the glide “height” (measured in bits) so that glide is long as possible for height as short as possible. this creates long mostly-horizontal glides which presumably are the hardest to find signals in, ie weakest, it would be expected that steeper climbs/ descents give stronger signals.
- these glides are then presumably somewhat “wild” compared to Terras density glides and are different in another key way; as in the rosetta diagram the Terras density glides have nearly 45-angle descents from their max, although maybe there is some resemblance to “sideways” 0.64 Terras glides. looking at the grid plots of this new sample confirms theyre mostly undifferentiated (characteristic ½ density 0/1 noise) although they tend to have a lot of small 0/1 triangles, ie maybe substantially but not very highly undifferentiated.
- then this code analyzes aggregated results over 6 different widths 100..200 in 20 count increments. it uses the old idea of “glide left/ glide right” ie left/ right of peak. it works with the top 20 long glides and throws out any where left/ right range are less than 50 iterates.
- then it looks at the lsb ranges in the glide left/ right labeled 1, 2. it does a cumulative (also running) average on 1-bit count divided by total span size starting on the lsb moving higher up to 20 bits. these are graphed as red, green lines.
- then it sorts results either by 1st point in the left or right lsb average, in 1st/ 2nd graphs. for the 1st point, this is exactly equivalent to the parity density over the left or right side (the single lsb).
- there is also a measurement of the left/ right slope via bitcounts,
m1, m2, blue/ magenta lines, and not surprisingly this correlates highly with the left/ right parity densities (1st/ 2nd graphs orders), because the climb/ descent while gradual are typically nearly/ roughly linear.
- then overplotted on all this is the same logic running on 2nd lsb to 20th instead, ie excluding the parity density “line”/ sequence, and finding essentially the same signal!
it took a little fancy footwork in the code, a bunch of moving parts, but really was just chasing the signal wherever it led and comparing with control version and ended up with this. it shows that lsbs are consistently less dense for the right range than the left range, even excluding the 1st lsb which controls glide exactly. considering all the accumulated code/ thinking behind it, its maybe more than a mere exercise/ experiment, verging on a tour de force. it is essentially finding a strong signal in an undifferentiated region. however this “strong” glide-control signal/ aspect of the iterate seems to decay/ fade fairly rapidly over the lsbs corresponding to the steep downward slopes. graphed this way its a strong signal, but looking at eg grid plots, its not visually apparent at all. unless maybe youre a robot that is. 😛
😳 🙄 oops whipped that out a little too fast, graph titles are not entirely as desired from before the refactor. both graphs are showing both the “0/1 lsbs” averages overplotted.
(later) looking around, this is indeed related to
review139b from 12/2019. that approach looked at many features over left vs right glide and found many signals. however it was working with bitwise-generated iterates, maybe not quite as “wild” + undifferentiated. needs closer/ further analysis/ comparison there.
(later) 😳 wow really did whip that out too fast. did anyone notice? the
m1, m2 blue, magenta points do not overlap in each graph for the two different lsb ranges despite having the same calculation. am not sure how that is happening right now, it almost seems like a gnuplot glitch maybe related to the autoscaling logic and overplotted point data… oh wait guess there is an explanation related to left/ right reordering logic… hint: both left/ right orderings have associated
m1, m2 and orderings can shift after each running average density (re)calculation, but not dramatically because of the stability/ strong correlation of the lsb vs non-lsb averages. but what about some apparently missing magenta ‘m2’ lines? actually further thought, suspect maybe accidentally scrolled the graphs with mouse wheel before saving ❓
😳 oops! thinking more about “descending vs ascending cycles” wrt collatz iterate bit size. last month talked about the order mattering of 0.99 · 1.01 · 0.99 · … and basically forgot about the commutative law of multiplication!
(later) these results relate to an old idea, not sure if it has been outlined in this blog, but essentially the trajectory behavior peak looks a lot like a transition point or tipping-point type phenomenon. this is seen somewhat in density statistics (a transition from non center to center density) but was realized years ago that density effect on glide is highly nuanced, and the (long) undifferentiated, near center-density glides have long seemed to throw a wrench in the works of focusing on density. if there is some kind of transition point feature controlling glide max/ height, its apparently faint/ very hard to isolate, although the prior idea comes close. and the Terras glide construction seems to suggest in some ways that focusing on glide max, if it is in the predetermined range, is something of an irrelevant anthropocentric bias.
(11/9) there are a lot of ideas about density (distance) narrowing, and there are some/ a few interrelated ones about entropy (distance) narrowing. it would take awhile to dig all these up/ survey them, but here is a quickly located example,
bitwise24 from 2/2019. in the same month entropy narrowing was measured in other ways, and other older pre-Terras analysis blogs about “traps” included some analysis of entropy.
so my next idea was to try to come up with a metric that combines density and entropy narrowing, and then show that it controls (or rather, more precisely, limits/ bounds) glide dimensions somehow. but how is this formulated in optimization terms? again devil(s) in the details. this is a 1st take on that idea and its nearly immediately throwaway code because of the lackluster result(s), but it did lead to many combined/ substantial insights. was tempted to just skip over all this because writeups while strongly exercising ones brain can be rather timeconsuming, but a lot of thinking resulted from it, so think thatd be a loss. the graphs are kind of an almost embarrassing misfire/ mess, but on the other hand, sometimes theres such a thing as an informative, even striking antipattern…! aka case study in unintended emergent pattern(s) and careful optimization objective formulation/ tuning!
the basic idea was to have a measure ‘dea’ that combines entropy and density distance, larger for non-center values. here distances from center (ie ½) can be either added or multiplied. the multiplication leads to more distinct values. then the prior “gap” code can be used to try to find a “natural or typical” distribution over ‘dea’ and then look at the trajectory metrics.
overall it might be said the code worked ~½ or so.
in graph #1 (all graphs sorted by ‘dea’ left to right) there is some rough correlation of glide length ‘cg’ blue. however, the funky gaps are immediately glaring. in graph #2 the algorithm is seen to be “thrashing” between two different region shapes. in graph #3 there is some revelation of what is going on where the gap distance ‘gdea’ is plotted along with ‘dea’. as GAs sometimes do, the raw optimization objective superficially was achieved; the algorithm did manage to generate a range over ‘dea’… but just not anything like really desired/ intended/ expected…!
- the algorithm will focus on high ‘gdea’ values but as can be seen, there is a long range where they flatline. this is due to many different points having the same ‘dea’ value but having different “internal structure,” ie in a sense the feature is very coarse/ rough. the algorithm is more “satisfied” with the low ranges and focuses on spikes (focus in the sense of “generating candidates similar to”), but to break up the distribution, it has to target that long low range somehow.
- also in the flatline region of ‘dea’ the ‘gdea’ values spike some. this is due to that the nonunique gap values are overwritten by the algorithm on a “latest found” logic basis.
- in the nonflatline region with a steep upslope, there is a lot of ‘gdea’ spikes signalling more attention focused there, the code detects gaps there. but in a sense the logic is exactly inverted, because the algorithm puts more focus where there is a better range.
- related to this is that the gap values are computed over all candidates found/ generated, but only top candidates are saved, and those do not match over all the gaps calculated.
- another weakness of the algorithm is that one wants to extend both sides of the distribution, not just the insides of the distribution, and the current gap code does not calculate/ weight gaps at the endpoints differently. in other word endpoints may have small gaps and the algorithm may not try to visit them (expand them) at all. in a sense the gap is represented as “known” but endpoints may have unknown or large gaps if they can be extended.
- the gap code only works on a single variable, and it was quickly becoming clear it seemed like one wants to get a distribution over multiple variables eg
d, e, cg, and graph #2 chunkiness/ thrashing also points directly at the limitations of non multivariate approach. behind the scenes ‘dea’ changes almost not at all but the features that compose it vary widely “behind the scenes” or “under the cover.” or vice versa/ conversely, maybe ‘dea’ might change a lot and other variables behind it might not change much.
- so that gap code, while quite sophisticated and yielding some insights, is quite limited. yet another difficulty with this code is how opaque it is. it is reminiscent of some math theory in the way it does what its supposed to but the logic is nearly indecipherable without a lot of background explanation/ framework.
so as in english editing, relating to improving the quality and esp concision of writing, an old expression, they sometimes say kill your babies. the more ruthless editors talk this way, and its not entirely out of place here. (what is coding if not a lot of (text) editing?) this gap code needed to be rewritten, which sometimes has a kind of brutal aspect to it, esp when so much effort/ polishing has already been invested…!
(later) so the gap code arose out of the idea of trying to find a uniform distribution over a feature during optimization, and this turns out to be not very simple at all esp for multivariate cases. my next idea was to look at the difference of the feature distribution versus a linear distribution, and use that to guide optimization. the linear distribution is often considered across many experiments and there is various, sometimes sophisticated postprocessing code to try to extract nearly linear distributions from nonlinear ones. so what would it look like to try to do this “on the fly”?
the code already has some linear-time calculations to get z-scores, and then thought this new logic could be a variant on that. then hit on the idea that various points have a “distance from linear distribution” and that this could guide the optimization much like the gap code already did. ie points that are far from the linear distribution can be focused on more for further generation, smoothing out “distant” features by generating/ selecting nearby, closer-to-the-line candidates.
so then my next idea was to try to get a linear distribution across different variables say
d, e, cg. the code looked at them separately using the linear difference idea, and then immediately ran into an old problem with multivariate optimization, that basically various multivariate optimizations cant be decoupled by individual variables. after running, it might look like each variable had a varying distribution, but looking at them together, there was not much continuity or what might be called simply (combined) spectrum. eg density might vary but entropy didnt vary much over the different density values, or vice versa.
(later) 💡 ❗ 😮 ⭐ 🙄 it took a lot to figure all that out but the new code is not all that different.
- my new idea was to simply use the ‘z’ score of combined parameters in the uniform distribution measurement/ target. this code tries to achieve nearly linear ‘z’ scores by using the line difference measurement as the target value to optimize, called ‘z_d’, ie generate candidates similar to the outliers to find new nearer-linear candidates.
- it also finds the max ‘z_d’ value and assigns it to the endpoints of the distribution (which under the linear difference logic would be assigned 0 difference), causing the algorithm to “include focus” (prioritize) on extending endpoints.
- since the algorithm is not incremental it does the calculation intermittently. but what about new points that dont have ‘z’ or ‘z_d’ scores? this is not a trivial point to handle; and there seems only one option/ answer; the algorithm is designed to rank them as middle/ average, neither high or low fitness.
- my other change was thinking that trying to look at entropy and density at the same time is a bit much for now, so this code throws out non ½ density iterates and the linear distribution is sought over the features
ea, cgwhere ‘ea’ is entropy distance.
the results are awesome, even eyepopping, seemingly almost too good to be true! the algorithm works seemingly flawlessly… ‘ea’ entropy distance on only the 1st iterate of trajectories seems to be strongly controlling the entire glide length! but, full disclosure, in a way that is opposite of what was imagined, ie the longest glides are the farthest from ½ entropy. there is a substantial spectrum in both ‘ea’ and ‘cg’ and also in the grid results. these are highly repeatable. graphs are by ‘z’ value from left to right. strangely ‘ea’ green 2nd graph climbs but never makes it to center, maxing out at about ~0.45; there seems to be some kind of limit or transition point involved… ❓
ahhh! its so @#%& amazing, but is it too good to be true? with fractals one can wonder about certain possible biases. is this something like a selection or survivor bias? like an algorithmic equivalent of a rorschach phenomenon? in a sense it seems the algorithm doesnt have any inbuilt bias, but in another sense all optimization is about a kind of bias. ok, so how does one make sure then? ❓
looking closely one can see the max ‘z_d’ values/ spikes assigned at the endpoints, 1st graph orange line. notably there is also a long dip/ bump in the
z, z_d (delta) linearity in the right ⅓ section of the graph.
re entropy expectation, this has probably been lost somewhat in a lot of prior blogs that only looked at numerical values. close to ½ entropy is the more typical concept of “high entropy” in the thermodynamic/ information theory disorder sense and so these results are exactly as expected, the ½ entropy range is high disorder and tends to limit glides. long glides have low entropy, ie far from ½, ie low disorder.
(later) 😳 oh, oops, but wait, not so fast, trouble in paradise, the honeymoon is over already! all that is nice/ great, but… ofc looking at the 3rd graph, despite the ½ density sample rejection, all that optimization/ sampling is mostly the differentiated region. the undifferentiated region is maybe a little smaller than the left ~⅕ of the graph, and the entropy curve in the 3rd graph is different in that range, and its a lesson on the (naive!) assumption that even inside the ½ density region is mostly undifferentiated, already understood/ demonstrated last month. there are probably a bunch of other experiments that already showed that (spec
hybrid45/ hybrid46), but hey, at the end of the day they are all ultimately carried out by a (particular) human with limited perception/ memory…
❓ but on the other hand, 2nd thought: where exactly is the undifferentiated region wrt both density and entropy? that has not really been carefully elucidated previously it would seem…
(11/10) 💡 this new framework is very easy to modify and led to all kinds of additional experiments, trying out new parameters/ features/ optimization combinations, trying to get a grasp on the overall dynamics, getting carried away some. some fascination/ pleasure involved! it is easy to lose sight of what one is looking for.
💡 some new abstract theory seems to emerge. it appears that the optimization has something to do with outlining a multidimensional surface/ object, for now call it a “blob”. there has long been ideas about using PCA and it is well known that PCA works around a gaussian ellipsoid distribution. this new code/ algorithm has some similarity to finding the principal component and looking at dispersion around it. it seems now like an ideal analysis algorithm needs to find the shape of the (multidimensional) blob which involves tracing out its boundaries and filling it with evenly distributed points. it also seems to be similar to the concept of a “control surface” from control theory. however apparently this fairly basic code is doing a fairly decent job anyway.
this is a long 370K iteration run with 500 width iterates, one of the most thorough experiments done on the core density/ entropy region. it focuses on 0.05 away from center ½ for each. many features were involved, all of them found to be roughly correlated with increasing glide lengths. the graphs are in ‘z’ order left to right.
- in the 1st graph the long bit width seems to help out with the linear smoothing without larger deviations; ‘z’ blue is relatively smooth/ unbumpy, as compared to ‘z’ black in prior graph #1 with a long concave warping/ dip on right ~⅓ section.
- in 2nd graph new metrics ‘de’ and ‘de2’ which are density/ entropy product and density minus entropy respectively show significant signal, esp the latter. ‘de2’ was added after it was noticed to be significantly predictive in a prior/ intermediate optimization.
- 3rd graph shows ‘cg’ and ‘cm’ increasing.
- 4th graph shows ‘cg’ vs ‘de2’ and finds a strong wedge effect.
- 5th graph is another wedge type effect ‘cg’ vs ‘de’.
- 6th graph shows density, entropy trends (‘e2’ is
1 - e) and that the algorithm does not find lower values for longer glides, somewhat deemphasizing the long-observed center symmetry. not sure right now if this is the algorithm bias or something deeper. it seems to be showing
d, eare less constrained for shorter glides.
there is quite a bit of interrelations/ complexity here, even more not graphed/ included. it reminds me of the expression from psychology/ therapy, “its a lot to process.”
(11/11) 😳 🙄 @#*% its so easy to mess this stuff up. looking closer at
hybrid47c the 1-run bit width restriction is messed up. at line 85 its
log2(log2(nw)) where ‘nw’ is bit width but that works out to
log2(log2(log2(n))) not what was intended! again the correct formula is either
log2(log2(n)). the wrong formula works out to ~2.73 for
nw=100 ie excluding all 1-runs 3 or longer. am not sure it will affect the results much but they need to be recomputed…
(11/12) spent all day on veterans day yesterday hacking hard on looking for various signals. its a long story. could write a lot but will think over a little more what to write up. oops, forgot to include the grid diagram for last run. the patterns emerging are quite remarkable at this point. the new uniform distribution system is proving to be quite a handful and allowing a new handle on the problem, with many various prior aspects/ phenomena emerging, combining in new ways. sometimes feel a little burned out but my neurons are firing at the same time, getting new ideas, its a strange combination. these writeups are sometimes getting more involved and despite wigners semifamous complaint about “too much schmooze” think it might be a good sign. writing up stuff does not take a huge amt of overhead thanks to all the excellent capabilities of this software eg ruby, gnuplot, wordpress, but its a lot of typing + clicks nonetheless, there is some expense/ cost at times. so sometimes a summary is more in order…
this code is somewhat “back to basics”. it has a little bit of new ideas.
the prior code was finding a lot of signal but was also involving a lot of features in the optimization/ distribution analysis. what happens when most are taken out?
- this focuses only on ‘cg’ glide length only which is in a sense the basic induction feature/ parameter of the problem, a kind of “spine” long sought. what trajectories would the algorithm find? there is old meaningful hybrid code to maximize ‘cg’ but not to solely evenly distribute it; that is newly available logic/ capability.
- the code is adjusted to reformat ‘z2’ and ‘zd_2’ which are nearly the same as ‘z’, ‘z_d’ except normalized/ reformatted for graph purposes.
- the filter is on 0.05 away from center ½ density/ entropy. it also restricts the max 1-run length to “natural size” ~
- these combined constraints seem to effectively control/ focus on the undifferentiated range in a very tight way, possibly more tightly/ accurately/ precisely/ effectively than any other experiments. its a testament to the impressive power of the optimization framework that theyre so easy to make (2 lines of code!).
- ❓ a new twist on the midpoint (“center of weight”) metric is introduced after some studying/ thinking. the initial inspiration for midpoint was around working with ½ density iterates. but what about iterates that are not close to ½ density, or even just a little off? is this biasing the metric in a way that is undesirable, ie is another pov more desirable? an idea came that an actual “center of weight” metric is different, this prior one assumes nearly ½ density. what would one not assuming ½ density or far from ½ density look like?
- 💡 ❗ immediately its a bit, almost obvious: determine the weight (count!) of the bits and then determine where the ½ point is from that. this is called
midpt2in this code, and it is already discovered to have some remarkable, maybe even gamechanging properties!
in 1st graph ‘mp2’ black has a clear trendline from left to right, but alas it has more signal on left side and signal tends to plateau on right side. not surprisingly for a lot of data it seems similar to a feature used sometimes
dlo - dhi where lo/ hi are the ½ lsb, msb ranges. its not obvious from the graph but in contrast to the last experiments (which also involved them in the distribution/ optimization) the entropy/ density seem to be mostly random over the whole range.
looking at the grid plot graph #2 its clear. the iterates are not so different on the 2nd ½ right side. more than ¾ to even ⅘ of the lsb of the iterate is not different, with only some isolated “interruptions”. alas learned the hard way there is substantial aliasing/ overplotting effect in this grid plot and the block sections may be somewhat visually misleading, ie have gaps etc.
also without going into the actual data here but doing so behind the scenes, it appears this top ⅕ msb is mostly undifferentiated left to right for typical signals studied. from graph #1 though, this is controlling the length of the glide, esp for glides where the peak exceeds the predetermined range (here width 200). see ‘cm’ magenta vs ‘nw’ red in the graph. this is exactly the same range that the algorithm has trouble linearly normalizing. it seems possible this is an indication of some kind of complexity issue. the trajectories “seem” to exist but are hard to locate. so this distribution nonlinearity might be a (long conjectured/ sought) way to estimate “missing data” apparently interrelated with local minima.
the whole right range is also related to the earlier idea about the GA focusing on a local minima across candidates found. here it takes the form/ location of the trajectory prefix. all the trajectories have nearly identical initial glides. different runs have the GA settling on a different lsb pattern, but the patterns still tend to be stable on the right ½ range.
❓ 🙄 the idea of the algorithm finding or not finding features and what it “really means” is quite subtle and is a long theme of this blog. one is reminded of the concept of “realism” in physics which has very strong philosophical elements after an entire century of perplexity over QM theory. am trying to think of some other metaphors to conceptualize it. the immediate one comes to mind of human bias of “when you have a hammer, everything looks like nails.” here (prior algorithm vs this one) it seems almost a case of the algorithm actually finding nails with its hammer.
another metaphor: the algorithm may be working a constraint something like trying to cover a floor with a rug that is too small. it can move the rug around trying locally to cover a missed spot, then again discovering another missed spot, “lather, rinse, repeat.” the nonlinear bulge represents the missed spot and here the “rug” (size) is the fairly limited size of the breeding pool (1k candidates).
something else to keep in mind, just because a collection of trajectories is found that thwarts some kind of signal(s) does not nec mean the signals are ineffective in a proof. in short, the proof might have other mechanisms to sample trajectories…
(later) 😮 ❗ ⭐ 😎 ❤ as alluded spent a lot of time analyzing those glides, and then spent some more time. was experimenting with different formulas for features, almost hacking one might say. then came out with this extraordinary, astonishing breakthru! these are two similar features based on the new midpoint measurements combined with
dlo/ dhi, exact formulas viewable in the code. this is also inspired by the earlier experiment finding signal esp in the lsbs of the glides, but then started looking over the whole glide range, and wondering if some major overthinking has been going on. am writing this up within less than ~½hr of discovery (essentially liveblogging!) due to my sheer amazement/ excitement. a picture worth (far more than?) a thousand words… note these are (constructed as) scale invariant signals and the glides are “extremely” undifferentiated!
(omnipresent caveat, it is possible this signal is related to (“is an artifact of”) the generation process, but right now think its highly unlikely.)
💡 ❗ the further thinking on all this is as follows. a signal has long been found in density/ entropy spread in the tail end of glides, as well as many other signals such as max runs, scaled max runs, etc, and it would take a very long survey to collect/ list all this prior work, although
construct137 from relatively recently 6/2020 comes to mind. this all was maybe skipped over in the past thinking (due to some bias in retrospect?) that it only applies to small iterates less than some bound. instead, thinking more in terms of scale invariance, as repeatedly is the overarching mandate with this kind of work, what if most or all of the signals simply attenuate for larger iterates but never really go away and can still somehow be amplified/ extracted? then as in this code (written/ viewed but not graphed) it would seem that theres a whole bunch of signals that correlate with glide length.
(11/13) 💡 back from the car shop to deal with real world stuff, aka order vs entropy in the world of humans/ machines… they recommended precision auto body for a dent… thought about collatz for hours while in the waiting room and wandering around a small part of the city/ park, had more than ½ day to think about it, have lots of cool new ideas. was thinking about the earlier mentioned crucial decrease criterion (“recently” named that on 6/2020 but formulated much earlier! years ago!) and have some new thinking on it. it is indeed key but not sufficient to solve the problem and have some sharpened thinking on what is more sufficient, and need to write it up formally. wonder if anything like it has been found in the literature before. the literature is vast, suspect this basic pattern may have popped up in some other context, but who knows where…
as mentioned am starting to feel that only an ML/ computational-based analysis of the problem can succeed and feel need to move more in that direction. sometimes it seems building a single unifying/ breakthru feature could break open the problem, at other times that seems to be a quixotic mirage. maybe the quest for the unifying/ breakthru feature is anthropocentric/anthropomorphic thinking/ bias. with all the computational focus around here, thats a tough idea to come to grips with/ swallow, but sometimes success is all about discarding every/ all illusions and “biting the bullet.”
part of the challenge seems to be that the inside-vs-outside core (undifferentiated region) statistics seem much different, almost even in a sense “opposite,” (yin-yang dynamics again!) and its all highly nonlinear, and a full solution has to weave it all together, and big data/ computationalism is all about accepting, even embracing, some things really are just too big to wrap a (human) brain around.
ah, but then solving the problem seems equivalent to wrapping ones brain around it! one of those endless paradoxes around here…
recently have been thinking about/ looking into/ reading some about decision trees.
- its a great ML method that has advanced significantly in recent years. none of my earlier ML books mentioned them much, but think the theory is much improved now, and think they outperform some other methods on certain types of problems.
- my understanding is that final/ best solutions in the LHC particle analysis kaggle datamining contest ended up with them heavily utilized. oh, theres that whole physics thing again… iirc was reading some leading kaggle competitors talking about them some, link buried deep in my pile somewhere. they seem to have some advantages that might work well for this problem. two nice intros, KDNuggets and wikipedia.
- one of the problems with NNs and GAs is trying to eliminate redundant variables. a NN will tend to put weights on all variables no matter how insubstantial, and its n-factorial different combinations of variables some which will do better than others. it seems slightly counterintuitive that an ML algorithm will run better by eliminating (“culling”) certain variables, but NNs/ GAs generally have this property, and in some ways its a serious design deficiency. has this ever been addressed in the literature other than by ad hoc methods? eliminating the lowest-weighted variables 1st and looking for improvement is a reasonable/ not-bad approach but it seems not especially elegant.
- actually wrote some code somewhat like that for picking/ evaluating random variable selections in some past blogs wrt nearest neighbor type algorithms, it was very slow going, high computational expense and results were not especially standout for all the trouble so to speak. a nice ML algorithm would let one throw lots of variables at it and then it would just throw away those that are useless for the best classification, and it looks like tree algorithms do something quite like that.
- my understanding, and havent used them at all so far, is that they work from the best signal variable 1st, working with then the lower-signal ones, possibly with different orders in different branches, and at some point no more signal can be extracted, and maybe it will stop before it exhausts all the variables, and/ or some or even many/ most branches dont include all variables. at this point, as captured some in the last code, now have a big plethora of features to try out in a model, far more than last time around.
- another feature of tree algorithms is that their performance on classification may be very fast in some cases. to classify one only has to explore
log(n)size of the decision tree on average whereas for nearest neighbor it requires (linear) time proportional to n points. this will be helpful in this context where often hybrid GA code is used and classification is needed for each candidate, in some cases many thrown away as “unsuitable” wrt filters (density/ entropy/ max 1 run length etc) and many others thrown out as not sufficient (below bottom cutoff) population fitness.
- the kdnuggets intro says that the tree algorithms do well on nonlinear data. it seems that fractal data is about as nonlinear as one gets…!
(later) 💡 was thinking all about function bounding via Big-O notation O(f(n)) and how it relates to the decrease criterion. it was very tricky in my brain, a lot of moving parts were floating around, it took hours to sort this out from the fuzzy haze (to haphazardly mix at least 3 or more metaphors). but then, some flashes of insight. it got boiled down. the decrease criterion again is finding a g(x) such that g(x) > g(f(x)) where f(x) is the collatz mapping. g(x) is very important but hasnt ever been named, although stuff like “loop invariant” has been thrown around. so call g(x) the induction function. here are the additional key constraints on g(x) after lots of thinking, hinting at some Big-O like ideas but formulated simply without them.
- g(x) is discrete (integer based). this ensures it “counts down” to 0 and does not (despite being defined in a sense as “strictly monotonically decreasing”) asymptotically approach a higher limit.
- for all c, there exists a c2 such that g(x) > c for all x > c2. in a sense, again invoking “opposites,” this might be called an unbounded lower bound, or at least thats the phrase that came into my head on thinking about it.
very astute readers might be able to piece together how these two properties along with the decrease criterion lead to a proof, that all f(x) sequences are converging! hint: the key idea is how the induction function relates to glides. will sketch out the proof sometime and then try to formalize it precisely. ok, ok, (maybe ghost of fermat is nudging me here!) am not a pro at this stuff, dont have a full proof yet, have to think about this more, but it relates along these lines… ❓
(11/14) 💡 recently there was the outline idea of focusing entirely on scale invariant properties to build a “map” of the problem also via ML, also related to the tree model idea, ie maybe tree models might be well suited for the “dynamics mapping.” the induction function is not quite the same as the map idea although theyre interrelated, although that needs/ remains to be articulated in more detail.
in the last code all the features are scale invariant. what about using them in an induction function? it seems to be clear that scale invariant features can only indicate a range of trajectory dimensions, and suspect therefore a function based on them cannot overcome the 2 additional constraints, and maybe to be utilized with g(x) they need to have at least 1 scale variant measure eg bit width is the likely/ obvious candidate.
although how would one actually prove this? was starting to imagine scale-invariant ideas working across multiple iterates/ collatz mappings that somehow might “sense” or “reverse engineer” scale-variant properties. these are a bit sketchy/ hazy/ even weird ideas, but the recent ideas/ even logic (written/ thought) nudge the thinking in this direction/ vice versa; heres a rough initial stab at analogy.
consider floating point numbers/ representations. they represent a range. but also imagine they are distributed somehow. now imagine while looking from an overall pov they look/ appear “ostensibly” evenly distributed over the range, there are subtle patterns in the numbers in the way their 0/1 bits are distributed esp in the lsbs but not the msbs (a phenomenon/ property already identified in collatz iterates not represented as floating point). the overall range is nearly the same, but maybe the lsbs vs msbs could encode information about scale somehow.
some of this might be related to the idea of floating point representation/ roundoff error (aka noise) and how it may be different at different scales, a well/ deeply-examined concept in numerical methods literature, and surely with some fractal-related properties/ aspects.
heres a more tangible yet simple example, maybe more meaningful than merely “toylike.” consider only odd-bit-width iterates.
- they cannot have exactly ½ density. they can have a density that is close to ½, and their “closest nearness” to ½ ie the formula
min(abs(d - ½))over all fixed width iterates can indicate the scale (bit width) of the iterate!
- in short the decreasing minimal gap between their possible/ expressible density and ½ is measurable, and directly indicative of scale!
- and then note that a combined function of multiple collatz mappings increases probability that it will have at least 1 odd-bit-size iterate…
- and then immediately wondering, is there some way to ensure/ guarantee it with some (variable?) # of collatz mappings…?
- maybe even over a fixed # of sequential iterations? this is such a basic question and yet it seems very little studied relates to it, except maybe indirectly eg connections between possible parity sequences and bit sizes…
- and note maybe then some scale-invariant measures can be constructed over multiple iterates…
- all this is somehow indicative of/ seeming to point to the paradoxical, but maybe not contradictory idea that the (sequential collatz) bit patterns can somehow “encode scale” in a “scale-invariant” way…
- the Terras ideas give strong/ arbitrary control over (“initial” sequential) odd/ even iterates but what is the “corresponding” control if any over odd/ even iterate sizes/ widths? it is not immediately clear, the question is apparently wrapped up in the deep complexity of trajectory dynamics, its something to be analyzed…
- quick thinking, the forward/ corresponding reverse/ inverse/ backtracking
n div 2 → 2·noperation ie “shift left” always subtracts/ adds exactly 1 bit, but
3·n + 1 → (n - 1) / 3for
n mod 3 = 1iterates has some kind of varying/ dynamic/ complex effect on bit size.
💡 also, there are other toy ideas/ analogies thrown in this blog over the years, and have a special fondness for them, although havent thrown out anything like this in ages. recently came up with another toy idea as another metaphor, need to write it up. in short consider arbitrary base n expansions of numbers, and then let digits be represented by (unique) strings of different lengths, then look at the total size representation of strings over the counting (or decrementing) operation…! it makes me think of the phrase/ phenomenon “encrypted counting,” there are some old blogs musing in that direction.
(11/16) 💡 so many directions to go in, so many leads/ ideas to follow up, but just like life the most worthwhile/ substantial/ promising stuff is also the hardest/ most timeconsuming.
hybrid51f graph again there is more to the story, more to interpret. it seems to be explained in 3 sections left to right where the existing understanding/ framework plays a large role.
- 1st ⅓ section left is the predetermined range,
cm/ cgmagenta/ blue rise linearly with ‘cg’ slightly “faster” ie a slightly higher slope.
- 2nd middle ⅓ section, ‘cm’ tends to push against/ be constrained by/ “resisted by” ‘nw’ bit width orange while ‘cg’ rises at same slope ie ‘cm’ still in the predetermined range, but at the very end.
- 3rd ⅓ section right, ‘cm’ exceeds ‘nw’ and ‘cg’ continues rising ie ‘cm’ enters postdetermined range.
- the midpoint signal ‘mp2’ tends to work (ie capture signal) on the left ⅔ section while breaking down/ levelling on right ⅓ ie becoming noisy and undifferentiated aka in a sense “featureless.”
- also the noticed distribution warping phenomenon coincides with the right ⅓ section.
this is quite interesting and unanticipated emergent behavior of the algorithm that nearly perfectly aligns/ matches the conceptual framework built up, giving strong support to its validity (actually, both the conceptual and optimization frameworks) as even more than a mere conceptual understanding but as an underlying dynamic of the “system”.
⭐ ⭐ ⭐
some general thinking on signals. have written in this blog, it seems that it doesnt make sense to throw sophisticated nonlinear ML algorithms on signals that dont show at least some linear trend. my 1st experience of this came with, believe it or not, market data analysis/ forecasting. in other words it seems nonlinear ML algorithms are all about squeezing additional signal out of something with at least some linear signal/ correlation. this is in the sense “there is no free lunch.” xkcd comics notwithstanding, in real statistics with careful analysis, one cannot actually mix a bunch of noisy data and find signal, at least in the hands of a truly experienced unself-deluded professional… or rather, (“caveat”) maybe that is exactly what a lot of data science is all about, its raison d’etre, to try to remove the possibilities of delusions statistically/ mathematically, and hopefully succeeding most of the time…
this is abstract but heres my basic tangible, bottom line, “brass tacks” point is that if the data cant be presented in some way where some signal seems visible to the human eye, my suspicion is that few ML algorithms would succeed in extracting much signal at all, or maybe any. some might laugh at how much gnuplot is used/ utilized around here (there are troll comments exactly like that) but its more than just a tool, its a kind of extensive philosophy. and if one argues otherwise, ie signal coming from apparent noise, maybe the data is just not being presented in a effective graph that really does exist. and lets never forget the human visual system is one of the most sophisticated signal detection devices in existence on planet earth…! (and professionally trained experts exceed even average acuity/ ability!)
my only other idea questioning that, ie a possible exception scenario, based on very solid theory of what is sometimes called in ML circles blending, is that maybe if a bunch of noisy signals collectively have some same subtle bias, and then combining them all reveals/ uncovers the bias, ie in some sense are “noisily correlated,” then the ML algorithm might succeed. but then can the ML extract anything much different/ better than the average of the signals? (wow! that gives me an immediate idea!) it seems some deep rules of thumb can still be developed/ elucidated around these areas…
🙄 which reminds me of this article found today… a kind of halloween horror story about ½ mo later… was diagnosing an issue with a particular app made at BigCorp that has caused me mountains of trouble/ grief in 2019. in short, its a death march, a ½ alive, ½ dead zombie app, a graveyard of/ for developers complete with stacked layers of dust, spider webs, tombstones, coffins, and open/ unfilled pits aka deathtraps everywhere, including one with my own name engraved in stone. ofc it is utterly broken but in a way that the company does not understand or really care about, which ofc is how it got that way in the 1st place. the entire system is not really coherent; at the moment they are just trying to figure out why it is using too much CPU, because who @#%& cares whether the data is correct? … but CPU cycles actually cost @#$% money! lol!
😮 which got me wondering about the early origins of “GIGO.” it dates to at least 1957, enjoyed this delightful/ in depth/ excellent etymological analysis, hope someone else gets a kick out of it. alas sometimes its not even GIGO— which is as the article tone, sometimes ½ funny/ facetious— but just… garbage. the phrases gallows humor or “whistling past the graveyard” come to mind… sometimes stuff comes back from the past to haunt you… sometimes stuff is haunted even to begin with… actually there is something even worse than garbage: radioactive waste— literally dangerous/ toxic to touch or even to be near! 😮 👿
anyway it seems that as one squeezes more and more signal out of the glides, focusing more and more on the undifferentiated region, they continue to reveal some part that “reduces” in a sense to pure noise. so in this case, from the 3-section analysis, as already somewhat well understood, it appears that “hard” trajectories are those that exceed the predetermined range ie peak extends into postdetermined, and as mentioned it seems about the top ~⅕ msbs of them are just noise. and no human-detectable/ constructible signals are so far found in that section, and so it seems at moment unlikely any sophisticated ML system, even trees, could extract signal there from the existing format.
so maybe the idea to use here, as suggested earlier… and evoked by this data… what if trajectory dynamics are encrypted into iterates in a way that the “future” is almost “perfectly” obscured in cryptographic noise? what if, if hammers make everything look like nails, here the “nails” are binary encodings? theres always been that possibility lurking, might as well face it directly… think have already admitted/ conceded it long ago, in moments of frustration/ despair…
(11/18) 💡 ❗ this follows up on some key prior ideas in somewhat surprising ways and is some rather straightfwd theory in various ways but somewhat involved with multiple steps and lots of moving parts aka some “fancy footwork.” recently mused (end of 11/12) that (in other words, basically) maybe there are many trajectory length estimates already known or seen. as has long been understood, the trajectory length estimate is a basic induction function.
- the most basic feature known for years is density, and in graph #1 for an undifferentiated glide found by the
hybrid51fcode it is examined/ graphed in red. the tail end density spread was noticed essentially immediately on discovering the feature. but as mentioned, maybe there was some bias that the signal is only present in tail ends of glides up to some length. it certainly looks that way in the graph.
- but as usual looks can be deceiving and on further thought its a conjecture that really needed to be tested rigorously… “that day has finally come”… how to convert it to a trajectory-length-(remaining) signal, if it had enough signal, ie is not tail-end-limited as it appears?
- the trajectory length remaining vs density is graphed in #2 with density red left side scale and remainder length green right side scale, its a basic wedged-shaped trend. the signal is evident esp looking at bottom/ top of remainder scatterplot/ points, but how to extract it?
- an immediate idea is to look at averages in “bins”. converting the density range into 50 bins and then averaging the remainder points in each bin gives rise to graph #3 which very gratifyingly captures lots of signal! even strikingly so, showing how averaging can sometimes extract sharp, strong signal out of seeming near-noise. and in my memory the 1st new use of this nice gnuplot feature called linespoints 🙂
- then in graph #4 one can reconvert the original density signal into expected indexes via the nearest-bin estimate in red. alas, this has signal but is very noisy, so how about a 200 sample running average? note running averages are in a sense local, or a mix between local and global operations. this is green line ‘xa’.
- ok, thats not bad at all! but just looking at that trend, if running averages are allowed, what about just averaging the initial signal? in graph #5 ‘da’ is computed which is
-(abs(d - 0.5))ie negative of the density distance graphed in red. this has nearly the same trend and then a 200 point running average is computed as ‘daa’. the same ‘daa’ signal is graphed in #4 for comparison, and it almost seems to be better than the ‘xa’ bin method, using less computation.
- these are really great trends/ signals, and prove something that is kind of extraordinary/ novel/ even a paradigm shift. the density distance (spread)— while manifestly thin here— encodes remainder signal even for large iterates!
- but alas not very linear at all! ah, but all this gives rise to a whole new idea. in the next method some further theory is exploited. in graph #7 the 750 sample running average of density distance is again computed, leading to a very monotonic function green. on very close inspection, it is not perfectly monotonic, it tends to lose (strict) monotonicity more on the left with larger iterates and a longer, nearer-flat section, where even the same “wiggle spread” could cause nonmonotonicity. which gives rise to the following.
- then one can assign “nearest” remainder distances to densities and then look at the final result. in graph #6 the density is rearranged to monotonic order. since the actual function is slightly nonmonotonic, the estimate of trajectory remainder is noisy in the green trend, but very closely linear for 2nd half! (further “post” averaging would increase its linearity.) not pictured, this nonmonotonicity leading to inaccuracy in the “more distant” points can be straightforwardly decreased with larger running averages (both “pre/post”).
- in graph #8 its a scatterplot of the exact same ‘y’ values from graph #6. in graph #6 it looks like wild spikes, but in graph #8 the extraordinary human visual processing system comes to the rescue and recovers the original line trend! its also evident its a kind of (nonlinear) warping/ projection of the data. so nonlinearity is used to maximize linearity so to speak, ala “fighting fire with fire”…
there are some basic ideas/ concepts just revealed/ implemented/ coded/ demonstrated here.
- any “nearly monotonic” signal, no matter how nonlinear, can be converted to a “nearly linear” signal in this way!
- this is an immediate/ basic/ striking demonstration/ confirmation of the property just conjectured namely that trajectory scale is “scale encoded” in iterate data!
- the running averages are closely related to the earlier ideas about an induction function maybe requiring computation over an iterate span of size proportional to the iterate ie bit width, in particularly simple form.
(later) musing, the “unbounded lower bound” property from 11/13 concept (which honestly maybe isnt best terminology ie strictly accurate) reminds me of the epsilon-delta definition from calculus and suspect there is some kind of similarity, can the former be phrased in terms of the latter? or does it appear in some other math context from the literature? actually looking some more it seems to be alluded to in this definition of limits as an infinite limit, but its not written out exactly. “Similar adjustments can be made to define limits of functions f(z) when z → ∞.” (oh thx for helpfully pointing that out, lol)
we say a function has an infinite limit at infinity and write if for all , there exists an such that for all .
(11/21) as mentioned last/ this month was thinking more about the lookahead function vs new sophisticated findings from the hybrid algorithm. essentially, a lot of tuning on the algorithm has pushed it into long undifferentiated glides that seem “longer” than any before seen eg terras density glides. in short they have long(er?) postdetermined glide sections. some of this is challenging in that there are copiously many generation methods/ algorithms over the years but typically only a few metrics are tracked on each one, so maybe these “long/ extended glides” have been seen/ generated but not noticed much. due to the incremental/ experimental nature/ flavor of the investigation, the glides and metrics are somewhat independent of each other in the development but not the theory; in theory all the metrics can be computed over any glides.
but then thinking about the last exercise, oh! the “fly in the ointment” “rears its head” again. again the ufos are a possible/ long counterexample to trying to use the just-examined “seeming or imposed continuity” of the density measurement for sophisticated proof purposes, eg the induction function, and something apparent to test against. so went and looked again at the ufo work mainly 3/2019 wondering about the trajectories and their differentiability. and then noticed this (precursor) exercise think have been trying to locate again! it relates to the lookahead function also which as surveyed last month showed up (later) on 6/2019 and 7/2019.
bitwise28 looks at “2x width distance iterate width ‘wz’ (bit width of iterate at 2x the starting bit width distance) divided by max iterate width ‘mx’, called ‘wzmx’.”
❓ it seems the only possibility of escape from the ufos breaking the density scale encoding approach is if somehow they lead to somewhat anomalous glides in some way wrt differentiability, ie are actually differentiable somehow over its ufo-presequence trajectory. there are hints of lots of triangles in them and other aspects, but it also seems likely that the more/ longer the pre-trajectory is generated, the more undifferentiable it can be. another idea is that the density scaling works over the postdetermined glide but not farther into the “long” postdetermined drain, and again this is not refuted by known understanding/ searches that have never found ufos in the postdetermined glide even while specifically targeting it with the best/ most thorough optimization code known, the hybrid algorithm. ie put another way can large ufos be constructed in the tight undifferentiated region lately being studied ie close to center density/ entropy and short 0/1 runs?
(later) it does not take a lot of work/ time to test this idea. this is some adapted code loosely based on
backtrack3 from 3/2019. the last exercise had a trajectory length of about ~1350 so this code generates a 1400 length pre-trajectory with a large ufo discontinuity starting at 100 bit width that has 85 length 1-lsbs, 15 msbs random ½ density. it uses a new technique of finding a parity (“pre-“)sequence very close to ½ density. this stands in for criteria trying to limit density and entropy to the core and restrict to short 0/1 runs. after 67K iterations it ends up with a ~375 width iterate as in the 1st grid plot. note the remarkable/ unusual very high linearity of the pre-sequence slope, noticeably/ substantially more than post-spike. it also looks very undifferentiated, maybe even more than post-spike, although ofc that kind of measurement needs more quantification.
then in graphs #1-#8 the prior
smooth3 code/ algorithms are (re)applied. the nonlinearity/ distortion from the ufo is evident in various ways, almost in every graph except #2. the bottom line graphs are #6 and #8 and it causes a predicted significant distortion/ noise region near the lower iterates, lower right side, although maybe not as much as expected. it is evident some of the averaging/ mapping techniques are not highly susceptible/ vulnerable to it, ie exhibit some robustness. the endless background question, how much is “good enough”?
(11/24) alas the BigCorp RadioActive App (RAP?) at work is still on fire and melting down, causing some major stress, ie restless sweaty night, but at least now its sheer wretched seething toxicity is not (entirely!) focused on me and spreading across the team some! so, (full irony mode on) something at least to be thankful for… wrote up some notes that included the phrases biting the bullet and no quick fixes.
which reminds me of this problem at hand. (at this point of years of focus nearly everything does…) musing on “the current conditions,” in which reality seems to be entangled so to speak (evocative/ alluding of jungs concept of synchronicity), is there some bias here about trying to find quick fix to the problem in the seemingly endless analysis of features etc? have this new idea that is years old but now have a better grasp on how to carry it out. doing a full survey of its precursor work would be very timeconsuming, but it clearly fits in with prior ideas.
the “no quick fix” scenario here is to carry out a sophisticated induction function construction attack, so to speak. this has already been attempted in the past with a genetic algorithm more directly and a little more indirectly with linear algebra/ matrix models. so whats the next step up? have never done much work on tree ML and despite it being very attractive as outlined, am going to hold off for now. have much more experience with nearest neighbors and feel it can capture substantial nonlinearity/ fractal aspects of the problem, maybe enough to lead to some new breakthrus. have most of it mapped out in my head, now the “devil of details”… 😈
💡 ❗ ⭐ ❤ that will surely be a weeks-long effort to finetune. meantime, something to carry on the flow of ideas. came up with this basic conjecture that unifies the major areas of focus of these blogs, and the unification of 2 of them has already been brushed on/ hinted/ alluded to. the conjecture is that these 3 diverse attack ideas/ directions/ formulations/ frameworks/ paradigms etc are all not merely interrelated but actually interchangeable/ equivalent…!
- the induction function construction
- the “ordered”/ “orderly” traversal ideas
- the “feature map/ transition graph” outline idea
now, just need to sketch out some proof of that…!
(11/25) this uses some of the
outline8 code from last month. it generates 100 samples of random density and uses ½ of them as classes and the other as train/ test classification points. they are all kept in a single array with a marker to indicate the classes. some of this is different, not easy to describe, and am gonna have to come up with some new terminology/ concepts/ explanatory elements.
it finds nearest classes for the initial and “transition mapping”. the initial and transition mapping are built out of 2 4-length running averages over 5 samples. ie its like a batch size of 5 with 2 overlapping batches of size 4 “separated” or offset by 1 iteration; this was constructed previously but without much comment. specifically it was introduced in
outline5 from last month the 1st cut of nearest neighbor classification without the hybrid point generation. immediately it is found that longer batch sizes tend to lead to more self loops (not further quantified/ graphed here) hence the small size here. also immediately discovered, and this could/ should be expected, that self-loops are numerous and need immediate attention before the other/ advanced aspects of the induction function construction can be tackled.
this is a simple experiment to look at (only) the classes involved in self-loops and delete the most common class over 500 iterations and replace it with a random sample point. ‘mx’ green is the occurrences of the most common class. ‘a’ red line is total class count. ‘v’ blue is sum of all class occurrences.
unexpectedly, relatively quickly in ~50 iterations and little change afterwards, the “optimization” drives up the # of occurrences of the most common (self-loop) class ‘mx’, increasing total # of occurrences ‘v’, and decreasing total # of classes ‘a’. its nice to see at least there is some effect but its basically the opposite of what is desired at least in the sense of increasing occurrences of the most common self-loop class. but then also on further thought its not exactly counting the key parameters.
(11/26) so then not a lot of code change leads to this, but in a way very different, now desirable results. its clear a key indicator is simply total # of selfloops, thats ‘l2’ green in these following graphs. batch size was increased to 10 based on algorithm working (more) as expected and the class fraction of all points is changed to ⅕ (20/100) from ½ (50/100). the next code has 2 optimization goals,
(a) the 1st is to drive down ‘l2’, again focusing on self-loop class mappings, and
(b) also to look at the least-occurring/ least-common class and maybe increase it, ie try to balance out total occurrence distributions over classes.
- the good news is that the 1st key optimization (a) can be done quite effectively as in graph #1 in only about 250 iterations with a main strategy of throwing away the (self-loop) class with the “closest combined distance” (sum of distance of class for initial/ transition mapping). ‘z2’ dashed red line is lowest combined distance to matching self-loop class and spikes over the run; not exactly sure how to explain/ picture this right now, it seems to be some kind of distance tradeoff with classification adjustments. total classes ‘a’ yellow is not changing much here & across all 3 of the optimization runs/ graphs.
- however, turning on the 2nd (b) optimization is hard(er) and turns out to tend to work against (a). so it is applied only on “spaced intervals” specified as a parameter ‘i’ here 3 count. in graph #2 (b) turns out to be apparently hard to change and there is some but only nominal-bordering-on-weak improvement in the least common occurrences ‘mn’ cyan line which stays at a fraction of the max occurrence line ‘mx’ magenta. the (b) strategy is to find the “nearest” class in the combined distance sense that was thrown out and use it as a class instead!
- in graph #3 the 2nd (b) strategy of replacement with a random new point is carried out and the ‘mn’ cyan line is somewhat random, but also roughly the same order of magnitude as in graph #2 strategy.
was surprised about the hardness of trying to evenly distribute class mappings and it seems to point to some kind of intrinsic dynamics/ property of the system, but its not clear yet.
my other immediate idea running thru my head is to look at grid patterns for all this and sort points by basic properties such as entropy/ density pre/ post optimization for some insight on what the algorithm is doing, what regions it is exploring/ favors, etc. these are just the beginnings but “uncomplicatedly” achieving the basic necessity of reducing/ completely removing self-loops is a big deal. ok, at heart its really a simple algorithm/ technique but at least one complication can be ruled out for now. also it does reveal that “balancing” class occurrences may be challenging or difficult…
(11/27) this is a checkpoint, it took a few hrs. am moving onto the “function evaluation assignment” phase. the idea is to find values for the g(x) induction function once the points it is evaluated at are known. had an idea which seemed reasonable but didnt work out very well. it didnt converge to anything much, it was somewhat random. the decrease criterion for g(x) was basically about ½ probability for each point, what is expected by chance. then started thinking, need to understand the basic relationships here.
and ran into this very basic idea. the class assignment code will set up an ordering between classes. this can be represented quite handily with a directed graph. node a pointing to node b indicates node a should be assigned a higher value than b. then looking at the digraph, what emerges? this code was modified to have ½ total of 100 classes, ie 50. graphviz (
dotty again) did a beautiful job on the layout, am a bit delighted bordering on amazed. the staggered/ overlapping layout it reminds me of font kerning algorithms.
- the analysis shows that there is (in this run/ case) “low conflict.” conflict is a case of a loop or two nodes pointing to each other, because then not every node can be assigned a value to satisfy the ordering. for a 3-loop, its like the “rock paper scissors” game where there is no winner (ie a draw) if 3 people select even all different 3 cases. but there is only a single loop in the entire graph, namely 61↔44 midright, indicating “overall low conflict”.
- it is not immediately obvious how to assign values but apparently starting at higher sections and working downward while decrementing an assignment will be very effective. here graphviz layout is actually/ essentially/ nearly fully solving the problem; assigning the ‘y’ value as the vertical height of computed node layout clearly gives a nearly perfect values assignment!
- this requires deeper analysis but on 1st glance, it looks possibly like this unexpected low conflict (uncommon loops) is an emergent property pointing toward some kind of “inductive spine” on the problem long sought. it was possibly present in last months similar nearest neighbor work but its unexpectedness/ significance wasnt noticed at the time. one of the current best ideas on the inductive spine long studied is ‘cg’ in the undifferentiated region as in recent hybrid experiments.
- however, heres a/ the “catch/ fly in the ointment.” there are 5 separate disconnected islands or components. a full solution requires full connection, ie no separate components, a single connected graph.
- another aspect (almost the same as disconnected components?) are “dead end states” that dont lead anywhere further, those have to be addressed also somehow, such that all states lead to other states.
- but, a sharp reader might notice, if all states point to other states, ie there are no unconnected components, then loops are guaranteed, right? a paradox that needs to be examined closely, pointing toward something…
- other runs generate somewhat varying results, a simple 2nd try graph #2 has more loops but a little better interconnection/ aka fewer components (4) with one of them larger on left.
- there are many other issues to address also just glossed over so far such as trying to get a full representation over all possible iterates/ features etc.
- another possible major wrinkle, reviewing last months ideas: it was found that possibly most of the features were not being “considered” by the classifier except for density due to low signal/ high noise in the others. this was not quantified then but indirectly indicated. however, there is a lot of indication in many other experiments that various other features have some continuity/ signal over iterates eg entropy, midpoint, 0/1 runs, etc.
- another possibility: what if the already known undifferentiable region is “emergently” just assigned a single “mush-like” class by the classifier?
- and honestly this is all rather limited (a blunt assessment, “crude”) given there is no tracking of relative iterate size here over the classes/ states.
- nevertheless, the general concepts/ logic are brought out/ illustrated.
(later) 💡 ❗ again pondering recent “highly undifferentiated” hybrid generated trajectories, maybe the key question is if “differentiating classes” can be built or derived for undifferentiable glides that discriminate glide/ climb from nonglide/ descent/ drain! without fully realizing it at the time, that is in a sense a direction/ theme of much of the work that has been done “manually”! ie in a sense classes derived from relatively uncomplex human-constructed functions out of existing known features. and it does not seem this approach has been specifically targeted… what would such a focus or “attack” look like? it seems much of the components have already been utilized in different/ other ways… along these lines already have a big idea to try, with all the known/ constructed features at this point there is a lot to try here!
(11/30) so that led immediately to this code. or rather the basic idea is immediate but the “execution” (that is in the sense of human agency) took quite a few hours. this builds mostly on code from last month,
outline6 which had a basic nearest neighbor algorithm also with separate test set in addition to training set. the code was close but alas took quite a bit of adjustment. one aspect was its use of the ‘j’ key/ index in hashes as a shortcut to point to different indexes, that had to be taken out. another glitch was with setting the nearest neighbor ‘nn’ temporary variable that made an assumption the hash elements were common to (shared by) the test and classification element lists. all this took quite awhile to isolate/ adjust.
this code works with the
hybrid51f “highly undifferentiated” data again and has a lot of moving parts. recall it is density/ entropy range limited to center +/-0.05 ie 0.45 – 0.55 which is quite tight. its somewhat hacked together and has a lot of list variables in the main code, somewhere around 8, nearly a record number, and way past my typical threshhold for refactoring (eg into subroutines) but it gets the job done and seems to be running correctly, and dont immediately plan to build on it for now, so am letting it all stand for now. it tries to build a classifier for ‘cg’ glide length out of the existing 1000 points. it tries to choose 250 representative train points that are nearly linear over the ‘cg’ range. due to nonlinearity and (avoiding) distortion it ends up with 208 total. the algorithm uses the other points remaining as a pool of 250 possible classes, throwing out/ replacing the worst performing class on each iteration with a random new class from the pool.
some extra feature variables are added just to try to improve classification. it is not clear right now how much of a contribution they make, they seem to improve it. the new variables are max 0/1 run related, average 0/1 lengths, and a ratio of average 1 run length divided by expected max 1-run length. graphing these over the
hybrid51f data (initial iterates) gives a rough (visual) estimate how how much signal can be found, and the features predictably still have a lot of noise/ lack of signal on the 2nd ½ iterates with the long glides extending into postdetermined range.
❗ however, the algorithm finds significant signal over the “ostensibly undifferentiated” predetermined range and that is no small accomplishment! remaining “todo” items: try to figure out/ deconstruct which features are contributing to the signal, not a simple question.
- 1st graph: ‘et’ red is error total (average) over train data. ‘er2max’ green is the worst performing class error (removed). ‘j’ blue is the last removed class, to look for any trends. ‘et2’ magenta is test error. ‘er’ cyan right side scale is test error improvement (rate/ percent). ‘er2’ black right side scale is train error improvement.
- 2nd graph: ‘y’ red is classifier (train) prediction. ‘e’ green is error. ‘cg’ blue is prediction variable over test data. ‘i’ magenta is class distribution of indexes (to look for trends). the array is size 250 but is truncated at upper end in the graph due to shorter # of test points, 208, and as seen it stays fairly evenly distributed even though the ML algorithm is free to bias it if it helps prediction accuracy. ‘y2’ cyan is test prediction. ‘y’, ‘y2’ are noticeably flat/ noisy at the right ½ range of the graph where ‘cg’ exceeds the predetermined range/ count 200.
so signal is extracted, but again the burning, difficult, unanswered, open question: what amount of inaccuracy/ imprecision in the signal is acceptable? in the theory, the induction function g(x) is perfectly monotone. in this sense the problem seems to be about decreasing the noise of a noisy prediction function. many techniques have been developed to do that already, but what remains is some (“sufficient”) grand synthesis of all the “scattered bits and pieces.” 😮
had some rough deja vu on looking at graph #2. last month did a big review of MDE results and how they might be expanded in a new framework, and mentioned 1/2017 page. at the end of that page,
fit5d, fit5e, the predictions seem more than vaguely similar to graph #2. the prediction essentially splits into two regions, one low and one high, ie roughly bimodal. a big difference is that in the past case the iterates are variable width. recall in that case ‘h2’ is basically
c / nw (total trajectory length divided by initial bit width) but note also it used compressed mapping. in this new case the 1st lower left region is much more noisy.