collatz story arcs further

the outside background around this post is that google just announced world-shattering results in protein folding problem with machine learning. this is historic, deserves to be highly celebrated (not mere typical marketing hype!), and crosscutting more than 3 of my favorite fields all tied up into one problem (bioinformatics + physics + ML etc), and am very inspired/ awed/ psyched about this. those feelings are not easy to obtain these days. would like to put large effort into commentary on all this, but alas my audience is not into reciprocity. drop me a line (comment) if you want to (rather easily?) prove me wrong…

am immediately working on the last code some, and my full data scientist expertise/ repertoire is being put to the test. its been some back-and-forth, almost a dialog or even conversation with the data, which is nearly the best case scenario. its like a kind of debugging, but on the level of data manipulation more than coding errors and has a lot to do with trying to understand the presence/ lack of generalization in the model, which maybe as has been indicated a long time ago, is the machine learning equivalent of induction. in other words, the code might work on less complex data, but it doesnt, so has to be further tweaked. this is an attempt to make a relatively long story short.

  • my 1st instinct was to look at the performance of the model over a sample trajectory. tried it out, and it seemed to fail rather soundly. was expecting to plug in the sequential iterates of the glide, and see a roughly declining function. it looked like mostly noise. but then realized that the model is predicting ‘cg’, which is quite nonlinear over “subglides” because initial iterates of the subglides sometimes go down quickly, and the subglides tend to be short or long.
  • but even after fixing that, and sorting points by actual subglide lengths, it still led to noise predictions.
  • so then started wondering why the model was “falling down” so badly. one would at least expect roughly the performance of the test set. and started wondering if the test set was not being properly “blinded” from the algorithm. recall test points are selected out of the total set of points, some of which may coincide with train points, or points that were once used for training.
  • the 2nd scenario is more exotic and would involve a kind of “residue” effect where the model is affected by not just points used to train it but past points used in the training… not likely, but also not completely impossible either, esp recently seeing various biases in GA populations as discussed.
  • this led to some code to make sure the algorithm is guaranteed to never “see” test points. after retraining, then found the same negligible/ noise performance on the unseen test data.
  • then, my idea was to train over the subglide data which was apparently fundamentally “harder” than the initial iterates. so wrote some code to generate all the subglides (a nice/ powerful “resampling” technique used a few times previously) and select a even distribution over them. the subglides are very large, coming out to about 40K total. but then sampling them down and training over the subglides led to no substantial signal being extracted, and then this seems to point to a deeper issue, but even though rather obvious in hindsight, it eluded me.
  • so, and this is not easy to admit, this led me to wonder about possible biases in the distribution, still thinking that the subglide characteristics could not be a lot different than the initial iterate data. is it possible to do random selections but then they are differently biased?
  • this led me to the idea of maybe the average distances of points in the samples, such as being close or far, and actually wrote some code to try to throw out “far” points, or “near” points, using the existing nearest neighbor code (called dist). but again, this did not seem to affect or improve the “null” test performance that much. and in hindsight, this was a red herring, probably effort “off in the weeds” so to speak.
  • ah, but thinking this over, and trying to understand better, and wondering what could give additional info/ insight, then this led to more obvious ideas that should have been tried to begin with. the algorithm has a rough estimate of confidence in the ‘z’ distance measurement of nearest matching classes. just looking at that, it leads to a lot of additional analysis leverage/ insight.
  • turns out to be characteristically varying over the low and high ‘cg’ points. moreover, (matching) distance of all the (subglide) test points is significantly “farther” than over the (hybrid) train set points. aha! that points to that features are basically somehow different over the test points. not a surprising scenario, one that could have been noticed much sooner.
  • immediately, ‘d’, ‘e’ are much different over the hybrid vs subglide data. other features show similar strong differences. some of this can be regarded in that the hybrid data is in another sense not very evenly distributed and “biased.” ofc it is already understood/ intended to be selective which is sometimes subtly entangled with bias
  • then looking closer at features, there is a glaring issue. many of them are not scale invariant, and also the ‘ea’ and ‘da’ were incorrectly computed after ‘d’ and ‘e’ were adjusted to be offset by ½. the ‘a’ and ‘mx’ variables are scale variant as currently defined! it is easy to adjust them to be scale invariant and retry the analysis.
  • another close look at the features indicates that the algorithm might have been seizing on the lsb metrics which behave differently on the (‘cg’) low vs high region, for the hybrid data they are noisily distributed on bottom region and then flatline to constant values on the top region, as mentioned this was related to the GA settling in a local minima. the subglide data has no such bias. in hindsight its obvious any training algorithm would seize on this (“non”) “feature”!
  • 😳 after going thru/ fixing all this (eg adjusting features to be scale invariant, train/ test over apparently harder subglide points), the code ends up emptyhanded, ie a null noise result. the code finds train improvements, but apparently its entirely not merely overfitting, but in a sense “overfitting noise.” however, at least the test performance during training does “accurately” indicate no improvement. ie it is able to fit the training data, but only in the sense of basically entirely “memorizing” it, and not generalizing to unseen data to anything meaningful at all, ie ending up with only noise. its not quite as bad as GIGO (mentioned last month), thats too harsh, but it is apparently at this point NINO aka noise in, noise out.

some of this is quantitatively confirmatory of an earlier idea, that statistics inside the core are different than outside, and that something that works inside doesnt work outside. what is a little tricky is that here the core is found via the hybrid algorithm and the initial iterates have the core selection property (density, entropy, low 1-lsb runs), but later iterates turn out to not have closely the same feature profile. for analysis purposes, it looks like the subsequent-initial iterates are harder/ more noisy/ more representative of the “core” or at least undifferentiated region.

in other words, these careful experiments are starting to seemingly show a difference between the core/ undifferentiated region. maybe this is not surprising because the core is somewhat thought of as selectable via basic criteria like density/ entropy/ 1-lsb ranges and the undifferentiated region is more random than that.

my last words here after this somewhat grueling combat yet nevertheless worthwhile exercise are that maybe a few tricks up the sleeve remain…

outline10.rb

(later) oh! it was kind of obvious! the code was refactored relatively quickly using scaled features to predict an unscaled value, namely ‘cg’. what about a scaled glide length parameter? that is called ‘hg’ in this code, the horizontal glide ratio ie cg / nw. the code finds a faint signal; weak but nevertheless detectable visually. in the following (postprocessing) graph built from gnuplot2.cmd as a scatterplot instead of line plot, the left half is the hybrid generated initial iterates and the right half is the derived/ generated subglide iterates. ‘y’ red is the prediction and ‘hg’ green is the actual horizontal glide ratio.

  • in the right graph there is a weak linear signal detectable, more noticing/ looking at/ focusing on the top range, but not much apparent in the bottom range. it appears a local average would improve, even quite strengthen the scattered signal much like last months smooth3 graph #2 → #3.
  • in the left diagram the clustering of features associated with the hybrid algorithm is evident, as a kind of “bias”. the algorithm did not train at all on these points but then the feature clustering emerges, a significant finding, and have to think about it more. trying not to state the obvious, but it has something to do with how the nearest neighbor classification “solves” the problem. there is more spread in the lower predictions and more clustering in the higher ones.

there are some other graphs/ analysis included here in the code am pondering/ puzzling over but dont have immediate comment; possibly/ probably something deep (gnuplot4.cmd)…

there are 3 other (interrelated/ interconnected/ “intersectional”) tricks/ ideas/ possibilities up my sleeve immediately occuring to me thinking this over, all aligned at increasing nearest neighbor ML performance:

  • possibly improve the nearest neighbor classifier performance by focusing on selecting nearest matches instead of throwing out low performing classes.
  • some nearest neighbor algorithms adjust variable weights, and never saw/ read how that was done, and was scratching my head over it, then came up with a basic/ straightfwd gradient descent idea, eager to try it out. the weight adjustment automatically helps identify major vs minor contributing variables.
  • somewhat aligned, the algorithm could try to focus on data that fits the model well and not worry about the rest, ie in a sense be allowed to customize the training set and yet, paradoxically, (hopefully!) somehow improving generalization!

this is maybe a somewhat unique aspect or “luxury” associated with this particular problem in that maybe there is unusually high flexibility in defining and throwing away “outliers.” this may be acceptable if the distribution selected/ preferred by the ML algorithm still comes out “random” or “uniform” over iterates which based on this and past analysis seems likely. in a way it may be a mechanism by which in a sense the ML algorithm itself can discover coherent features (as combinations of existing ones).

❗ ⭐ 💡 the sophistication of the model and small slice/ wedge of result is energizing, even a bit exciting! some long-watched/ sought areas finally addressed and maybe some new vistas opening up! some mystery to explore!

outline10b.rb

(later) am not big on holding onto secrets around here. the immediate mystery alluded/ exposed by the graph is that there is very high disorder in the features moving from low to high prediction values/ classes. there are no trends/ order discernable/ evident at all on 1st cursory look. on one hand this seems to attest to the sometimes extraordinary ability of ML to extract signal from seeming noise, and maybe directly refuting my ideas about ML not effective unless at least some signal is (humanly) discernable (or although maybe a confirmation, because this is very thin), but on other hand this seems somewhat unusual/ unexpected and makes me somewhat nervous.

not sure if this is a sign, but the omnipresent caveat is the possibility the ML is finding very subtle “footprints,” “echoes,” or traces of the generation algorithm in trajectory iterates and not, as always fundamentally sought, more general/ generalizable properties of the problem. or, some kind of variant of overfitting/ memorization? the subglide generation strategy seems to be very good at mitigating/ insulating from this risk but its always lurking and needs further/ deeper attn. the possibility also remains there is some kind of further hidden order to uncover. (started this post with the title the plot thickens and then, thinking it sounded familiar, realized from google search had used an identical title a few years ago.)

(12/4) started to think after all this that maybe the hybrid51f glides, while quite “curated”/ “effectively” noisy, and (paradoxically, almost tongue-in-cheek) “high(est) quality” in that sense, “simply” are just not noisy enough, mainly due to the similarity of initial iterates and that subtly “flowing” into aka “coloring” the later sequences and subsequent analysis. that word “simple” is questionable because it took so much accumulated finesse to arrive at them. however, more specifically, theres a seeming “huge” or glaring statistical difference in the feature trends of inital vs subsequent iterates.

but can something further be done? the initial brute force idea is to rerun the algorithm a bunch of times, think that maybe the local minima it arrives at are still random. but then to get n “independent” samples one has to rerun the algorithm n times, a severely infeasible idea at least without a cluster, cloud or supercomputer at ones disposal.

but then what else? my idea was to change the algorithm a little. it is easy to create the “binary difference” between initial iterates and then attempt to maximize that along with the other criteria. it is not cheap to compute because its a n2 comparison but cutting down the candidate pool size ¼ to 250 instead makes it feasible/ manageable. also, it doesnt have to be recomputed every iteration because most of the 250 are unchanged, but that simple optimization is not added here out of laziness convenience/ implementation speed; am anticipating do not need to rerun this much after generating a saved set.

also, noticed that the prior code calculating the linear difference ‘cg_d’ uses absolute value, and that it can be improved by looking at relative differences instead. in a sense this difference is measuring “oversampling or undersampling.” the code was modified/ streamlined to move some logic into a new subroutine auxvars that calculates “auxiliary variables”. another change is that a new auxiliary variable ‘cg_x’ “absnorm” calculates a ‘cg’ z-norm distance from average, and this replaces prior code of setting/ replacing ‘cg_d’ at its endpoints to be a maximum. then the final optimization is to maximize ‘cg_d’ linearity, ‘cg_x’ z-norm distance from average, and ‘bd’ (average) binary distance over all iterates.

this code is not fast due mainly to the unstreamlined ‘bd’ calculation but its acceptable, and rerunning it 5x gives 5 x 250 = 1250 “very random” 200 bit width samples with (almost) no signal evident in initial bit patterns. at this point its kind of like an “multi-/ high capability iterate/ trajectory construction set.” this is one of the most sophisticated/ “multilayered” optimizations ever done around here but its all born/ invented out of apparent necessity so to speak… the 2 graphs below (ordered by ‘cg’ left-to-right) are for 1st ⅕ runs with others looking about the same.

  • a quick glance at feature statistics shows that the initial and subsequent iterate distributions/ spreads are much more in line than hybrid51f; this probably requires further careful analysis but essentially a major design intention fulfilled.
  • 💡 ❗ its notable/ remarkable to realize that some major effort is spent on creating iterates with high(est) amounts of noise and the other major effort spent on trying to extract signal from that apparent noise.
  • this follows the “adversarial” pattern long noticed/ pointed to around here and seems to get to the heart of the problem ie trying to find underlying/ fundamental dynamics that is independent of glide generation algorithms, a not-easy order, ie more accurately running a gauntlet.
  • ❗ and another nagging realization seems to be dawning: maybe a lot or even majority of the iterate features detected are actually related to generation algorithms/ “bias”!
  • this is not unthinkable esp since the most basic feature(s) ie density, (lsb) 1-runs, entropy etc were initially detected (years ago now) from/ inspired by the bitwise algorithms.
  • ❓ ❗ 😮 👿 a key question of the proof direction/ “outline” hinges on whether all binary features are “artifacts” of generation algorithms, if that is the case, its a killer/ dealbreaker/ showstopper/ “downfall”

hybrid52.rb

(12/5) ❗ 😳 a surprising finding! the nearest neighbor code is adapted from 2 mos ago, 10/2020 and didnt notice/ uncover this rather both glaring/ subtle aspect until just now. aka subtle until detected, then glaring. was looking at some of the optimization dynamics more closely and found that nearly ½ of the 250 classes are unused! ie they are not the “result” or match of any nearest neighbor. on 2nd thought this ought to be expected or even obvious, because matching ~250 points to their nearest neighbors is unlikely to be 1-1 and something akin to the old pigeon hole principle applies. watching it, the total # of used classes stays remarkably stable even thoughout a long optimization of removing worst classes.

but, how does this affect optimization quality? remarkably, it does not really seem to interfere with optimization. my initial thought is that maybe the algorithm will perform better if it better utilizes/ uses all the class mappings possible, ie the (total) classes are sort of “performance capacity.” but this thought is not entirely compatible with the optimization technique of removing worst-performing classes and improving performance…

💡 so, still thinking this over & have to work out the details. have worked with nearest neighbors algorithm for decades but not particularly extensively and have to build up some expertise/ intuition on it esp wrt this very challenging dataset… but, have many ideas to try out, which increases my enthusiasm/ energy levels…

(12/6) tried some code to replace unused classes looking for other matching ones and it drives down the unused class count to about ¼ total but cannot get it below that, and does not affect classification performance. also at this point am somewhat just trying to reverse engineer the nearest neighbor algorithm over this (now very noisy) data, without optimization, trying to understand the nature of the data/ any embedded/ exploitable signal.

the prior experiments showing very strong decrease in training error are probably a bit misleading, almost bordering on a mirage, with test error so minimally impacted. clearly need to find a way to somehow align train and test error performance much better, a very tall order. some of this involves just trying to understand initial test error without optimization and what signal is present/ absent. ie once again back to basics…

(12/7) 😳 this is a bit )( embarrassing to admit, but at least heres some new insight. its a n00b kind of error/ mistake/ oversight.

  • the average prediction is better than the nearest neighbor prediction!
  • my suspicion is that the nearest neighbor train improvement is due to throwing out “outlier classes” ie classes with far-from-average predictions which has the “simple” effect of focusing remaining class predictions closer to average.

this is the kind of basic sanity test that should be applied on day 1, hour 1, but it was an oversight/ omission here in the race/ rush/ hurry to add model complexity. as a consequence of this, heres an apparent picture of whats happening. the feature space is a kind of multidimensional “ball” or more accurately an ellipsoid. the points closest to center have the nearest neighbors, and those on the outside have farther neighbors. this trend can be found by sorting the nearest neighbors (matching classes) of all train points by ‘z’ distance and looking at the increasing spread of the features in that order. the optimization is (probably) slowly throwing out the “outlier classes” at the edges of the ball, with the effect of focusing remaining classes/ predictions closer to the average and thereby “improving” performance.

(later) started averaging the features again over multiple iterations to substantially improve (initial) model accuracy, but alas only to bring it in nearly line with the baseline of predicting the average. then looking into repeated iterations of the nearest neighbor algorithm without optimization leads to a basic realization, obvious in hindsight. the input data ‘hg’ spread affects the prediction accuracy linearly. this spread could vary quite a bit with the prior random sampling of about 250 (“test”) points out of 1038, with low spreads leading to calculated higher accuracy (average error magnitude).

  • some of this is related to the very low feature signal found in individual iterates and the prior outline code recognized/ addressed that by using sequence averages; now looks like its nearly a necessity.
  • also some of the moral is that merely improving model performance/ fit is (often, but) not (always) an indication that its functioning as expected.
  • in a sense the adversarial attack is working; whereas before signal was extracted, the last noisy generation code seems to outpace/ aka thwart the latest feature-extraction code and so now focus has to be put on the latter.

the spread dependence suggests calculating model error in terms of input distribution spread. another option, a special sampling routine is constructed that minimizes the difference between/ ie fixes variance of different sample sets. then the input point “spread” (ie over the prediction variable ‘hg’) is identical across different sample sets. then, there is still substantial variance in the (initial) nearest neighbor fit. this seems to relate to (finally!) the inherent noise/ inaccuracy of the classification algorithm and whether the random samples are actually “representative” (wrt the feature similarities)… but, feel there is still yet some question here why the model generates seemingly high difference/ variance (or more specifically, of error) in predictions even for nearly identical input data… ofc probably the simple explanation is that there is very high noise in the feature analysis wrt the prediction variable, but still wondering… ❓

(later) ❗ 💡 😀 😎 ⭐ ❤ whew! finally some very nice/ solid results worth writing up. these are not earth shattering but theyre nice and solid and shows that, after some major worrying and sweating, its not so bad that the entire framework has to be thrown out. it runs on the new highly undifferentiated hybrid52 data. it simply shows that averaging over 10-50 samples in increments of 10 highly improves the performance of the nearest neighbor algorithm/ classifications. it also graphically shows the basic variability of the model performance which appears to be relatively evenly distributed over the ranges; noneven distributions will not come out as linear/ curve more. this code does 25 “nearly identical” runs per block/ batch and then for each block sorts the results by error magnitude.

for 10 samples almost all the predictions are inferior to merely “guessing” the average. for 20 sample the model has about a ~⅓ chance of outperforming the average guess. then model performance improves for higher sample count averaging to the point that for 50 count, all model predictions are better than the average. this is a confirmation that individual iterate features are quite noisy but that key goals are achieved of

  1. averaging decreases/ smoothes the noise out and
  2. decreases it with the effect of improving model prediction accuracy.

however 50 samples seems high (¼ of the bit width 200) and the suspicion that substantial averaging may be necessary/ even critical is now confirmed, and again there is the noted suspicion that averaging will have to increase as iterate sizes increase. the graph also does seem to show diminishing returns on the averaging strategy. elsewhere, reflecting a lot of the detailed background work, this code calculates a lot of other apparently meaningful indicators of the internal model dynamics esp overall error and may comment on them later. there is some new code to save/ retrieve the subglide calculations (io routine) and the nice new sample3 routine that calculates nearly identical (linear) input sampling.

outline13b.rb

(12/8) there is some verifiable/ provable signal here but its verging on razor thin, or at least “not much to work with.” this code looks at variable reweighting optimization and finds it exists but is nearly negligible. it rotates through the feature variables and evaluates whether a 15% increase or decrease improves fit, and then reweights if the fit is improved, and terminates if no changes result in improvement. the weights start out slightly randomized by about +/-5% so they dont overplot. the train error is in gray solid right side scale and test error orange solid.

there is some very weak effect on test error suggesting its not random, but alas bottom line, very little is to be squeezed from variable reweighting, although theoretically at least it was worth trying/ “examining.” on the other hand, unless its merely a local optimum, it seems to indicate all the feature variables contribute significantly to the fit. the next step is to see how this interrelates to the class optimization. there is some hope/ possibility that maybe class optimization could combine favorably with variable reweighting.

note: some wrinkle here, the prior runs output the ‘outline2.txt’ file with averaged calculations, overwriting each time and stopping at 50 count. this next code works with the saved file if it exists and regenerates it if missing for default 20 count. the code was run with the undeleted 50 count file as can be guessed from the corresponding error trends.

⭐ have been thinking about this awhile and finally carried it out, its a semiawkward pattern that has shown up in a lot of prior code but never directly addressed, and recently getting a little out of hand eg in outline10b above, hacked together a bit quickly in the heat of the moment, the main code has (repeated) graph output logic. instead this code has some new stream subroutine logic where the prior out logic was refactored to support both batch and streaming graph output. the graph output logic has changed a lot over the years and its not easy to build something general/ reusable because theres copiously different/ multitude of types but the current code has evolved to a reasonable/ usable state/ compromise.

outline14.rb

(12/9) this took quite awhile to figure out the somewhat tricky logic. this code interleaves/ alternates class optimization with variable weight optimization. it has some advanced code for selecting initial classes, train, test points. the class optimization also throws out/ replaces all unused classes along with the one with worst error. the initial classes are those out of 500 linearly sampled with the lowest average error on mapped (“member”) points. the train and test points are the 500 nearest points to the classes, split/ interleaved, so that the two sets have nearly the same overall distances from classes. again working with the saved 50 count average file.

this strategy after long last and a lot of pondering about an approach finally gives a nice conceptual model about the internal optimization dynamics and good corresponding initial “nearness” between train and test error, gray and orange respectively. however despite a lot of alterations in classes and (a few in) variable weights by the algorithm there is not much improvement in either train or test error. am suspecting the simple explanation that prior algorithms were simply throwing out classes that are not (as) “near” to the train points in favor of those that are nearer whereas here the train points are selected/ designed to already start relatively near(est?) to classes.

while starting relatively low (the good news), alas, train error nearly flatlines and the test error looks a bit random. this combined info/ dynamic suggests the initial choices/ selections are not far from optimum and the algorithm cant improve error much. so not everything is about iterative optimization/ gradient descent. or another way of looking at it, here initial distance calculations/ sorts/ selections/ cutoffs seem to be a near standin for (iterated) gradient descent.

outline16.rb

(later) (liveblogging) 😳 argh, after work, one of my favorite/ most productive/ primetime windows/ intervals of the day, all ready to code up a(nother) storm/ have a great idea to try out, then accidentally deleted the hybrid52 “output” files and am now rerunning/ waiting for it to finish running, it takes over 90m or so and so much for not having to rerun it, lol! “like watching paint dry” lol!

(12/10) ❓ ❗ 💡 some extended reflection(s) partly reinforced with/ by behind-the-scenes adjustments/ observations/ tests… am really now focusing on trying to get train and test error to line up and its quite challenging, really a struggle. have added multiruns and am looking at trends. a remarkable finding did emerge. in the last few graphs, the test error tends to be higher than train error even at the very 1st iteration. is it just my imagination? multiruns confirmed the trend then started wondering. did it have something to do with the weight initialization that happens after the point sets selections? was even wondering about the “even” interleaving points being slightly closer than the “odd” interleaving points to the point of having an effect. all plausible, but it turned out not to be the case.

remarkably it turns out to be due to the optimization strategy of throwing out unused classes. this immediately has a measurable negative impact on the test error. the prior graphs are not capturing this initial operation because the 1st point of the graph is after the 1st optimization step, which throws out the unused classes. this was rather narrow yet striking in graphs and want to include those.

this took some thinking to conceptualize somehow; another way to look at this: the replacement of unused classes tends to increase “false positives” where the algorithm matches the new classes but on balance they have inferior performance to retaining yet ignoring those initially thrown out! even on describing it, somewhat counterintuitive/ surprising/ not easy to explain/ understand!

now, the unused classes seem to need some kind of attention, and some kind of strategy might improve performance based on adjusting them somehow, but clearly the current strategy of repeatedly throwing them all out (which was chosen somewhat ad hoc without much testing) is now revealed/ uncovered as definitely “not the way to go.”

❓ but then after looking at only throwing out the worst class instead of unused classes, the issue still remains of test error apparently typically diverging from train error even on the earliest iterations. this is quite counterintuitive. typically (in my experience) if an algorithm improves train error it will tend to improve test error also at least for some # of iterations. that seems be atypical/ even rarely the case here.

looking even closer there seems often even a countertrend from the beginning, even with this new finetuned set selection logic. whats going on? this is a particularly noticeable problem with the esp, even exceptionally hard undifferentiated/ noisy hybrid52 data now being focused on. ie its probably exposing some kind of flaw in the optimization strategy for extremely hard data… in short it is getting “fooled”

this made me think of/ review prior experiments. it seems that test error across all nearest neighbor experiments has tended to be mostly lackluster, even the earliest ones, except for some that had incorrect calculations; ie the early experiments from years ago 1/2017 ended with a note that the train and test error were not correctly separated and never followed up carefully. the revisit of this type of code from about 2 mos ago did not start out with looking at test error, ie it was added later, and the substantial decrease in train error was a real victory but seems to melt away on carefully looking at test error.

outline14, outline16 just run show virtually no test error improvement, its mostly a noisy/ bumpy flatline. the unused class “optimization” did turn out to be increasing the bumpiness/ spread of the test error, and taking it out makes it less noisy but however doesnt improve the trend much. could it be the interplay of the worst class removal + variable optimization? am right now looking at worst class removal without variable optimization and there is again strong early divergence in test vs train trends.

last month the nearest neighbor optimization for induct was obviously, remarkably bad for test error. in the next inductc experiment, test error is very visibly unaffected by the optimization strategies. then outline9b, more similar to the current experiments, there is basically no test error improvement. this was mostly unremarked on at the time. reminds me of that great/ evocative american expression whistling past the graveyard.

was kind of just overjoyed to get good/ solid train optimization in the early experiment outline5 from 10/2020 and thought the hard part/ “heavy lifting” was mostly done. and then didnt add test measurement until a bit later with outline6. there is a very gradual downward trend in test there, so somewhat a validation, but again on the weak side. overall, test error has been highly unconvincing across many nearest neighbor experiments. and in a sense, is signal really being extracted if test error does not improve? in retrospect more accurately initial signal is being extracted but the optimization is often not succeeding.

some of this is excusable in rushing wanting to utilize frameworks rather than (“endlessly”) tweak them, and the prior generally less extremely challenging/ tricky data allowed it aka “looking the other way,” but at this point the issue is rather glaring and unavoidable.

(12/11) some heavy detailed/ comprehensive study turns up some nice findings/ realizations. as they say in real estate, the 3 most important elements are location, location, and location. here it seems the bottom line is proximity, proximity, proximity. or in other words, proximity is affinity. it turns out at least wrt this data, the nearest neighbors algorithm here really is all about distance. in other words, feature distance really does highly correlate to feature (class) similarities wrt using nearest neighbor regression based on classes. some other insightful/ deep generalizations can be made. 1st some writeup of the results and then the observations.

there are 6 new distance arrangement strategies/ algorithms constructed. they are all similar in theory or logic but have substantially different effects. again the algorithms start out with 500 linearly sampled by ‘hg’ from the total (~1K), finding the nearest points (pairs/ pairwise), and selecting the half, 250, with lowest average error as classes. the optimization is by exchanging single worst class + variable reweightings.

nearclass
this finds 500 nearest points to classes and interleaves them into test, train odd/ even as in prior code.
nearclass2
this finds 250 train points nearest to classes, then finds remaining nearest 250 points as training.
nearclass3
like nearclass2 except in opposite order ie nearest train points to classes found 1st then test points.
nearclass123
find test points nearest to classes, then train points nearest to test points.
nearall
finds the 500 nearest points of the total ~1K points, interleave them into test, train. slowest of all algorithms because its looping over ~1K x 1K points ie about 1M point comparisons total!
nearall2
less thorough than nearall, linearly samples 500 of the ~1K points, and then orders them by nearness, and interleaves them into test, train.

for comparison purpose the 6 graphs are plotted with the same y/ y2 ranges instead of autoscaled and there are some complex/ complicated interrelationships here not easy to keep all in mind/ head at same time but are very revealing of underlying dynamics/ story. ‘et1’ train error gray, ‘et0’ test error red, ‘er1’ train error ratio green, ‘er0’ test error ratio blue all right side scale. the big result is that test error red, blue acts much more coherently in the sense of far less noisy. all the test and train error ratios are less than 1 indicating better than “average error” (error corresponding to predicting the average as compared above in outline13b).

in graphs #1, #3, #5, #6 there is improvement in train performance (gray, green) due to the variable reweighting and worst class removal. however, the flip side of this is that test error is mostly flatlined across all graphs except #3 where it conspicuously “decays” ie increases/ gets worse but also starts out as the best at 50% of average error. as one would expect it also flips the (initial) train error better/ lower than test error red, gray compared to closely similar strategy #2. the only graph where test error seems to improve along with train error is #6. the interleaving strategy used in graphs #1, #5, #6 is highly effective in lining up initial train, test error gray, red and also adding to the strong coherence of test error.

outline16c.rb

there are some remarkable generalizations from all this. probably, both somewhat surprisingly and not entirely surprisingly, and a little paradoxically, the best algorithm wrt test evaluation is #3 distance-based initializing without running any of the iterative optimizations! in other words as guessed earlier, in this case the nearest neighbor calculations for positioning train/ test points near to class points serves as a near full/ complete standin for any (“worst” class removal) optimization! the optimization is “effective” but in the sense guessed; it can (apparently) gradually move all points closer to class points while discarding the farthest ones, and thereby increase model accuracy, but as seen this can all be short-circuited/ bypassed by train/ test point selection, and with some setups eg #3 there is nothing further left to “optimize,” and optimization attempts can only overfit/ decrease (test) fit. this is pivotal and wondering, is there anything in the nearest neighbor literature that talks about anything like this? ❓

the other major realization is that this nearest neighbor/ optimization performance seems to be constrained/ limited/ driven by the abundance vs “scarcity” of points that are “near” in all senses. the class points are near to each other and the train, test points perform better the nearer they are to classes, and improvement only seems to come from choosing/ exchanging points closer to classes, although need to look at that more specifically to see whats happening. which makes me wonder, is there something to be gained by picking class points that are not near to each other but are somehow near to other points? how would that look/ be different than current strategy? ❓

but anyway as guessed earlier this all suggests another/ new type of optimization that involves selecting/ attempting to generate as many “customized” points as possible in the sense of minimal distances. and the 1K starting pool size here is now seen relatively limited. a bottom line is that distance between points and classes is a main indicator of model fit.

another observation, nearest neighbors is technically typically called K nearest neighbors and this code is focused on the K=1 case. maybe some of this analysis would significantly shift for K>1 eg the combinations of classes could play a bigger role and maybe iterative optimization will work better.

elsewhere the impact of variable reweighting needs some further analysis. if the variable weights are meaningful it would seem that similar weights would be found across different optimizations, and also variables weights would not tend to “jitter” in the sense of both going up or down ie adjusted both directions. another excellent idea is to start variable weights at the feature z-norm values, ie corresponding standard deviations for each.

another important check, not fully included here, is to look at statistics of the prediction variable. the key error average is evaluated here but, and this is basic in hindsight, what about the range/ standard deviation? is it affected by the set distance selections etc? ofc as noticed earlier, anything that tightens the range of the prediction variable will (somewhat deceptively) “improve” model accuracy. ❓

next, since there was sizeable work/ tricky logic on this, while its not esp central any more with all the other directions/ new understanding, heres a followup careful quantitative study of the effect on test error of removing the unused classes in addition to worst class mentioned/ reported earlier (yesterday 12/10). using only the (1st) ‘nearclass’ strategy and reinitializing the class, train, test sets each time, blue line is difference in the test error after removing the worst class, green with worst class plus half the “farthest” unused classes, and red with worst class plus all unused classes. the half unused classes are chosen as those farthest from test classes, and it impacts error consistently less than when including them all, showing again the key relevance/ significance of distance(s).

😳 … oops…

  • alas, sometimes after sweating over details for hours, feel missed something basic immediately on posting a graph. kind of like deja vu but different, mea culpa vu?… here had that feeling: itd be nice to have a comparison with throwing out the nearest unused classes as another baseline to compare with.
  • another key indicator, not pictured, the magnitude of the train/ test errors bounce around quite substantially between runs, but the differences after the optimization adjustments/ modifications as plotted here are much more stable, a nice/ key illustration of the algorithm dynamics.
  • another comparison worth looking at, nearness of the unused classes to each other/ used classes.

outline16d.rb

(later) 😳 oops!

  • looking closer, that last strategy nearall2 does not look right in the code. it does not remove the test/ train points from the “remaining pool” like the others and think that the worst class removal/ exchange in the optimization could add them back, leading to duplicates and the overall analysis behaving differently/ not as intended/ inaccurately/ strangely. and this was the only strategy with test improvement, is basically it due to a defect?
  • also, somewhat unexpectedly but maybe not surprisingly, the averages/ standard deviations of all the sets are not uniform. this will take awhile to understand/ identify/ isolate but basically the remaining pool after class assignment/ extraction typically has lower ‘hg’ average/ standard deviation. the 3 strategies with interleaving have fairly close statistics in train, test sets. without interleaving the 1st points extracted after the class points have higher averages in nearclass2 and nearclass3 strategies. somehow averages and standard deviations in the prediction variable ‘hg’ are related to distances/ proximity. again this presumably relates to the relative scarcity of proximal points and selecting/ removing them apparently affects/ shifts the remainder points statistics.

(12/12) ❗ ⭐ 💡 😀 😎 ❤ there is so much to work/ followup on, some years-old threads/ themes are coming to fruition, and its like “kid in a candy store” lately. thinking about the deepmind protein folding breakthru, wondering wheres my world class datascience team? AWOL!

endless time can be spend tweaking/ optimizing models, but what about the bottom line? all this is in line with building the induction function construction. so how does it look? answer: it looks great, fabulous! this code uses the model for prediction of (mostly) unseen data. all this needs more careful look but the consecutive iterates after initial ones will also tend to have different bit widths than the model classes.

  • graph #1 500 points are “linearly jitter sampled” as built earlier, its the sample3 routine. then plot prediction ‘y’ green vs nearest class distance ‘z’ blue, and actual ‘hg’ red. this is a striking, almost breathtaking graph. the predictions have strong signal. not noticed until this graph, the model is finding higher ‘hg’ points to be closer to the classes, ie apparently more signal there. this is opposite of what is expected, it seems to me that this is directly indicating the model may be more accurate on long glides than short ones, maybe highly or even fundamentally relates to the prior distance/ proximity dynamics observations, and needs some major further analysis asap!
  • in graph #2 theres some sophisticated logic to do iterative prediction using the model. this calculation is tricky and was done years ago for MDE and maybe some other nearest neighbor models, it would be useful/ helpful/ informative to do a survey. but basically subsequent points of trajectories can be evaluated with the model and averaged to get better predictions. the model was not applied for low ‘hg’ and those points have to be thrown out. here the model is iterated over 10 subsequent points and after low ‘hg’ “outliers” are missing, only 1/10 of points (~55) remain, unexpectedly low. the graph shows the increase in accuracy with increase in samples/ average smoothing. the hotter colors are initial and cooler colors are later (higher sample count).

there is some extraordinary effect here. the predictions do get more accurate as seen in the move of some spikes toward more linear/ accurate predictions. but other spikes remain almost completely unchanged. this is about the even vs uneven distribution of the model and shows the averaging technique has significant limitation. but also there is some immediate mystery, how can averaging be so stable for some points and not for others? it must be coming down to variation in how classes are mapped. in short, as usual, “more analysis/ tuning nec/ assembly required.”

nevertheless! … the work on bringing out “any signal at all” is substantially different than “sharpening an existing signal;” the latter being much more feasible/ viable/ systematic. overall this completes “full circle” a somewhat broad process that is crucial to the proof structure as outlined, and there/ here is evidence that it is conceivably viable at least in theory.

this conceptual framework is basically successful in some measure, and deserves some new naming in celebration. maybe this has been alluded to and has been somewhat imagined previously, but at long last it is now more manifest. it seems to be a process of defractalizing the inherently fractal data via data science/ ML techniques.

it attempts to build/ maximize a nearly linear signal out of nonlinear/ extremely “noisy” data— the most highly undifferentiated that can be constructed so far— and doesnt fall down. in short, it extracts usable signal! ie it is apparently finding/ extracting (binary) features intrinsic to the collatz function— exploitable for proof purposes— and not merely artifacts of trajectory generation algorithms!

outline17.rb

(12/14) again a lot of directions to go in at this point, but a basic sanity check is to go back to the seed database (db5.txt) that hasnt been revisited in quite awhile and look at associated predictions. this took some time but its helpful. in years past this has been enough to kill various models… there is some new code to do a more generic buffer logic for arbitrary data here used to generate/ save and/ or load the relatively timeconsuming class selection nearest neighbor calculation (outline3.txt file).

this following output indicates only 219 of the 800 trajectories have ‘cg’ glides longer than the minimum 55 count for the algorithm to run; 50 are needed for the average feature statistics, and in this case smoothing was done on 5 points instead of 10 otherwise the sample was too small. this leaves 6/8 glide methods. then 81 of those points and 5/8 glide methods have non-lower-limiting ‘hg’ values. the seed database tends to have a lot of “smaller” iterates so finally a 50-bit width filter is applied leaving only 41 samples and 3/8 glide methods. then these are graphed.

the results are “not terrible” (again it “doesnt fall down”) but (full disclosure) are not very good either; they are clearly very coarse/ even primitive. putting the best spin on it, the “further work is clearly cut out” here, although on other hand some significant limitation of this database is now apparent in hindsight; in more than 1 sense its been “outgrown.” the large spikes are not averaged out at all. its a very rough nearly “trimodal” model of low, medium, high.

my initial suspicion is that this lackluster model performance is largely due to relatively low iterate sizes and in a way features not having very many bits to get good “resolution” (again recalling the microscope analogy). the green line is actual ‘hg’ and a lot of the samples, about entire right half, are outside the range of the model which tended to max out more at ~3 as in prior experiment graphs. the high sensitivity of the model on iterate sizes makes me wonder if even the previously considered “moderate-to-larger size” 200 width trajectories are (significantly?) limiting the accuracy of the model. ofc concepts about “size” are all relative on this problem and partly anthropomorphic thinking.

❓ however, in short, even after zillions of “samples” now generated using a vast arsenal of highly tuned/ polished tactics/ strategies/ techniques etc, there is still remaining the not-fully-solved problem of “determining/ generating representative samples.” … but on other hand, dont see fundamental/ inherent obstacles/ showstoppers right now; it seems not intractable, ie within reach

outline18.rb

["read", "outline3.txt", 250]
[#<Proc:0x33f71b8@C:/Users/xxx/Desktop/xxx/xxx/outline18.rb:551 (lambda)>, "outline3.txt", 250]
["read", "db5.txt", 800]
{"l2"=>219, "k"=>["w", "cg", "cm", "m", "r", "c"], "c3"=>55}
{"l3"=>81, "k"=>["w", "cg", "r", "c", "cm"]}
{"l3"=>41, "k"=>["w", "cg", "r"], "c4"=>50}

(later) 💡 ❗ 😮 😎 ⭐ ❤ holy @#%*! looked at feature trends and found some linear signal among several of them, then making me wonder if a linear model might work somehow. it looks like the last linear code is ~2½ years old. took it off shelf, dusted it off, much to my surprise, shock and utter/ intense glee (a rare psychological combination here signalling a paradigm shift? …), it works fabulously, after some fancy footwork + furious copy-pasting coming out around an extraordinary only ~½ hr, not just running, but finding signal! reusability is so delightful…

1st graph is over the ~1K linear sampled points out of subglides and the 2nd graph is the 250 best classes selected (calculated/ generated/ saved to disk from last code). this code centers feature variables ie subtracts averages. it finds high 0.67 correlation coefficient in the 1st case and 0.76 correlation coefficient in the 2nd case. so broke one of the basic rules of data science, try to fit a linear model 1st! actually though, not exactly; initially the linear model failed due to linear dependency of the features, and had to manually remove 5 of them which were “random/ too colinear” and it was mostly just running on a hunch in identifying them (ok, yes, it wasnt oracle magic; also prior familiarity/ graphical trend).

also, this seems to again resuscitate the idea mentioned that nonlinear models will often not succeed unless some basic linear signal is present and the models typically improve on/ “squeeze out more of” the linear signal.

and, esp looking at 1st graph, clearly some “straightfwd” nonlinear sigmoid-like adjustment/ “flattening” of the prediction variable ‘hg’ on the ends toward center/ mean will significantly further improve fit. this fit is clearly already significantly better than the nearest neighbor model and it doesnt even use model averaging yet…! dont have a full explanation for this yet, but a rough idea is that there is redundant signal spread through or “embedded in” the multiple feature variables that the linear regression can extract but the nearest neighbor algorithm sees like noise.

outline19.rb

(12/15) at this point my thinking/ attn is returning to an earlier idea that was half pursued. on 10/2020 2 rough initial stabs at the outline idea were laid out, outline, outline2b. these are combinations of the nearest neighbor and the hybrid algorithm. my thinking is as follows. there seem to be various models to be found in various regions of the data, but now there seems not to be “1 model.” here maybe more than 1 is better than none. but a proof would be better with “1 model.” what is going on?

then last month 11/2020 mentioned the “multidimensional blob.” have some new ideas on that related to recent code. it appears again the objective is to traverse the feature space as a multidimensional blob. one needs to take representative samples on its interior and surface. what are “representative”? an immediate idea is that the blob has a kind of density associated with it, and a prediction variable, say glide length. for some feature regions, close features lead to significantly different/ varying predictions. these require “more” (local/ nearby) samples to map out, ie a denser sampling per space. other regions are “more predictable” with less variance in the features and associated predictions. the hybrid algorithm can be and has been used to do this kind of variable sampling.

this all reminds me of a complex emerging topic in cutting edge ML called novelty detection, have covered that in other AGI posts… think novelty will play a big role in future ML… its a bit extraordinary for it to come up here, but also not surprising when one is trying to build cutting edge 21st mathematical “machinery or technology”

this scheme would appear to allow “1 model” that has various regions of different densities. however, its very challenging to implement. the alternative am thinking about is a 2-model idea, basically along long lines considered: an exterior vs interior core boundary. the last hybrid52 generator seems to be doing well in focusing on the core and yet, much to my amazement, as just demonstrated a solid linear model can be found there (the “interior”) also.

it seems likely to me there is some kind of single model underlying the overall dynamics but right now its like the ancient (indian) proverb of the blind men and the elephant. the different experiments are each like a single blind man with a different feel at a different spot. and it does not seem at moment there is a “typical” or comprehensive/ overarching ML framework to capture it.

(later) however, a linear model even while not sophisticated has quite a lot going for it. the linear model is very fast/ easy to compute. it can be adjusted to support nonlinear aspects. there are ways to iteratively improve it in nonlinear ways; this novel technique was employed a few yrs ago now, and am thinking now of reusing it.

on further analysis the above graph #1 has an optical or perceptual illusion associated with it related to the sigmoid adjustment/ improvement idea mentioned. the green line is the model prediction and it looks like it has a particular mild slope that is more gradual than the actual test points in red. so it might seem that simply “increasing its slope” would increase the fit. but here it would seem to show a difference between “slope” and “shear.”

some of this illusion is based on the graph sorting by actual points rather than points related to some kind of linear mapping of the feature space, in other words, the graph represents a highly nonlinear mapping/ rearrangement of the feature space. increasing the slope of the model can only mean multiplying by some constant. but that will “spread” the model points around a center (horizontal) axis, thereby increasing the error; on closer look the points are already spread over the center axis. on the other hand, “sigmoidally” shearing all the points differently depending on their distance/ direction from the center of the model (higher points up, lower points down) would improve the fit. but what does that look like algorithmically? stuff like this has been done before but its also kind of novel.

💡 re shearing/ “remapping,” something like that was done/ demonstrated/ carried out last month in the smooth3 and backtrack32 ideas/ operations. and there is a basic concept behind those experiments sketched out/ alluded to at the time that deserves to be called out further and am have the idea how to reutilize it again here. it is not hard to prove this; proof omitted for now, although that code is essentially the proof idea already encoded in an algorithm. call this monotone function remapping:

any monotone function can be mapped onto any other monotone function using a typically nonlinear shearing operation.

here “typically nonlinear” means that of all cases, not many would be handled with a linear shearing, but some could be. ie in a/ some sense “most” arbitrary monotone functions are nonlinear.

there is a very powerful transformation inherent in this concept already outlined and likely to be further utilized into the future. maybe this has been noticed/ called out somewhere in the literature, it would be fascinating to try to locate it elsewhere, maybe there are even multiple occurrences.

and increasingly with references to shearing (which is a sort of nonuniform stretching) and blobs, there are more topological concepts showing up these days. think it is a good sign. the topology is related to dynamical systems/ differential equations and this attack is now increasingly aligned/ moving toward converting the collatz DDE (discrete differential equation) into a continuous one, aka a dynamical system.

💡 other new thinking/ direction: the ‘hg’ glide (horizontal) ratio prediction variable has served very well for weeks now but maybe again something else is now called for and the time is ripe for yet another pivot. the issue is in recent experiments that attempt to do model averaging and then come up with very few points; very many of the sequential/ consecutive points have to be omitted because they are not part of longer glides, and this also may be related to the model averaging not smoothing out much in places.

how to picture this? it appears to be related to following. the anthropomorphic picture of glides is an upside and downside, a climb and fall. but, endlessly fractal like, almost like an ancient zen proverb, the climb is full of falls and the fall is full of climbs. in other words the climb typically has many short/ local falls. each of these has to be omitted from the model predictions because the short falls are like distracting/ distorting noise/ outliers in the overall climb estimate. suspect this has led to good/ high signal predictions so far but maybe it is reaching its limit on useability.

whats the answer to this? again maybe need to come back to estimating trajectory lengths (horizontal ratios). from some quick experiments, (thankfully although maybe not surprisingly) those work with the linear model also, but seem to have somewhat less signal. it may be an acceptable tradeoff. esp if it leads to more accurate model averaging by having more points.

(12/16) ⭐ ❗ ❗ ❗ 😮 😀 😎 ❤ 🙄 holy @#%&!!! these are some remarkable, extraordinary, even breakthru results. earlier quite a bit of time was spent on worst-removal optimization for nearest neighbors and it was found to be a specious operation at best if other optimization was applied. but some of the idea stems from a few years-old experiments doing a kind of nonlinear optimization on linear fitting. the idea is to (iteratively) toss out outlier points and improve the fit possibly at some expense of model generalization, and these were effective in the past. this was on my mind recently with the suggestion at the beginning of the month with this quote now deserving highlighting based on nearly radical confirmation

… the algorithm could try to focus on data that fits the model well and not worry about the rest, ie in a sense be allowed to customize the training set and yet, paradoxically, (hopefully!) somehow improving generalization!

in retrospect, prescient. have just applied this to the linear model instead of the nearest neighbor algorithm and the performance is outstanding, stellar! breathtaking!

the code is very simple. the same ~1K sample points are reused. these came from the ‘hg’ linear sample, but thats ok, the new ‘hc’ horizontal glide ratio (total trajectory iterations divided by initial bit width) is computed. the optimization algorithm targets finding the best 250 linear fitting points. almost astonishingly straightfwd!

it simply finds the worst fitting point in the last linear fit (largest error), and discards it/ replaces it with another one from the pool. then, if the new one improves the linear fit measured by the correlation coefficient, the point is retained and another worst-fitting point is selected next (after a refit with new point/ recalculation of errors). in other words, the model fit/ calculated variable weights “drift” gradually as different training points are added/ subtracted and the overall training set changes composition/ distribution, ie selection, even technically “bias.” although, “gradual drift” may be a projection at this point, need to look at how weights are actually changing, are there any discontinuities?

amazingly, despite being very selective aka “putting all its eggs in one basket” (a single worst point) this algorithm has not been caught in a local minima yet after several reruns and consistently hits the same very high optimum. it optimizes the correlation coefficient up to a breathtaking 0.983 in this run.

1st graph shows the optimization metrics over ~1.5K iterations, blue is worst error, green is best correlation coefficient, red is current coefficient. the fit is shown in 2nd graph, as expected the selected points are nearly linear fitting, but a sizeable nonlinearity at the bottom range is handled by the model. so the algorithm in a sense does exactly the (non) linear mapping previously outlined/ desired, applying a kind of sigmoidal-like adjustment of the data/ curves on prior/ initial fits.

3rd graph is from a similar 2nd run after the 1st/ 2nd graphs from same run. the model is recalculated on the “other points” from the pool outside the model and a difference (blue bars, right side scale, roughly same as left side scale) is computed with the initial (linear) fit but unoptimized model. the idea is to look at the “expense” of the model tuning/ selection bias on initial generalization. it turns out to be mostly randomly distributed and minimal at the same time! this is almost shocking! ie it still makes “roughly the same predictions” on original data even after the optimization/ selection bias. so in short the model finds extraordinary improvement/ near ideal fit based on “sampling selection or bias” at very low expense of generalization! it seems to defy the laws of data science…!

how to summarize all these near-magical outcomes? its almost like the opposite of overfitting. in short its superb, maybe world class, paradigm-shifting model fitting + generalization. its the ML doing a large part of the hard work/ heavy lifting in finding/ even “discovering” actual properties of the problem that “determine” or even “control” trajectory length.

in a sense the algorithm learns how to train itself based on data selection/ “focus shifting” its like that old expression of a self-licking ice cream cone… it seems to be a striking, definitive confirmation + even vindication of the recent outline plan…

the collatz phenomenon of long glides embedded among the mass of short ones has been long remarked on here; its one of the earliest “phenomena” noticed/ isolated; its the recurring needle in a haystack aspect of this problem. but here the ML seems to be functioning in a sort of parallel/ inverse/ reverse way: it is finding needles in the haystack. the haystack is all the trajectories (~1K sampled here), and it finds 250 that can be very tightly modelled with a relatively simple linear model on various “not-exotic” bit-oriented features. the algorithm is basically (1) selecting the data that has signal in the (almost arbitrary, in the [non?]anthropomorphic sense!) human-chosen features and (2) also very closely fits the linear model!

lots of years have gone into finding representative features. this slick optimization finds representative trajectories, or trajectories that best represent the features. in short, finding signal in the noise, order in the fractals…

there have been so many years of “head against brick wall,” in contrast this combination is eyepopping, and is making my mind reel…it feels like light at the end of the tunnel… or the “trajectory”… 😀

❓ oh, but the motto around here is “always room for improvement”— or optimization, and “some question usually remains.” this algorithm is not particularly computationally expensive, but one wonders if the iteration is totally necessary and if this can be done with even fewer moving parts (not unlike the experience just playing out wrt nearest neighbors). maybe just selecting closest-fitting points after an initial fit would be sufficient? it needs to be investigated… how much is this different than just “discarding outliers”? that question could be addressed by looking at eg which points are discarded wrt the initial fit/ ordering.

❓ feeling some deja vu on all this, know its been extensively used years ago, but trying to remember if this “linear-toss-refit” technique got similar very tight fit on points and if it was graphed, need to look this up again/ survey.

outline21.rb

(12/17) so, is the algorithm doing something really novel here, or is it more straightfwd? as just experienced with the nearest neighbor code, sometimes these kinds of questions really take a lot of careful analysis/ focus to answer. it is not hard to add some additional graphs that reveal more of the picture; this could easily be part of the last version if there wasnt such a mad dash to publish results, lol. this is a 2nd run.

  • 1st graph is internal weights, post z-norm applied. an interesting question is if the weight vector is just changing magnitude and/ or direction. from this it looks like some of both. also it is mostly gradual changes but some jumpiness/ jumpy spots. key observation previously missed: only 3 variables are contributing most of the weight, average bit lengths, a0, a1, a01. raising an immediate big question, how much fit can be retained throwing out other variables?
  • 2nd graph is the ‘hc’ values selected in the final model in red, sorted by all ‘hc’ values. the model does indeed pick near-center values, but the density is somewhat irregular/ scattered. in other words its not as simple/ even trivial as throwing out outliers/ picking the nearest-to-center values.
  • 3rd graph is the initial fit (‘hc0’) with the selected points in final model in red. again, near-center ‘hc’ values are chosen but again in a scattered (but internally very ordered) way, ie not “merely” throwing out farthest-from center points, although thats clearly part of it.
  • 4th graph is difference (error) in initial fit vs actual ‘hc’ again ordered by actual ‘hc’, again scattered, ie again its not “merely” throwing away outliers/ highest error points although thats clearly part of it. this shows the difference between a line and a sigmoid (graph #3 red, green from last time) is roughly another sigmoid. its also helpful to plot errors as impulses, 4th graph.

❓ however there is one key “gotcha” possibly revealed in these graphs. a lot of it will come down to the performance of the model-averaged case. in a sense, these graphs show the model is not surprisingly “throwing out hardest” points ie those with the longest trajectories, although it throws out short ones also. this doesnt necessarily mean it will never predict those values however for other data or with model averaging. it has to do with whether model averaging actually increases accuracy, which it typically does, but the key question is over the extreme points. as the expression goes, at this point, “it could go either way”…

again it relates to model generalization. does the model completely fall down on points that it excludes (ie “behave” nearly random), or does it retain some kind of generalization over them (ie better than random guessing), in which case this technique is potentially quite substantial?

💡 this code uses some notable/ neat concise few-line techniques/ tricks where the hashmap and list structures are very useful for representing/ adjusting/ modifying/ revising/ “juggling” points, and on-the-fly list concatenation etc, with lists often behaving very much like sets and hashmaps like simple (ordered + mutable!) objects with associated properties, showing some of the power of ruby harnessed on data science. much of this is replicable in python but my feeling is that ruby has some inherent, unique elegance even verging on beauty at times 😀

outline21b.rb

(later) this straightfwd code just throws out all the worst outlier points wrt error in fit, predicted vs actual. it turns out to be very close to the prior model ie correlation coefficient 0.961 using a tiny fraction of the computation: two linear fits (linear inverse calculations) ie initial/ final versus over a thousand. here is graph #3 and #5 (prior) comparisons except using final fit instead of initial fit, and the large nondifference (ie near equivalence) in graphs #3 is quite apparent. #5 does show lower absolute error than last graph #5. the omitted graphs are not substantially different.

so much of the mystery is removed; overall the algorithm appears not much different than removing all worst outliers or equivalently retaining all best fitting points; on the other hand the two algorithms have measurably different effect/ performance on error. a key question is how/ if it will affect model averaging. this version of graphs on final fit instead of initial fit shows that the optimized model doesnt fall down at all on the excluded points, ie it generalizes. this will favorably play into model averaging. on other hand maybe all the computation/ iteration of prior algorithm leads to some (more) desirable property.

❓ but, maybe still something unexplained or mysterious; that the point selection seems to have almost zero effect on this generalizing is a bit remarkable, even surprising or almost astonishing… wondering, is it maybe something to do with the fractal nature of the data, such that almost any subsample is self-similar to the larger sample…?

😳 2nd look, in 2nd graph there seems to be some overplotting effect distorting (visual) results, found via the line plot significantly different coloring range than the impulse plot. need to figure out fix.

outline22.rb

(12/18)(lol) 2nd thought, after good nights rest, the solution is simple, just plot error after plotting the other lines, so overplotting is less random but as desired. but it didnt occur to me late at night writing that last line! another simple idea, the scatterplots of same data work better.

💡 now on to some advanced nonlinear data science. various tricks/ transformations can be used to use linear regression on nonlinear scenarios, or with nonlinear end functions fit. these are relatively straightfwd in theory but wonder if this kind of stuff is much in the literature, its probably somewhat exotic, would certainly love to find/ hear of anything similar/ related being used by others.

this next code occurred to me staring at 1st graph from last time. all this is running great except some glaring fit issues exposed in the new more highly informative/ revealing graphs/ analysis. basically the model is focusing on intermediate values and “giving up” on predicting the extremes. it does correctly trend on the extremes, but for high values highly consistently undershoots and low values highly consistently overshoots. thats very undesirable. in other words even though the model isnt getting worse in its general prediction error, its not getting any better either. can this be improved without affecting error much? could there be some remaining tradeoff to work with? the very high fit accuracy (deceptive in a way because its only over the subsample) is obviously something to work with.

  • just staring at the graph gave me an immediate idea. there are predicted points close to the actual in the low and high regions. how to select those? also my other idea was to apply a basic linear norm to ‘hc’ at the beginning, graph #1 below, also called the rank norm in prior work esp on the misc optimization algorithms where it reappeared in multiple/ crosscutting contexts. here is where, doing an inverse mapping on the rank norm turns this into a linear model “embedded inside” in a nonlinear mapping.
  • after that this code refits, graph #2, and picks “nearest”/ lowest error points evenly distributed across the ‘hc’ line. going from ~1K samples down to 250 which works out to finding the lowest error/ nearest actual points in intervals of every 4 points.
  • then, what about refitting using that sample? for now the model is limited to the top 5 weighted variables which turns out to be elo, a0, a1, a01, a1m in that order and gives strong fit. it works out to 0.893 correlation over the sample and then has significantly better extreme or “extremity” predictions on the larger/ full dataset as seen in the next 2 graphs. however, some other tradeoff is immediately apparent, the error blue is no longer evenly distributed and has a sort of sigmoidal-slope bias to it, and am wondering, even marvelling a bit about this emergent property, dont know exactly how to explain it.
  • in 4th graph the error red also has a remarkable aspect of being somewhat “more in focus” at the center than the extremes. linear fitting aside, aka “hammering (‘apparent’?) nails,” what is this saying or indicating about the underlying/ fundamental/ intrinsic/ inherent model or embedded function here? basically, when one fits a higher dimensional curve in a lower dimension, one expects, instead of uniformly distributed, systematically varying error…

😳 missing on 2nd look/ thought: it would be nice to have another graph after #2 showing which points were selected for the refit, relative to 1st fit.

since these concepts are being repeated, heres some terminology. the larger dataset can be called the “outer” one and the sampled one the “inner” one. (or in basic pre-20th century math terms, the set and subset.) am briefly tempted to call them the unfit and the fit respectively, but maybe thats a little too literally discouraging, lol!

❓ this is very encouraging, the inner sample is much more as desired, in short over/ underestimating some on opposite ends where it didnt before, and maybe nearly ideal, but some slice or wedge of dissatisfaction remains. the model is still under/ overestimating on the outer sample. it seems one wants the algorithm to reorient its entire bias particularly at the extremities, somehow. it seems likely it must be possible, the only question is how much tradeoff or error it “costs.” now, how to do it?

💡 ❗ oh! lol! the solution seems be simple, almost obvious, and occurs to me almost even as writing out those words, did anyone else catch that? the prior iterative worst-point removal code was optimizing the inner sample correlation coefficient. what about simply optimizing the outer sample correlation coefficient? ie “reorienting” the outer prediction distribution? … hmmm, does that make sense, what would it look like?

hmmm, error is measured on the outer set, and discarding is related/ linked to the (its) worst error, but the worst fitting point in it may not be in the inner set…

💡 oh! its simple, a few line code change! throw out… (from the inner set…) the worst fitting point on the outer set… thats also in the inner set! its so cosmic, its like a snake eating its tail, or yin+yang 😀 ☯

outline23.rb

(later) 😳 oops, lol, its late again and that last crazy idea didnt really make any sense! lots of great ideas melt on attempting to actually code them, lol! despite some of the thinking blending/ intermixing them wrt fitting purposes and leading to the terminology, in another sense, the two sets inner, outer have mutually exclusive points.

(12/19) 🙄 ❗ stayed up late last nite due to some getting wired up over this somewhat unusual/ novel/ interesting algorithm, a rarity for me, and can feel a bit how my brain starts to shift into impatience and grouchiness, proving our thinking is rooted in physiology (aka here a near headache etc) but its often or sometimes well disguised from our consciousness. and still woke up early this morning, it seems my bodys ability to sleep in on random choice may be declining with age… (lol, TMI?)

did some major rewriting on last code. after getting past the senseless “not fully coherent” idea, it appeared the way to go was as follows: find the best fitting point in the inner set and exchange it with the worst fitting point on the outer set. (still maybe some hint of yin-yang balance there…?)

this new code goes back to the full set of feature variables minus a minimal 3-variable subset that evades linear dependency, dlo, d, e, and leads to some remarkable/ unexpected/ emergent dynamics. have to think a little further on how to revise it, its probably not what is fully desired, but very intriguing, and maybe much closer— a funky/ almost elliptical way of putting it, but obviously the thoughts are still forming in my head as writing this. its remarkable how the concept of error tradeoff is quite direct here and its almost as if error gets shifted like some kind of fluid entity from one set to another, almost like model fit is a scarce resource being (re)allocated or (re)distributed from one pool or “reservoir” to another. also its not exactly clear which of the these dynamics are due to the algorithm vs the dataset; ie do the same effects emerge with other data, or not?

  • in graph #1 the correlation coefficient of the outer range only is plotted red right side scale; other useful statistics could be included but didnt for now. the max error of the outer set is plotted green/ blue. this code has no random logic and turned out to be perfectly monotonically decreasing not requiring logic to find an alternative to the last greedy choice, so it just quits when it no longer improves, in itself a not really anticipated aspect. the surprising finding is that the correlation coefficient climbs after a decline. am not sure how to explain this right now, its probably relatively straightfwd somehow on closer look but dont have the immediate answer.
  • in graph #2 the inner set is shown and the algorithm remarkably focuses on including points that are not in the center of the ‘hc’ distribution, causing a kind of striking low/ high bifurcation, and the predictions increase/ overlap over the two ranges.
  • immediately to my eye graph #3 could use some adjustment wrt overplotting/ and/ or repositioning but am holding off for now. it shows over the outer range, the algorithm is close to what was intended. it completely reorients its predictions over the entire range, in a sense balancing out, and error extent is basically very evenly distributed over the whole range, although error direction is again highly biased at the ends.
  • graph #4 shows combined sets where the inner (red) vs outer (green) set distributions end up and its a bit dramatic also. the inner points are pushed outward in 2 separate senses/ directions, away from center ‘hc’ values both parallel and perpendicular to the ‘hc’ line, now even leading to some semiconfusing mixup/ reversal/ inversion in the terms “inner vs outer.” in some (new) sense this seems to distribution/ balance the outer set error, however, the basic issue of underestimation at high end and overestimation at low end over both distributions remains unaddressed/ “unfixed.”

this experiment and the mixup about the “semicoherent yin-yang” idea lead me to realize maybe it is a bit hard to put into words/ ideas/ logic the distribution that is preferred, and am gonna have to think more carefully, or carefully more about this. what actually is “better”? sometimes with optimization it feels like a hall of mirrors…

these results also seem to point to some kind of maximal limitation of the data in being modelled with the linear model in the way that the inner set predictions for low and high points nearly perfectly coincide, as almost strikingly seen in graphs #2, #4… it is also striking to compare prior graph #2 with complementary graph #4 below, the graphs are almost color complements of each other in the point clouds! quite the linear regression hacking going on around here!

❓ seem to be getting into somewhat unfamiliar/ unexplored territory. at least theres lots of signal to play with. its almost that with the solid signal many new possibilities are opened up. so whats the next direction/ avenue? again, think the point cloud for the combined sets needs to evenly distribute around the ‘hc’ line somewhat like last experiment, but achieved via some iterative approach. it would seem to be this implies evening out the max or average error over the whole distribution ie both outer/ inner sets via iterations. it seems most of the basic ideas/ elements/ “moving parts” have been identified, they just have to be combined in some definite/ key way. ie something like a cast of characters:

  • linear regression
  • inner/ outer sets. fit on both together/ separately
  • misc quality metrics: point error, max or average error over set(s), correlation coefficient of fit
  • moving/ exchanging points from one set to the other esp based on the metrics
  • possible randomness on selecting points from one set
  • iteration, refitting
  • greedy algorithm/ gradient descent
  • backtracking if no improvement
  • looking at distributions/ membership of points in the inner vs outer sets
  • nonlinear aspects/ transforms/ mappings eg apply rank norm
  • etc

💡 ❗ re reservoir exchange comparison/ metaphor, maybe the idea/ target is to trade/ exchange worst fitting points in each set until errors are roughly equal? looks like the next promising or even excellent idea to try…!

outline24.rb

(12/20) ❓ something is just not fitting in my brain, there is some unintuitive/ difficult to grasp stuff going on here, even after all the effort, still having trouble wrapping brain around the situation(s)/ circumstances/ overall dynamics; it seems some new conceptualization/ even paradigm shift is required, but what exactly is it? getting back to basics/ initial findings, outline19 (12/14) is hard to understand. if the graph were interpreted as a single regression line/ variable fit of points, it wouldnt be accurate because it clearly isnt the error-minimizing line due to the apparently difficult to remove extremities over/ underestimation.

but thats not what happening here, its a multidimensional fit with actual-variable reordering, although my brain keeps trying to interpret the visual results in terms of single variable regression. need to build up my intuition on this stuff some more, bet somebody has run into similar situation(s) and documented and/ or analyzed it, but have never seen it done… in a sense its about the prediction algorithm “playing it safe” and trying not to “color outside the lines” where more extreme predictions are too costly wrt to error to guess at. and also in a way even though the multivariate regression seems to be correctly minimizing error my instinct (at this point still presumably not completely invalid/ something to rule out) seems to return to trying to make it “look” more like the single variate regression outcome.

💡 think iteration will give something close to what is desirable, however another idea just occurred to me that maybe again achieves nearly equivalent results while avoiding iteration: just tried doing the rank norm on the actual variable which tends to linearize it from a more nonlinear curve; how about increasingly nonlinearizing the actual curve, such that after the mapping/ stretch/ fit, the fit “scatterpoints” come out evenly distributed around the linear (rank norm) regression line? how would that look/ work? it appears something that amplifies/ stretches the actual curve based on the (linear) “misestimation” is called for…

whether such a strategy will work will depend on if the fit can be improved at the extremities without exactly commensurate decrease in fit/ increase in error in the center… but then wondering, is the current fit already implying that would happen?

❓ other key idea: all this has been engaging/ even fun, but as alluded more than once, really need to start looking at the model performance with trajectory averaging to understand better how the local model affects the more global predictions. “premature optimization” aka “scope creep” can affect more than coding, it can creep into data science also! however, thinking this over, its also the case that it seems predictions will be nearly the same, and if the model is not accurate over the high range, the model averaging will suffer correspondingly also. in defense there was a rough start earlier on outline18 (12/14) but its increasingly now looking almost crude.

(later) 🙄 ❗ 😮 ⭐ 😀 this code took strangely long to put together, its conceptually simple wrt prior code but little glitches/ hangups kept tripping up the flow, and got interrupted right dead in the middle with big fat gob of female gf trouble (more TMI? lol!), but finally got it hammered out. it uses the initial “straightfwd” linear model without any finetuning/ optimization largely similar to outline19 with (“only”) ~0.53 correlation coefficient fit over the ~1K points. it finds the longest glide and applies the unaveraged model over its length, and gets striking, almost ideal results as far as monotonicity. in all the recent heavy focus on optimization and model biases/ misestimation, lost sight of/ forgot that monotonicity is almost all that matters!

the green line left side scale is the “raw” linear model prediction for remaining trajectory length, ‘hc’ red. despite the irregularities in the bit size due to the glide, and other irregularities related to intermediate glides impacting the monotonicity of the bit width, its almost perfectly monotonic! this is kind of extraordinary! on a roll lately! another extremely solid demonstration/ finding/ milestone! now the “premature optimization” intuition aka “getting carried away” aka “overthinking” looks dead on…!

actually, 2nd/ further/ deeper look, need to not gloss over this/ call it out explicitly, something even more remarkable is going on! from this diagram apparently the model is actually “misestimating” the local ‘hc’ values red, ie not strictly tracking local bumps, in exactly a way that causes higher, more accurate monotonicity of the induction function green. something big, even extraordinary seems to be going on and it seems to be related to/ centered on finding/ isolating actual/ real intrinsic/ fundamental collatz glide-/ trajectory-controlling properties…

candidly/ full(er) disclosure, was a bit further excited/ awestruck and whipped this out without looking at/ examining details very closely; need to look closer & think about/ conceptualize it all some more! but in short, this is an apparent proof of concept and/ or vindication of the newly emerged concept of induction function construction leading to defractalizing/ defractalization.

⭐ ❗ 😮 there is also some concept of isolating/ extracting latent or hidden variables here exactly as intended, where ML can (sometimes!) be utilized to identify/ “extricate” emergent properties. the model is working exactly as designed— to not understate it, even impressively— finding a combination of features, and effectively “discarding” the rest (via low/ negligible weights), that gives rise to a very smooth, monotonic function built out of a subset of features with a trend not really at all evident in prior data analysis.

outline25.rb

(12/21) something hard to picture/ explain is going on here, but am working to get to the bottom of this. eg, staring at the last graph, it seems something is off, and am trying to articulate it. in particular the ‘hc’ line seems to have about 3 distinct slopes in left, middle to far right less steep, and far right steep fall. but ‘hc’ is only calculated using remaining distance divided by ‘nw’. remaining distance is perfectly linearly decreasing. so it seems the change in slope is explained by ‘nw’. there is correspondence for the 1st two slope ranges in ‘nw’, but ‘nw’ does not seem to change slope at all in the last steep ‘hc’ fall. however, the graph is correct, and its explained in that ‘nw’ does have a very slight slope change in the same 3rd rightmost range and its the (nonlinear!) inverse that factors into ‘hc’. here one runs into the limitations of graphs/ visual approaches/ “rules of thumb.”

❓ further look, the ‘hc_y’ trend strongly resembles smooth3 graph #4 trends from last month. and looking again at the weights, maybe its not a coincidence at all! it appears a nearly identical prediction can be arrived at with much lower correlation 0.38 and using only a0, a1, a01 variables. but maybe in some sense smooth3 graph #4 was computed in a similar way? and this strong similarity also makes one suspect that maybe the ufo construction also invoked last month will likely similarly disrupt these ‘hc’ predictions…? actually it would be a bit magical/ miraculous to find anything that isnt disrupted by ufo constructions… but on other hand, that is maybe a crucial end goal of the induction function construction, if they are undifferentiated/ aka “embedded” inside the core density/ entropy region, as they apparently are… note here the terminology gets stretched some because ufos themselves are eminently differentiated wrt known/ developed features.

here is a slight logic change worth noting which reveals a little more behind the scenes dynamics but doesnt change the smooth prediction. the above ‘hc’ and ‘nw’ lines are computed using the ‘nw’ 50-sample averaging/ same variables included as last graph. what would it look like without the averaging? the result is a visualization of much more of the local noise, but this only seems in a way to make the unaltered very smooth ‘hc_y’ prediction even a little more contrasting and thereby mysterious.

outline25b.rb

(later) 😳 😦 👿 o_O the answer is quick and brutal, for a grisly analogy, like slicing off a body part: ufos kill models. this is backtrack32 from last month slightly adapted, creating a ufo embedded 100 width starting iterate and 500 pre-iterates, along with the last code slightly adapted, to process the ufo using the current linear model. it results in a major disruption in the monotonicity of the model prediction ‘hc_y’ green, a downward bump at about ⅓ from left centered on the ufo, ie model-breaking nonmonotonicity. as demonstrated the model is very powerful in the undifferentiated region but breaks down/ fails to generalize here in the postdetermined region.

however, there is maybe some glimmer/ catch, possibly a loophole: compared with the prior graph, the pre-descent to the ufo is obviously/ glaringly anomalous in nature wrt ‘hc’ ie easily identified as “apparently artificial.” so now the obvious question is, can a pre-trajectory to the ufo be created that is not “anomalous,” ie in at least this sense, and/ or more broadly in other senses? admittedly, a more pressing question wrt the overall plan is whether any features can be constructed/ put together that yield a monotonic induction function “in spite of” embedded ufos.

another idea: a recent pivot switched to ‘hc’ instead of ‘hg’ mainly due to sparsity of ‘hg’ predictions wrt model averaging. but maybe ‘hg’ focused on glides would not be subject to “ufo disruption.” the idea that ufos cant appear in (postdetermined) glides keeps reappearing and has never been refuted. to some degree equivalently, glides have never been constructed in front of ufos.

backtrack32b.rb

outline25c.rb

(12/23) 😳 this took a bit )( frustratingly long to put together, but think its correct now. it applies the linear model using ‘hg’ instead of ‘hc’. there were 2 very subtle glitches that didnt cause the code to fail but gave incorrect results (mostly messed up scale in the predictions) and literally took a few hours to carefully/ painstakingly isolate:

  • fencepost error on calculating feature averages meant they were slightly off and the model wasnt being applied exactly
  • there is a step in the model to subtract averages and reweight by standard deviation ie z-scores before calculating model parameters, ie model parameters are calculated wrt the z-scores, and it messed up results also, or rather its omission did, when not reapplied to the “new” (generated) data, the subglides.

this all was detected by looking at the overall fit that looks exactly like outline19, and then trying to find the outlier/ edge points in the new prediction, ie those points starting a trajectory, which should have been included, but which werent correct at 1st, ie rightmost point(s) in outline19 graph but none corresponding in the new graphs.

it seems a bit alarming at times how easy it is to generate predictions with no errors that are incorrect due to logic defects! part of the “joy” (expertise!) of data science! also as part of this, refactored the code to extract/ sample the ~1K subglides from the hybrid output. it was never saved separately but is embedded in programs up to outline18 where there was a big pivot from nearest neighbor model to chasing the linear model.

after all that, the results are not bad. they are not eyepoppingly smooth either like the prior results. 5 longest glides were used and they come from 2 similar subtypes generated by the hybrid algorithm; it generates many more “subtypes,” but among the longest trajectories/ glides there are only a few. the 1st subtype left (also 3rd, 4th) is better in prediction, the 2nd (next, also 5th) has a large unruly spike at the end. another key observation is that this model turns out maybe not really all that different than the ‘hc’ model because the code is similarly weighting a0, a1, a01 and “deweighting” all the other variables. each glide started with about ~600 iterates but gets cut to about 1/12 of that, ~50 after throwing out subglides that are under 50 iterates (the averaging range/ window/ count) and/ or hg<0.05.

outline18b.rb

outline26.rb

💡 🙄 👿 other new thoughts today. have to be brutally honest to be in this occupation, esp with oneself! there has always been a lingering possibility that binary-/ bit-based features, painstakingly constructed now over several years, are insufficient to solve the problem. there has been admission of this over the years, but every time they yield (remarkable!) new signal, its been like “a bright shiny object lying by the side of the road” (old american expression) … & ofc the implied part, “stopping”/ jumping out after it. this cringeworthy possibility might also be called the red herring scenario.

however, close examination of these techniques suggests they have both strengths and weaknesses. there was an old idea of trying to link the “local and global” in the problem, and by successfully linking them, bringing about a solution. these approaches really clearly embody that tactic, and have played out exceptionally well recently.

it is still so striking how well the linear regression of only 3 binary features brings out a nearly perfect induction function over the hardest undifferentiated region over the long pre/ postdetermined glide, and it feels like some much-needed psychological fuel, an invigorating, exciting, even near exhilarating vindication… at least of something. its a hard fought/ won, well deserved battle victory! however, with sun tzus admonition about strategy vs tactics always in mind (and this is, at this point, an epic battle) is it a viable overall/ “overarching” strategy? (lol, sounds like a great title for the next blog!)

alas with the ufo failure (and it wasnt surprising at all, they were specifically constructed for that purpose, to break binary feature models) it is now looking like the linear regression over the bit features is something like a toy defractalizer. an impressive toy, but maybe still a toy compared to what will be necessary. am also suspecting now that the nearest neighbor algorithm even though fundamentally nonlinear will not fare any better.

but then it comes back to, if these features are not sufficient, what are any/ those that are closer? it seems that virtually ALL ML techniques run on features in some sense. must always keep in back of mind, it is conceivable that the features being sought dont exist. which makes me wonder, is there a proof-like demonstration of that? it certainly relates to the idea the problem is undecidable/ unsolvable. but my intermediate levels of success still make me suspect there are more (advancing) surprises yet in store.

my other idea is that the problem is extremely region dependent, where regions are somewhat like different feature ranges. some features work in a subregion and then are thwarted in other regions. the “density/ entropy core” concept seems to be “reiterating” this. this ties in again with the idea of trying to construct a feature-driven overall map of these regions/ subregions. on the other hand, the above exceptional case aside, it seems ufo pretrajectories may not be distinguishable (via bit features) from non-ufo pretrajectories.

❗ 💡 however, it is quite notable that the following map seems to be suggested by all the recent analysis, ie possibly viable and not refuted (including even the ufo counterexample, which again, seemingly easy to overlook/ forget somehow, isnt a glide):

  • all trajectories converge to core density/ entropy range
  • the trajectory will either terminate its glide by then, or the above trend will govern the glide termination (“inside” the core region).

(12/24) ❗ 😳 👿 😡 @#%& that line about how easy it is to make wrong calculations turns out to be unpleasantly, cringingly prescient. closer look, the same big-but-easy mistake has messed up 3 separate “results” which now melt away as “too good to be true.” outline25, outline25b, outline25c. again, exactly same mistake, the norm logic/ calculation was not applied in all 3 cases and the failure led to the striking results that are “almost” totally wrong. wishful thinking, now seen as naive enough to be nearly magical thinking!

this is the corrected code. honestly, am close to embarrassed in posting it. “squinting very hard” the model seemingly/ maybe has some validity predicting ‘hc’ over the glide as in 1st graph, green, but completely breaks down after the glide, 2nd graph. it fares no better over the ufo, 3rd graph. basically, the prediction gyrates wildly outside the “training range” and is apparently essentially random in that range. this lack of prediction is frankly not surprising, but all the excitement about ML extracting emergent properties in the form of strong/ striking signal looks like its pretty much trashed.

so, largely defeated by the adversary yet again. there is some provable signal looking at correlation coefficient 0.53, but otherwise it seems very lackluster/ weak and too wobbly to build on. yep, this stuff really is hard/ discouraging/ disappointing/ nearly excruciating sometimes. it might be harder than rocket science, lol… happy holidays!

outline27.rb

(12/25) ❗ 😳 😮 @#%&! more cringing! 2020 hindsight… there was an awful lot of hacking this month on something that is basic. its not that its not real, but its basic. just graphing a0, a1, a12 over a trajectory reveals a lot of the story. all of them nearly coincide and their trend is simply the smooth curve from the outline25 series, inverted. honestly, this is the kind of stuff that looks worse than amateurish. what a joke, lol! at least for now, just an “inside joke.” because, in a sense, there are no outsiders…

yes, have been staring at the weights for these variables and it hinted at the story. the regression is essentially just weighting a0, a1 equally and a01 as ~½ of those weights. have noticed this almost from the beginning but its significance only took a few weeks to dawn. also scaled average bit runs are probably graphed somewhere in the semidistant past, maybe this was already known, but my semiexcuse is that while bit runs are highly studied, the scaled version typically hasnt been studied very closely/ in depth around here so far.

oh well, can make other semiexcuse this is what happens when theres nobody else watching. “peer programming” would be great for statistics/ ML/ math research too. a lot of new stuff for the cutting room floor, and nevertheless think its not all lost, there are some salvageable scenes/ film clips in all that…

💡 another observation. staring at the (longest) glide selected, it is long/ narrow/ sideways, not arching much, and it looks a lot like the 0.64 Terras density glides which have been generated/ studied some. could it be nearly the same? its a lot easier to generate the long 0.64 Terras density glides than the hybrid code, thats for sure. and those were exactly the glides that may be the most undifferentiated because they lie on a sort of transition point between higher density glides which are steeper and have higher signal/ stronger features and lower density ones that dont glide.

(12/28) 💡 its time for some major retrospective + overview + some reevaluation/ reorienting/ shift/ pivot/ realignment.

1st, here is a quick/ basic exercise to nail down recent observations/ hints. this is the average 1-run, 0-run, 0/1 lengths over the longest glide for latest data in the middle of the graph, and the scaled values along bottom, right side scale. last month talked about features and scale encoding. this is similar, although its almost the opposite, a scale unencoding. as long known the average bit run length is scale invariant over the undifferentiated region, ie nearly constant around ~2, so then dividing it by iterate width “automatically” causes a kind of scale encoding.

except its like a pseudo-scale-encoding; my original idea of scale encoding was not to use iterate width directly in the feature. but its hard to explain and this will take more subtlety to define because eg density seems scale invariant, but its defined by dividing # of 1-bits by iterate width. here the scaled version is nonlinear, and that seems to explain some of the nonlinearity or underfitting of the fit of prior regression models (eg noticed/ seen maybe earliest in outline19), although further isolation of that is needed.

another way of thinking about this is that smaller sections of the curve are roughly/ approximately linear and a linear model will fit to them “piecewise” with different weights depending on the location of the section, where in prior cases the fit section was the initial glide at the start of the trajectory. sections that are not fit will be expected to/ tend to diverge, possibly wildly, as just seen in prior outline27. as for the scaled run lengths trend there are probably older exercises that show this curve for scaled average run lengths but it would be timeconsuming to find them.

review173.rb

💡 ❗ ⭐ (wdittie) this is another new idea carried out relatively quickly, partly inspired by the pivot direction which is to be described after this. there is a technique of doing a long running average between two nearly identical calculations looking for an “eventual” stable discrepancy. that is used here even with more care. (re)thinking about this more recently, its also crucial to also look at the standard deviation along with the averages and see how they compare, and that key aspect was missing in some earlier similar approaches.

but calculating standard deviation “naively” can lead to a problem of numerical cancellation in the standard deviation calculation described here/ wikipedia as the “naive algorithm.” this code implements the method over shifted data which doesnt have the cancellation problem. the idea is to look at entropy of 0.64 Terras density glides (red) vs ½-density iterates (green). note the 0.64 Terras density glides are “sideways” and the ½-density ones are “drain” so any such feature discriminating them is automatically a (remaining) glide-sensing mechanism.

theoretically it has long been known there is probably a very slight discrepancy, but how much? this code does manage to draw it out in a statistically meaningful way ie showing it holds within variance bounds (drawn with error bars) uncovering/ revealing/ “exposing” the ½ density iterates have a slightly higher entropy. there are 2 graphs, a larger one and a detailed zoom. closeups of the 1st graph in the midrange reveal/ remove the heavy overplotting of green over red.

many similar experiments tend to turn up no discernable difference but they dont use this now clearly powerful technique of focusing on a single discrepancy “indefinitely” (ie via “unlimited” iterations). notice remarkably it takes over ~1.5K iterations to solidly separate the values attesting to the razor-thinness of the signal/ discrepancy, so its both razor thin and quite definite, a quite valuable combination wrt isolating signal rarely seen around here! its also remarkable how stable the standard deviation range is, it does not decrease after early on almost at all with additional samples.

a way to understand this diagram is that the left side has fewer samples and is therefore maximally noisy and expected to vary almost on every run, but becomes more stable moving rightward, until after some n≈? iterations where it is expected to be almost the same on every run. an interesting way of combining graphs with dynamic/ emergent behavior and statistical/ measurement uncertainty. it is highly reminiscent of fluid dynamics in a wind tunnel with highest instability on left side, and likely this analogy is not fanciful or mere coincidence. as understood almost since its inception and increasingly emphasized/ reinforced here, this problem has many aspects that are like physical systems; this line of research uncovers its a dynamical system at heart.

construct180c.rb

⭐ 🙄 ❗ ❓ now then, whats the bigger story here? the big outline was proposed at the end of 9/2019 without any corresponding code except a quick “sanity check.” early on there was the suspicion/ known risk that features or corresponding signals of them might be very thin and/ or noisy in the undifferentiated region. that code was not discouraging in ruling them out entirely/ conclusively/ definitively but not too definitive in the other way either; it did proactively show/ foreshadow the differences could be thin.

the next month 10/2019 started to experiment more with nonlinear feature maps via nearest neighbor algorithms. it was able to show improvements in the nearest neighbor optimizations, indicating there is some signal. but how much? alas, no key/ “baseline” comparisons with (average) predictions were ever made; its not exactly clear how to do that anyway, and need to build up some systems/ understanding there. my suspicion is the signals were better than average (predictions), because the iterates still had a lot of differentiation in them eg density/ entropy/ 1-runs etc, but in retrospect from this month, need some better basic signal testing of the model.

showing improvements in an optimization algorithm is good “locally” but doesnt give a “global picture” of how good the fit/ signal is. am now realizing at this point this is a major oversight/ omission. but the shape of the “fix” is not clear to me either yet; however it generally has a lot to do with testing via various scenarios.

then in the middle of last month 11/2019 the hybrid51f algorithm got fairly sophisticated in focusing/ zooming in on the undifferentiated region and probably outpaced the signal detection. was still looking at feature maps/ classes which were demonstrating some signal, but again, not really compared much against a baseline.

at this point though, this month, the new hybrid52 algorithm and corresponding analysis seems to conclusively demonstrate that most or nearly all current feature signals are razor-thin in the undifferentiated region. my idea was that the mapping optimizations might eventually (somehow) “directly” show the limitations of feature “resolution.” havent exactly come to that directly, and still need to build better systems for that, but am feeling it indirectly at this point. some of this violates one of my rules of thumb that, again, strongly wishful/ near-magical thinking aside, nonlinear algorithms are not likely to find (additional) signal unless a basic “linear” baseline is present. its heavy to call following hard-sought notes/ observations/ lessons learned a post mortem but also not entirely inaccurate…

  • alas bottom line its all very close to the concept/ phenomenon named/ identified last month, NINO, noise in noise out…
  • in the end a lot or most of the signal isolated seems to be that the glides tend to be early in the long trajectory, and the scaled 0/1 runs features can roughly correlate to their position in the trajectory, but (now partly experimentally justified, partly emanating from intuition here) dont really seem to be glide-sensitive in general.
  • the prior experiment shows there may be some razor-thin signal to exploit in the undifferentiated region but it tends to take very many samples to isolate/ stabilize. cursory look at some other metrics with the same technique suggest they have similar measurable differences.
  • all the prior efforts to optimize various optimization techniques such as nearest neighbors and regression was semi worthwhile and not wasted, but there was some definite jumping of the gun because what might be called “relative signal” is not well analyzed/ understood yet
  • there was some (over)reliance/ failed expectation on the nearest neighbor code variable reweighting to take out spurious variables, and it seemed/ seems unable to do that so far, so apparently need some better way there, and it all became way more obvious after regressions were applied to the same data, unweighting almost all the variables except for the scaled run lengths, which in retrospect are questionable as “features.”
  • it is my suspicion that theres not much signal in the data, and there are somewhat indirect ways of showing that, eg poor test error performance (improvement), but need to develop more direct/ definitive methods. on other hand it is not really/ exactly/ totally fair/ realistic to expect different ML algorithms eg regression vs nearest neighbors to perform similarly or weight variables similarly, the signal in features is still somewhat relative to the ML algorithm(s) applied.
  • a lot of effort also went into the generation system, which is impressive and works largely as intended, but its starting to look like the iterates are not so different than Terras 0.64 density glides, and those are much easier to generate, so maybe something to focus on intermediately instead.
  • it seems likely there are regions where almost any feature will have some meaningful effect on trajectories, such that a relatively general feature map has a lot of validity, but again, as has been remarked before, the undifferentiated region is like a feature graveyard and recently in retrospect have put in an awful lot of effort to “dig a few more graves + bury a few more (‘cold+dead’!) bodies.”
  • ok, thats dramatic. some of the biggest reeling disappointment/ frustration is outline27 nonresults, but looking at it/ thinking more carefully/ 2nd thought(s), it could be more of a failure of generalization where the features worked somewhat/ roughly over the “training” (fit) range but break down/ fail outside of it. there is maybe some further hint of this in the outline17 nearest neighbors model where the higher ‘hg’ points had nearer class distances.
  • in some senses both key prediction parameters hc, hg seem to be showing limitations as far as leading to (mis)generalization where the model seems to be “learning” something very imprecise/ inaccurate/ unintended like “glides happen at the beginning of trajectories”; an alternative seems to be called for but is not clear/ distinct yet.
  • ❗ overall, lots of tools/ understanding have/ has been sharpened, but it looks like once again the challenge is largely/ essentially finding any feature(s) whatsoever in the undifferentiated region. the last experiment and a few others are encouraging that it can be done…

(12/29) 💡 (wdittie) staring at graphs of density and other metrics led me to notice/ wonder. believe this has never been noticed/ posted before, a basic finding: there is significant correlation in subsequent values in the collatz sequence both for ½ density iterates and 0.64 Terras density. the correlation coefficient comes out typically 0.55-0.65. this is a scatterplot of prior vs subsequent density over an entire trajectory for the higher correlation ~0.64, with hotter colors later in the trajectory, ie colored by sequence index, showing apparently lower iterates have higher correlation, and other metrics like entropy behave similarly to this plot. this suggests the standard ARMA process seems to be in play.

construct181.rb

(12/31) 💡 ❓ lots of new ideas/ code to write up. this month has been a doozie as far as word count, possibly nearly the longest ever, and am holding off briefly for a few days to start a new month entry. however, a remaining quickie: the observation/ question about the a0, a1, a12 features is maybe not so simple as thought. as stated the regression formula is close to a0 + a1 – ½a01 where ‘a01’ is the average over both groups. this formula sounds a lot like the following: average of some trait over men, vs over women, vs over men + women. this formula is not a constant if the group sizes of men and women are changing size, or if the average is changing relative one group vs another…? here group size is # of separate 0/1 runs which is proportional to entropy…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s