in contrast to sometimes having no idea about future directions with these posts, the last blog gave some excellent themes/ ideas for research directions and this new post. to put it in the vernacular, am finally really on a roll! aka/ reminds me of “hot hand!” and yet also lately feeling really strong flow in more ways than 1! 😮 😀
💡 ⭐ 🙂 an immediate great idea was to study how actual encountered errors during training data affect the convergence/ stability property. this was easy to code up based on prior structure. this code analyzes each of the 250 error vectors for 20 different densities. it repeatedly adds the error vector at every iteration of the (“meta”) trajectory.
another prior aspect/ wrinkle of this was affecting/ eluding me until now. another way of thinking about/ visualizing results is that as noted, esp with both positive and negative error, the added error either “increases, decreases, or disrupts/ thwarts” convergence, ie trajectories respectively terminating sooner, later, or not at all. this is a 3 way “trichotomy” ie each error result can be placed in the 3 classes.
then how to illustrate that? came up with this nice variation on a pie graph. figure theres a name for it but havent heard of one. each class is a different color. increase/ decrease/ thwarts classes are green, blue, red respectively. (this color scheme was chosen intentionally. although maybe yellow/ amber for the blue makes more sense, and then its a lot like a stoplight with similar meanings!) then drew the results over the 250 train density variations from low to high density.
am pleased with the results. in line with some of the previous observations (errors correlated with variables by density) they show the nonrandomness in stability properties as closely related to density. for low densities there is mostly improved convergence. for higher densities there is more decreased/ thwarted convergence. there is typically not much intermediate “blending” between the classes (ie “one extreme to the other”) except for (mid-)higher densities. although actually the most thwarting happened in middle densities.
this is very informative/ revealing/ intriguing/ suggestive and shows that maybe the major challenge/ focus is going to be trying to improve the meta function accuracy versus different densities. the other very encouraging sign is that nearly half the errors improve convergence and maybe about ~⅔ dont thwart it, ie only ~⅓ are problematic. note that decreased convergence is still acceptable for proof purposes.
(6/2) a natural next step is to look at the contribution to that by components of the error vector, in other words by each the 8 variable errors alone. this code loops through all the variables and outputs the final results for each, with the % non converging in the final column (over 100 density increments). ‘dh’, ‘dl’, and ‘wr’ “alone” are enough to account for most or more of the nonconverging aspects (up to ~30-35% nonconverging adding each only) although ofc this is a simplification because adding multiple components is not necessarily reducible to effects of adding single components.
actually somewhat surprisingly adding individual components seems to increase nonconvergence over adding multiple components. this may have to do with the components balancing or canceling each other out somewhat esp with some positive/ negative mix. remarkably the ‘a1’, ‘a0’, ‘sd0’, ‘sd1’, ‘mx1’ components alone did not disrupt/ thwart convergence much ie have low nonconvergence contributions under 11% and typically much lower. for comparison the combined variable nonconvergence is around ~23%. overall this suggests trying to reduce error in the dominant components, and ofc esp the critical width variable ‘wr’.
results are in the form of triplet counts [x, y, z] where x is increased convergence, y is decreased convergence, and z is nonconvergence. 2nd triplet is percentages of totals.
ofc at this pt it really makes sense to solve the multidimensional linear equation and it presumably would give total insight into all this, but have to brush up on my theory for all that, instead of “just dinking around”…
["a1", 100, [920, 1059, 77], [0.447, 0.515, 0.037]] ["a0", 100, [822, 1280, 0], [0.391, 0.609, 0.0]] ["dh", 100, [1092, 406, 623], [0.515, 0.191, 0.294]] ["dl", 100, [1008, 357, 756], [0.475, 0.168, 0.356]] ["sd0", 100, [843, 1027, 237], [0.4, 0.487, 0.112]] ["sd1", 100, [1121, 873, 18], [0.557, 0.434, 0.009]] ["mx1", 100, [1191, 726, 175], [0.569, 0.347, 0.084]] ["wr", 100, [1114, 258, 748], [0.525, 0.122, 0.353]]
am not sure what this means yet but its worth noting/ reporting/ saving. repeated iterations of the ‘wr’ calculation f^n(x) where ‘n’ is the # of repeated iterations, specified here as
ARGV, improves the single-component error measurements, dramatically for ‘sd0’, ‘dl’, and ‘wr’ in particular. maybe/ apparently looking at residuals, some kind of smoothing effect? here are runs for n=[5,10,15].
💡 ⭐ 😮 ❗ was musing on all this and realized while not yet across the finish line, an old concept formulated/ espoused is finally filled out/ detailed/ realized here, the old “global vs local” dichotomy/ concept/ theme/ paradigm. the fit curve/ meta fn emulates the global properties of the (“target/ real/ actual”) curve, in particular the convergence dynamics, and errors are calculated at each point, locally. the meta function is also similar to a invariant. in a sense the index in the meta sequence (ie sequence computed by meta MDE/RR function) is the invariant and “hidden variable”…!
5 ["a1", 100, [481, 918, 0], [0.344, 0.656, 0.0]] ["a0", 100, [749, 1204, 0], [0.384, 0.616, 0.0]] ["dh", 100, [1088, 590, 399], [0.524, 0.284, 0.192]] ["dl", 100, [1001, 942, 147], [0.479, 0.451, 0.07]] ["sd0", 100, [804, 1195, 0], [0.402, 0.598, 0.0]] ["sd1", 100, [670, 493, 0], [0.576, 0.424, 0.0]] ["mx1", 100, [418, 686, 0], [0.379, 0.621, 0.0]] ["wr", 100, [983, 844, 268], [0.469, 0.403, 0.128]] 10 ["a1", 100, [528, 708, 16], [0.422, 0.565, 0.013]] ["a0", 100, [683, 1072, 0], [0.389, 0.611, 0.0]] ["dh", 100, [1019, 600, 262], [0.542, 0.319, 0.139]] ["dl", 100, [907, 934, 66], [0.476, 0.49, 0.035]] ["sd0", 100, [720, 1058, 0], [0.405, 0.595, 0.0]] ["sd1", 100, [715, 388, 0], [0.648, 0.352, 0.0]] ["mx1", 100, [662, 572, 15], [0.53, 0.458, 0.012]] ["wr", 100, [1041, 800, 222], [0.505, 0.388, 0.108]] 15 ["a1", 100, [427, 537, 1], [0.442, 0.556, 0.001]] ["a0", 100, [578, 1093, 0], [0.346, 0.654, 0.0]] ["dh", 100, [862, 636, 224], [0.501, 0.369, 0.13]] ["dl", 100, [845, 874, 50], [0.478, 0.494, 0.028]] ["sd0", 100, [682, 988, 0], [0.408, 0.592, 0.0]] ["sd1", 100, [463, 308, 0], [0.601, 0.399, 0.0]] ["mx1", 100, [396, 480, 0], [0.452, 0.548, 0.0]] ["wr", 100, [992, 864, 188], [0.485, 0.423, 0.092]]
(6/6) this is some rather involved code that crunches a lot of statistical analysis, a bit abstractly at times. there are 3 categories/ classes of “decreased, increased, non-” convergence labeled ‘-‘, ‘+’, ‘x’. was wondering how possible it was to predict the class based on simple predictors. was examining ‘wr’ error and its consistently/ entirely negative for low density seeds all the way up to half density (ie meta fn overestimates actual value). one predictor is to add ‘wr’ component only as an estimate of convergence associated with adding all components which turns out to have substantial predictive utility, even including finding some nonconverging cases. this is not really hinted in the prior numerical results listed above, where ‘wr’ doesnt seem to stand out among other variables. another simple predictor is to use the sign of ‘wr’ error only (again vs predicting convergence of adding all components, like a baseline), but which with only 2 states can only predict decreased or increased convergence.
in addition to case (1) the “straight prediction” (adding ‘wr’ component only or ‘wr’ sign estimate), these 2 predictors can be used with other data transforms such as (2) skipping over (or “throwing out”) the nonconverging cases ‘x’, (3) skipping the decreased convergence cases ‘-‘, or (4) mapping predictions of nonconverging cases onto the decreased convergence class instead. the four cases are labeled [w, x, y, z] respectively in the code. these statistics are calculated over 20 consecutive density ranges, 5 ranges total over the 100 densities. output is the prediction success rate over four transforms, with the 2 predictor pairs in each transform, reported as a triple with the 3rd number as # of points.
overall this data is rather complex but the basic story is that the simple predictors (essentially attributing most of the convergence properties to ‘wr’ component and its sign) work very well or even perfectly for the 1st (low) half of densities, and then degrade over the 2nd (high) half, but are still relatively effective. clearly the density plays a huge role in class prediction, and also the prediction/ error (that is, for class) is only degrading/ breaking down at high densities. lurking behind all this is the general drive/ initiative to create a new model with new variables, and this helps prioritize how to look at its accuracy. also immediately it suggests trying a new variable something like (1 – d) (‘d’ density) that behaves in a countering/ opposing/ “inverse” way to current density variable. also, taking the best of every set of predictions below comes close to a worst prediction ~60%.
0...20 [[0.850, 0.750, 420], [0.833, 0.833, 378], [0.850, 0.750, 420], [0.850, 0.750, 420]] 20...40 [[1.000, 0.750, 420], [1.000, 1.000, 315], [1.000, 0.750, 420], [1.000, 0.750, 420]] 40...60 [[0.900, 0.350, 420], [0.875, 0.875, 168], [0.947, 0.368, 399], [0.950, 0.400, 420]] 60...80 [[0.402, 0.498, 420], [0.295, 0.587, 356], [0.535, 0.332, 316], [0.752, 0.498, 420]] 80...100 [[0.200, 0.150, 420], [0.300, 0.300, 210], [0.158, 0.105, 399], [0.583, 0.450, 420]]
(6/8) another (maybe more basic/ natural) way to look at it with the dichotomy instead of 3 classes, the meta function can converge or nonconverge over the different starting seeds, and the perturbed function can converge or nonconverge. this can mean a false positive or false negative in the prediction of convergence (‘true’, ‘false’ columns respectively, note that there can be no false positive if every actual case converges and they only occur for n=2). this analyzes over 20 densities, and the ‘x’ column is # of nonconverging cases for the meta function. ‘n’ column is repeated iterations of the collatz mapping. ‘r’, ‘e_a’, ‘e_m’ are as previous for the ‘wr’ fit. this is a simple way of generating “slightly different” meta function fits.
❗ ⭐ curiously for n=2 there was total nonconvergence of the meta function (x=20) but otherwise, it was total convergence (x=0). ‘eq’ column tracks # of cases where the perturbed function matched the meta function convergence. it reaches a remarkable max of 93.7% for n=4, and 90.1% for n=16, n=27 and n=28 are 92.5% and 94.3%! note ‘e_a’, e_m’ basically gradually increase and ‘wr’ correlation coefficient ‘r’ increases and has a maximum of ~.90 at n=16. but this is counterintuitive, because contrary to expectation, decreasing accuracy of the meta function fit measured in the steadily increasing avg and max error does not mess up the stability (roughly/ basically, ‘eq’ value), and sometimes improves it. although maybe the improving correlation coefficient ‘r’ explains most of this? could stability be correlated with the correlation coefficient? and anyway what is going on with the best cases n=4,16? ❓
a glitch discovered early on in developing this code and reflected in its approach/ angle/ direction: note that some of the prior code doesnt handle correctly the case where sometimes the unperturbed meta function is nonconverging & eg prior code can get nil exception in that case. had spent a lot of time finetuning the meta functions to guarantee convergence & it became very reliable, almost an assumption, and sort of forgot momentarily that it can break. but this is all helping with figuring out the generality of the techniques.
r e_a e_m x c false true eq n 0.634 0.00572 0.0162 0.0 20.0 0.421 0.0 0.579 1.0 0.836 0.00616 0.0278 20.0 20.0 0.0 0.318 0.682 2.0 0.786 0.00945 0.0309 0.0 20.0 0.282 0.0 0.718 3.0 0.826 0.0106 0.0363 0.0 20.0 0.0627 0.0 0.937 4.0 0.769 0.0144 0.0417 0.0 20.0 0.157 0.0 0.843 5.0 0.86 0.014 0.0321 0.0 20.0 0.147 0.0 0.853 6.0 0.865 0.0142 0.0417 0.0 20.0 0.255 0.0 0.745 7.0 0.864 0.0156 0.0421 0.0 20.0 0.238 0.0 0.762 8.0 0.887 0.0168 0.0592 0.0 20.0 0.149 0.0 0.851 9.0 0.859 0.017 0.0531 0.0 20.0 0.137 0.0 0.863 10.0 0.896 0.0169 0.0547 0.0 20.0 0.213 0.0 0.787 11.0 0.839 0.0241 0.0693 0.0 20.0 0.214 0.0 0.786 12.0 0.891 0.0219 0.0522 0.0 20.0 0.128 0.0 0.872 13.0 0.851 0.0245 0.0597 0.0 20.0 0.148 0.0 0.852 14.0 0.84 0.0246 0.0665 0.0 20.0 0.131 0.0 0.869 15.0 0.909 0.0205 0.0678 0.0 20.0 0.099 0.0 0.901 16.0 0.861 0.0266 0.11 0.0 20.0 0.113 0.0 0.887 17.0 0.885 0.0264 0.0784 0.0 20.0 0.165 0.0 0.835 18.0 0.837 0.0311 0.1 0.0 20.0 0.131 0.0 0.869 19.0 0.869 0.0285 0.0889 0.0 20.0 0.196 0.0 0.804 20.0 0.861 0.0323 0.0892 0.0 20.0 0.134 0.0 0.866 21.0 0.851 0.0305 0.0866 0.0 20.0 0.151 0.0 0.849 22.0 0.815 0.0341 0.0947 0.0 20.0 0.173 0.0 0.827 23.0 0.861 0.0302 0.111 0.0 20.0 0.191 0.0 0.809 24.0 0.875 0.0336 0.092 0.0 20.0 0.178 0.0 0.822 25.0 0.88 0.0337 0.0899 0.0 20.0 0.0853 0.0 0.915 26.0 0.883 0.0306 0.112 0.0 20.0 0.0745 0.0 0.925 27.0 0.893 0.0348 0.104 0.0 20.0 0.0569 0.0 0.943 28.0 0.86 0.038 0.12 0.0 20.0 0.217 0.0 0.783 29.0 0.782 0.0476 0.143 0.0 20.0 0.0833 0.0 0.917 30.0
(6/9) 😳 tried that code again in a slightly different form which reruns identical code for n=1 and get substantial variance of about 10% or more in the ‘eq’ measurement. whoa! therefore a lot of the prior variance is not really due to much precision in the analysis (at least in that measurement, maybe less so in the others which have more steady trends, eg ‘e_a’, ‘e_m’, except for ‘r’)… have to do some major rethinking!
(6/12) consolidating/ synthesizing many of the prior ideas leads to this graph/ scatterplot and maybe a lot of new insight… & more 2020 hindsight, it all seems obvious in retrospect. this code reruns the distribution and fitting code which as just noted leads to some significant/ inherent variability. the idea here is to try to guess which error vector additions cause nonconvergence. ‘wr_e’ was found to sometimes be a very good predictor from earlier inquiries and this is a kind of new basic realization. the code sorts all the error vectors by ‘wr_e’ and determines convergence. convergence is seen to “correlate” closely with the sign and magnitude of ‘wr_e’. if ‘wr_e’ is negative the trajectories tend to converge. if its positive, they tend to nonconverge, and all roughly with a probability proportional to ‘wr_e’ magnitude (ie absolute value).
the graph format is a scatterplot requiring some explanation. there are 50 error vectors and plot points per run & 40 runs. each run corresponds to an adjacent set of 3 vertical points/ bars red, blue, green signifying nonconvergence, mixed, and convergence respectively. they are offset slightly in the graph to try to somewhat avoid overplotting. mixed convergence means of the 20 tested trajectories (over the 20 density gradations/ intervals), there was a mix of converged and nonconverged. also no bars are plotted for a run where not every trajectory for the (fit) meta function converged ie the sample is “thrown out” or “skipped”.
the vertical value/ position of points is just the ‘wr_e’ component of the error vector. the important takeaway from this is that while sign/ magnitude of ‘wr_e’ is a fairly good estimator of the overall (non)convergence, its not “perfect” (as someone near to me is always reminding me is a property of humans…) and the converged and nonconverged trajectories interpenetrate in a sort of yin-yang way. convergence fades as moving into the nonconvergent region and vice versa, but each are “embedded” in the other. the mixed cases (blue) tended to happen for ‘wr_e’ positive.
💡 ❗ ❓ a wild idea pondering some of this. on one hand one might presume that a meta function fit would always have positive and negative errors, and that would be the case with most typical methods that try to minimize sum of squared error or something similar. but maybe thats thinking “inside the box”. what is all this suggesting? it seems to be implying that if the meta function consistently overestimates the actual function yet still has consistent convergence, that is the “stability” property that is desired. this seems to suggest fitting some function that is the actual function adjusted with increase bias eg something like f(x) + c, c > 0 or c2 * f(x), c2 > 1 where f(x) is the actual function.
(6/15) ⭐ 💡 ❗ 😮 this is a fairly quick riff on prior code to accomplish that analysis for adding a small constant to the fit curve. there turned out to be a very subtle bug in not doing a deep copy of an array, fixed at the bottom of the model subroutine. the idea in this code is that it computes both models, the unmodified and the slightly modified fit, and does a comparison to see if there is any improvement (which would specifically be a decline in nonconverging seeds and an increase in converging seeds). the code “marks up” the array with the fit prediction and there was “cross talk” between the two fit routines.
there are 40 runs, 20 seeds over graded densities per run, 50 initial seeds/ error vectors, and this timeconsuming code took maybe over ~1hr to run. the graph output is the model results in pairs, graphed as line from one of the pair to the 2nd. the 1st pair is the # of trajectories in the unmodified model and the 2nd is in the modified model. red, green, blue are nonconverging, converging, and mixed results respectively. if the fit adjustment works as desired then green lines will incline and red lines will decline, exactly the results of the experiment, consistent and even sometimes dramatic improvements. this is something of a
breakthrough in showing improved stability of the meta fn in the direction of a proof.*
the graph seems a bit messy though and am thinking of other/ new ways to present/ visualize the results. but, also, full disclosure, the improvement comes with exactly the opposite adjustment as was predicted: fitting to a decreased function ie adding a negative delta d=-0.01, and still dont understand this right now (positive delta led to similar magnitude deterioration in stability!). next step ofc is to try to tune this delta to get perfect stability required for a proof.
😳 oops! on further thought maybe dashed that off a bit too quick! did you spot the wrinkle, lol? was thinking about this code during commute and realize its a bit mixed up. it basically computes and throws away a baseline computation at the beginning of the
test routine over each of the 2 model evaluation/ subroutine calls.
(6/17) * 😳 😳 double oops! too good to be true! that code adds ‘wr_e’ errors from the 1st model to evaluate the 2nd model, definitely not what was intended; the 2nd model errors must be evaluated instead & basically have to disregard all those prior results as probably defective! (but am leaving the code just as a signpost of current directions, albeit wobbly.) am also delving into the background data of this algorithm and finding strange stuff. fixing the code to evaluate ‘wr_e’ errors for the 2nd model seems to have huge effect on the speed of convergence of the “baseline” meta function but almost no effect on the perturbed, error-added convergence. cant figure this out right now, mysterious! ❓ ❗
(6/19) ❗ 💡 after a lot of marking up the code with logging intermediate results, graphing them, and cogitating aka wracking my brain, think the answer is revealed. basically an additive
or multiplicative* adjustment to the meta function leads to a corresponding “counterreaction” in the errors and then has no effect on the overall stability calculations, ie the two cancel each other out. theres a simple math proof of this but am not going to try to write it out right now. (admittedly had some vague suspicion from beginning that the approach was just “too easy”… reminding me of that devastating adjective, facile…)
anyway though, therefore some other route must be devised to improve stability. could it be that better fit of the ‘wr’ function (eg measured in avg error of that component) leads to better stability? seem to need a new metric to measure degree of stability associated with the ‘wr’ component. am thinking maybe just counting the percent of converging errors vs total count? and could this be shown to correlate with eg correlation coefficient of the fit? ❓
(6/20) ❗ 😮 this is a simple exercise, a quick riff on prior code
data44d.rb that simply tests stability of higher iterations of the ‘wr’ MDE/RR eqn (also already explored some in
data44.rb). this was just a shot in the dark but it turned out to be a bit eyepopping. its exactly the desired result; higher iterations of ‘wr’ eqn lead to higher stability as seen in the decreasing red points (along with some increase in mixed case blue points, but thats improved over/ compared to nonconvergent red points) and for n=~36 there was actually perfect stability (all green)!
its not entirely clear what this means at the moment. have to figure out what it means to increase iterations of ‘wr’ eqn relative to the other variables. an immediate question is what happens when all the variable eqns are advanced by the same iteration count. note also the error magnitude steadily increase as in the prior
data44.rb ‘e_a’ and ‘e_m’ variables. this experiment can be thought of as exploratory/ conditional, something along the lines of, “if the collatz fn was slightly different in the following way, then stability would be more attainable”.
💡 simple conjecture, suspect this behavior may be due to a greater downslope in the ‘wr’ eqn fit, & also fewer upslope cases.
(6/21) this is a simple riff, 1-line chg that has been going through my head lately, have tried it a bit earlier but didnt post it. it uses the geometric mean of the multi ‘wr’ iterations instead. it doesnt have as big of an effect, and doesnt improve stability much in general, but its another way to adjust the meta function, is probably closer to the real function, and does seem to have a nearly perfect stability run (again?) for about n=~36 iterations. on the surface its a strange/ unexpected coincidence wrt prior run.
(6/22) * this code rescues the illfated/ halfbaked
data45 approach with a slightly different angle. it turns out that a multiplicative adjustment on the meta function coefficients for ‘wr’ does introduce changed behavior wrt stability albeit somewhat unpredictable. this introduces a factor ‘f’ that adjusts all the ‘wr’ coefficients excluding the constant. ‘f’ starts at 1.0 (no adjustment) and increments by 0.01 each iteration. it also fixes the model evaluation to not mix up the errors from the two models. the idea here is to be more efficient by not recomputing the entire model because all the fit variables other than ‘wr’ have the same logic/ evaluation, although that reevaluation is not really very expensive compared to the trajectory traversal computations.
the runs are offset by 0.1 x axis increment/ delta in the plot. over 50 iterations ‘f’ therefore ranges from [1.00..1.50]. it leads to improvements (green line converging count upslopes, red line nonconverging count downslopes) but only relatively modest of about ~1-5 nonconverging trajectories changed to converging. also the ‘y’ position shows the significant variability between runs. the experiment is basically successful in demonstrating possible improvement via simple altering of the meta function.
this graph has a counterintuitive aspect that took awhile to comprehend even while staring at it. it may look somewhat like there are upward and downward trends in the green/ converging and red/ nonconverging segments. in a sense there are, but not exactly as expected. what is happening is that moving rightward, more lower-starting (converging) segments are discarded (correspondingly, more higher-starting nonconverging segments discarded) due to the 2nd model evaluation not resulting in all converging un-adjusted trajectories. ie there are more/ wider gaps and what remains fits into a visual trend. ie in a sense its a trend due to selection/ bias over random data variability.
also though it appears that exactly as applied effect increases, the discarded runs increase, but also improvements do not seem to increase, suggesting this approach cannot reach the desired perfect stability (strange, “antisynchronicity” how this tradeoff apparently exactly stymies the goal). however, this selection bias leading to apparent “improvement” gave me an idea for the next algorithm. 💡
⭐ ⭐ ⭐
(6/23) ❗ 😮 this code looks like a veritable breakthrough! this finds a “fully stable meta function fit” ie convergence over all the sample points and errors for each sample point! had a shift in thinking/ pov that have arrived at (what can be realigned/ reframed as) yet another optimization problem.
the code is not very complicated wrt prior logic, just a few basic twists. its a heuristic to try to improve the stability using 2 basic ideas. the code has a measurement of “outlier” points wrt the curve fit and stability in the ‘wr_e’ value. but notably, there is significant variability in the meta function fit for different data samples. can this “natural” variability be used to some advantage? this uses a rather ingenious heuristic approach. the basic idea is that samples determine the meta function and maybe careful sample selection can actually “steer” the meta function toward the desired result (full stability).
there are 2 scenarios: (a) the unperturbed meta function sometimes doesnt converge, or (b) it does but does not converge for some of the (“problematic”) error vectors. so this code handles/ responds to those two cases. in case (a) it resamples 5 of the “most outlier” points ie the largest ‘wr_e’ values. for (b) it resamples the problematic error vectors (each corresponding to sample points) that didnt converge. surprisingly, this code works! the bit width of samples is 100 and sample count is the 1st command line parameter. nearly identical code as follows all converged for widths [20,30,40,50,60,70,80,90,100]! (fineprint, detailed next: it seems in some cases maybe “barely”…)
following is some analysis of the width 100 case, total stability over all 100 samples obtained at ~120 iterations. there are 2 graphs. the two different nonconvergence/ adjustment cases (a), (b) are plotted in red, green respectively. 1st graph is the density of points that are resampled. the algorithm tends to alternate between the (a), (b) regimes and focus on resampling increasing density points although there is a “late” resampling of points over a wide range of densities right before convergence. the 2nd graph is the ‘wr_e’ values that are adjusted. it goes through alternating expansions and contractions. these graphs are somewhat reminiscent of some other older experiments that would traverse the collatz reverse tree and lead to alternating regimes.
it is a unequivocal “godsend” to find even a heuristic function that works, esp considering how nontrivial it is to increase stability, and hence am reporting all this immediately, “as is”, somewhat rough. however suspect immediately that there could be some improvements on this convergence technique. it appears that sometimes the algorithm gets focused on changing only a few of the same-(high-)density points for the (b) resampling, sometimes only 1, a “precarious” position with high potential for the dreaded “paint into corner/ trap”. it looks like the algorithm could possibly get stuck doing so. it did once get stuck on only changing the “max-density” seed which had “pure” density=1 and therefore was not able to alter that seed, and so changed the code slightly (in
dist method) so that the density=1 case is excluded (this amounts to randomly locating a 0 bit in a otherwise full run of 1s).
also maybe changing only 5 of the worst ‘wr_e’ outliers for (a) case is too few, am thinking it should be more something like 25% of the total sample count. so lets see whether these ideas improve convergence time. another obvious test now is whether other random seeds other than those in the solution (“out-of-sample/ test data” outside of “training data”) still lead to convergence (the entire point of the algorithm/ exercise).
this code has some other new ideas. for the nonconvergent nonperturbed case (red), it chooses only the top quarter of positive ‘wr_e’ vectors to change. for convergent nonperturbed case, if it tries to apply the same subset of vector adjustments, the code finds the largest ‘wr_e’ component vector that is not in the previous adjustment and adds it to the list, to try to avoid painting-into-corner. this code found solutions for sample counts of [50,60,70,80,100] and didnt converge after ~350 iterations for count 90. it converged rather quickly for count 80 and count 100.
so overall/ bottom line the convergence comes out as a bit random, sensitive, and fragile. however, on the other hand, one observation is that the heuristics could fairly quickly find sets of samples where only a few were nonconverging. graphs below are for count 60 run which converged in ~150 iterations. there is less order in ‘wr_e’. theres an outlier iteration at about ~115 where the unperturbed model passed but the perturbed model had many nonconverging points (maybe the majority) which were all adjusted, over the broad density range instead of more confined to higher and including many negative ones, somewhat like prior experiment around iteration ~120.
but this new code/ experiment gave me some new ideas/ insight on what is actually desired vs what the code is optimizing for. the real goal is to find a meta function for which any seeds chosen at random converge. that is not exactly the same as trying to find some set of seeds for which some meta function converges (what the current heuristics are aiming for, and finding). not exactly sure how to proceed at this moment. but as thought earlier, each of the “found” models could be tested for its general performance over out-of-sample seeds aka test instead of train points. am suspecting that the models found may not be all that robust wrt test points. also wondering if maybe they tend to fail more (ie nonconvergence) for the high density seeds because from training runs at least those seem to be “harder” to “stabilize”.
actually do have an immediate idea of trying a genetic algorithm! another possibility is that as long as many different candidate meta functions can be generated that are built by these heuristics and pass these “in-sample” tests, maybe 1 can be found scattered among them with high stability (wrt testing out-of-sample points)… lots more stuff to try out/ directions to go in… ❓
(6/24) 😳 😦 😡 👿 the answer comes quickly (at least minimal suspense), and it dashes hopes. this code is modified to generate a model and then save it/ the coefficients to a file. the 2nd time the code is run it reads/ analyzes the model. it picks random samples and reuses the scan analysis logic in the code without any modification of the routine. it outputs converged or mixed/ nonconverged results as either green, red respectively (it would be better to again plot the mixed results as blue but dashed this out and it wont alter the results much), here over 50 iterations.
here a model was generated for the default 50 samples and then analyzed. as expected most of the nonconvergence is with high density seeds which are proving to be the “hard” cases. havent written up this aspect yet, but have observed from looking at intermediate data logging that the high density seeds are apparently exactly those that lead to nonmonotone trajectories ie those with long(er) “glides” and lower density seeds may have no glides at all.
there is some good/ slightly consoling news in that the model is consistently showing convergence for nonnegative/ positive values up to about ‘wr_e’=0.01, but otherwise the bottom line is that the prior code to find perfect stability models is apparently entirely subject to overfitting or biased sampling, and the final optimized model does not seem to improve the model much (at all?) over the baseline model generated without the expensive/ timeconsuming optimization. ouch!
after a determined burst/ drive/ advance, this feels like something of a major setback, not clear what the next direction is yet, maybe need to try to look for some kind of patterns in this background data and/or am thinking about some kind of genetic algorithm for optimization but details are still hazy right now.
(6/25) ah, this is an obvious idea in retrospect but had to write a little logging code to reveal it and then “make it obvious”. the in-sample ‘wr_e’ errors are much smaller than the out-of-sample values. so the optimization is essentially just finding biased/ unrepresentative sample points that closely match the model so to speak. somewhat like that japanese saying, the nail that sticks out gets hammered down. mulling over how that tendency therefore needs to alter the strategy.