this is a hazy idea thats been pulling at me for several years, finally came up with a way to solidify it and then decided to try it out. results are unfortunately lackluster so far but it was a relief to finally see how it works (otherwise would just keep wondering if its the “one that got away”!). and anyway think the overall conceptual idea is at least somewhat sound and still has promise maybe with some other revision/ angle.
the prior runs showed that theres a roughly predictable linear relationship between decidable metrics for each iterate and the glide length (“horizontal ratio”). so, one of my rules of thumb discovered the hard way over the years, once its proven at least a simple machine learning algorithm works, then one can look at more sophisticated ones that should at least outperform linear regression. (and conversely, if there is not even a weak linear relationship present, it seems unlikely that machine learning will “pull a rabbit out of a hat”. this is surely not an absolute phenomenon, but it seems to hold in practice, and think it would be very interesting to isolate exceptions, suspect they are rare.)
the idea explored here on specific collatz data is a general one for machine learning. suppose that one does not have very many coordinates in ones data. each coordinate is something like a “feature”. and one would like to increase the number of features using only the supplied ones. this is similar to what a neural network does but typically each neuron has many connections. one wonders, can it be done with few connections? the smallest case is 2 connections. is it possible that only 2-way analysis of input data, built on recursively, could have predictive capability? the answer here for this data is mostly no but maybe theres more to the story with further tweaking. also it might work for some other type of data.
a general question here is, can multi-way neurons be constructed out of nonlinear 2-way ones somehow? what kind of data supports that?
clearly just a multiplicative function of the 2 input variables does not scale recursively; new functions are linearly dependent. but what about a nonlinear function? this code constructs a (“nonmultiplicative/ nonlinear”) 3-nearest neighbor radial basis function on the 2-way “prediction unit”. there are n^2 possible 2-way functions and this code just greedily/ recursively adds the one with the lowest error. finding nearest neighbors even for the 2-way case is computationally expensive and it was hoped that the complexity might have some redeeming value/ capability over the simplicity of the inputs.
the plot is train error in red and test error in green, vs # of 2-way function iterations. the train error does go down, but the test error is not apparently very much connected at all to that trend, but maybe not totally disconnected either. it seems to be noisy as the train error declines gradually and then both level out. this is based on data similar to prior runs. there is a variable ‘$m’ which determines the fraction of data in the train vs test set, here 1/40. the full data point count is 3000. the absence of any decline in the test trend is disappointing! 😦
then was thinking of some way to salvage this. another idea that immediately occurred to me. looking thru a machine learning book, it does a K-means clustering prior to doing the nearest neighbor radial basis calculation. so maybe thats a way to go.
(1/9) 2nd thought, maybe there is some improvement effect there on test data within 1st 20 iterations with error results ranging from ~800 to ~400, half as much. its just that the results looked so random. expected to see a smoother/ concave upward curve. need to analyze that further (eg more runs, validation + test data etc) to figure out if its “real”.
had a really cool idea about looking at all possible n-way prediction units instead of just the 2-way ones (still using the same radial basis function construct) and coded it up yesterday, need to run it some more & post results shortly.
(1/11) this is that code and a few runs. unfortunately the runs diverge substantially. its not clear if this is due to the randomness in the input distribution (this time 2000 points) or the fitting algorithm and it may take substantial further effort to figure out. there are some other ideas in here. the code generates 50 new random n-way prediction units but adds them to a cumulative list that is not thrown away each time, to avoid randomly re-encountering/ recalculating prediction units. it resorts the cumulative array every 50 new entries and then chooses the best in the list.
looking at the output, it continues to choose many 2-way prediction units and also larger ones of around a half dozen coordinates, some more. also changed the code to weight different based on distance, 1/d instead of the prior (1-w) idea which seemed to have a major effect on decreasing the error. suspect this might be due to slightly different logic for exactly matching neighbors in the radial basis function & need to analyze that further. it could be there are many exact matches in this data.
the 1st run was gorgeous in its shape and showed a nice declining improvement in both train and test error, although have no idea about the absolute error size yet. the 2nd run was much different and the train error did decline within 3 iterations but then started increasing. its definitely disappointing in comparison. if it had at least declined for more iterations it would have been more aesthetic. but at least it does look more like classic train vs test curves. a 3rd run has only increasing train error at/ after the 1st iteration. the 1st trend now looks like an unusual case. an immediately apparent enhancement/ next step for the code would be to compare these results versus the linear regression as a baseline.
😳 working on some new code, noticed that this code is basically/ probably/ mostly “using an identical point to predict itself” so to speak as a result of the loops and the inner loop not excluding the same prediction point from the nearby point list. thats basically a bug & it doesnt make too much sense; the way to go seems to use anything/ everything but a point itself to predict itself. its also a way of thinking about radial basis functions in the sense that a “perfect” function has no error and is actually to be avoided because it doesnt generalize, its a sort of extreme/ limit case of overfitting.
(1/12) this is a major refactor of the code and some new features.
- the “equal point” logic is changed to average over all equal points found, otherwise do the 1/d distance weighting.
- the code is refactored to use hashes with variable names instead of lists with variable positions, which improves readability some, although am never gonna win any awards for the overall “quick-and-dirty” style used.
- the code splits the new prediction unit sampling to half those with the last-added prediction unit and half those without, and computes statistics (avg, stdev, min, max) on these two sets. its useful to compare averages etc.
here are 3 runs. the results seem a little more coherent and it seems like test error is starting to better match up with train error at least in 1st two graphs, and a minimum in test error occurs after the start, but generally fairly early like ~5 iterations. last graph the test error seems rather noisy and uncorrelated with train error. these runs have 1000 data samples/ points. one other observation is that the larger n-way prediction units are definitely preferred by this logic. another observation is that the n-way sample for the new unit consistently outperforms the sample not using the unit, and this to some degree validates the general strategy.
again its hard to tell whether the randomness is due to data points or the learning algorithm. at least train error shows a relatively consistent downward trend. also the total error is higher for both train/ test for more samples and this seems to indicate the generation algorithm is finding larger samples with more noise or at least harder to predict.
this code moves up the samples to 2000, tries out train/ test data as 3:1 split (ie every 4th point is test). it has some code to pick nonoptimal solutions better than average with the idea that maybe being too greedy can actually be a drawback. it picks a range of solutions better than average, incrementally close to optimum over 20 iterations, and then picks only the optimum after 20 iterations. the test error seems still nearly as noisy and not sure why this is or how to decrease it right now. the 2nd graph is a very long overnight run and again test error seems to reach a minimum early on around ~10 iterations, peak around ~50 but then go through a long slow slide. train error seems to improve only up until about ~120 and its surprising to see test error go down even as train error apparently plateaus/ flatlines!
(1/13) this code is refactored some to allow 3 samples, the test, validation, and training points. it has some logic to reshuffle the test/ validation points for every potential prediction unit evaluation (not sure where that trick originates), and the test points are never evaluated by the algorithm wrt training. also the data points are shuffled prior to running the algorithm. it seems quite unlikely that this would have an effect, but it seems like it might, ie maybe it alone decreased test evaluation noise. in other words sampling every 1/n points for test may introduce some subtle bias. to isolate these subtle effects might take very many runs to see statistical differences/ trends. the code to smooth the descent over 20 iterations is disabled and its back to greedy.
results seem to successfully/ substantially tighten the correlation/ decrease noise between test, validation, train (red, green, blue)! however there is some expense in that train error does not decrease much and/or decline is more gradual. but then wonder maybe the large prior decreases were not “real” anyway.
(1/14) this code has more new features/ ideas at least worth trying out.
- ( 😳 ) the the distance weighting algorithms (1 – w) and 1/d are chosen at random for evaluation and the (1 – w) logic is revised some. closer examination, the prior (1 – w) algorithm probably had some major incorrect behavior for different cases and in retrospect its surprising it worked much at all. it might have been what is intended only for weighting 2 points. for example if theres one exact match and one inexact, the inexact match is weighted 0. but if theres one exact match with two inexact matches, the two inexact matches will be weighted with nonzero weights. makes perfect sense right? 😛
- the # of nearest neighbors is also chosen at random from 1..10.
- since evaluation of a single prediction unit seemed to be noisy, there are two different methods “near3b” and “near3” to aggregate over 3 evaluations hopefully for better accuracy/ precision. the 2nd can easily choose the top, median-of-3, or worst evaluation (current setting).
variations on this code did decrease train and validation error but didnt seem to decrease test error. so am now thinking maybe have to roll back some of the changes to get back to decreasing test error. however, it was a very marginal decline in the last code and only in 1 run (2nd), where another run (1st) only increased test error. am starting to suspect the reshuffling of train/ test points too frequently may be to blame. the runs are very long at this point and its not easy to get quick feedback on design choices any more. at least it is much easier to disable features than to implement them. 😐
am looking over all the prior results, pausing for a moment, and noticing that the test error decline is really just not very convincing almost anywhere… a less delicate adjective might be “flimsy“. maybe got a little carried away there/ jumped the gun! am thinking at this point need to get “back to basics”. need to just decrease test error in any way possible with the simplest possible algorithm to begin with, and then try to improve that. also, it needs to consistently find a test improvement. in retrospect, was trying to improve on slush so to speak, the opposite of a firm foundation… as the saying goes “too clever by half”… maybe can reuse the code later after building something more solid/ basic to begin with. so, then, in short, how to “separate the baby from the bathwater”?
(1/15) this is some simplified/ pared-down code that just looks at the “sort-correlation” between ‘h2’ (red) and radial basis function predicted values (green), similar to prior depictions of ‘h2’ vs regression curve fit. some rough correlation can be found but theres also a lot of variance between runs even to the point the prediction sometimes looks nearly random. this uses 3 nearest points and ‘d0’ distance weighting, and all 7 metrics/ coordinates. honestly, bottom line, seems this expensive prediction logic does not seem to outperform linear regression at all and even seems to be noticeably inferior. some defense, this nearest neighbor code does not use a “same point” data for prediction the way regression does.
heres a study of n-nearest neighbor count effect on prediction accuracy and d0/d1 (green, red) distance weighting metrics for n from 1 to 50. d1 outperforms d0 for low values and d1 nearly declines asymptotically although both show late uptrends. its nice/reassuring to see a quick basic affirmation of the fundamental nearest neighbor/ radial basis algorithm & the smooth trend after all the prior dispiriting/ expensively-obtained noise (both in terms of coding effort and computational load). it also shows the max 10 nearest neighbor count used earlier is probably a bit too low, a better value might be ~15.
pondered a little what to do next. was next trying to figure out how to view correlation in general between test/ train for the n-way prediction units, is it really there? think this approach is somewhat clever and results highly encouraging. there are 7 coordinates and 2^7 possible subsets. one naturally wonders how train/ test error correlates between them. this code evaluates all of them for r=10 neighbors and the d1 weight algorithm, and then the graph is ordered by test error (red) with corresponding train error plotted in green, and count of n-way elements in blue on 2nd right axis which naturally declines left-to-right. 1:1 train/ test split. results are interesting and tend to vary with each run, which shows the sampling dependence/ variation.
in the 1st case the test error is consistently larger than the train error, almost nearly by a constant! in the 2nd, train error flatlines in the 1st region even as test error gradually increases with test error more expectedly underestimating train error, and in the 2nd region there is a flip and surprisingly test error consistently overestimates training error! in the 3rd theres a near “vice versa” reversal of those two regime trends flipping the underestimation and overestimation. graph 4 the train error is nearly equally staggered around the test error except only for very low/ high.
admittedly these results tend to point toward the importance of some better sampling algorithm because the prediction algorithm seems to be affected, even highly impacted by some unknown sample tendencies/ groupings/ classes, and results would be expected to vary substantially based on the class.
(1/18) 💡 ⭐ 😀 😎 ❗ this is some really cool code based on a very ingenious idea! the prior code had a major assumption. it assumed that adding a new vector representing a feature (aka “prediction unit”) would improve prediction. that feature chosen was simply the prior prediction. however, probably few prediction algorithms have this “recursive” property that adding its own best prediction to its logic will improve its prediction. linear regression certainly does not have this property. and the lackluster results suggest prior logic/ algorithm doesnt either.
so then the immediate question, what feature vector could be added to improve prediction? the remarkable idea here is that gradient descent can be used to solve for this unknown “feature” vector! in other words the hidden feature is simply the completely arbitrary function (sampled at the same positions) that when added to the radial basis function calculation, minimizes prediction error!
the details are rather tricky wrt radial basis function code, but basically depending on the expected values versus the predicted values, there is some hint on which way the new coordinates should be positioned. there is a “delta” that would affect relative distances. then the coordinates can be adjusted in pairs as either “pushing apart” or “pushing together”. tried it out, and it works! it decreases both train and test error in a very solid way, and moreover the # of iterations for convergence is low, typically under a dozen or so. that rapidity was surprising!
here is a graph of results. 1:1 train/ test split. the lines are the train/ test error (red, green) for iterations within each new vector added. ie at end of each separate line, the new feature vector is added. the algorithm terminates when there is no improvement in train error. feel like some general/ promising new theory might lie behind this! note this code is computing the radial basis function over all features and maybe some improvement can be had by limiting to a subset of features as earlier.
😀 ❗ this code has a 1:2:3 train/validate/test split, and also the “final train” data for the final test evaluation is the train+validate points instead of just train points alone (as in some earlier code). it also uses the reshuffling logic. all the code seems to come together in a very solid way and behaves exactly as designed/ intended, and there is a major improvement/ decrease in curve fit error (~50%!), and tight connection between train/ validation/ test error (red, green, blue), and its repeatable! cant really ask for any better results! the validation error as sometimes/ repeatably less than test or train seems like incredibly good luck! only a tiny quibble that maybe the 3rd run might have terminated too soon, maybe it needs a few retries on adding features. the last idea would then be to try to improve results some using feature subsets.
(1/19) this is some ambitious code that marries many of the prior ideas together. wanted to see how some different approaches work but its hard to compare them unless they run on the same sample data, which seems to vary a lot between runs. a simple approach would be to have a seeded random number generator (to generate same random run/ sample) and code that can be run with a command line parameter to change its behavior. however, then one has to decide how many iterations to run each algorithm whereas it seems useful to be able to terminate algorithms in the middle manually. so a more streamlined/ sophisticated approach is to run all the different algorithms in parallel. the code did not really support that immediately with all the use of global variables, but the refactor was not hard or timeconsuming.
this code has 3 different algorithms, each one with postprocessing that uses the hidden feature optimization addition. the 1st algorithm chooses top feature subsets of all possible subsets. the 2nd alternates between choosing subsets that include the latest feature and those that dont. the 3rd optimizes over all coordinates/ features. in this 1st run results are similar/ competitive/ excellent/ gratifying for all of them and it will take some more analysis to decide if one outperforms the others. 😀
(1/20) this code adds a 50 iteration “minimum buffer” on each of the algorithms linked to the test error to determine the termination condition and outputs the data for the best model found of the 3 algorithms.
😎 ⭐ ❗ this code analyzes the predicted vs actual curves for ‘h2’ and glide length via sort correlations as with the prior linear regression comparisons by loading the saved model. the curve fit is verging on fabulous! while there is underestimation in the ‘h2’ space for higher values (1st graph), an even more promising sign is that there seems to be no underestimation or overestimation, or apparent increasing error in the glide length space (2nd graph). but, this is misleading as seen in the 3rd graph which shows the error does apparently increase (with model seemingly more underestimating larger glides).
❓ a next step to confirm this further would be to do an adversarial attack on the model with the sequential bit flipping greedy algorithm to see if prediction error on glide length can be forced to indefinitely increase (ie without bound) for larger seeds, but that would be expected based on the graph. its also natural to wonder, do all such models have this underestimation property? it would seem that any model that did not might be nearly a resolution to the problem…. its also notable/ interesting that the underestimation at extreme values happens in both the ‘h2’ and actual glide length spaces which are otherwise apparently/ presumably not correlated much… more than a coincidence?
💡 other thoughts: how about instead of equally weighted errors across points, a machine learning algorithm that allows one to try to decrease the error on different points based on relative weighting? could it overcome this (biased) misestimation? a simple approach would be to use a function and inverse to transform the actual values away from the mean.
(1/23) 😳 😮 went to all the trouble to write some new regression fitting code for the current data, and then found it fit quite poorly… in the uncensored but colorful vernacular, crappy! was surprised! & was wondering quite a bit what happened and then decided to go back to original regression code for comparison/ replication. the 1st “successful” code was
fit5.rb and some variants. was a bit perplexed. looking back carefully, the prior regression analysis did not look at the ‘h2’ space much at all, instead focusing on the ‘ls’ (glide length) space which seemed to have some reasonable linear fitting. in retrospect/ 2020 hindsight, not looking into ‘h2’ fit was a major oversight/ gap.
fit5.rb code (very slightly, only changing the sort column) to look at the ‘ls’ space instead, got nearly the same results! the trajectory generation code is nearly the same with some minor differences. heres a graph of the ‘h2’ regression curve fit. (the “improve” neighbor logic has been ripped out.) the fit does not appear to be completely useless but basically reduces to a bit of lower estimation on the left and the rest is nearly entirely centered around the average. so the actual linear trend in the data was essentially very marginal to begin with! was possibly fooled somewhat by looking at regression coefficients that all looked meaningful.
two more observations: its quite counterintuitive that the ‘h2’ fit would be so much worse than the ‘ls’ fit, but ofc data science is always full of endless surprises and “curve balls”… another impressive aspect/ “silver lining” is that the nonlinear algorithm fits quite better even when there is relatively poor/ rather weak linear fit. another vindication is how effective sort graphs are in analyzing data/ algorithmic performance in general. it reminds me of spectrum analyzers and “bass vs midrange vs treble” analysis. maybe more than a mere analogy. even though this data is quite multidimensional, a lot of analysis can be gleaned/ reduced to clever use of 2d graphs, tieing into some of the raison d’etre of this style of attack, even the blog itself. “getting a lot of mileage out of it” so to speak 😀 😎
next, a possible direction, try to figure out what features are correlated with the signal/ non-average region in the fit; what is causing it to be shaped like that, and can the signal be further isolated/ improved on (eg via adjustments/ new features) so it somehow expands over the whole range? if it can it seems likely that the RBF nonlinear curve fit will pick up on the same signal.
(1/24) this is an elegant refactor of the code that has almost no functional change. the data is stored as lists of hashes with variable names as keys. it highly streamlines the code and analysis and think its so much more readable/ maintainable than trying to deal with columns as (numerical!) array indexes. there is some minor new function. the idea was to look at correlation between input variables and the predictions. the mystery is largely solved and it looks like a lot of the signal is correlated with ‘dlo’ variable. this is a plot of ‘dlo’ (red/ left tics) vs ‘h2p’ (green/ right tics) which is the predicted ‘h2’ value, the graph is sorted by ‘h2’.
(1/27) 😳 😮 the
fit15 variants (and
analyze based on them) are screwed up in a way that is not so obvious. ouch! did you spot it? was trying to think about how to do an incremental/ online version of the algorithm to set up for an (interactive) adversary-based use of it. and then started thinking closely about/ looking at the pairwise hidden feature adjustments. the code does not exclude the blind data from the calculation. big oops! am not sure how to rescue it right now, am having to think hard about how to do it. yeah, some things are just Too Good To Be True.™