collatz radial basis function back to basics

hi all. some extended wallowing in self appraisal/ reflection to begin with in this installment. last months installment of collatz had a big highlight, at least wrt this blogs history.  as has been stated in various places, & trying not to state the obvious here, part of the idea of this blog was to try to build up an audience… aka communication which (to nearly state the almost-canard) is well known to be a “two way street!” there are all kinds of audiences on the spectrum of passive to active, and in cyberspace those in the former camp are also long known semi-(un?)-affectionately as lurkers.

must admit do have some “blog envy” of some other bloggers and how active their audiences are wrt commenting. one that comes to mind is scott aaronson. wow! thought something like a fraction of that level would be achievable for this blog but now in its 5th year, and candidly/ honestly, it just aint really happening. have lots of very good rationalizations/ justifications/ excuses for that too. ofc it would help to have some breakthrough to post on the blog and drive traffic here through a viral media frenzy… as the beautiful women sometimes say, dream on… ah, so that just aint really happening either. 😐

however, there was a highlight from last month, for this blog something like a breakthrough, but also, as you might realize the subtext on reading further, with some major leeway on where the bar is set (cyber lambada anyone?). got an anonymous, openminded, even almost/ verging )( on encouraging comment from someone who wrote perceptively and clearly had a pretty good rough idea of what was going on in that significantly complicated collatz analysis blog post as if reading substantial part of it and comprehending it, and getting to some of the crux/ gist of ideas/ approach here. nice! 😎

(alas, full “open kimono”/ self-esteem challenging disclosure… admittedly that is a very rare event on this blog, and despite immediate encouragement and my marginal/ long gradually increasing desperation now verging on resignation acceptance, anonymous has so far not returned. this overall predicament is something of a nagging failure gap/ regret/ ongoing challenge wrt the original idealism/ enthusiasm/ conception of this blog. which reminds me, also, long ago there was an incisive/ discouraging/ naysaying/ cutting/ near-hostile/ unforgettable comment, and may get to “highlighting” that one too eventually as part of the overall yin/ yang balance etc after changing circumstances and/ or building up enough courage wrt my cyber-ego, long keeping in mind that other quirky aphorism, success is the best revenge…) 😈

anyway here is the comment again, suitably highlighted/ framed/ treasured forever at the top of this blog:

What is a “glide” and how is it related to the trajectory length? Have you defined it somewhere earlier? What are your input variables for the model? What’s the reason to believe that even if you have a good predictor for your “glide” it helps to prove the conjecture?

who was that masked (wo)man? now riding off/ disappeared into the sunset? can you not see some flicker of a unintentionally deep socratic/ zen question here? this commenter/ probable-mere-passerby has somewhat accidentally summarized/ cut to some of the core conjecture being explored here over several years…! ❗ 💡 😮

so anyway, jolting me out of the typically nearly solipsistic reverie/ stream-of-consciousness writing of this blog (momentarily!), this proves there is intelligent life/ consciousness out there, and in this case it only took a few years to make some brief, glancing, flittering, incidental contact with it. thx so much, anonymous & cyberspace! am feeling glimmers/ stirrings of enormous gratitude for this brief )( feedback, like “not all is wasted”. on other hand, looking back at those rosy early days/ expectations at this blogs inception, nevertheless in a semi-crushed mood at the moment, am reminded of that old saying by (19th century!) general Von Moltke sometimes quoted in eg warfare or chess strategy, no plan survives contact with the enemy. another one from sometimes-sun-tzu-or-zen-like rumsfield, you go to war with the one you have, not with the one you want, ofc with shades of the lyrics to that old rolling stones song! 😮

⭐ ⭐ ⭐

ok, enough with the melodrama emotions (or those relating to publicity/ community engagement efforts), on to the latest installment. have been banging away on a lot of ideas related to general machine learning approaches on the collatz data from a radial basis function (RBF) angle. am very interested in recursive approaches. recently tried predicting/ fitting residuals of the RBF and didnt really get any results out of it, basically the residuals are entirely noise as far as the RBF is concerned so to speak.

another idea tried is that earlier code was computing a “hidden feature vector” but which was based on “cheating” somewhat by looking at the hidden/ blind data. then tried the idea for the RBF itself to compute the mapping of input data to the computed hidden feature. thought this was a very clever/ promising idea, but “reality intruded” (doncha just luv that expression, maybe a general theme for life etc!) & this also came up emptyhanded because apparently as far as the RBF is concerned, the hidden feature is noise/ unpredictable from the input data… notable/ interesting findings, but still “null results” too much of a hassle to write up in any more detail even though the code is quite involved/ delicate. (but did this morning just think up one more trick up my sleeve to try!)…

so for this RBF its a little disappointing/ frustrating to seem to have essentially no tunable parameters even after some massive effort in this direction, no shapely/ viable/ classic train/ validation/ test curves to stare at whatsoever, just a scattered/ desolate junkyard of laboriously-discarded full-of-many-moving-parts ideas/ “null results” (that word combination nearly an oxymoron!). 😦

nevertheless its been good exercise so far. and another way of looking at this is that maybe “straight” RBF is an inherently very effective machine learning approach that doesnt require much “training”. (another way of looking at training in machine learning is that its to try to summarize/ compress large amounts data, whereas there is essentially no compression in RBF, at least this version which includes every point in the dataset in the model in some sense, and is likely an intrinsic part of the powerful qualities of RBFs.)

so! without further ado, finally to the point! in this post, am starting here by getting/ scaling back to basics and just posting the basic RBF code which is not very complex but has been hammered at for days & has some new/ redeeming features. (and noticing just now after reviewing last post that never really did post that basic core code/ result, because was somewhat prematurely getting carried away/ jumping the gun with all the “bells and whistles” which mostly turned out to be duds…)

1st, here the data generation is decoupled from the curve fit logic in this code. then followed by the basic/ streamlined curve fit code. then theres a graph of the fit of ‘h2’. its noticeably better than linear fitting from prior installments but it has the same general shape, such that lower ‘h2’ values fit “not bad” but higher ones tend to generally have predictions of nearly average values. 2nd graph is the ‘ls’ fit. 3rd plot is the error in ‘ls’ fit.

data.rb

fit20b.rb

fit20b

fit20bx

fit20by

⭐ ⭐ ⭐

(2/11) 💡 ❗ highlights! some )( evidence for non-total personal isolation/ reclusiveness… social media aka cyberfriendsheather with her upcoming physics guest speaker session & DS mention collatz in physics meta. freudian slip? 🙂 😛

Careful with the Collatz conjecture. I can drive you mad.

Continuing to make progress with my two pieces of code (factoring/collatz). This has got to be one of the funnest projects I have ever worked on. 🙂 *

also here in a fun chat on collatz & other misc topics, heather promised me to run every ruby program on collatz on this site… that could take awhile, am not gonna hold her to that one! but, “dream on!” 😛 o_O

⭐ ⭐ ⭐

(2/14) 😳 😮 😦 o_O 😡 👿 crushing! setback! weeks of chasing ghosts/ phantoms! back to the drawing board! @#$& was always a bit suspicious of the data distribution. had a closer look. there is an initial output of a lot of points that have a ls=2 but have various ‘h2’ values. this is a quirk of the generating process. a tiny change to exclude the trajectories with ls<=3 leads to the following code, and refitting with fit20b.rb gives total noise prediction of ‘h2’ centered around the average. in other words there seems to be no predictive value to input variables in current form and prior predictability was due to the skewed/ biased distribution. 2020 hindsight, in future will look more for bias in the distribution before jumping the gun to playing with fitting…! (is anything salvageable?) 😥

datab.rb

(2/15) 💡 ⭐ XD ❗ ❤ 🙄 (typical research as bipolar?) again retrenching and (maybe?) snatching some victory from the jaws of defeat. going back to some of the earlier findings. this is some new more sophisticated data generation code and the basic linear fitting code separated. the generation code consolidates points over 3 separate runs, excludes short glides 'ls'<=20 and samples half the 'ls' range over each 'h2' "slice". including the whole 'ls' range in contrast (sometimes) seems to lead to nearly random fit. from the graph there is some impressive signal here but is it only due to bias in generation algorithm?

the question of bias in the generation is turning out to be rather subtle and apparently takes quite a bit of care to try to generate “nonbiased” samples. (and maybe that is the real story of last many weeks.) somewhat counterintuitively, maybe seeds that are part of “unbiased” distributions are “hard” to find. (definitely have to think more about this!) the separation of code allows the sample distribution to be examined more directly via the data.txt file. one idea that is coming to mind is some older code that looked at a 2-way variable distribution to search for seeds, which could be generalized to some kind of multiway analysis/ frontier search.

addendum: maybe got an anomalous random run previously. changing to the full ‘ls’ range for each ‘h2’ slice sometimes leads to nearly the same results/ linear signal below. except the bottom edge tends to flatten out in that case.

data2.rb

fit24.rb

fit24

 (2/18) built some very sophisticated code that took hours to debug. it looked at the histograms on the frontier and selected set over 3 different axes, the ‘h2’, ‘ls’, ‘ns’ dimensions. it has a neat voting/ weighing algorithm that tries to add points based on different vote increments wrt gaps/ outliers in the selected histogram, accumulated over all 3 axes. however, wasnt understanding its behavior, it seemed to be largely selecting only small ‘h2’ values and larger ‘h2’ values were quite rare, and while sophisticated & running as intended, not ready to post it yet wondering if it is still not performing ideally/ as desired.

had to do some more thinking/ analysis. then was led to this basic analysis that maybe have done something like before (using somewhat different logic) & then didnt remember. this simple code tries to maximize ‘h2’ using more recent patterns. noticed h2 up to ~150 is findable. but on other hand large ‘h2’ values seem to be confined to lower valued seeds and maybe dont even exist for higher seeds! it appears the ceiling may asymptotically decline for higher seeds to around ~25-30. this graph is ‘h2’ scale on left and ‘ns’, ‘ls’ scale on right.

an idea from this data is that maybe fitting larger ‘h2’ values does not make sense if they dont exist for seed sizes approaching infinity! it also shows how a dynamic histogram approach could get messed up if its range is skewed by high ‘h2’ values associated with lower seeds only, later seeds will seem to fall mainly in “lower bins”. and its an uncommon case of a statistic with a ceiling that maybe declines for larger seeds.

1st graph is maximizing by ‘ls’ (1st arg) which seems perform best. 2nd graph is maximizing by ‘h2’ which tends to cause ‘ls’, ‘ns’ to run sideways.

data4.rb

data4

data4x

this is a in interesting twist without too much new code & maybe comes close to what is desired. it seems 2 phases are needed, 1 to generate points and another to sample them in a more uniform way, because almost no matter what the more “raw” generation logic chosen, “easy” (short) trajectories are quite common and the “hard” (long) trajectories are rare. this code actually contains some of this tradeoff in the code.

the idea is that there are 2 strategies. 1 randomly samples points out of the frontier and is good at giving a balanced distribution & not getting trapped in local minima, but does not maximize the variables much except by accident. the other strategy looks at relative values of ‘ls’, ‘ns’, ‘h2’ and tries to greedily maximize the combination by choosing top frontier points. this 2nd strategy has the effect of putting more force into the maximization of the combination of variables but then its biased away from “typical” samples with more “medium” values. here are 3 runs, using strategy 1 (arg0=1), strategy 2 (arg0=0), and a alternation between them that gives the general desired result of maximizing variables while at the same time sampling broadly over the whole distribution (arg0 null). the last step not implemented yet is to resample the points in a somewhat balanced way across the ‘h2’, ‘ls’, ‘ns’ dimensions. (graph order/ colors switched in this graph)

data5.rb

data5data5x

data5y

(2/20) this is a huge rewrite of the prior data2 code/ algorithm synthesizing prior ideas and combined with some new features. it took a lot of care/ analysis/ debugging, partly due to a tricky-to-isolate but simple glitch. theres a lot to describe here & intend to get back to that detail, but for now, the point is that it has 6 separate powerful algorithms to sample/ advance the frontier and they are all alternated/ run “nearly” in parallel. 5000 iterations and 3 separate runs. in the following case that led to a total of 572 points and roughly the same results as earlier (maybe a bit weaker where the lower boundary/ edge is not as increasing), overall indicating generally some robustness and maybe/ hopefully lack of bias in the generation algorithm. again fit24 linear regression was used to analyze signal in the graph below. actually this code samples the “full h2 slice” instead of a “half-slice” at the end and that may account for some of the discrepancy. the 2nd graph is the nonlinear RBF code fit20b on data from a different run. the really weak performance of the computationally expensive RBF code esp wrt regression is surprising/ disheartening but at least theres some discernable signal. 😦

data6b.rb

data6b

data6bx

(2/23) this is a sophisticated idea again synthesizing much prior ideas/ code. it ran off the last data distribution data6b.rb. was wondering about quantifying relative performance of regression vs RBF strategies. also came up with a new RBF idea that uses linear interpolation. seemed like ingenious idea to me (it fits trend based on line through known points/ distances and interpolates d=0) but was surprised/ disappointed some that this idea does not perform well here compared to other distance weighting approaches. (maybe more tweaking can save it? maybe its very sensitive to # of neighbors & 10 used here is too small or large?)

this has 3 RBF weighting strategies including the new interpolation idea, and 1 regression approach. there were 551 points. the code just splits points between train/ test many times and looks at average error over many runs. here after 100 runs the 2nd 10 neighbor RBF algorithm (green #2) was found to be indistinguishable in performance from regression (magenta #4). the new interpolation method is blue #3. the 1st plot is the raw data and the 2nd is the averaged errors. in the raw data all the 3 RBF approaches seem nearly the same but the average finds measurable discrepancy.

some explanation of the prior graphs and that relatively noisy RBF fit is explained here, it corresponds to RBF method red #1 here with “inverse distances” which is noticeably inferior to regression and the best RBF weighting method #2. its also newly heartening that one of the RBF methods is on par with regression, and am now wondering if maybe some very tight optimization may show at least a slight edge for RBF over regression. 🙂 on the other hand it would be nice/ highly desirable to find a method such that more computation leads to better accuracy, that is ultimately one of the big questions of big data/ machine learning. 😐

4fit.rb

4fit

4fitx

(2/24) this is an interesting/ neat pattern to figure out the best n-nearest neighbor count while also trying to minimize function evaluations. the idea is that theres a noisy function but with some kind of true mean, and more samples & the average get close to it with less uncertainty. but no need to sample ‘n’ far away from the optimum and its better to minimize total sampling (computation). the idea here is to start with a rough estimate n=20, sample at n, n-1, n+1, and then move “left” or “right” (lower, higher) by a single increment depending on the best minimum found so far. the estimate at each “slot” is the average over all samples at that slot.

following are the results that show that even optimizing over optimal ‘n’ for each RBF method separately, (alas) this RBF does not outperform regression, method #4 magenta line, but one method does come close, #2 green line. notably, two of the RBF methods #1, #3 red/ blue interchanged in relative position based on optimizing ‘n’ showing the importance of doing so; blue #3 is the new linear regression method and its nice to see its improvable beyond worst performer! (n=10 is much too few neighbors for this method and its error improves substantially around n=20.) not sure what the early noisy spikes are about right now! honestly, looks like some kind of intermittent glitch that could be hard to isolate, but maybe not affecting the results much…

4fit2.rb

4fit2y

(3/1) its looking like the linear regression code is very hard to beat even with sophisticated RBF code. this RBF code puts exponentially decaying functions based on distance around all the train points and does gradient descent to compute optimal exponents for each. the train error goes down to about ~7 which is less than prior regression test error of ~7.4 (red), but the test error for this new method is ~9 (blue), rather bad! note convergence is after about ~20 iterations and the rest is mostly extraneous.

fit26.rb

fit26

this is some very cool code that combines ideas/ code from about 3 separate prior programs. the idea here is an old idea of having 2 separate semi-intelligent systems, or optimizers that work against each other, aka an adversarial approach that is big/ trendy in machine learning theory/ AI right now (eg with google Go). the idea is that one optimizer finds maximally bad fit candidates based on current analysis properties/ model. the other finds the best model to explain the latest stream of bad fit candidates. they work in parallel. all the prior code to do each separate function is all lying around, it just needs to be combined, and that was not hard.

the distribution generation code has two separate approaches, one to choose a random point in the frontier to advance and the other to try to maximize ‘ls’. the idea is that one wants a nearly uniform distribution of ‘ls’ values that is steadily increasing. this took maintaining 2 separate pools of candidate trajectories, one with random ‘ls’ values over the whole range limited to 10K size and the other with maximal ‘ls’ values encountered so far limited to 2K size (smaller because its sorted which is more time-expensive). the linear regression fit code runs on the last 400 candidates to generate the new model. the big question is, does the ‘ls’ error increase without bound? note that this code fits ‘h2’. the Huge Goal is to come up with some kind of adaptive model where ‘ls’ error is not steadily increasing. this is 100K iterations in about 3½ min. the graph samples 1/40 output points at random.

a bit startlingly, the error (blue) does not go up much here even as ‘ls’ range increases (but note ‘ls’ red points increase tends to level out after ~1000, and green points are estimated), and note final seeds are about a few hundred bits wide. also notable there are some late ‘ls’ spikes where in not every case corresponding error spikes totally correlate in size, ie it is roughly correctly estimating even near-outlier spikes, although some spikes correlate eg last 2 big ones. anyway though, because even with very sophisticated distribution generation logic attempting to maximize error but error nevertheless does not increase much, all this seems verging on a breakthru at moment! (although its not totally clear how much of error limit is due to not finding larger ‘ls’ trajectories. but, already have another cool idea for generating a better distribution, a major shift in pov.) ⭐ ❗ 😮 😀 😎

fit28.rb

fit28

1 thought on “collatz radial basis function back to basics

Leave a comment