collatz revisualized

starting these blogs out, sometimes dont really know where they will lead ahead of time, but its a new month and a new pov. so the (“inherently a priori“) title is typically either “whats happening at the moment” or “some general/ larger theme intended to be pursued,” in this case the former. (now pondering that, am intending to revisualize the last collatz experimental angle in particular but on other hand, nearly entire theme of this overall research prj/ program in general could be said to be “revisualization”!)

at 1st was thinking maybe not all the major extra effort for polished/ pretty visualization was worth it at the moment, but couldnt resist, just wanted to see it, and was curious/ wondering about a few additional statistics. there was some real payoff by saving all the intermediate data and coming up with a major refactoring of the visualization code, ie decoupling visualizing and generation phases, and did a fairly massive rewrite without having to rerun the very expensive generation code. all easier imagined/ said than done! took quite awhile/ substantial effort.

this code twice calls a complex general analysis/ visualizing subroutine for a shifting distribution, once for the density data and 2nd for the ‘wr_e’ data. the code generates the graphing commands rather than prior manual entry. it basically does a scatterplot of both using black points, and superimposes some new statistics. the max is plotted in green and the min in red, the “baseline” in yellow. then there is some plotting of the “out” and the “in” points, namely the value of what was thrown out and what was added in the optimization, in blue, magenta respectively. and finally the ‘ncc’ metric or ‘non converging count’ is graphed/ superimposed in lightblue on its own independent/ separate scale. colors, point and line thicknesses were all adjusted. these lead to new trends/ some insights.

  • one striking difference is that it is found that the density plot carries the max density rightward and it was misleadingly/ deceptively overplotting the graph boundaries/ edges/ border in the previous plot. in other words theres a massive/ widening gap/ space between the constant max density and 2nd max density and the idea of an apparently narrowing range is mostly an illusion.
  • the ‘in’ and ‘out’ densities are not random at all and tend to trend, and theres a sideways swing toward/ away from the baseline. some of this could be guessed by looking at the distribution of density points alone but a hidden pattern is also revealed. also interestingly the ‘in’/’out’ densities tend to alternate around each other in phases.
  • another observation is that in ‘wr_e’ the genetic algorithm almost always finds points close to the current minimum but that gap does widen/ spike around iteration 25 but not later.
  • the final observation (at the moment) is that ‘ncc’ does seem to correlate to the other trends in some way but its not so clear from this single run.

am now wanting to write some code that does many runs, do a long multirun eg overnight or longer, and look at what comes out, and try to find tendencies/ patterns. also, realized that its a real “miss”/ oversight to have left out the saving of ‘wr_e’ data for the test set and definitely want to add/ examine that also, feel there is likely something quite significant/ meaningful in it esp in comparison to training set. the density of the test set on the other hand is obviously equally distributed.

data54b.rb

data54bx

data54by

(8/4) ⭐ 😮 😎 ❗ 💡 ❓ heres some new cool code that does multiruns and has slightly modified storage/ analysis routines. computing the ‘wr_e’ for the test set densities is a low-expense operation so it can be done every iteration. the 3rd graph is the new ‘wr_e’ measurement for the test set. have been looking at very many runs and looking for patterns. am esp interested in anything that correlates with the ‘ncc’ optimization metric. but ‘ncc’ is optimized indirectly in the code. am not really sure how this is working, it seems a bit magical.

‘ncc’ typically goes sideways for awhile and then just drops cliff-like to zero. feel like have never seen anything like it! its as if something is constraining it somehow. it reminds me (old blog allusion here!) of road runner vs wile e coyote and the coyote dancing in midair and then falling off the cliff but only after he looks down… (and musing, here the metaphor has transformed from hopeless to hopeful!). from the new ‘wr_e’ test error, perfect convergence is found almost magically even as ‘wr_e’ is far from zero! somehow pictured that ‘wr_e’ test error could/ would/ should be driven closer to zero, but nothing whatsoever like that is happening.

something quite mysterious/ striking is happening here and seems right now it could take quite awhile to figure out. was trying to find metrics that correlate with the ‘ncc’ rate and nothing really seems to line up with it. looked at the graph metrics and also difference of training density in/ out points. would like to get into ‘ncc’ convergence dynamics more but on other hand its somewhat a case of “ends justify the means”.

the convergence is quite consistent and has never failed so far in dozens of runs! admit some surprise at my good luck here! seems to be genuine further breakthrough territory. and feel its a “godsend” when almost everything else on this problem is only very hardwon through painstakingly direct/ targeted/ specific tuning. but still am left with a feeling of amazed wonderment almost bordering on cognitive dissonance. what the heck is going on here?

my next urge is to poke at the final solutions a lot in some way. maybe try more density samples etc, and also try to make an adversarial attack on the “perfect” (measured) convergence somehow, not exactly/ immediately sure how to do that, it doesnt seem entirely obvious at moment, have to think about it more. could it be enough to try to maximize trajectory or glide lengths somehow? ❓

but this is an embarrassment of riches and then just had another idea for more visualization. ‘wr_e’ seems to have a distinct distribution. there are not a lot of points, only 100, but it would be fascinating to see a histogram for the test/ train sets. for the train set, it seems to converge to something normal-looking. for training, it narrows to a very small region and its not clear from the diagram what the distribution might be. another idea is to increase density sample points to get more resolution in the histogram. unfortunately however theres an “expensive” n^2 increase in test set ‘ncc’ nonconvergence measurement time for increasing density sample points… although, its an “embarrassingly parallel” computation… wheres that cluster supercomputer when needed? ❓

data55.rb

data55x

data55y

data55z

data55w

(8/5) this is maybe some of an answer/ secret to some of the prior questions. this code has a minor modification to compute histograms, and reran it on prior data for prior graph run using existing stored data. it looks like the skewness of the train/ test distributions may play a key role. this can be guessed from prior distribution diagrams but it didnt really jump out until displaying these histograms. more attestation to the power of targeted visualization.

1st graph is train histogram of ‘wr_e’ and 2nd is test histogram. in the 1st it moves from left/ dark to right/ red skewness finally centered/ orange, but narrow. the 2nd moves from right/ dark skewness to centered, and the distribution is very smooth and curved, looking parabolic, and visually there also seems to be some zig-zagging effect between adjacent histogram bins, but not sure if thats just a noise or visualization effect. (note, reviewing stats defns, that left or right skewness is the opposite side of the mean/ peak!) looking at them, a conjecture might be that both train and test distributions have to converge toward the center, and/ or train distribution narrows, to decrease or flatline ‘ncc’?

💡 another idea occurred to me. what if the optimization was designed to try to alter the training distribution in some way? as far as some strategy of discarding outliers? eg it could try to throw out the point “farthest” from the mean, or alternate between min/ max, or possibly outliers based on histogram, etc… ❓

data55b.rb

data55bx

data55by

(8/11) this is the code to try out some other selection logic and its a null result for 2 alternatives. there are 3 methods, ‘a’, ‘b’, ‘c’. ‘a’ method is as prior logic. ‘b’ method finds highest and lowest points based on absolute value of ‘wr_e’ (throwing out highest and seeking lowest). ‘c’ method is similar except throwing out farthest/ seeking closest to average. ran method ‘b’ for over 1000 iterations and it didnt converge. method ‘c’ didnt converge after 500 iterations. in both cases ‘ncc’ seems to just oscillate sideways. for case ‘b’ all the training ‘wr_e’ converged to zero but the test values were distributed over a range and the distribution was not shifting.

data56.rb

(8/16) 😳 poked at/ reanalyzed the solution with some code and found a glitch. the solutions so far are basically “degenerate”. what is happening in every solution examined is that the min/ max ‘wr’ found during the fit phase is between 0 and 1. this means that the algorithm is basically “cheating” by throwing out any train points that have trajectories longer than a single iteration. ie basically it selects starting seeds that immediately decrease from their starting position. oops!

however, this turned out to be easy to fix. instead of a min/ max limit on the formula computation, which is a nonlinear computation anyway, it makes sense to just call/ label the trajectory “nonconverging” if there is some overflow to large numbers, in this case 1e10 in the detect subroutine. unfortunately though this means rerunning the timeconsuming solution generation was necessary, but ofc this is maybe always the easiest part, ie just running the code.

the other change is new code to reanalyze the solution. my idea was to look at the “nonconvergence map”. this is a natural 2d array/ matrix composed of the convergence step count for the n x n ‘wr_e’ perturbations in the retest subroutine. the count2 subroutine is modified to return this array for later storage and the plot routine is modifed to add it to the series of plots. its graphed as gnuplot (surface plot) matrix with image code as below (have graphed stuff like this before but now feel should have figured it out/ used it a long time ago!) there is some minor variation in it between review runs (ie for a fixed model solution). another minor fix is adjusting the ‘y2tics’ logic in the plot. actually that was introduced in the last data56 code and carried “backward” so to speak (ie wrt program suffix/ numbering but not “branches”). the x/ y axes are ordered by low-to-high starting seed density.

data55c.rb

data55c

(8/31) did a lot of work, quite a few hours over quite a few days to tighten the analysis significantly and look at even broader patterns/ nonconvergence consistency.

  • this code looks at analysis for non-multi-iterations ie only 1 iteration denoted with global variable $m and max meta function iteration limit is now denoted with global var $c.
  • there was a small glitch in the detect subroutine that returned nil instead of $c for the overflowing case which randomly affects runs.
  • this has a new stats2 routine for analysis called/ triggered with args ‘x’. it starts out with a comparison of total glide length analysis of the real vs meta “sub-trajectories” over the overall optimization iterations. brighter colors are later iterations, graph “c”.
  • the points/ resolution in the prior “nonconvergence map” are increased from 100 to 250, that is starting seeds only ordered by density, graph “z”.
  • there is an analysis of points and their errors for the iterates after prior starting points, ie for (“all”) the points in the glide(s), using the same “nonconvergence map” approach. the glide iterates are “slightly larger” than the starting points. however there are many (a few thousand) total points in the glides and this code does a limited random sampling over them, but based on equal intervals within first ordering them by 5 different orders ‘d’, ‘n’, ‘x’, ‘nl’, ‘y’ which are ‘d’ density, ‘n’ starting iterate, ‘x’ index in glide, ‘nl’ iterate bit width, ‘y’ remaining glide length respectively.

in graph 1 “c” there are two general curves representing the real and the meta glide lengths, from left to right by starting seed density low to high. clearly from the graph/ trend, the overall meta function is effectively estimating glide length as an increasing/ nonlinear function of starting seed density. the optimization effect over optimization iterations is to generally decrease the glide length estimates but not by a lot, ie for the meta curve, there are brighter lines (for later optimization iterations) trending downward. the estimate is indeed roughly within the average range of the real glides.

the overall optimization convergence time was actually significantly better/ faster than prior code with multi-iterations set to 50 and in some runs there was immediate “convergence” to the solution ie “1st try”. the other graphs show quite a bit of random scattering striping and not so many local patterns esp “horizontally”.

⭐ ❗ 😀 😎 the bottom line is that this code is looking at how much the meta function convergence matches the real function within all encountered errors and finds there is no failures at all over many separate runs (signified by no max iteration count runs of $c=500 encountered anywhere!)—however full disclosure, there were some runs that have a few scattered nonconverging points or long meta trajectories close to the limit. nevertheless its overall very solid! this means this code has, as designed, indeed successfully found something like a loop invariant in the meta function index variable!

data57.rb

data57c

data57z

data57d

data57n

data57x

data57nl

data57y

(9/1) 💡 this is some simple riff that sorts the grid points after the sampling; vertical axis sorted by starting iterate density and horizontal axis sorted by ‘wr_e’. from prior analysis this would predictably tend to lead to incrementally “hotter” gradient/ regions—and that is exactly the remarkable/ emergent/ smooth/ nice/ unifying/ eyecatching result and a simple way of conceptualizing the overall nonconvergence grid dynamics. even beautiful! now having some satisfied/ inspiring/ triumphal feelings lately that complexity can be tamed ❗ 😮 😎 😀

data57b.rb

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s