💡 ❗ ⭐ this code has a lot of moving parts and took quite awhile but is also a culmination of many previous ideas. it has the same distribution generation mechanism as last run (slightly modified/ adjusted), but its a 1st cut on evaluating the significance/ accuracy of the MDE/RR model as a predictor of trajectory length. the better the accuracy of its modelling, the more “plausible” a (“analytic”!) model it is and viable for exploiting for further derivations.
this code generates starting seeds with smoothly varying densities to analyze the model. from experiments it was seen the model could have two approaches. the MDE/RR basically predicts ‘x’, the (bit) size ratio of one iterate to the next. the model starts with the initial seed and then iterates recursively using the matrix equations for the 7 indicator variables. so what is its logical estimate/ prediction for trajectory length? method #1 is a somewhat “naive” method for comparison that simply uses the starting seed bit density. the method #2 uses the MDE/RR eqn and multiplies out all the ‘x’ series and determines when its less than 1 (“bit length”) and uses that as the estimator. method #3 is based on experimental observation that the model sometimes predicts ‘x’ negative and its nearly the same as method #2 predictions as far as typical location in the trajectory.
the code calculates correlations between predictions and actual length. results are that correlation is significant starting at around ~0.5 and method #2/ #3 (green, blue) do indeed outperform method #1 (red) (~0.4) for estimating trajectory lengths. the near alignment of the two techniques is quite apparent. green #2 slightly outperforms blue #3. the more naive technique #1 red is inferior for bit widths up to about 40 (30 on the graph) and then all the methods tend to overlap. dont have an explanation for this yet, although might guess its revealing the limitations of the model computed/ generated on smaller iterates. magenta line/ right axis scale is the average trajectory width. overall this is some vindication of the general approach although feel the naive method performed surprisingly well in comparison. 😎
(5/3) 💡 ⭐ 😮 hindsight is 2020! the prior inclusion of the single variable density predictor was included almost as something of an afterthought, yet it performed surprisingly/ remarkably well, leading to the following direction. had an idea to scale back to basics, avoid the complex matrix eqn formulation, and focus just on the density effect on bit size/ “width” ratios, here called ‘wr’, and am a bit shocked that maybe never looked at something this basic/ fundamental/ key before, although presumably have come “close” in prior experiments. (also am having some whiff of deja vu and wondering if maybe discovered this once, maybe even early on with initial discovery of density property, and put it aside to write up later, accidentally indefinitely!)
this looks at iteration counts/ spans ‘t’ of 20 over all density ranges and 4 variations on the collatz compression (2 binary variables for compressing odd/ even runs). there is another (3rd) binary variable on whether to generate odd seeds only. then it does a simple single-variable linear fit.
testing all 8 combinations gives the best optimization as not odd only seeds, and no even or odd compression, ie combination
[false, true, true] with correlation coefficient a very impressive ~0.73. it would be quite remarkable if all the prior complicated assembly (incl the sophisticated code to find balanced distributions over ‘wr’) is really (mostly) just expressing this particular relation.
the bigger idea here is that this is “close” to a single-variable recurrence relation for iterate size, could something so simple have the earlier sketched out stability property sought? to build a 2-variable recurrence relation what is also needed is a formula for subsequent densities, which is already also quite apparent from previous work… ❓
😮 😳 another rather huge observation, an elephant in the room finally spotted at this point, is that again things seem to have gone in a circle, and this is circumstantial evidence maybe the prior very complex/ painstaking code to generate trajectories with varying bit size ratios is actually just finding starting seeds over smoothly varying densities!
[[false, false, false], 0.5563232157226472] [[true, false, false], 0.2562485822983529] [[false, true, false], 0.5889218554667478] [[true, true, false], 0.5764039403722678] [[false, false, true], 0.655020246814034] [[true, false, true], 0.5610258625610238] [[false, true, true], 0.7338845238137213] [[true, true, true], 0.6128443693547614]
this quick code variant does the simple calculation to find density from prior density (‘d2’) and finds good best correlation also (best is same parameter combination), but a tad less at ~0.58.
[[false, false, false], 0.16887564367232774] [[true, false, false], 0.22601164387824987] [[false, true, false], 0.057085499485387846] [[true, true, false], 0.4551754316324641] [[false, false, true], 0.45771331768904455] [[true, false, true], 0.3621724662709662] [[false, true, true], 0.582959960585929] [[true, true, true], 0.47686131190553194]
next it is natural/ simple to explore the effect of gap length (# of iterations) on correlation of ‘d2’ (red) and ‘wr’ (green) for the best model “6”. best fit is apparently for shortest gap (2).
❗ oops, its not totally sensible to do linear fits on the density changes. the correlation is real but from long earlier experiments (initial ones in the area), the transform is a nonlinear sigmoidal-looking function. its apparently “bent” somewhat in a way where a linear fit has significant correlation. need to work on creating a more exact formula/ function for the sigmoidal shape/ form and came up with a plausible function tinkering in gnuplot, need to write/ code it up in more detail. gnuplot has a powerful arbitrary formula fitting feature (based on gradient descent), would like to set that up also.
(5/4) this is kind of a hack on finding a sigmoidal-like function that fits the density mapping. the standard sigmoid goes from concave upward to concave downward. this observed function goes from concave downward to concave upward. so constructed a reflection about the diagonal line y=x. the parameters are hand-tuned and need more adjustment but the basic/ general idea is there (green). for comparison the linear fit line is displayed in blue. need to do algebraic simplification on this but wanted to capture the work for future adj. iteration span ‘t’=2.
User-Defined Functions: f(x)=0.5/10*x g(x,y)=1/(1+exp(-y*x))-0.5 h(x)=g(x,-0.5)+2*f(x)+0.5
💡 ❗ 😮 🙄 this is a remarkable pov/ big story and wish had figured it out long ago (dont recall actually observing it before, but looked at stuff close to it), all the hints were in place, nearly staring me in the face, but it took a very long time to think about the problem in this way! this code does a linear fit on the bit size ratio mapping for gap size t=1 to 50. very significantly this line will have an intercept at y=1 which is the threshhold of density ‘y1’ for which iterates go through the phase transition from smaller to larger. very significantly this threshhold starts at ~.65 and increases reliably/ essentially monotonically.
this is a huge explanation for the overall phenomenon of the problem! in general high density iterates increase in bit size but their density decreases as they do so. finally, apparently/ it seems, every iterate loses enough density to decrease in size. so density functions quite like the long earlier sought “potential energy” property/ phenomenon!
the first graph is density threshhold ‘y1’ red, linear fit slope ‘m’ green, correlation coefficient of the linear fit ‘wr’ blue. 2nd graph is the changing line fits using different colors, darker for low ‘t’ gap size and lighter for high ‘t’.
this is some code that investigates the # of iterations to reach the core versus the bit width, ie ratio of the two using the “extend” optimization method. it seems iterations to core increases without bound but thats probably due to seeds that increase in size also. what about the ratio ‘r’? the ratio seems to compress to around less than ~0.5 after many iterations. this code uses a rolling log in batches of 5k lines. on the 15th file ie 15*5k=75k iterations, the output looked like the following. ‘ns’ green right axis scale and ‘r’ in red left axis scale. its real data but (the universal disclaimer) not clear how much this corresponds to real existing outliers due to the possible “paint into corner”/ “vein following”/ lack of binary prefix diversity cases.
(5/8) this is the 2-variable model (next density, next bit size ratio) used for halt prediction using only the “bit size less than 1” criteria. it generates the 2-line model and then uses the “extend” optimization to find 1000 long-trajectory points to try out the fit/ prediction on. this graph is the actual trajectory length (red/ left axis scale) versus the prediction (green/ right axis scale). the model highly overestimates but there seems like good correlation. in some ways this is quite notable because the model is based only on density variation. the 2nd graph (another run) is starting density of the generated seeds which has a weak linear trend but its not clear yet how much the model is tied to that.
💡 a way to show a sort of counterexample to this model: generate some random seeds with the same density (say ½) and width (say 100). the seeds will have different trajectory lengths but the predictions will be the same.* (note/ correction, see below) in fact without recomputing the model, the predictions will be the same for arbitrary width seeds! so obviously this model is highly oversimplified. but its a small modification to at least take into account initial seed bit width effect on linear fit/ trend as in prior analysis ie recomputing the model for each width (actually the prior looping was over gap size instead of seed bit width, but is similar & am presuming there is a similar trend).
(5/10) this is a quick riff off
extend35 that tries to extend all variables simultaneously, ‘ls’, ‘c’, ‘ns’, ‘r’ partly because ‘r’/ ‘ns’ was found to plateau in that code. under this multivariate optimization ‘r’ compresses/ maybe declines more even as ‘ns’ increases. ‘c’ the iteration count to core also plateaus giving some (further) evidence for a limit/ bound. here is ‘r’ in the 5th of the 5k output files. also has a nice refactoring idea, contrasting with prior approaches, of encapsulating all the ranking logic in a single method.
(5/17) feel like there are several different directions to go in right now and am at something like a crossroads, good prospects but next step not obvious. one can try to increase the accuracy of the model or try to figure out if a somewhat inaccurate model could still be used to prove the problem based on the “stability” property conjectured.
btw previously comments on
data31c.rb said “the seeds will have different trajectory lengths but the predictions will be the same.” that needs some qualification. the trends (or local difference eqns) of the predictions will be the same. the predictions of trajectory lengths are based on initial seed bit widths, the latter of which can still vary even though the bit size ratio change formula is the same no matter what the current bit width. in other words this formulation apparently captures/ reveals the natural self-similarity of the problem based on density vs bit widths.
this code analyzes the simple 2-line fit function in another/ new/ basic way. 1st is a graph of the 2 lines found from the fit. red is density delta, green is width ratio minus 1, and 0 baseline in blue for comparison. 2nd graph is a plot of the density (red, left scale) and bit width (green, right scale) for 10 trajectories with density varying stepwise over whole range 0..1 and ‘w’ starting at 3. the curves converge for all densities. the bit widths are separate lines but they plot so adjacently as to look like a single shape (separate lines can be seen slightly better in the full resolution graph instead of shrunken version but also forgot to line plot them and instead the plot points are multipixel).
the density has a critical point/ attractor point at ~0.3 where densities above it decrease and below it increase (red line x intercept on prior graph 1). graph 3 is based on the same data in graph 2 and is just the sum of the two trends (density plus bit size), and its monotonically decreasing! suggesting a very natural/ simple invariant possibility. from a quick riff on the code however increasing the starting bit size ‘w’ from 3 up to 50 causes the sum to move into a nonmonotonic trend.
bottom line a meta function has been constructed to reflect the general trend of the collatz function and captures some of its basic dynamics such as some trajectories gliding and others not, a density transition point, and some of the concavity aspects of real glides/ trajectories. here the glides go from concave downward to concave upward, etc.; ie it replicates/ reproduces/ simulates the target function to some degree, hence a “toy model” with nontrivial properties and quite a bit of explanatory power. and this meta function has excellent analytic properties such as total “characterization” via the characteristic equation. admittedly its about time to start solving for it instead of the endless empirical brushes!
(5/18) ❗ 💡 ⭐ 😮 😀 this is a tiny )( few-line modification yet seems to vindicate some of the prior sketched ideas/ conjectures on stability and qualify as a breakthrough, and also is some nice “low hanging fruit”/ nearby big payoff/ affirmation/ results that can be demonstrated yet didnt require the more complex multivariate formulas. this tests adding small uniform random perturbations to each of the 2-variable recurrence relations at each iteration, max k=[0.005, 0.010, 0.015].
results are nearly identical as previously for the smallest delta ie convergence, aka stability! for the mid delta, the trajectories are less smooth, a bit more random and show some stronger similarity to real trajectories and still consistently converge (numerical/ empirical results, can this be proven in general?), but over a longer iteration count range. for the high delta, the trajectories diverge. its another transition point but in this case a remarkable one from stable to unstable. an immediate question is whether some kind of math analysis could predict that transition point delta range!
stability is of course closely related to the concept of convergence vs divergence long associated with collatz (ie property of single trajectories), but the former idea is intended to capture the aspects of semi-random walks with local perturbations.
😳 small oversight, this code has perturbations only as increments, ie positive deltas only; should definitely have included negative deltas, and suspect they may tend to “cancel out” the positive deltas and affect the stability transition point. ie immediate conjecture: combined positive & negative deltas of size ‘k’ will be more stable than positive deltas ony of size ‘k’. another small oversight, accidentally left out left side density scale on the plots but its identical to prior graph #2.
(5/19) this is a simple next step of studying the transition point in a little more detail. its 500 trajectories of smoothly varying density, and looking at the range ‘k’ in [0.007..0.009]. the transition point was found to bounce around a bit due to jitter/ randomness in the computed line fit coefficients so a fixed random seed was chosen. the graph is # of nonterminating/ nonconverging trajectories based on a 500 iteration count limit. note this must be a moderate overestimate because some trajectories would terminate/ converge with more iterations, ie false positive where detection is of nontermination.
this range of ‘k’ is crucial and can be called something like the particular tolerance of the stability property. next steps are to determine whether the delta between real collatz trajectories comes anywhere near the tolerance range. there was an earlier experiment along the lines to find those deltas (prior collatz post,
data22.rb experiment on 4/21).
this is all some extraordinary development at this point! to summarize, after years of wandering in the wilderness so to speak, the question of a proof can be reduced somewhat/ nearly/ partly to a numerical problem involving a single numerical gap measurement. presumably this gap would be lower with higher accuracy models than the 2-variate one eg the 7-variate models explored previously.
it was also observed that the delta between the meta function and real function may be higher for lower iterates (same 4/21 results/ graph), but this could possibly not stand in the way of a proof, because the proof hinges on “at the limit” gaps which maybe have precisely the needed property that they tend to narrow for increasingly higher iterates, ie the meta functions approximate the real function with more accuracy, ie lower iterates “below some limit” can be regarded as (finite!) special cases aka all part of the base case of the induction.
💡 this code also seems to reveal a key property associated with the study of semirandom walks, namely mean-returning (aka “convergent”) vs mean-avoiding (aka “divergent”) behavior. in this way its reminiscent of the deeper study in the area eg theory associated with the hurst exponent.
this is a jampacked graph that considers the error on the density and bit size ratio linear fits versus different bit sizes. the good news is that the errors are indeed decreasing for higher bit sizes! the “other” news is that the errors are not low enough to fit within the tolerance (k≈0.007 from previous experiment) by 1-2 orders of magnitude (very notably ‘w_r e_a’ green line comes very close at 100 bit width). ‘e_a’, ‘e_m’ are the average absolute error and max error. ‘e_a’ are plotted left side scale (red, green), ‘e_m’ (blue, magenta) are plotted right side scale. typically e_m ≈ 4 ⨯ e_a.
this graph also underlines that the density and bit size ratio lines have different errors associated with them but the prior experiments used only a single delta to determine tolerance (ie 2d errors vs 1d). but this general exercise shows the overall process of attempting to build/ test a meta function within the stability tolerance of the real function. as just noted there is also some hope of a tolerance-sufficient model/ meta function based on the declining/ asymptotic trend of the error, but here it mostly levels out without much hope of further substantial decline.
(5/22) this is a major reanalysis with some set of tweaks. a great idea is to try to analyze the transition point in terms of the max (fit) errors encountered, over each dimension. (wow, this all seems so simple and obvious in 2020 retrospect!) so theres a single new variable that determines the amount of delta applied to each of the 2 linear eqns in terms of the max error for each fit. but each fit is over a separate bit size (as in prior graph/ experiment), so an array [20..100] stores all the separate fits and errors, and is used in the evaluation ie slope/ y intercept are stored separately for each bit width also. there is a natural generalization to n-variate formulas and maybe will attack those soon, at least its an option. this code pins the density variable to the range [0..1]. it also analyzes in terms of a positive delta only, or a positive and negative delta.
the 1st run/ graph is positive delta only. the transition point is found in the range 0.22-0.26. this is “quite respectable”. the “magic #” is 1 and if a formula can be found where the lower transition point start is higher than 1, thats approaching sufficiency aka “proof territory”. left scale/ red line is the # of nonterminating runs (500 steps max). the right scale/ green line is total steps over 500 sample trajectories (of smoothly varying density).
the 2nd run/ graph is combined positive and negative deltas. the story was unexpected and hard to understand! there is another kind of transition point, but it was found that increasing +/- deltas actually improve/ accelerate convergence of the meta function. the transition point is about an order of magnitude larger in the range starting at ~3 and peaking at ~5 (notably/ comfortably/ substantially above the critical threshhold 1!), and thereafter declining! it is not at all clear yet how this is happening, but the code was analyzed closely for correctness and seems ok (there is only a single line difference between the two delta modes).
baseline or the “unmodified” meta function is about ~60k steps over 500 trajectories and convergence decreases ~3x to ~20k steps. the idea/ conjecture that positive plus negative deltas improves convergence is confirmed, but in a surprising way. am still not sure what to make of this yet or how to think about it, a bit perplexed at moment… am suspecting that the prediction accuracy of trajectory length probably is impacted adversely, maybe another transition point/ angle to study… ❓
(5/24) this code takes out the randomness from the perturbations and just adds a positive delta (again in terms of max error) which seems to maximally avoid/ prevent convergence of trajectories/ meta function. the prior code has a transition point k=~0.22. that code adds many sub-k deltas and this new code which always adds ‘k’ delta ie more average/ cumulative delta has a lower transition threshhold k=~0.11, about half prior k. actually thats exactly as predicted in that the prior code based on uniform randomness and positive delta adds 2x this k on average. the new code has new bilinear search code to efficiently find the transition point.
this code analyzes how much the transition point increases depending on the lowest bit widths fit in the linear formulas, ie a cutoff. x axis is bit width and y axis is the delta associated with the transition point. it demonstrates a very gradual increase in the transition point can be obtained by excluding lower bit widths from the meta function trajectory iterations. however the slope seems to flatten out suggesting maybe a quadratic formula or maybe an asymptotic limit. this is consistent/ aligns with the trend from
data35.rb graph except this graph expresses the property crucially in terms of gap distance with 1 the magic limit. so bottom line this simple 2-variate model covers ~12% of the gap and apparently some other method must be utilized to decrease/ cover the “remaining” ~88% gap. the main alternative/ hope at this point seems to be a multivariate meta function with better/ tighter fit and maybe there are other “tricks up the sleeve”.
(5/30) this is analysis of the transition point using the multivariate code with 7 variables. there are a total of 9 variables but 2 were excluded. it was found that certain combinations led to a strange oscillatory behavior. it seemed to be traceable to variables that are too “similar”. including 2 of 3 of the density variables seemed to be fine but including all 3 led to the oscillations. same with mx0/ mx1 the max bit run lengths. this code is improved and now smart enough to find the left and right sides of the transition point via efficient bilinear search in the
trans2 subroutine, and then stepwise analyze it higher resolution over 100 increments in
trans1. there are 2 modes and 2 output files, 1st/ 2nd graphs below. mode 0 adds (percent of) the max error over each dimension. mode 1 adds full range of random positive or negative max error using uniform random generator.
😦 😳 after all this substantial/ finetuned/ careful work the results are both a bit surprising and disappointing, a cold water splash in the face! the random model (2nd graph mode 1) which adds less than worst case error at each point has a transition point around ~1. however, whereas presumably this 7 variable model is more accurate, looking at the worst case model only (1st graph mode 0), the “gap coverage” is only at about ~5% which is less than half as effective as the prior experiments ~12%! this is a real bummer/ setback and means that something counterintuitive is going on that needs to be isolated to make any further progress.
another misc point to note is that the transition point on graph 1 isnt sigmoidal at all on the right end with a sharp inflection point/ plateau. am thinking some of this may be related to “false positive” on the termination detection. on inspection it looks like the curves may have a long gradual concave upward decline that means they still terminate after many iterations. am musing on how to deal with this effectively. ❓
⭐ ⭐ ⭐
💡 ⭐ (later that evening…) had an amazing idea rolling this over/ around in my head while driving in traffic. there is some strong difference between the random vs the worst case gap covers. was pondering why. another way to visualize error is that at each point there are errors in the different dimensions. but what is the distribution of those errors? this leads to the question of correlated vs uncorrelated errors in the dimensions. the worst case is when all dimensions have largest error, ie a sort of “correlation”, but only in a single point. whereas the concept of correlation is over multiple points; therefore this concept seems different, more like maybe a synchronization or coincidence concept. or maybe simply misalignment vs alignment.
which in turn reminds me of Talebs influential book on the black swan. (never did read it, have been meaning to! maybe need to go look it up soon!) am thinking maybe an apropos term would be something like “black coincidence” or, say, a straightfwd neologism like malcoincidence (vs, say, bene-coincidence?…)
anyway have some basic ideas of how to study this now, and think they are a priori quite promising seeming… am thinking/ conjecturing that “average” errors in this sense will fall somewhere between the two current models and maybe/ hopefully closer to the current random model which is roughly sufficient… this also leads to the question of provability properties of the black vs “white” coincidences… ie the key question is are misalignments common, uncommon, or impossible? am also wondering if/ hoping the multivariate models will have some major advantage here… ie is it a case of “the more the merrier” and extreme combined errors are less likely with more variables/ dimensions? in other words what are the dynamics of how errors over different dimensions “mesh”? all gives new meaning to the term wiggle room! ❓ ❗
(5/31) this code has a few enhancements. it outputs the “training errors” into a file
err.txt for inspection. the distribution is quite nonrandom wrt the density ordering of the training “seeds”/ iterates (for many or most variables that is), but didnt feel like including all the graphs at moment. next the bilinear search code has been improved in various ways eg to have a counter termination and to more accurately terminate based on finding both left and right sides of the target at least once, and returning a point that actually hits the target, not merely near to it.
next there are two error addition modes. the 1st is in terms of the dimension error averages instead of maximums. the 2nd is in terms of real (absolute) errors randomly sampled from the training data. as seen it turns out these two methods are nearly identical in terms of the transition point and also improve gap coverage substantially by roughly an order of magnitude to ~0.28. it falls short of the “magic/ golden limit” but this is encouraging/ desirable in more than one way and shows that errors are maybe somewhat “normally” distributed around their means at least in the sense that the mean is a very close standin for distribution samples. but this still largely leaves open the key question, what is it in particular about the overall setup that can have a major effect on increasing the gap coverage?