collatz standing on shoulders of giants (not!)

RJL has a nice new blog that cites collatz/ Taos recent work on it. wrote a comment on it and it led to a negligible # of hits. lol!

this is some recent hybrid code that has some minor improvements and didnt want to lose the changes. the basic idea was to start with limited 1-runs and see how much the trajectory size could be optimized. it has to throw away candidates that exceed the 1-run limit. the limit is determined by average statistics with ½ density iterates. the graph code is a little improved by numbering multiple graphs etc. it was found the prior “init” routines were not working as expected and/ or had low probability of creating sub-threshhold runs so they were adjusted.

(uh oh!) wordpress has a new editor. always run into funky quirks in the past, some of them nearly showstoppers eg incorrectly handling escaping of code blocks thereby screwing them up/ aka corrupting them every time saved. but maybe this one will be slightly better. you cannot imagine how many times have edited posts, saved them, then had to scroll back to the point where was previously editing. omg, who the @#%& asked for that? what an unbelievably tedious “feature” that shouldnt have gotten past UAT even one nanosecond and yet was default functionality for years, lol! (argh already found a code-set shortcut doesnt work, sigh) oh and it only took me about 30m to figure out how to insert the excerpt indicator… thanks guys! and it took yet more “fiddling” to find the rearranged controls for image embedding. welcome to 2020 the age of ubiquitous smartphones and endless fiddling, lol! Nero would be happy! also the idea of going back to old blogs to edit them, esp long ones, and trusting wordpress to “do the right thing” is a bit terrifying right now. welcome to modern software which is that weird mix of both extremely powerful and extremely fragile at the same time… its probably not unlike the ancient art of metal sword forging and the old word/ concept is called brittleness o_O

hybrid38.rb

(9/12) this is another idea that just occurred to me. it seems not to be ruled out by existing experiments.

consider all glides longer than some short threshhold. they all seem to have “semiorderly non ufo drains”. in other words ~½ density parity sequences without outlier long runs.

some of the idea here is that the outlier ufos were found in drains but never found in any drains for non-short glides. there is a long earlier conjecture that the postdetermined glide drain is semiorderly, this extends it to the full drain. also all this time “postdetermined glide” has been used but it needs a little more distinction. once again evoking the slipperiness of words and this problem. it could mean either

  1. the postdetermined section inside the glide. this is the typical usage. may be empty if the glide terminates in the predetermined range. doesnt include much of the drain, ie specifically excludes postglide drain.
  2. the entire postdetermined trajectory after the glide which includes the drain.

(9/16) this blog is copious with ideas now and my brain also, and its like RAM deciding on what to try/ pursue next. lately the image from construct101b and construct101c from 1/2020 has been on my mind some. it was alternatively called a blob, cloud, or an arch. the other idea that occurred to me at the time but did not mention it was a brainlike structure where the stem is the drain region. just had an idea to look at the postdetermined glide only. the prior code aligned the trajectories based on endpoint of trajectory. but what about aligning on endpoint of glide? that change along with omitting the predetermined range led to this striking, extraordinary result. ofc its both unexpected and unsurprising at the same time. there has been some usage of the “alignment technique” ie hortizontal or vertical alignment of trajectories based on various criteria and here it turns out to be utterly pivotal.

the 1st graph looks at the parity sequence and generates a random walk with increments -1, 1 for the 0, 1 parities. it has a lot of signal, but on wondering about it, then came up with the idea of assigning increments 1, 3 instead. these may seem like magic numbers but they make a lot of sense. in a log2 graph, as long mentioned/ utilized (this shows up also decades ago in the Terras ideas), the semicompressed mapping leads to increments of roughly -1/2 and +3/2. simply multiplying by 2 leads to -1, 3 increments and the 2nd graph which is tighter.

this is a different way of looking at a phenomenon that has long been noticed/ examined: in short, (postdetermined intraglide) drains are nearly linear random walks. one of the earliest revelations of it was the “rosetta diagram”. even earlier were a series of experiments that looked at “slope correlation” or the linear correlation in the drain and found it to be very high. but these results are reaffirming and point to some new ideas/ angles about the “core inductive structure” of the problem. in a sense this is a self-similar/ scale invariant structure for the “remainder” of trajectory glide verification calculation.

construct159.rb

(9/17) 😳 oops, math/ numeric typos are so easy to make, the actual increments should be log(1/2) and log(3/2), or working backwards (as is case here), log(2) and log (2/3), and both of the prior sets do not match that, and what the plots show is that if one of the increments dominates the other, the end result will be more nearly linear. using the correct increments displays a wide spread in the subtrajectories random walks that closely matches the log-scaled iterate plot. another glitch is at line 46, carried over from the prior code and not spotted; the line is redundant because all trajectories are glides and the last glide index is always found.

some other experiments reveal the following. the just cited construct101c led me to some question, it wasnt examined too closely at the time. one can see some “sideways” glides generated at random from ½ density iterates. but how does this mesh with the long noticed/ known finding that “most ½ density iterates chosen at random are drains”? it turns out, via simple code/ analysis, the larger the iterates, the less likely glides are chosen at random, ie the more likely the iterate is a drain. however, from Terras ideas its clear there are many glides for higher iterates, and its also known how to generate many with close to ½ density starting seeds. eg the recent 0.64 glide parity density construction does this.

so a theme long examined is, what is it about ½ density seeds with glides that makes them different from nongliding, aka draining ½ density seeds? after many months of analysis, maybe even years, it seems am coming up mostly emptyhanded on this basic question. it seems there is something very deep yet to find here, but it has been extraordinarily elusive so far. despite having quite an array of powerful tools, have not been able to turn up anything definitive so far. yet again more thinking outside the box is apparently required, even despite copious amounts/ pivots/ efforts/ time/ energy now expended so far.

(9/21) the 3-power sequence is starting to figure again in my thinking. there has been a relative lull in it over last ~½ year of blogs but its reemerging in my attn at the moment. last month had the crucial “twist/ turnaround” in that it reevaluated results that earlier seemed to come out as null but the null finding was in retrospect due to a miscalculation/ defect/ bug. had the urge to do an experiment but then it reminded me of prior ones, ie algorithmic/ logic/ code deja vu, and so then not remembering exactly, had urge to do a quick survey. its not easy to survey old blogs anymore, but heres a quick list. the sizeable list reflects an increase in apparent significance/ interest. many of the experiments are over the related match ratio. in short a lot of signal has been found in these different angles/ povs but some “mixed signal” in the form of limitations were found also.

8/2019 construct140, bitwise41, bitwise42b, construct41

9/2019 review134, review134b, construct42, construct43, construct45

12/2019 review140

1/2020 construct107, review152, review156, review159

8/2020 review134c

this is a sort of warmup exercise very similar to construct42 graph #3. am interested in the 3-power sequence associated with >0.64 Terras density iterates. a basic factoid seems to have been overlooked/ forgotten in all the other directions now “re-dawning” on me: the 3-power sequence density seems to be a basic “explanation” for a big “near-plaguing” question over years, ie the difference in ½ density iterates either gliding or draining.

(as mentioned/ mused recently eg 6/2020 blog title, these days the phrase “plaguing question” takes on distinctly new meaning…) this code does an exponential moving average instead of a running average as in construct42 and gets a significantly different result in the early trendlines, starting out in an increasing curve instead of a lower range moving into local maxima. so the density variations seem to have some deeper structure and need more attention to analyze/ isolate esp in the initial range. there are also some clear wavelike-trends for the higher densities in this graph somewhat similarly to the earlier construct42 graphs. something is going on here…

( 😡 re the new editor now just discovered that it locks/ loses scrolling on the ipad somehow due to a bug. @#%&, geez, argh! also, Evil Big Corp has announced they will no longer allow web mail access, which screws up my method of copying and pasting code, ie via saving drafts; they have years-long rejected pasting it into pastebin. they continue to tighten the screws, its like the scene with luke skywalker et al in the garbage compactor… pasting into a wordpress draft still works, lol sssh dont tell anyone!)

construct163.rb

(9/22) forgot that density is quite noisy in those sequences and it explains some of the trends, (rerunning shows) they are somewhat random and related to initial noise in the sequences. old experiments involved eg high smoothing and one sometimes tends to misleadingly/ near-mistakenly picture trends in the smoothed form, and smoothing can disguise different types of noise.

this is some further analysis of the 3-power sequence starting from Terras density iterates, surprisingly a fairly basic idea that hasnt been looked at in prior results, although there are related povs; it could be related to/ nearly guessed from construct107 and the idea that longer glide Terras seeds have higher density, seen in the epic/ pivotal 2/2019 findings eg construct5, and the density/ entropy trajectory converge trends were seen in construct5d. 100 width iterates, 100 length sequences (predetermined range), 400 sample average (unusually large to overcome the noise), 4 different starting Terras densities 0.65, 0.75, 0.85, 0.95, and the density, entropy, and same for the 3-power sequence. the 3-power sequence is not “parity aligned” to the collatz sequence where parity alignment typically tends to give better fits/ correlations.

the 3-power sequence converges to center density and entropy faster, and the concavity is reversed. but theres clearly a sense in which the 3-power sequence is approximating the collatz sequence. this analysis for the collatz sequence is quite similar to construct15 on 2/2019. not remarked on at the time, it is somewhat striking how the density and entropy converge to the center at exactly the 1st postdetermined point.

💡 an immediate idea comes to mind. what if one tried to construct an approximation of the collatz sequence from the 3-power sequence using current iterates? it seems in principle it could/ would/ should be possible to devise something, albeit subject to a lot of noise constraints.

construct163c.rb

(later) there is a remarkable finding in construct15 that is overlooked and now just stumbled across it again. it is long known the low vs high bits sometimes have different characteristics, seen early on eg in construct5. did not notice this much earlier but just now found over trajectories for Terras density iterates, it is the case that the low density is sometimes stable at center while the high density varies!

💡 this led me to some remarkable idea. how could this be measured as a continuum instead of 2 separate values, one varying and one unvarying? it appears they are pointing to the idea of a center of weight. consider the 1 bits as having a weight (as the density concept already considers with hamming “weight”), then for a ½ density iterate, there will be approximately ½ of the width comprised of 1 bits. so where are they distributed? an even/ “uniform” distribution would put ¼ of them (of all bits that is, ie ½ of ½ all bits) low (half, lsb) and ¼ of them high (half, msb). but the density low and high differences show that this is not the expected distribution, ie the center of weight is not at the center of the width.

so then came up with this simple metric, it measures center of weight as a fraction of the width, either the “left or right” (msb vs lsb) by counting ¼ of the 1 bits counting from left or right. then applied it to the existing code, and got high signal. there is also a lot of correlation going on. this needs a lot more analysis and investigation, but am just posting it in relatively raw form for now. this also uses parity alignment on the 3-power sequence. as in the last series, the noise tends to decrease for the higher Terras density iterates. another observation is that ‘mr’ roughly tracks entropy ‘e’ and ‘m3r’ roughly tracks 3-power entropy ‘e3’ as in prior graphs, but this center of weight metric would still tend to be near ½ if entropy and density were evenly distributed in the high/ msb vs low/ lsb ends of the iterate. another strange twist/ finding is that ‘m’, ‘m3’ center of weight is ½-random measured from the left, and contains signal measured from the right. this seems to show the msb bits are roughly evenly distributed and the lsb ones are not, ie tend to “bunch” to one side, even if both sides have ½ density/ entropy. ❓

so anyway yeah its raw, but nevertheless…

😎 ⭐ birth of a new feature!

construct164.rb

(9/23) this is an analysis with large 800 width trajectories to allow a long 500 width running average on individual trajectories, 40 trajectories above ½ Terras density. the density and entropy are extremely noisy hence requiring the large average span to smooth out, and the averages come out very close to the center also as seen in construct15 but they nevertheless have strong signal. the density and entropy parity on the other hand have very strong signal away from the center ½. the postdetermined return-to-center tendency seen in recent series of graphs and construct15 is not much observed/ evident at all in averaging in this dimension/ direction. so have to think more about why thats happening and come up with some kind of picture/ explanation for that, its not a trivial thing to understand.

❗ bottom line though, given “sufficient samples,” these 4 metrics seem to be sufficient to approximately determine the “relative location” of the iterate in Terras density glide “spectrum.” sounds relatively innocuous right? but that could be a very big deal because it possibly essentially leads to a general formula for “current glide location”…

construct165.rb

(9/24) 💡 ⭐ 😎 ❗ ❓ had some new flash of insight last nite. some loose ends are starting to cohere/ mesh/ meld together.

it goes back to dynamical systems. just recently revised my thinking on that outlined on 7/19. and then reread this neat quanta profile from 12/2019 on Tao + Collatz. it reflects some of the same idea. have cited Taos recent result but forgot some about the quanta writeup. he also has an excellent “high school” level slideshow refd on his blog, my favorite slide is the squirrel at the end that also seems to ref Creation of Adam by Michelangelo. and ofc scrat… maybe a new logo for this blog?

But Tao realized there was something similar about them. With a PDE, you plug in some values, get other values out, and repeat the process — all to understand that future state of the system. For any given PDE, mathematicians want to know if some starting values eventually lead to infinite values as an output or whether an equation always yields finite values, regardless of the values you start with.

the article doesnt state this outright, but its quite obvious and has been around here for a long time. the collatz problem is a DDE, a discrete differential equation.

now, the big picture is finally starting to materialize. its a possible answer both to this problem and a more general technique, exactly the 2 main goals long outlined here. some of this was already sketched out in extensive years old work on the MDE, matrix difference equation ideas. this also relates to more recent ideas about traversals and a scaffold. but now have a better idea how to generalize the basic problem and describe the basic attack strategy! its strange how these ideas have all been circulating for years, but was running on partial intuition aka “going thru the motions” and some aspects were still eluding me until recently/ now!

  1. identify (quantitative) features of the problem. they typically must be expressed in scale invariant form. how to tell? scale variant properties in short are unbounded. example: iterate size is not scale invariant. iterate density is scale invariant.
  2. build a model that (“so well”) replicates that it determines the change in features from one iteration to the next. try to increase fidelity (fit) as much as possible. the model can be built using ML.
  3. then work on an adversarial attack on the model. the optimization/ ML code then attempts to break the model in 2 different ways.
  4. the first way is to look at the termination properties of the model. the (adversarial) ML will try to find an iterate and its associated feature set that does not lead to termination in the model.
  5. the 2nd way is to try to find collatz iterates that when converted to the model lead to high/ “outlier” levels of error in the fit possibly not encountered during training.
  6. if both of these “stress tests” “fail,” ie that is “fail to break the model,” then the model (apparently!) always terminates, and its always within the fit error, this is a possible structure for a proof. the two stress tests are possibly interrelated, because a failure in one may lead to a failure in the other, or they may fail separately.
  7. one has to then analyze the model dynamics and attempt to build a proof out of the (simpler) model structure instead of the (more complex) actual original unmodelled “straight/ raw” problem. in particular the model may have basic properties of scale invariance related to the features, which play a crucial role in possible proof structure. the model serves as a map of the problem’s crucial features and their interrelationships/ dynamics.
  8. if something fails about the stress tests, one can either (a) adjust/ add/ subtract features (b) look at different ML models with different structure and/or more sophistication to fit the same data. the type of failure will point to how to proceed. for example if there are samples/ cases with high error rates, one can look for anomalies in them and how they “defy” existing features with an eye to building new ones (discovered anomalies are like outlines/ signs of new features!).

much of these ideas were already carried out with the MDE Matrix Difference Equation approach. need to review again to remember more exactly why it failed, think it was related to lack of termination associated with the undifferentiated region. while it may have many variables, at heart it is a linear model, not esp sophisticated, and it was not esp expected to succeed, but was a worthwhile, even valuable exercise in attaining experience/ knowledge/ strategy.

the trolls have argued even with some credibility in various places against the approaches outlined in this blog in that they cant possibly succeed. however, while ofc unconventional and frankly, revolutionary if it eventually/ ever succeeds, think the above outline cannot be “credibly discredited” a priori by anyone knowledgeable in the areas. to paraphase the oft-maligned Wolfram, its a new kind of mathematics (+ Science + Engineering!) ❗ 😮 ⭐ ❤

(later) looking back 3/2018 had the final thwart8 code, last in the series showing limitations of the MDE prediction, which showed that the ‘w’ strategy tends to create long glides but with glide iterates in the ½ density region, and the MDE “remaining glide” predictions are poor/ flatline for long “undifferentiated” glides. in other words the MDE was mostly tuned for the differentiated region and tends to fail in long undifferentiated regions, which from later Terras work are known to be not rare at all ie quite copious. actually “tuned” is an understatement, but maybe only mentioned in passing at the time; the training samples were (iirc) almost and/or entirely from the density distribution which (in long/ knowledgeable retrospect) nearly explains everything about the applicability + success + limitations of the model…

(9/28) the above outline is straightfwd, coherent, fundamental. it literally took many years to formulate. it is not unprecedented. others are already pursuing similar ideas but maybe are not as close to comprehending the “big picture”. its a new technique for 21st century mathematics and computer science, a fusion which might be called algorithmics. its not (yet?) an answer, but it is a kind of map. pondering it, have realized there are 3 wrinkles or “gotchas/ caveats” both mentioned/ acknowledged and somewhat hidden. these are somewhat like “antipatterns” and are expected on hard problems, ofc no procedure is guaranteed due to the phenomenon of undecidability.

  1. the model may fail to halt. there was some example of this in the MDE analysis.
  2. the error may not be “in line” aka “in bounds/ limits.” a “typical” antipattern is probably along the lines that the error increases with larger iterates. cant exactly recall a clear case of this for scaled features so far but think have alluded to the idea previously. so the prior outline is probably pointing toward bounding the error by a constant. however it could be different (constants) across different features or maybe some overall error estimate across features is bounded.

❓ now, thinking about all the prior analysis, heres a/ the (potential) “big gotcha”

  1. as already discovered, there may be an indistinguishable/ undifferentiable aka “featureless” region. but here now is a somewhat new pov on what it may look like. a featureless region means that the pre/ post measured features of some collatz iteration all do not change substantially within noise effects.

but this also yields some insight into a possible strategy/ opening. one can focus on looking at whether current/ developed features are coming up as unaltered/ unshifted on some particular region/ edge cases, and look at tuning/ adjusting them and/ or adding new features that improve “discrimination.” significant work has already been expended in this area/ direction.

this also tends to point to the need for a multi-iterate/ “block” analysis approach/ direction because various/ many features generally do not change “much” in single iterations.

on further thought and in interests of being thorough here at this pivotal juncture outlining the big picture, there is yet another fineprint item/ gotcha worth mentioning that has been alluded to earlier, in contemplation of the Perron ideas in some older blogs, but am understandably holding off on thinking/ addressing it for now.

  1. the prior ideas involve meshing the raw collatz data with the model within error parameters/ bounds/ limits. there is some assumption/ reliance on that “errors are randomly distributed.” in other words adjusting the raw data to the model approximations introduces a sort of “roundoff error.” the analogy to numerical computation error is apparently/ probably strong. some numerical computation errors compound in a random way and dont affect calculations/ overall “bottom line” results, others do, and cause erroneous results aka “overflow” etc.

(later) there has been a lot of work relating to this new idea. the basic theme is looking at change in some features over a single collatz mapping. in contrast a lot of other prior focus/ exploration is in longterm trends in feature changes. was reminded/ thinking of some of the work in the 1st category. quickly looking back, some early explorations of Terras glides before a lot of the better/ later/ much more thorough/ oft-cited quantifying was on 10/2018, experiments 121, gap2, gap3. these look at changes over 1 mapping in density, entropy, and max runs, among others. these increase confidence in this direction because they generally show feature variation over a mapping. this is a key concept and seems to deserve a new name. its related to differentiability but somewhat different.

define gradation as the change (“delta”) in one or more features over some iterations ie multiple/ batch collatz mapping. the key question of gradation is over all features over the undifferentiated region. the gradation may be steeper or milder. if it is too mild, it gets lost in the noise. call them subpar.

💡 ❓ the adversarial approach is again relevant here. the adversarial algorithm could try to find sets or regions of iterates that cause mild to subpar gradations among the currently devised/ investigated features. maybe esp notable are features that lose their gradation “edge” as iterates get larger, they have issues with scale invariance and thereby probably also model fit error.

(9/29) could not resist some fast check on some of this/ “proof of concept” ie some form of feature gradation check/ testing. the idea is to quickly rule out any potential showstopping element(s). heres a quick implementation of some of those questions via optimization using the hybrid code. the formulation is to look at density and entropy metrics over 400 width iterates with a batch size here of 5, 10, 20. there are full, lo/ hi variables, and then also the center of weight aka called midpoint in the code is measured. the optimizer (1) restricts iterates to center density, throwing out those that arent and (2) attempts to minimize the distance to center of all the metrics. in this scheme most generated candidate iterates are thrown out due to not being ½ density. another kind of simple optimization technique could just generate random ½ density iterates as candidates but presumably would not achieve as high optimization, the difference relates to the genetic algorithm exploiting “similar properties” of “similar iterates.”

the finding is that the metrics cant be driven to zero (undifferentiability). also unsurprisingly the lowest/ smallest rightward wedge tends to be larger for larger # of batch iterations. the distributions over the variables tend to be different at the different bit widths (seen in the different color distributions). results are sorted by the ‘z’ optimization value plotted right side scale; note there are 2 different z values (oops), the familiar multivariate optimization parameter measuring dispersion and the 2nd is the batch iteration count.

overall/ bottom line, this is encouraging, a good sign, however as usual/ theres always a caveat; the issue of noise has to be taken into account. maybe the features (values) are being driven into a noise region such that any ML algorithm cant distinguish any termination properties with any certainty. also conspicuously the (adversarial) algorithm is not actually looking at deltas yet.

some prior work already looked at some similar ideas. eg looking at entropy or density distance from center for the 1st postdetermined iterate, and finding it can be driven low but probably not to zero, ie its an asymptotic curve toward some nonzero constant.

hybrid40.rb

some other thought on the overall outline procedure. the basic idea, if it isnt already obvious, is to convert the aspects/ dynamics of the problem into a feature space. the feature space is finite size/ bounded, unlike the problem which is unbounded. this is not a new idea and another example would be the LSA technique. which via SVD/PCA identically also shows up in ratings systems eg netflix etc… in that analysis there is reference to “latent/ hidden variables” and “feature vectors” etc. with some relatively strong analogy. however, there is a key difference. the matrix approaches already work with finite sized matrices for both raw data and the model (comprised of the principal components). the infinite/ unbounded aspect is not present, although some of these models are very large, eg # of netflix subscribers or amazon raters etc. however, another strong analogy arises in the concept of noise and (numerical) approximation. PCA is in a sense an approximation algorithm that removes or disregards noise.

💡 some other thoughts on looking at this, esp the last noisy graphs. there is an anthropomorphic tendency to assume features change in a continuous way which is naive at best and catastrophic at worst (or less dramatically, borrowing from psychology advice to avoid “catastrophic thinking,” simply a “showstopper/ dealbreaker”). one might call it similar to the gaussian “normal” bias. discontinuous feature changes are likely to disrupt the modelling.

💡 my other idea is that an adaptive algorithm sampling the problem vs feature space seems to make a lot of sense. somehow it needs to “focus/ zoom into” “problematic” regions in the feature space with undifferentiability or discontinuous aspects. in some cases higher sample resolution in a region/ area could improve the model, or it might lead to the discovery of the failure/ breakdown of the model in that region. not exactly sure how to measure this yet. but heres another idea.

💡 one can look at the feature space as a finite set of “representative” vectors as with nearest neighbor modelling. then one can look at how the vectors map onto each other, leading to a transition graph (somewhat similar to a FSM, finite state machine!). the problem of finding a nonhalting computation aka infinite trajectory then becomes reduced to finding loops in this feature transition graph. the proof reduces to proving no loops exist in the graph (within the fit/ modelling error). clearly, the way that noise interacts with features is going to play a crucial/ paramount role here. aka “potential showstopper.”

💡 as for “failure/ breakdown of the model in some region,” again from knowledge of/ experience with fractals, another scenario arises: possibly the halt/ nonhalt boundary is “infinitely complex at all scales” in certain areas. one is reminded of the multi-magnet pendulum dynamics. have to think more how this manifests and how it can be counteracted. but there are a lot of tools/ angles now. the main ones are listed: new features, better ML algorithms, more sampling points, etc.

1 thought on “collatz standing on shoulders of giants (not!)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s