# ML + ATP + collatz = ? (via weka)

💡 hi all here are some wild half baked ideas and slightly )( more on using/ combining ML (machine learning) with ATP (automated theorem proving). both topics have been big in this blog and lately am trying to harmmer out some (new) ideas on how to merge them in a grandiose 21st century vision. there are also some preliminary experiments here.

how would one apply an ML algorithm to collatz? the whole area is quite unexplored of uniting ML algorithms with number theory or TCS proofs but yet seems open to experimentation. number theoretic problems are similar to analyzing pseudorandom functions, or functions with “signal mixed with noise” much like with Big Data. have long had a nagging suspicion theres something potentially big lurking in this (right now seemingly) exotic marriage. ie a tip/ (very big?) iceberg floating around. ML is absolutely centric to empirical approaches to problems for decades and is now after years of sometimes painstaking development on very strong theoretical and practical footing.

it appears to me the sensible main/ natural idea on how to apply it would be to try to create a prediction algorithm for trajectory stopping distance, the key “apparently unpredictable” aspect. stopping distance is not known to be bounded (not sure why wrote those words, “lol”) but is bounded for all experimental values. suppose one came up with a learning algorithm that predicted the pseudorandom-looking stopping distances. the error scaling would be a key factor. does the error increase or worse, “blow up” somehow as the model scales to very large numbers? this is a nonobvious & maybe sort of subtle question that could lead to a lot of study wrt the different ML algorithms.

it is not so obvious how ML algorithms work under scaling (esp in the number theoretic sense of unbounded inputs/ outputs) and maybe an extended research question. obviously if error increases without bound then this is a kind of failure although not necessarily. what if the error itself can be bounded? (as already long encountered in these pages) this is a sort of infinite regress of some sort/ at times (the residuals after fitting seem just as unruly and random as the original data, and in some ways can be seen to have increasing rather than decreasing deviations), but yet somehow it does seem to be moving in the right overall conceptual direction.

a definite identifiable aspect would be the requirement of using “bignums” (arbitrary precision arithmetic). ML algorithms are almost never implemented with bignums but there is no conceptual reason they cant be, and it may be a key element of any possible success in ATP, number theory, or general math/ TCS ATP. lack of bignums would almost assuredly lead to error blowup.

so suppose one had an (ML-“derived”) algorithm for predicting stopping distance that was accurate and the error was bounded in a dramatic way, say even by a constant. for all finite values “plugged into the formula”. the inputs to the algorithm would be any provably/ guaranteed recursively computed metrics although one would have to narrow it down somehow in practice. (heres also where GAs might be very applicable/ impressive in recursively building recursive metrics, this has been demonstrated in some experiments eg video game playing algorithms or circuit constructions by Koza etc! and have a nagging suspicion that GP, genetic programming is particularly relevant via the Curry-Howard correspondence, where the evolved programs might have direct parallels to definition/ proof/ lemma structures.)

then the proof of the problem reduces to proving the “correctness” of this algorithm. but how does one do that? unfortunately this seems to be another key undecidable problem in TCS, namely attempting to prove two TMs are equivalent, ie the model and the original problem. however, in this case one of them (the model) is provably recursive. is there some technique(s) or trick to proving equivalence of a provably recursive TM with another that is of unknown status (either recursive or recursively enumerable)? that is the big question. but am leaving that unresolved even though this conceptual approach seems like a genuine advance. (in a sense it is known humans do this wrt theorem proving, although its presumably not generalizable, nevertheless the exact shape/ nature of that boundary is open to further inquiry… in the 21st century!)

so in summary the problem “breaks down” to these two admittedly-both-wild but not-completely-outlandish aspects:

• “machine learn” the execution time of a possibly undecidable/ non-recursive algorithm (like collatz) as accurately as possible. this model is necessarily recursive.
• prove the recursive model is mathematically equivalent to the original algorithm.

there is some strong similarity here to compression. the model must necessarily be of limited computational complexity, and for many algorithms it is nearly constant time for predictions (efficiency is a consideration/ design goal for many ML prediction designs), but collatz trajectories grow without bound. the ML algorithms would generally do “less” computation than the collatz trajectory “computational tableaus”. so in a sense the model is compressing the collatz computational tableaus, if such a model can be constructed.

one of the simplest “ML algorithms” is polynomial interpolation. this model indeed grows in an unbounded way for inputs. but what about other ML algorithms? was wondering about RBFs, radial basis functions. intuitively think they might grow without bound. also its a key question of how complex the model is for large numbers ie “in the limit”. does the algorithm converge to some simple approximation say linear in the (infinite) limit? then it is rather weak. but do some models stay highly nonlinear even in the limit? those would be more promising.

this blog has recorded some early ML stabs at the collatz problem and gotten nontrivial results with GAs. GAs still seem very promising to me in many ways because they do not seem to have limited algorithmic complexity (in the estimation models) like a lot of other ML algorithms say NNs, SVMs, etc.

finally decided to break down and approach this more as an ML problem. have been avoiding this as long as possible because of the complexity of ML tuning, but am finally ready to bite the bullet.

via reddit ran across a nice cloudacademy tutorial recently so took a brief detour looking at new capabilities of Amazon cloud to do ML, but it apparently so far turns out only to do regression! yikes! guess will have to wait a bit on that one for the more advanced algorithms!

looking elsewhere, the recent LHC Higgs ML contest on blogged here had a lot of positive approaches using WEKA in the forums. they even had a sample WEKA setup for the problem. they seemed to get good results using tree decompositions and WEKA seems fairly strong in that area. WEKA also has several clustering algorithms and those seem more promising to me for collatz. (for “noninsiders,” WEKA is one of the leading open source ML packages/ suites, made by students at Waikato university in New Zealand, and named after their indigenous bird.)

so, to learn the basics, installed WEKA and did a “simple”/ “open science” exercise found (aka “popped up”) on the web via the magic of google for a senior computer science class in datamining at U Delaware ECE/CIS dept. identified poisonous vs nonpoisonous mushrooms using a branching tree! the class even had the excel spreadsheet online as of this writing! (alas the class didnt seem to get into clustering algorithms at all, which was what am looking for, but found it an excellent intro to the software & trying/ applying the clustering worked mostly intuitively/ smoothly….)

⭐ ⭐ ⭐

#### “deformations”, “variable computational velocity”, random walk “inverses”, open problem half-solved

have some new ideas with collatz that were plugged into weka. 1st some bkg.

an earlier idea wanted to analyze which trajectories in a block to decrease to develop a “orderly decrease” in overall current trajectory sum. realized recently that there is a remarkable way to do this for a single trajectory, it feels fundamental in some way. have had a vague idea floating around on this for many months (the original rough requirement idea is now about a year old) but only just finally worked out the math/ conceptual details!

this can be formalized with lots of great math/ calculus formulas but thats always possible later, am holding off on that for the moment and approaching/ advancing it more computationally/ algorithmically right now (maybe that is backward from how the mathematicians might approach this but hey, thats my style for now) 😎

(maybe this math is found somewhere in some textbook or paper, it seems not inconceivable, it seems to relate to dynamic/ adaptive control theory… maybe have to look into that further at some pt also… for now its a lot of ideas packed “formally” into a little code…)

imagine the area under the curve of trajectories. suppose one wanted to advance trajectories such that this area was “emitted” at an orderly rate. this reflects a sort of continuous deformation in amount of computation (of the trajectory) per time instead of one iteration per time slice. in fact suppose one wants to “emit trajectory area” in some arbitrary rate. the most orderly rate to emit it is a monotone decreasing line. so create an arbitrary function with that property starting at (0,1) and decreasing linearly to (1,0). now, how can a single trajectory area be emitted at exactly that rate? its a simple calculus/ algebra problem. am not going to write out all the math for it, instead encoded it as code.

the 1st graph (red) is the desired orderly emission rate/ “target rate”. next overlaid graph (green) is the integral of the area up to that point, its quadratic. next graph integrates the area for a complex trajectory in this case n=27 and overlays this on the linear integrated graph (the parabola). the two match closely with normalized scale but note the collatz trajectory diverges by “spikes upward” in the middle. it also accelerates faster in the beginning where the trajectory is moderately low and has to “catch up” to the “faster” linear area emission.

the final graph 3 is a “deformation graph” and also representing/ measuring “variable computational velocity”. this graph shows that at time x, if y amount of the sequence is output, then the overall trajectory emission will exactly match the linearly decreasing target emission graph. the x axis is scaled to 100 max and the y axis is scaled to 110 total iterations for n=27 trajectory. note how the graph “slows down”/ levels off in the center where the trajectory is spiking the most in its middle.

other trajectories that dont have big nonlinearities/ “spikes” in their “descent” also not surprisingly result in a nearly linear deformation graphs. the trick of the ML algorithm is to attempt to (“locally”) identify features that “control/ predict” the complex spikes & nonlinear deformations over many trajectories. in short, there is some attempt to predict global properties (such as total stopping time) from local characteristics (such as “total trajectory area computed/ emitted/ seen” so far).

note this technique works in general for any monotonically decreasing target emission graph assuming the collatz conjecture is true! so it actually works for all possible random walks. and seems like some sense of an inverse or a conversion. this is a positive answer to an earlier conjecture posted on codegolf that a orderly nonmonotonically decreasing algorithm exists, but with fineprint, ie looking at/ precomputing the overall trajectory & not computed merely using local metrics. (as stated earlier that is part of the key difficulty in proving the problem, in connecting those two, global/ local, which hopefully an ML algorithm could target.)

now some brief starry-eyed musing: this code reminds me of the concept of “hidden variables” in physics and bohms concept of implicate vs explicate order. the complicated explicate order is the 3n+1 trajectories. a possible hidden “implicate order” as in the final graph would allow one to compute the trajectory in an orderly way. then the open problem is how to compute the implicate order from the explicate order, so to speak! as just observed, also the implicate order needs to be computed in a local way instead of a global way, and these together outline the general problem and hint at some general underlying theory also.

(including the short-but-tricky gnuplot code for overlays for future ref.)

speed.rb

`set ytics nomirror; set y2tics; plot 'out.txt' using 1 with line, 'out.txt' using 2 with line axes x1y2;`

#### weka experiment

the next experiment is to output a bunch of these deformations over not-that-many trajectories, only 20 here, as a start to learning weka (and ofc hopefully weka starting “machine” learning this problem). this outputs in csv format which is not preferred in WEKA because it cant identify the target attribute etc, it has to be manually set. the input is (very) simply scaled current “location” in trajectory length 1-100 and current trajectory value. note both are local (the 1st is merely a counter) whereas the final target deformation value requires full precomputation of the entire trajectory.

tried the Kmeans clustering algorithm with 4 centers (ie just a small scale warmup/ toy experiment). following are 2 of the weka graphs. the different colors (apparently) represent which is the nearest cluster. so this ML algorithm is basically attempting to learn the “deformations”.

the kmeans clustering/ RBF-like algorithm here assigns all the higher trajectories to a single cluster which suggests it wont scale well and that all higher values not in the training set would match that cluster (this is quite raw & havent even done blind training/ test separation/ “split” yet). but the estimation formula is crucial because does it depend only on the nearest cluster, or all of them, in which case it would still be nonlinear in the infinite limit case…? have to learn a lot more about this pkg and its algorithms. the good news is it was mostly painless to install and use, although (as quite typical with open source) the manual certainly leaves something to be desired. using an off-the-shelf system defn takes away some of the biting-the-bullet overhead in working with ML, esp if its not buggy!

(1st graph is basically deformation index 1-100 vs deform value over all iterations, and 2nd is all sequential samples, both with classes colored.)

speed2.rb

```=== Run information ===

Scheme:weka.clusterers.SimpleKMeans -N 4 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10
Relation:     out
Instances:    2000
Attributes:   3
i
t
z
Test mode:evaluate on training data

=== Model and evaluation on training set ===

kMeans
======

Number of iterations: 17
Within cluster sum of squared errors: 54.54741902982577
Missing values globally replaced with mean/mode

Cluster centroids:
Cluster#
Attribute    Full Data          0          1          2          3
(2000)      (601)      (580)      (243)      (576)
==================================================================
i                 49.5    14.8835    48.5655         59    82.5521
t            10696.944  1989.2629   387.1138 81013.4568   499.2101
z               17.403     6.8319     7.4621    77.6296    13.0347

Time taken to build model (full training data) : 0.03 seconds

=== Model and evaluation on training set ===

Clustered Instances

0       601 ( 30%)
1       580 ( 29%)
2       243 ( 12%)
3       576 ( 29%)

```

(5/23) wow was in for some big shock on better analyzing the deformations. its a natural question and bit of an oversight (ie so far missing) to see how close the resulting trajectory averages get to the linear descent. the algorithm can also intuitively be seen as a sequential bin assignment problem. it accumulates area to assign to sequential bins.

now this is tricky for various reasons. a key question is how large the trajectory is versus the “bin size” (p=100 in above runs); this is now obvious in retrospect. for a trajectory that is close to bin size, the algorithm is much more constrained on how to allocate the area per bin. so for n=27 the trajectory length is 110 and there are 100 bins and the algorithm has little flexibility in assigning area to the bins, the the resulting “area/bin allocation graph” looks quite “spiky”/ nonlinear, basically very nearly as spiky as the original graph. so the bin size is crucial, but also one immediately wonders about what averages over many allocation graphs look like.

so here is some code to integrate the bin allocation curves over many sequential trajectories, ie one large “block”. it took a large block size and small bin count to start to smooth out this bin curve to nearly linear. note an algorithm that looks at all trajectories simultaneously could do a much better job of juggling tradeoffs between them (the familiar “global vs local” design decision fork), this is a fairly simple algorithm that only balances each trajectory independently of the others.

notice here that neither bin count nor run count has a overly significant smoothing effect on the spikes. this is the now-familiar fractal quality of scale invariance and spikes that tend to be scale invariant. eg doubling the runs does not have a commensurate effect on decreasing spikes. this is something of a disappointment and a violation of the typical law of averages (which ofc is based on gaussian distributions) that so many runs does not smooth the curve much. even doubling again to 100K runs (not pictured) looks not much different than the 50K runs. so while noteworthy this is in some sense a failure in the design goal of this fairly simple algorithm. also its very likely the later trajectories are much “larger” and “swamping” the earlier trajectories.

but, the theory seems solid/ meaningful & this is interesting given a algorithm that can only assign area to bins in individual trajectories. the bin assignment logic here is fairly simple wrt bin boundaries and one can imagine better logic for balancing all bins together. other ideas to try are algorithms that balance across trajectories.

speedb.rb

graph args runs (odd sequential trajectories) / bins:
#1 10K 100
#2 20K 100
#3 20K 50
#4 50K 50

(5/29) pondering/ brainstorming on the last underperforming experiment, had a really neat idea for some interesting code that would attempt to partition the areas using a very intelligent but greedy algorithm (is that an oxymoron?). this is not easy to describe and took quite a bit of effort to code/ debug, but its very elegant in some ways and appears to be working correctly (its been tested some, but the code is very subtle in places). am adding it to the blog with the intention of describing it in some detail later.

the algorithm performs quite well and is another angle (near positive answer) on the open question of attempting to allocate area in a linearly descending way. this depends on a low number of bins but works flawlessly for some parameters. following are a few runs with trajectory counts, bin size (there are 2x bins where x is the 2nd parameter) to show its performance, which is very highly optimized for not-too-many-bins/ trajectories: ❗ 😀

1st: 100 3
2nd: 100 4
3rd: 400 4

between.rb

(6/1) the next obvious experiment is to look at subsequent blocks. this code is slightly modified and is run over 10 subsequent 400 trajectory blocks and 16 bins. the descent lines are quite smooth but interpenetrate/ overlap near the midpoints, and also there is not even distribution such that some of the lines nearly coincide.

between2.rb

#### reversibility and provability

am trying to map/ work out some subtle concepts in reversibility in computations versus theorem proving. in the above example, the descent lines are “nearly” calculated recursively. their exact positions are difficult to calculate, but for a moment imagine that there are recursive functions that generate them. then to prove that collatz is recursive, one has to work “backwards” from these “final” computations to derive earlier collatz trajectories. a proof of collatz is equivalent to a proof that this process would always succeed.

however notice that reversibility plays a key role here. there are reversible versus irreversible computations to calculate the lines starting from trajectories. adding a set of numbers into a final result is not reversible, the set of numbers cannot be extracted from the sum, it “loses information”. same with an operation like a sort. one could add “ancilla bits” like in QM during the computation to make it reversible and recover the original inputs from the outputs. however, would those ancilla bits end up looking nearly random?

this also reminds me of calculating matrix inverses by applying gaussian elimination to a matrix and “carrying along” the identity matrix and running the operations on it also. could there be some kind of general concept of an “algorithmic inverse” that is calculated while processing the trajectories, such that at the end one gets recursive functions, and then the inverse shows how to recursively calculate the trajectories from the outputs?

the aspect of nondeterminism also inherently enters in this area. suppose that one wants to reverse a sort. there are only a finite number of possibilities of original orderings (permutations) and a nondeterministic search can range over all of them. same with a sum, where (over naturals) the original inputs must be a partition (eg in number theory as studied long ago by Ramanujan/ Hardy etc).

but will these searches always succeed eg starting from the descent lines? that is part of the challenge of a proof. somehow reversibility and nondeterminism seem to enter the picture in some theoretical way but cant quite yet put my finger on it, the “big picture”…. ❓

#### collatz as an “oracle”

(6/4) some other funky conceptual ideas/ musings that dont actually lead to any code currently. the collatz function is not actually a “function” unless it can be proven to be recursive. this goes for either trajectory sequences or stopping distances. so in a weird way its more like an oracle. we have a function-like computation that may or may not be a function. also it might be natural to create a function F(x,y) where y is the limit on # of iterations of a collatz sequence or stopping distance computation. F(x,y) is obviously recursive. so then the question is can oracle F(x) a collatz related function be computed by F(x,y)?

this then reminds me of the concept of query complexity. F(x, y) must attempt to compute F(x) via individual “queries” so to speak. the collatz conjecture is that this is always possible. ie is the oracle actually recursive computable? (oracles are one of the more exotic elements of complexity theory & have never developed much intuition wrt them!) a simple algorithm to compute F(x) is to just call F(x, y) for increasing y, say either incremented by units or doubled every query or whatever.

(6/5) these are some surprising results worth briefly posting immediately but am not going to write it up very detailed for now (needs further analysis). the idea here is that when processing a trajectory, there are trajectories of lower starting seeds that match closely to begin with. the idea is to take off the “msb” of the trajectory starting seed and compare the unaltered seed trajectory with the msb-removed seed trajectory, and to do this repeatedly each time the trajectories diverge. each repetition is a intermediate point in the trajectory and it has a “prefix” associated with it which is all the bits “below” the msb (the current value “masked” by all the bits below the msb). this code counts the number of divergences but also the number of unique prefixes seen during the repetitions. the surprise is that for div-2 repeated sequences (w=2), there is only a single unique prefix! this seems significant right now! 😮 ❗

shave2.rb

PS 😳 oh! for div-2 repeated case, just noticed, the stripped seed trajectory does not terminate at 1 so it can just continually cycle 1-4-2-1 along with the straight seed. terminating the inner loop when the stripped seed trajectory hits 1 leads to different results.

this next code fixes that and then looks at the distribution of “prefix breakdowns” and histogram of total # of repetitions required. its very low and seems to be bounded by 4 max even for largest ranges tested, 229 (`ARGV[0]=29`), which took tens of minutes:

`{1=>134185965, 2=>31376, 3=>380, 4=>7}`

shave3.rb