the collatz saga continues. an earlier graph seemed to find some correlation between the density and the glide (upslope side). this is proving not so easy to replicate, it was not necessarily found in individual glides but in an integrated curve, which is not necessarily highly correlated with individual glides. (its an interesting general question how trends in the integrated graphs reflect individual glide trends/ statistics.)

was thinking, need to understand better simply how the different hard seed strategies generate different glides, as hinted previously, some of them are not having much effect on glide length (#2, #3, #4, #8). red is the upslope length, green is the downslope length, and downslope is typically longer than the upslope. notable is the flip in the two in #6.

another idea that is occurring to me: have some code that creates a database (in file format) of hard seeds with their basic statistics precomputed eg glide length, upslope/ downslope length, location of max, etc., and just fetch them for various experiments; this would be sufficient for many/ most purposes and significantly cut down on generation run time.

(other misc note: my corp firewall/ proxy started blocking `gist.github.com`

as an “inappropriate code sharing site”—long same with `pastebin.com`

. **@#%&**! oh geez it was just too easy before! they sure dont make it easy around here; & the near alternative, the wordpress editor, again seems to screw up code formatting for the js code embedding method. *its not easy being a hacker…* am thinking about looking into setting up a mobile wifi hotspot! do any cellphones do that nowadays?)

here is the database idea materialized. it outputs 100 trajectories for each strategy with retrying for duplicate seeds and outputs results into “`db.txt`

“.

this is the 1st analyzer code. it is looking for density differences in the nontrivial glide strategies #1, #5, #6, #7 and finds them. it subtracts statistics for the “left” and “right” (upslope/ downslopes). for trends near the center line, there is no signal (difference cancels), and for trends away from the center line, its signal. the 1st graph is difference in averages and standard deviations of the densities for the full left and right sides (red/ green resp), the 2nd and 3rd are for the positive and negative densities resp.; a big question is how much of this is related to idiosyncrasies of the hard seed generation and is any of it more general?

this idea occurred to me to look at the “half density” in other words treat the binary trajectory value as having different density regions, ie looking at where the bits are to see if they are evenly distributed, starting out simply with an “upper half” (msb half) vs a “lower half” (lsb half). that intuition seems to be “dead on” in that the upper and the lower bit halves do not have identical density characteristics. this code subtracts the average and deviations in densities (here, for the full binary value vs the lower half) and again this will cancel out to zero with no signal. the signal is quite prominent. red, green are difference in average, deviation for the upslope 1st graph, 2nd graph for downslope resp.

**(2/26)** so it made sense to look at the density of iterates by “cross section” where different regions of the value from lsb to msb are analyzed. this code splits the upslope and downslope region into 20 slices each and the cross section into 10 bins, and height is count of bits in the surface plot. it falls dramatically right as the trajectory starts its decline from max but still is elevated on the downslope side. there is a fairly striking ridge on the upslope half, for the lsb region. this lines up with earlier research that found frequent 1-run lsbs. this graph is for the ‘w’ strategy. these are 2 angles on the same 3d data. its integrated over 100 runs with a minimum upslope and downslope length of 5. normalizing by the total bin counts was crucial for this plot because the upslope size is consistently smaller than the downslope. on the msb side, there is a concave downward trend that is hidden (“swamped”) in the general decreasing-to-center “core” density statistic.

the ‘cg’ strategy plot is quite similar but has a curl in the front in comparison:

**(3/3)** earlier it was asserted that the sibling cascade trajectory lengths were roughly “linearly correlated.” that has a huge fineprint: its very noisy and the deviation increases maybe more than linearly for the longer trajectories. this code studies this in particular. its 10 runs for the ‘w’ strategy at ½ hardness with the graphed lengths of the sibling trajectories starting from the longest to the shortest. so theres an obvious linear trend but its very noisy. the code also refactors/ streamlines the statistics logic to avoid the global variable usage.

**(3/4)** this code looks at a bunch of metrics together some visualized previously. over the 4 nontrivial glide strategies, graphs top sibling “overlap” (# of identical prefix odd/ even steps, red), glide max index (green), glide length (blue), sibling glide max index (magenta), sibling glide length (cyan). there is a very strong correlation between prefix overlap steps (red) and sibling glide max index (magenta) for all but the last ‘r’ strategy where they increasingly diverge (for harder seeds). another way of looking at it, the prefix overlap steps nearly/ roughly bounds the sibling glide max index. there is a small change to the generation logic density ranking because it was found the prior approach leads to “loops” of non-hard seed sets for the ‘w’ strategy.

overall another way to think about these statistics is that an inductive approach would attempt to predict the new trajectory (green, blue) from the other “lower” statistics (red, magenta, cyan). another observation is that these varying statistics seem to collectively effectively discriminate the different hard generation strategies.

**(3/11)** this is maybe somewhat simple in 2020 retrospect but took me quite awhile to figure out after wondering about it repeatedly. the concept of siblings is as noted early hinting some inductive structure. but it was hard for me to picture it. the current approach is to take a long trajectory and look at a (smaller) sibling. but an inductive approach must attempt to resolve larger trajectories out of smaller ones (aka a magic trick, sort of like pulling a rabbit out of a hat).

so what is the inductive structure? in short one needs to relate between and generate larger seeds out of smaller ones in some kind of inductive way. there is a simple/ natural/ easy way to generate base-2 numbers recursively by appending 0/1 to strings. but this inductive structure being sought seems to want to *prepend* ie alter the higher bits. it took me strangely long to figure out this pattern, but here it is.

the inductive approach is to “move” the lsb leftward and replace the next digit with either 0 or 1. ie 1 lsb is replaced with 10 and 11 lsb instead. here is some simple code that implements that and the output for 2^{5} cases.

["1"] ["10", "11"] ["100", "110", "101", "111"] ["1000", "1100", "1010", "1110", "1001", "1101", "1011", "1111"] ["10000", "11000", "10100", "11100", "10010", "11010", "10110", "11110", "10001", "11001", "10101", "11101", "10011", "1 1011", "10111", "11111"] ["100000", "110000", "101000", "111000", "100100", "110100", "101100", "111100", "100010", "110010", "101010", "111010", "100110", "110110", "101110", "111110", "100001", "110001", "101001", "111001", "100101", "110101", "101101", "111101", "100011", "110011", "101011", "111011", "100111", "110111", "101111", "111111"]

so now there is a concept of a trajectory and its “higher siblings” instead of “lower ones”. this following code looks at how long a glide is for the trajectory and its 2 higher siblings, but based on trying to generate hard (longer trajectory) higher siblings. the code discovers a hard seed and treats it as one of the siblings of the pair and then determines the lower sibling. the lower sibling glide length is in red and the 2 higher siblings are in green, blue. as expected the computed “hard” higher sibling in green is higher in the trajectory. the 2nd computed sibling in blue tends to be close to the lower sibling glide length (red).

this code looks “successful” and other graphs suggest a limit of about 4-5x the lower trajectory glide length bounds the 2 higher sibling glide lengths. so then was musing and thinking ahead about how this could be encoded in a proof somehow. it reminded me of the transducer construction studied for many years and had some thoughts on how to put that together.

in brief words, admittedly hazily sketched out (possibly clearer in my mind than in words but not enough time to write out in detail right now): one would need to look at two transducer cascades side-by-side. one cascade computes say 5x iterations of the collatz function for every 1 iteration by the other cascade. then using a lot of FSM operations (as with the transducer calculation) one can extract the terminating set from one cascade and proves it overlaps the terminating set from the other cascade. then verification can proceed for some finite n count of trajectory lengths. this is a bit tricky in that have not constructed a transducer to compute glides only, instead have a transducer for full trajectories. so may have to look at these calculations for the trajectories instead of glides as a workaround, but suspect the dynamics are somewhat similar right now…

**(3/25)** this is a kind of half baked idea *(in stark contrast to everything else posted, lol)* but am throwing it up here anyway. the general idea was that maybe adding a lot of random walks might smooth out the walks some, and that this could be quantified by measuring a smoothness metric like the nonmono run length. the best case scenario is taking a lot of increasing random walks, adding them, and getting something that is more monotonically increasing. the basic collatz random walk is of course the length of trajectories, eg determined for sequential starting seeds (one of the most basic empirical graphs even found early in the wikipedia article on the topic).

but there are other basic random measurements eg calculated during the hard seed generation such as glide length, glide max index, etcetera. there are about 6 of these already computed. then, there are also 5 different/ unique “trajectory compression” approaches for a total of 30 combinations. what would all these look like? the immediate answer is that all the separate statistics are extremely jagged/ spiky and not anywhere near linear, and adding them together doesnt improve smoothness at all. another basic issue is that all the spike “locations” are highly linearly correlated. the first 50 odd trajectories for each of the 30 walks is graphed below sequentially with different colors. anyway so again back to the drawing board.

**(3/31)** this is an idea of enumerating the seeds and graphing glide lengths in a “binary reverse” order and a color palette. there is a 1-1 mapping between binary strings and reversed binary strings. this shows a very basic inductive/ fractal/ self-similar scaled structure. the top 15th line has 2^14=16K points. each line has a higher max, equal to the longest glide for starting seeds of the given binary width. this all ties in with the prior analysis of sibling glide lengths. the siblings line up spatially/ vertically in this “natural” arrangement. it would be natural to modify this into a surface plot.

AnonymousIt’s not clear what are you doing here for those who don’t like reading ruby code, but anyway, do you eliminate “non-interesting” numbers through the modular restrictions?

https://en.wikipedia.org/wiki/Collatz_conjecture#Optimizations

subsection “Modular restrictions”

vznvznPost authorthx for the interest. yes some earlier code explores the “base-2 prefixes” that are constrained in the sense that only a small percent of prefixes lead to glides… will look it up for you specifically if you return (its all under the “collatz” section earlier). re ruby: the ruby code is close to pseudocode in its syntax. someone who doesnt like reading it might not like reading any (programming) language at all. the descriptions could be written in a more mathematical format but thats hard for humans to parse also! and my opinion is that any serious/ comprehensive attack(s) on the problem must have some element of algorithmic angle.

AnonymousWhatever you do, try it with these numbers only and see how your statistics/visualization change:

27, 31, 47, 63, 71, 91, 103, 111, 155, 223, 447, 671, 703, 871, 927.

vznvznPost authoryes, 27 is a semifamous case. frequently use it for unit testing. have you ever played with collatz? am interested in building up a chat/ conversation/ study group/ club on the subj… do you use stackexchange? hope you can sign up. see eg http://chat.stackexchange.com/rooms/12070/number-theory or also the “chat” tab on this blog… can discuss/ show you the details at length if you sign up. much of the code on this site is actually highly oriented around the “mod 2

^{n}” properties/ behaviors.