# collatz, loose ends

as mentioned in the last post, am zooming in on the “power iteration” algorithm. it is explained as, “if you want to find the dominant eigenvector, use the power iteration”. in my case, found it by discovering that “if you use the power iteration, it will lead to the dominant eigenvector”. kind of subtle right? maybe reminds me of that old saying “all paths lead to rome”. and then ofc, the classic, “rome wasnt built in a day”.

here is the code that compares the (normalized) current state vector to the dominant eigenvector, which apparently ruby organizes it as the leftmost column of the left eigen-decomposition matrix. it uses/ selects the 95th/100th density iteration which tends to lead to a longer trajectory. am in good company, as wikipedia notes the power iteration is used at the core of Google pagerank algorithm! 😀 😎 💡 ⭐ ❗ ❤

but, the perplexity continues in a new form. at 1st was thinking that maybe there is some kind of numerical overflow going on, eg maybe a very large statevector, which would imply a very large normalization constant. but as from these graphs the factor, namely ‘y_n’, thick gray, while maybe growing exponentially (but too early to tell in the curve), is not very large in the early region.

graph 1 is the difference between the parameters in the normalized state vector and the dominant eigenvector. it has a zig-zag odd-even iteration thrashing pattern. the 2nd and 3rd graphs are the odd and even iterations data separated. one of the alternations seems to converge, the other seems to diverge asymptotically. after multiple runs, it is not consistently the case that either the odd/ even iterations are diverging/ converging respectively, it seems to be a coin flip/ ie flips randomly. ‘yd_n’ thick red is the normalized error vector (difference between normalized state vector and dominant eigenvector) and shows/ confirms the alternating convergence/ divergence. it also nearly tracks ‘dl’ magenta, the lsb density.

also either ‘dh’ high density blue and ‘mx1’ max 1 values (differences) alternate negative. it is not easy to tell from these graphs because ‘wr’ is close to zero, but it is alternating between positive and negative.

another major finding with this code: it reuses the code to calculate the trajectory length based on the iterative, non eigen-decomposition method. using the eigen-decomposition method for the same # of iterations tends to lead to a length that is about exactly as long as until the parameters converge to the dominant eigenvector, ie a low nearly-constant error/ difference, about 0.05, but only on the “converging” half iterations.

❓ am still scratching head on this one! is this behavior analyzed anywhere? or maybe it is rarely analyzed in the literature as “divergent”? but it is known to show up in the leslie population dynamics, although in that case its a positive eigenvalue. the code doesnt expressly have anything to cause the alternation effect so am a bit mystified at the moment. seems like the proof is claiming that ‘yd_n’ should converge to zero but it only happens over half the iterations.

am starting to wonder if the wikipedia proof for convergence to the dominant eigenvector (power iteration wikipedia page) might require a positive eigenvalue in the fineprint, maybe glossed over! ie is maybe all this behavior attributable to the negative dominant eigenvalue? did spot one step of the proof that glossed over a subtle detail.

namely it claims that $b_{k+1}=\frac{Ab_k}{\lVert Ab_k \rVert}=\frac{A^{k+1}b_0}{\lVert A^{k+1}b_0 \rVert}$. ie it is asserting that a stepwise normalization is equivalent to a nonstepwise “cumulative” normalization. looks simple but wheres the proof of that?

matrix3.rb

(10/4) 💡 think have figured it out. on wikipedia the formula for the nth state vector in the power iteration is $b_k=e^{i \phi_k} \frac{c_1}{|c_1|}v_1 + r_k$ where $c_1$ is determined by representing the initial state vector $b_0$ as a linear combination of the eigenvectors ie $b_0 = c_1 v_1 + c_2 v_2 + \ldots + c_n v_n$ where $v_n$ are the eigenvectors, $v_1$ the dominant eigenvector and $\lVert r_k \rVert \rightarrow 0$ is the error term. the term $\frac{c_1}{|c_1|}$ is just the sign of $c_1$ if $c_1$ is real only, no imaginary part, and my current code is not calculating it/ taking it into account. $e^{i \phi_k} = \left(\frac{\lambda_1}{|\lambda_1|}\right)^k$ where $\lambda_1$ is the dominant eigenvalue, again reducing to just sign changes $(-1)^k$ if it has no imaginary part. ie in short $c_1$ sign is apparently determining the (initial) odd/ even alternation pattern and maybe correcting it will lead to convergence in both alternations (but honestly cant picture it exactly right now). 😐

further thoughts/ thinking out loud/ clarity: wikipedia doesnt seem to explain $\phi_k$ much and sort of acts like its a constant talking about $e^{i \phi_k}=1$ as a condition for convergence, but why then the k subscript? now realizing that $e^{i \phi_k}$ is alternating signs for a negative dominant eigenvalue. that would explain how a difference of two quantities a, b in the current code (here the coordinates of the normalized state vector dominant eigenvector) alternates if one of the quantities is missing an alternating sign correction/ term.

looking at that closely, am now thinking overall wikipedia proof is excluding my case of a negative dominant eigenvalue from “converging” case but omitting it from stated thm convergence conditions, however/ whereas a small alteration/ rearrangement in the proof accounting for alternating sign pattern can also account/ allow for converging in that “special” case. (or maybe wikipedia writer knew all this and was alluding to it in refs of “subsequences”?) am also thinking current code “solver” might work correctly for case of a positive dominant eigenvalue.

(11/7) this code took quite a bit of subtle thinking and debugging but basically fixes the messed up alternating negative factor(s) and the leading positive/ negative sign setup based on expressing the initial state vector as linear combination of the eigenvectors, found simply (not mentioned by wikipedia) by multiplying initial statevector by the inverse eigenvector matrix. also trying to understand what was not working in the prior code, went back to basics. noticed a rather big oversight, even after very heavy use and months of banging around, maybe have never really looked at the dynamics of the model directly, was immediately mostly focused on the difference between the current parameters and dominant eigenvector, and in general exploring misc properties/ behavior of the iterative model such as its crucial termination behavior, viewing it somewhat as a “black box”.

the 1st plot rectifies this oversight and shows the associated parameters of a basic trajectory. the error ‘e’ between the iterative model and the eigen decomposition calculation method is plotted in gray, right axis scale. it shows the parameters themselves tend to oscillate significantly. note ‘wr2’ is an adjusted value for graphing, (‘wr’-1.0)*50. the next 3 graphs are the same graphs as previously, the full iterations and the odd/ even half iterations for the parameter differences from the dominant eigenvector. ‘wr’ is scaled by 100 to fit/ spread in the graph. the alternating/ oscillating pattern in the differences is still evident but in this case as intended/ fixed it converges to 0 in both cases instead of only one.

there was 1 other major wrinkle, in 27/28 cases the eigen decomposition was in Jordan form ie the dominant eigenvalue/ eigenvector in 1st place. but there was one case that was an exception, it was in the 2nd column, so apparently ruby does not enforce that order and it was coincidental for (nearly all) this data. the code therefore finds/ utilizes it with a simple sort.

all this gives new insight into the overall behavior/ dynamics/ proof structure. in the model, for some trajectories there is a relatively smooth descent/ concave downward curve after the peak of a glide (‘nl’ parameter not pictured below although as graphed previously). for others, typically as seen in an earlier plot there is a smooth upglide and the downglide oscillates with increasing amplitude. this might actually be the “stable point” that is being sought in my case. in other words ‘wr’ tends to oscillate up and down with increasing amplitude and eventually/ apparently inevitably a “down” oscillation will be large enough to cause the trajectory to descend below its initial value.

overall this was an excellent/ very challenging exercise to translate/ implement math off the minimalist/ at-times terse wikipedia writeup into code and understand the basic dynamics of the power algorithm, piecing some parts together and filling in some blanks. to solve my particular problem involves a different direction (not exactly focused on the dominant eigenvector, which is a limit case) but this theory is highly adjacent/ applicable. the key step for me is to extract the single variable function for ‘wr’ and solve for $\prod wr_n < 1$ ie the end of the glide based on the model, and dont have a totally clear picture of how to do that yet.

💡 ⭐ another highlight! found a lot of overlapping material with the wiki pages under study for matrix difference eqn and matrix exponential in a book long lying around here, bought years ago. Matrices and Linear Algebra by Schneider and Barker has a very relevant chapter 8 Applications to Differential Eqns and is helping expose the relevant math in much more detail. saw/ skimmed/ pondered that chapter weeks ago but then didnt recognize the correspondence because it looked like another language. the eqn is expressed in the form of a “vector derivative” $x' = Ax$ but as sketched out earlier all the theory translates (for time steps $\Delta t = 1$ and 1st order Taylor approximations/ Euler differential eqn solver method etc.)

matrix3b.rb

(11/9) 💡 ❗ 😮 🙄 😀 ❓ hacked out some quick-and-dirty code yesterday that verifies the following simple property. as on the wikipedia page for matrix difference eqn, the dynamics is equivalent to $y_t = P D^t P^{-1} y_0$. now re “Extracting the dynamics of a single scalar variable from a first-order matrix system,” its not mentioned but almost trivial that the eqn for a single variable is the same eqn except $y_t = P_i D^t P^{-1} y_0$ instead where $P_i$ is the ith row vector corresponding to the variable in question. how straightfwd! now doing some math in my head and still needing to verify this, think the eqn for ‘wr’ is now apparent. it is in the form $\prod_n (c_1 \lambda_1^n + c_2 \lambda_2^n + \ldots + c_k \lambda_k ^n + c_0)$ where $c_0 \approx 1$ (from the inhomogenous “coordinate translation”) and $\lambda_1$ is the dominant negative eigenvalue, $k=8$ variables. now wrt all this a general proof sketch/ outline seems apparent/ possible/ plausible/ viable/ nearby/ in reach. after a possible initial increasing run $wr_n\,\textgreater\,1$ for $n\,\textless\,m$, in long-term asymptotic behavior $n\,\textgreater\,m$ each term seems to swing between values farther above/ below 1 which guarantees $wr_n \ll 1$. ie for $n\,\textgreater\,m$, $wr_n - 1$ alternates signs and $|wr_n-1|\,\textgreater\,|wr_{n-1}-1|$ or equivalently $(-1)^n wr_n-1\,\textgreater\,(-1)^{n-1} (wr_{n-1}-1)$ where indexing chosen so that $(-1)^n (wr_n - 1)\,\textgreater\,0$.

(later) more thoughts/ insight. its important to distinguish $w_n = \prod_n wr_n$, the current cumulative product. another way to look at it is that the product formula is a combination/ superposition of two functions $\prod_n (f(n) + g(n) + 1)$ where $f(n)$ is an oscillating increasing exponential (sawtooth-wedge shaped) associated with the dominant (negative) eigenvalue $\lambda_1\,\textless\,{-1}$ and $g(n)$ is a sum of exponentials of all the lesser eigenvalues $\lambda_k\,\textless\,1,k\,\textgreater\,1$ (as seen by inspection in all the data so far) associated with decreasing exponentials. ie/ in other words $f(n)$ is alternating sign and $g(n)$ is positive. at some point $n=m$ there is a shift from a steady increase in $w_n$, the initial upslope/ increase of the glide where $wr_n\,\textless\,1$ ie $f(n)\,\textless\,{-g(n)}$ for the 1st time. in the downslope/ decrease the oscillatory term increases from “moderate wedge” to “large wedge,” dominating the decreasing exponentials, and even more to the point $w_n \ll w_m$ and finally at some point $w_n\,\textless\,1$, terminating the glide.

another pov is that $wr_n$ behaves similarly to a derivative of $w_n$ except $w_n$ declines when $wr_n\,\textless\,1$ and “transition point” $n=m$ is the start of the glide decline.

(11/15) this code took quite a bit of careful adjustment. this computes the matrix difference eqn focusing on the ‘wr’ variable alone. wrt prior analysis it uses 4 separate methods that are all found to be equivalent (within small error). the 1st method extracts the single variable from the matrix eqns. the 2nd method uses the eigendecomposition left matrix ‘wr’ row only instead of the whole matrix. the 3rd method does some more algebra to compute the difference equation constants outside of the loop but still uses the diagonal eigenvalue matrix to compute eigenvalue powers. the 4th simply uses an array for the eigenvalues/ powers. the average of all 4 is computed and a max error of any measurement from the average is computed and found to be small over all runs. this is output in the 1st line.

some of the algebra expanded is as follows. as mentioned previously $P_i$ is the ‘wr’ single variable row of $P$. $P_i D^t$ is the same as a hadamard product of $P_i$ and $D^t$ eigenvalue powers both as column vectors. since hadamard product is commutative the order can be alternated, and also a hadamard product of two column vectors is equivalent to converting the 1st to a diagonal matrix, leading to the constant vector computed as $\mathrm{diag}(P_i) P^{-1} y_0$ computed outside/ before the loop. $P, P^{-1}$ are named ‘v, vi’ in the code.

the later lines of output capture lots of misc statistics for the longest sequence found such as eigenvalues, dominant eigenvalue, etc. following is a single run. it outputs the ‘wr’ eqn as ‘wr(t)’, and a graph is output.

there are four sample graphs for different runs confirming the basic prior analysis with some characteristic variations chosen below representing basically tradeoff between the dominant and nondominant eigenvalue parts. the dominant eigenvector oscillating wedge function is in red ‘wrx’, the composite nondominant eigenvectors function minus 1 is in (slightly thicker) green ‘wry’, and the composite of the two functions is in blue ‘wrz’. in graph #1 the red wedge is not very significant and the green (“composite declining”) exponential curve plays the big role. in #2 the wedge is more dominant. in #3 and #4 the exponential is instead less significant and wedge dominates.

some complications arose with a few of the runs where eigenvalues turned out to have imaginary parts and the code had to be adjusted to handle the general case. the final outputs did not have significant imaginary parts but there was a case with an imaginary eigenvalue part that was not insignificant (~0.07).

matrix4.rb

{"t"=>13, "e"=>4.440892098500626e-16, "i"=>472}
{"t"=>21, "e"=>2.220446049250313e-16, "i"=>494}
{"n"=>1267650600228229366312331083775, "ns"=>"1111111111111111111111111111111111111111111111111111110111111111111111111111111111110111111111111111", "nl"=>100, "d"=>0.98, "dh"=>1.0, "dl"=>0.96, "a1"=>0.32666666666666666, "sd1"=>0.16131404843417155, "mx1"=>0.54, "a0"=>0.01, "sd0"=>0.0, "mx0"=>0.01}
["eigvals", [-1.1781558446706701, 0.9911146773554002, 0.7924245248832454, -0.3837817023068492, 7.632783294297951e-17, -0.013677353227191647, -0.107812466973577, -0.10011183506035636]]
["dom eig, n", -1.1781558446706701, 0]
["c1, c1a", -0.036440670008669795, -1.0]
["c0, c", 0.9946060708372788, [0.00033211320363730253, 0.005994745299591639, 0.0015722083595888087, -0.0025473799816018473, -0.3153459580094538, 0.34097610818096974, -0.029929444090268074, 0.004341536200257558]]
wr(t) = 0.995 +0.000332(-1.18)^t+0.00599(0.991)^t+0.00157(0.792)^t+-0.00255(-0.384)^t+-0.315(7.63e-17)^t+0.341(-0.0137)^t+-0.0299(-0.108)^t+0.00434(-0.1)^t
["e", 2.220446049250313e-16]


(later) 😮 ❗ this is a fairly quick riff on prior code, mostly removing a lot of logic, that looks at some complete real trajectories and computes the distance from the “fixed/ stationary point” of the matrix difference eqn at each iterate, for 10 longest trajectories (glides). results were both roughly as expected and somewhat surprising. the graph is the distance measurement and also the current density, and color goes from darker/ longer sequences to lighter/ shorter sequences. upper curve is the density and lower curve is the fixed point (euclidean) distance. the two curves are nearly identical except for scale/ translation. not sure what it fully means yet but its apparently a key scale invariance phenomenon.

the decline to ~0.2 by the fixed point distance suggests that maybe the later flat/ plateau part is a finite lower-bound region of iterates that are the “base induction case” for the overall proof, and the induction starts at the leftward ascending iterates. another remarkable observation, some trajectories were seen to have nearly linear/ monotonic descent phases, even smoother than in this graph. need to try to capture some and examine them. this all begs for a lot of further analysis and my instinct was to immediately bang on it further but it was quite eyepopping and decided to post it immediately at the end of a days work.

matrix5.rb

(11/16) thought about it some more and looked over the wikipedia pg re matrix diff eqn. it says the eqn has a fixed point solution but only if all the eigenvalue absolute values are less than 1, which as mentioned is not the case here. so what is the fixed point vector meaning in this situation? now not exactly sure “what that code is doing” even though the results are notable. maybe need to compare with the scaled dominant eigenvector instead (maybe there is some relationship between dominant eigenvector and the fixed point vector?). another basic oversight is that the iterate bit width is not output/ graphed. after analyzing that, it looks like the vector difference/ distance is correlated with “distance” of the glide peak and/ or glide length! but need to do more coding to highlight that.

😮 💡 ❗ but after graphing seed width noticed this unexpected/ surprising remarkable pattern and decided to refactor the code to extract/ highlight it in something of a detour/ tangent. this logic just generated 500 glides, sorts the glides by max length and then for top 20 graphs ‘nl, a1, sd1, mx1’ variables color coded by glide lengths and finds remarkable/ striking self-similar patterns! it reminds me of results from a long-ago graph but iirc that was generated in a completely different way. may have to try to dig that one up and compare. the ‘0’ variables also have some rough trend but its not nearly as orderly. the “nearly linear/ overlapping” glide ascent in 1st graph seems a bit eyepopping, cant think of a simple/ immediate way to explain it… almost as striking as the few razor-smooth ‘mx1’ initial descent lines. 🙄

pattern.rb

this code occurred to me to test whether there is much prediction ability between glide length and the matrix diffeqn. the code computes expected # of iterations in the glide ‘t’ red via the diffeqn model and compares the initial vector to the final vector using euclidean distance ‘yn’ green. this code generates a lot of glides and selects/ samples them in linear decreasing length ‘c’ blue (right scale), throwing out very many to select the biased sample. a very rough correlation is indeed found and so this basic sanity test of the model is validated. one of the “better” correlations is seen below.

this reminds me of the weak correlation found between trajectory length and initial seed density. ofc all the machinery/ trappings of multiple variables is based on the idea that mere iterate binary density is not enough sophistication to lead to a proof, but that still remains to be seen. even an eventual multivariate proof finally nailed down does not preclude/ rule out existence of a simpler one.

matrix6.rb

(11/24) 💡 ❗ ⭐ 😮 🙄 after some extended mental wandering + pondering, think the answer to my earlier jacobian newton approximation question is straightfwd and was eluding me until now, think this is the solution. its the same as matrix differential eqn on wikipedia (not cited specifically yet, but many “nearby” entries cited). that considers eqn $\mathbf{\dot x = A' x(t)}$ where it is not pointed out but feel certain* $\mathbf{A'}$ is—2020-hindsight now obviously!—the jacobian! then a soln to my question is simple. the newton iteration $\mathbf{x_{n+1}=x_n - f(x_n)/ J(x_n)}$ is in the desired form $\mathbf{x_{n+1}=A x_n}$ when $\mathbf{J(x_n)=A'}$ and $\mathbf{A'=(1-A)^{-1}}$. notice that key value $\mathbf{(1-A)^{-1}}$ already is used for the fixed-point calculation in prior code, and the similarity of the form/ analogy with the prior derived single-variable case $b=(1-a)^{-1}$. in other words the matrix difference eqn is computing the fixed point $\mathbf{\dot x = 0}$ via newtons method and the soln is apparently the matrix exponential $\mathbf{x(t)}=e^{\mathbf{A'}t}$! (again still not sure how this “reduces” for A 1×1 dimension…?) these key conceptual connections are a bit unapparent from the at-times disconnected/ not highly coherent wikipedia entries/ style and the lack of some explicit statements in wikipedia seems verging on major oversight although Schneider+Barker book ref does understandably better job.

another interesting/ key/ foremost way of looking at all this: this blog series has long pondered the “global vs local” properties/ dichotomy of the problem as central to its solution. the “local” difference eqn gives future variables in terms of the current values of each variable but the “global” solution gives a eqn in terms only of the time variable and exponentials/ eigenvalues, not referencing the individual (eight) variables.

(11/28) 😳 😐 😦 ❓ * argh! wishful thinking, after some desperation? honestly, am really confused at moment. think there are some semi simple relationships between these seemingly disparate concepts but theyre still eluding me. was too hasty there & the prior algebra only holds if $\mathbf{f(x) = x}$ which doesnt look right/ makes no sense/ is probably a contradiction! and again now not really sure what the jacobian of the matrix differential equation is…? wheres a mathematician when you need one!?! @#\$&

(11/30) 💡 ❗ 😮 😳 🙄 so obvious in 2020 hindsight! the Euler integration method for the matrix differential eqn is $\Delta \mathbf{x}/ \Delta t = \mathbf{A' x}$$\mathbf{x}_{n+1} = \mathbf{x}_{n} + \Delta \mathbf{x} = \mathbf{x}_n + \mathbf{A' x}_n$ for $\Delta t = 1$. therefore the “integration/ sum” of the sequence of the recurrence relation $\mathbf{x}_{n+1}=\mathbf{A x}_n$ is the differential eqn solution for $\mathbf{1+A'=A}$$\mathbf{\dot{x}=(A-1)x}$. ⭐

❓ but honestly still really mystified, what is the analog of the jacobian $\mathbf{J(x)}$ for the differential eqn? part of the difficulty is that its multivalued/ vector value defined as a function of time (single variate), so is there a multivariate parameter analog for which theres a jacobian? is it hidden in all this somewhere, staring me in the face?

😡 👿 @#%& probably just blew nearly 1 hr trying unsuccessfully to add strikethrough to a formula above! geez doncha love cyberspace? the “strike” html cmd is incompatible with the mathjax in some browsers but not others. its not built into latex std or amsmath! and wordpress doesnt allow adding latex math packages! hey what do you expect for free? this stuff is like a toy sometimes… [1][2]