hi all. did some deep thinking and came up with some new/ old ideas. in other words, new ideas built on old ideas. intending to wrap up/ write that up soon. elsewhere…
background commentary… wow, this virus is really taking the joy out of existence isnt it? am able to get outside in the great parks/ outdoor areas/ nature spaces still left around my urban city, for walks, hikes, other exercise, and thats “good fortune” or what the religious call “a blessing,” worthy of gratitude. but it seems like even in the “normal times”/ “status quo” for our country, daily life is often reduced to a survival grind even for those with high(er) status jobs. roughly 5/7 days of our (adult) lives are spent working…
honestly do not really find a lot of joy in life at times even in “non pandemic” periods. more like, intermittent flashes of joy. and it almost seems like Big Corp esp Dark Side Manager is doing everything it (in)humanly can to take all the (last slivery vestiges of) joy out of the job, to crush any candle flickers of light in the oppressive grayness, to turn/ crank/ jam the anxiety/ stress knob as close to 11 or farther. sometimes, this world seems designed/ optimized for oppression. “welcome” to the Age of Anxiety, the Era of the Precariat. ok, ok, my and your ancestors dug ditches by hand, and here we’re typing on screens sitting in recliners in climate controlled rooms with huge panoramic windows, so in a way its all 1st World Problems… right? alas old canards seem to lose their zing…
am wondering at times lately about “hedonic normalization”. this is the scientifically analyzable trend of our human brains to acclimate to any situation, including positive ones. could it be a sort of hedonic normalization happens over ones life? its like in that expression “been there, done that, seen it all…” as one gets older, theres less novelty, maybe also giving a feeling of less variety (“the spice of life”). in a word, weltschmerz. seemingly, episodes from the past become more numerous, crowding in ones consciousness. is the current moment always better than prior ones? a sometimes haunting question for those getting older. maybe for mental health its better, the more one ages, the less to think about the past…
just read this book Lost Connections by johan hari. brilliant work! really deeply relate/ resonate to its striking contents/ message. overall a heavy message, hopeful in the end, but nevertheless can really relate to it. he quotes a scientist, an idea that occurred to me maybe 2 decades ago. “dont ask whats inside your brain, ask what your brain is inside of”. we live in some difficult urban environments. the steel, concrete, glass jungle. very counterintuitively, not at all superficially but possibly at least psychologically, it may even be more taxing than the savannah of our ancestors. there is some scientific evidence for that. we might think that electricity + plumbing makes our life easier and it does, but on the other hand, from an evolutionary standpoint, and counterintuitively, modern (mega)cities/ life are in many ways like an inhumane, bordering on alien environment.
the operative phrase is quality of life. theres a spectrum between surviving and thriving, as big as humanity! ~7B datapoints wide and still growing!
long had an idea to write a book on utopian ideas/ ideals for the 21st century. hari has already expertly captured many of my “inklings”/ musings. its a sophisticated, extremely ambitious roadmap for a positive future, but can humans pull it off? sometimes it seems the human herd is only blindly, mindlessly, hastily stampeding off of cliff(s) these days. actually thats a lot of the difficulty, its a conceptual/ overarching roadmap, but lacking on what might be called “design/ implementation details”. but its (far) better than nothing.
heavy thoughts, nevertheless in line with the zeitgeist!
⭐ ⭐ ⭐
this blog has long considered evocative or “out there” ie nearly outlandish analogies/ metaphors for the collatz problem. one that comes to mind from months ago was the extended fusion analogies explored. its part of the amusement, sustaining the full fascination. lately a few pandemic-related connections occurred to me.
a few times collatz fractal properties similarity to one of the worlds best fractal models (for human understanding) have been cited: snowflakes. snowflakes are both routine and exotic. most humans can relate to them even if they are in a snowflake-free climate. the idea was presented a few times here that different collatz trajectories are almost like snowflakes.
ah, but viruses are a new analogy-of-the-moment. heres the new pov. different viruses also seem to have fractal qualities in various aspects. eg their shape aka “morphology”.
now, theres a remarkable parallel. the immune system has to actually “learn about viruses”. the pathogen is an alien structure, but potentially system-threatening, ie worse than dangerous, a potentially catastrophic/ deadly intruder, or something utterly benign, say a pollen grain, involved in seemingly inconsequential “hay fever” allergies. (pollen grain shapes are possibly not so “different looking” than lethal viruses.) hows that for a problem with high stakes around false positives/ negatives!
the immune system cannot actually “see” any virus in the sense that we can look at 3d shapes. the immune system can only launch probe-like structures, antigens, that may attach to the virus. these are like manipulators or handles, flags that may or may not stick to the foreign invader. in a sense the immune system has to manufacture molecular tools on the fly, in a life-or-death race. there is some randomness mixed with order to this process, immune system function is one of the extraordinary feats of biology (even across species) and probably the human immune system is the most highly developed species immune system on the planet.
the human immune system has some interesting parallels to machine learning. its a kind of machine that learns. much of its function is still not well understood. we have the outlines, we have sketches. in a sense antigens function somewhat like molecular “feature detectors” in that they connect or dont connect (bind or dont bind), almost like on/off binary flags. the phrase “getting a handle on the problem” fits/ applies/ “sticks” in mind here.
there is a sense in which the immune system is solving a sort of dynamic protein folding problem wrt pathogen recognition/ harnessing. have not seen others remark on this, but the immune system generating antigens that attach to the pathogen seems (likely) a lot related to the protein folding properties of the antigen! not coincidentally Google started working on (“attacking!”) the protein folding problem wrt ML technology within the last few years.
so the immune system has an extraordinarily difficult job in that it has to recognize an pathogen only from surface properties in a sense 3d reduced to some 2d aspect, like an inexact mapping (insert analogies of trying to map the earth onto 2d projections here). ofc the other major element that the immune system monitors for clues, so to speak, is “behavior of the intruder”. this behavior is abstract, it has to do with the interaction of the intruder with/ between the organism. the immune system cannot peer into the virus to discover it contains DNA inside it, the virus has a membrane-like structure aka coating that hides its inside from outside. in this way the ancient legend of pandoras box, or a trojan horse, almost seems like direct metaphors for viruses.
alas a cell can only “discover” dna in the intruding mechanism by unpackaging it, absorbing it into its membrane, which is exactly the same as infection. infection is a sort of behavior that is discovered by the immune system 2ndhand from circumstantial signals, only associated with malfunctioning organism cells, ie they are compromised and corrupted/ hijacked into virus-copying machines. in a sense the immune system can only find invaders based on the damage they cause. damage may be (typically) spotty/ peripheral or it may be massive, aka an attack.
viruses are like the ultimate in asymmetric warfare that was a foremost political topic since 911. asymmetric warfare seems like the scourge/ bane/ PLAGUE (cutting/ lethal/ toxic pun noted!) of our age, its associated with major political events/ wars. ISIS one of the great recent evils to (dis)grace the planet seemed to master asymmetry. almost like a stateless, state-like virus composed of humans. a single virus can destroy the human organism of ~30 trillion cells, turning it into a pile/ puddle of flesh-goo. shades of the functions of viruses ring through bostroms infamous amoral paperclip-generating AI. maybe bostrom fails to realize that a dangerous AI-like entity already exists on the planet.
the cell has strong parallels to Turing machines. DNA operations such as replication and construction are similar to Turing machine like operations. this was first speculated on by von Neumann who was one of the 1st to understand the concept of information encoding a control system/ machine blueprint. it is not a mere coincidence that real computers (Turing machine embodiments in the real world) have “real” viruses.
so, this is a dark but not unhinged thought (in dark times when our country is routinely said to be being led by a routinely unhinged leader), but this project of solving the collatz problem has shades of parallels to the way the immune system recognizes and finally counteracts/ vanquishes a virus. the immune system has to learn on the fly, to “wrap itself around” the virus, to isolate/ contain it, it has to learn the nature/ structure of the foreign, alien attacker/ attack before it can mount a response, or “die trying”. sustained hard/ deep work on the worlds most intractable problems is not merely coincidentally called “an attack”.
two decades ago Bill Joy said “the future doesnt need us” and warned us of genetics, robotics, and nanotechnology of the future. its seems at least 1 of the trifecta was realized in only 2 decades since his warning. oh, but robotics, if you listen to Yang, has decimated the middle class/ economy. as the old cliche goes, 2 out of three
aint is bad. and there is a rumor/ hint bouncing around remote dark corners of the internet that coronavirus was engineered which in my mind is a clear case of nanotechnology (mixed/ overlapping with genetics). 3 out of 3 is catastrophic/ apocalyptic.
⭐ ⭐ ⭐
(6/9) 💡 have been banging quite a bit on this next code. there are some deeper ideas driving it. as the old zen aphorism goes, in the beginners mind there are many possibilities, in the masters, only a few. a new general idea is around machine learning curve fitting algorithms, which have been explored here in many contexts, but am thinking about a major revisit/ paradigm shift in mind. in short after several years, some fairly high quality features have been painstakingly “engineered”. none of them alone are enough to crack the problem. but what about the gestalt aka Big Picture? so far the gestalt is (mostly, apparently) humanly incomprehensible.
however machine learning algorithms excel, even specialize in this kind of synthesis. am starting to think that success cannot lie in a human-centric attack alone, which much prior work can be regarded as lying within that theme (in defense, along with much musing about breaking through it). some of the power of ML has been touched on but possibly, arguably its full fledged range not yet harnessed wrt this challenge. or in archimedes great/ old 1 word/ key concept—(fully) leveraged. one might say that ML is like the fundamental anti-fractal or de-fractal (complexity) tool. or maybe another image: if hard problems are a can of worms, ML algorithms are like worm sorters/ straighteners.
construct121 from 2/2020 seems to be the last time the 0.64 glides were used although there has been some behind-the-scenes (of the blog) usage. these are convenient for several reasons. they are (now, relatively) easy to generate, they have good variability in (glide/ total) length, and they tend to be in the ½ density disordered region. my other major “new” idea was to go back to the nearest-neighbors algorithm also, and the last major investigation there is now coming up on 3yrs from 1/2020 (time flies when youre having fun). some of those ideas have been recurring to me, there are some loose ends there. this code builds a bunch of the known feature detectors on 1000 glides, with ⅕ in the test set. am actually getting ideas about how to build some variants of the standard nearest neighbor algorithm, ideas about pre/ postprocessing the data. this code focuses on preprocessing.
there are 2 basic “norms” that have been thrown around a lot over the years:
- a “z-norm” which is like a gaussian normalization around the mean/ standard deviation, and a “rank norm”. the rank norm has been used with good effect on optimization and sometimes seems to outperform the z-norm.
- the “rank norm” looks at the order of all the values of a variable and then reassigns the variable the index in the order. after getting deep into this code and debugging it, found a remarkable not-before-seen glitch with rank ordering, which has now been used for years on the optimization algorithms. the code does not address the key issue of duplicate values. for floating point ranges, these are somewhat rare, but for discrete variables, such as “shortest 0/1 run lengths”, the issue of repeated values/ nonuniqueness can really stand out. the problem is that the prior logic can assign different ranks to identical values and it took awhile to figure out this was affecting this code (more specifically, noticed/ discovered via the overall eye-glaringly-high performance). this code slightly alters the rank normalization to merge duplicate values. it is not immediately clear if this would improve performance of prior uses of rank normalization.
another tricky subtlety that arose in careful analysis/ debugging of this code: for standard optimizations used commonly around here there is not a train/ test set separation which significantly simplifies the code. for a ML problem, the separation of data is crucial. accidentally coded an early version of this code without the separation. the separation requires that the norm statistic/ “adjustment” be computed over the train set and then the same values used for the test set. in other words the statistic must be “blind” wrt the test set otherwise there is “contamination” of data.
the buggy version of this code managed to find very high expected vs actual correlations using rank norm (but also the buggy “nonunique-aware” rank norm logic) approaching an eyepopping, yes, “too good to be true” ~97%. wow! usually a defect would tend to decrease performance but this was exactly the opposite case. in a sense 2 wrongs made a right… or appeared to. the logic to assign rank norm based on train vs test separation, now implemented here, is not so trivial, it involves basically interpolating ranks for unseen or not-previously-encountered values.
this code applies nearest neighbors algorithm using the 3 different normalizations, or more specifically the prior two plus 1 without normalization; the schemes are
ks, ks2, ks3 where ‘ks’ is none, ‘ks2’ is rank norm, and ‘ks3’ is znorm; red, green, blue respectively. it takes quite a few samples (1000 points each run, from reshuffling the data) to find the difference, but its quite measurable and stable and overall this exercise reveals the significance/ power/ even necessity of normalization. dashed lines are unaveraged data/ samples and unbroken lines are the (running) averages. the prediction is for the logarithmic ratio of the collatz function 40 iterations forward. there have been similar prediction ideas in the past mostly around total sequence length but this is new and was chosen because it is more predictable.
the z-norm is the top performer by a small margin but rank norm is a close 2nd and both are superior to the un-normed data. the performance is measured as the correlation coefficient between predicted and actual data, and numbers approaching ~0.7 here are very impressive for this fractal data. another striking aspect here is that the reshuffling leads to high correlations across the different algorithms. this seems to show that certain points/ samples are “more representative” than others and leads to the idea of algorithms that try to find “more ideal” samples. that idea was explored with success at length in the past with regression in a long series of experiments.
(later) small, or big update/ footnote/ fineprint on that prior use of the words “measurable/ stable”—as it turns out, in a heisenberg sense, and again appearances can be deceiving. another run, ie regenerating the sample data, and the two top norm performances (green, blue) are reversed. surprising! in some other runs, they are almost perfectly coincident. so it seems the averages are measurable/ stable for a set of data, but each set of data tends to be distinct wrt the top 2 trends. the bottom unnormed trend seems to be consistent and still underlines the quantitative added value of a norm operator.
(6/15) now after a few weeks of pondering the idea and turning it over and around in my head, might as well lay out some of the (potentially big?) cards. there is nearly zero chance that anyone would “steal these ideas” so to speak for multiple layers of reasons. an old idea proposed, now returning to, not written here for years but once years ago a huge focus (it would be a real challenge to find it in the blog at this point!), that the key to solving the problem is finding g(x) such that g(x) > g(f(x)) where f(x) is the function for a single iteration of the collatz sequence. then g(x) is monotonically decreasing “in that sense” ie “wrt the collatz mapping function.” since this is crucial, call this the decrease criterion. theres much more to be said here in specific detail but much has also already been said largely along those lines.
the decrease criterion is a bridge between mathematics/ theorem proving and ML, in a sense coding induction into a function.
one possible g(x) is simply stopping distance. however, there may be other g(x) that are sufficient, maybe not as hard to compute as stopping distance, or some kind of approximation/ monotonic transform of stopping distance. stopping distance can of course be computed via collatz iterations, but is there some other way to compute it? in a sense, if there is, the problem is solved. that is, it seems in some sense that the key problem is computing g(x) and/ or stopping distance through a means other than collatz iterations, in 1 word, “equivalently”. and the big open question is, how accurate/ precise can this be and still lead to a proof? there is some tantalizing hints that it doesnt have to be highly accurate/ precise. in this way it diverges from the typical ML focus/ priority/ target. “something else is in play here.” this particular question/ angle has been occupying my thinking a few weeks now with some new pov(s) on it.
now from earlier work spanning a few years, it is indeed known that using many of the carefully/ painstakingly constructed diverse feature detectors, g(x) can indeed be approximated. the last exercise is a case study/ case in point. the most obvious idea to construct g(x) is to use a ML algorithm curve fitting on stopping distance. however, had some idea about how to construct g(x) using another means similar to some earlier ideas tried out with nearest neighbor algorithms.
one idea is to start by choosing a random g(x) and then iteratively adjust its points being fit so that they better match the decrease criterion over sampled points. then this turns into a sort of gradient descent application but in a very unusual, even unorthodox format. do not recall ever having seen this idea in the literature anywhere, but think theres potentially some large power here. the literature is almost always about fitting a function with known data points. the idea of adaptively adjusting/ shifting datapoint values to customize the function— has anyone ever tried that? the literature is so vast that presumably its occurred to someone, somewhere, but really cant recall seeing it anywhere in data science or ML literature so far…
(6/16) 💡 ❗ ⭐ sometimes noted previously, a very simple/ “naive” stopping distance estimator is simply current iterate bit width. now if a possibly not-high-fidelity estimator could lead to a proof, why not this one? thats the zen question of the moment, or of years…
now, to expand on the prior ideas. the decrease criterion maybe holds for some set of points out of all integers. then one can wonder about the distribution of this property/ feature over those integers. the natural question is again about density. do the gaps between integers for which it holds increase indefinitely? very interestingly it would seem that small gaps are not exactly/ necessarily what is being sought here. what is apparently being sought in some sense are regular-sized gaps ie a scaffold that does not “break down” at larger scales. the image that comes to mind is a net, mesh or scaffold.
the scaffold is actually in some sense capturing the inherent self-similarity of the problem— an old idea possibly now finally emerging with some newfound structure/ clarity, fighting fractals with fractals so to speak! the estimator function builds a scaffold, and one wants to create a regular-shaped scaffold, but this does not automatically mean it (ie the gaps inside it) needs to be particularly “tight”.
what about the “holes” in the scaffold? these are the integers for which the constructed decrease criterion fails, but it seems likely wrt a proof structure that those points can be related to (or in some sense reduced/ converted to) the scaffold points!
this then led me to ponder the sharpe ratio from finance. it is a average divided by the spread (standard deviation). but here what seems to be needed is something similar. how about the inverse, spread (ie standard deviation of distances) divided by average distance between the points? it seems that is what one wants to optimize/ minimize about the threshhold, to make it as “regular” as possible. this also has the effect of partially maximizing distance between points.
the concept of density of points among the integers is related to some ideas dating to early work on collatz such as by Terras who showed that in a sense, the “density” of trajectories that cant be proven to converge by a fairly basic argument approaches zero. notably, this proof is itself much like a “scaffold” type construct.
the scaffold/ density ideas are tricky as from another angle. there is a sense in which larger numbers have fewer glides as found early on in many experiments, or conversely, larger numbers are more likely to be drains. (as have written before, suspect this is the Terras density arguments from another angle but havent examined it very closely yet.) this has to do with the binary prefix ideas. so to construct the scaffold it seems necessary to build it around larger glides but which also seemingly “distort” it in some sense. in other words, getting the scaffold density concept correct is not simple and dont have a straightforward map for that yet.
scaffold ideas have probably already been highly encountered with the copious ideas/ experiments surrounding “traversal algorithms.” but here now is a key new refinement/ insight on how to actually build/ construct it.
⭐ ⭐ ⭐
this is a fairly quick coding of many of these concepts. this code constructs g(x) using the recent nearest neighbor logic. the initial code was naive about finding points on the scaffold. it uses the 0.64 glides and chooses random initial points in them, along with a particular offset between 1st and 2nd. on average most of a trajectory is a drain, so this will tend to “oversample” the drain points, in other words be biased toward down pairs, but crucially of course the scaffold has to work over the glide range. so the code was then modified to get a perfect balance of “up points vs down points” (between 1st and 2nd, ie up/ down refers to x vs f(x) for evaluation pairs g(x) vs g(f(x)).). as the offset distance increases, it becomes less likely to find up points (ie pairs) and many down points have to be “thrown away” in the search for up points.
it turns out this throwaway count, tracked, increases exponentially with the linearly increasing gap size, red line, plotted on logarithmic scale right side. the criterion function accuracy (green line, correct up/ down labels out of 500 1st points/ pairs) goes up almost exactly correspondingly. theres probably some straightfwd explanation of all this but it doesnt jump out at me at the moment. some other interesting property(s) of the problem. scaffold dynamics like this will have to be looked at further/ more carefully as work proceeeds in this direction.
as for the idea of a scaffold “breaking down,” an idea used previously and applicable here: an adversarial algorithm is called for to try to find larger points that lead to greater gaps, trying to “break/ elude the scaffold” so to speak. if the adversarial algorithm fails, in the sense of not forcing a breakdown, its evidence for a proof! its remarkable that this tension between a construction algorithm (scaffold) vs adversary algorithm seems to lie at the heart of a proof. it appears that the interplay is involved in constructing some kind of “core fractal” that corresponds to a proof (“structure”).
quite notably, maybe even strikingly, this adversarial concept is also key in recent breakthru ML work such as Go. in a similar sense, AlphaGo is actually constructing a near-proof about the game of Go, also a net/ mesh/ scaffold concept that leads to the winning strategy out of nearly all possible game configurations. some of this is hinted in analysis about the algorithms possibility of discovering “perfect play” and it seems researchers are only a step away from analyzing/ building their algorithms wrt perfect play. analogously to this work, it seems that possibly a proof about “perfect play” for Go and/ or chess based on ML scaffold techniques could/ may be on the near horizon.
(6/24) was (re)thinking about some ideas from 2/2020. need to look into scalability of bit runs in the parity sequence (mainly the postdetermined glide) for long glides eg those from the 0.64 Terras algorithm, its a simple exercise that could lead to some insights. that led me to look at
construct120 again and try to understand better what is happening there. there is some scaling applied in the code between the runs and the scale runs that recall wondering about how it creates an overlapping plot, and revisited it. after a bunch of hacking, noticed the following trend. this is very simple code that just looks at the long glides and 1 runs unscaled (red) and scaled (green, right side scale) in the postdetermined glide, 5 equally distributed (low to high) in the 0.64 Terras parity density range. a remarkable pattern with strong signal emerges in the prepeak vs postpeak sides, somewhat more pronounced in the higher/ steeper glides, a sort of inversion/ transition point, centered around the peak! has this ever been noticed before? maybe hints somewhere… the huge question is whether this is a more general property. even after staring at it, still having trouble trying to understand the nature of this signal, there seem to be more implications.
(6/28) there was an idea about ‘cmnw’ for large iterates explored in
construct95 12/2019. then as just mentioned on 2/2020
construct120 looked at a lot of statistics for the 0.64 Terras glides, but in a bit of an oversight didnt look at ‘cmnw’. momentarily forgot the prior terminology, this relates to what was named the limbo range ie postdetermined climb range on 2/2019. later on 12/2019 the parity sequence associated with the limbo range was named a bridge aka postdetermined climb/ pre-peak range.
also had the idea to look at runs, density, and entropy of (subsequences of) the parity sequences associated with the glides. there is a postdetermined pre-peak “bridge” range ‘1’ and postdetermined post-peak “drain” range ‘2’. this code has 2 modes which have been applied in previous analyses and are combined here and handled in the same subroutine. mode 1 uses fixed width iterates and varies through predetermined range parity densities starting at 0.64 and has throwaway/ restart on the non full predetermined ranges (actually a threshhold/ cutoff of 10 length). the 2nd mode increases predetermined range size and picks random parity densities larger than 0.64 again with throwaway/ restart logic. 1st 3 graphs below are the 1st mode and last 2 graphs are the 2nd mode. the similarity over the 2 modes is substantial and something to ponder. ❓
- relooked at 2/2019 when these ideas were 1st emerging. 2 relevant experiments were
- a striking/ unexpected finding is that the density of the bridge ‘d1’ lightblue has a “typical” density around ~0.65 as range length increases—strangely/ coincidentally, nearly same as the Terras density cutoff being explored! also what might be called the sideways parity density. this can be seen in the 2nd graph which plots ‘d1’ vs ‘cmnw’ range/ bridge length. the variance around typical value decreases as ‘cmnw’ increases. this maybe explains some of the “low entropy” effects seen in eg
- detail on ‘mxx1’ and ‘mxx2’ the max run lengths in the 2 regions in 3rd graph, ‘mxx1’ red seems to be stationary and ‘mxx2’ seems to be (very gradually) increasing, although maybe some plateauing at end. more pronounced in the 5th graph.
- entropy did not seem to turn up any noticeable signal.
- the density/ entropy statistics are visibly affected by the restart/ throwaway logic for glides close to 0.64 parity density because “full” glides are rarely found at the sideways density and many have to be thrown out. also shorter glide lengths in graph 4 seem to have a similar effect.
full disclosure/ fineprint, there was some slight change on the restart logic after graphs generated on 1st mode and it was found the 2nd mode didnt run correctly (getting stuck), dont think it would alter first mode results.