the promises and perils of datamining

hi all. theres been some great news on datamining recently. just wanted to collect some of my favorites in a single place. in this TMachine post— links galore, stuff on the stackexchange datamining contest with $20K prizes, new $60M simons compsci/datamining center at UCBerkeley headed by Karp & Papadimitriou as profiled in NYT by Markoff, dataminer & geek extraordinaire nate silver predicting the election with 100.0% accuracy, inside scoop on how the democrats played datamining “chess” to the republican checkers, the irate lesbian with the shark/”killer” lawyer who destroyed the $1M netflix contest and the connection to the Bork surpreme court nomination and Video Privacy Act of the late 80s, and the sinister dark/shadow side of datamining….

some of the biggest news is with the election. rj lipton covers nate silver in his blog. nate silver is the guy employed by NYT who correctly predicted all 50 states vote outcomes as democratic vs republican, significantly before the election. magic? voodoo? or statistics and datamining? the media is a little bit hyperventilating over this feat.

of course much of this is due to the partisan nature of his work. I doubt Nate is all that ideological and hes been very careful not to inject much of his own political beliefs into the process. but the republicans have accused him of lots of nefarious misdemeanors.

nate silver is basically a datamining and statistical geek. and yeah, succesful geeks in full prowess can feel kind of threatening. as the geek girl on modern family recently said to her older sister, on confronting some geeks that stole a toy helicopter…. “some day your fans will work for my fans”.

is anyone reading this old enough to remember when “geek” was a pejorative? geeks used to be uncool. now post-Gates era, they are maybe still a little bit uncool but understood to be awe-inspiring, intimidating, and at times fearsome. quants working during the the stock crash may have had some to do with this reputation also.

yes I discovered nate awhile back before it was cool. I say that with strong geek pride like some people talk about rock bands they knew before they were massive. (u2 anyone? I went to their elevation tour concert!)

so the media is alternatively saying nate is either a game changer, or a spoilsport, or a fraud/charlatan, or no big deal, or ….? I personally think he’s a highly competent dataminer, and he waded into some very churning waters with his choice of topic of research. of course he understood that, and so far he’s withstanding the withering scrutiny mostly unscathed, as far as I can tell. a geek hero! if you make fox news furious at you, I say you must be doing something right!

* * *

next below an article about Obamas outstanding “data driven” campaign. holy cow, what a well kept secret. I read a lot about IT and the campaign and saw nothing about this operation until this article below. and its very impressive. it sounds like a james bond archvillian operation. or maybe a counterterrorism center. but no, its an operational headquarters to reelect obama based on datamining. very impressive! the republicans dont seem to know what hit them. after reading this article it sounded like they were playing checkers while obama was playing chess. or maybe chess on a supercomputer, wink…

also in the news, the stackexchange data mining contest. Im surprised nobody on or has said anything about it, even in meta. geez people! $20K of prizes is absolutely nothing to sneeze at is it? or are you academics unmotivated by cold hard cash, filthy lucre? what a bunch of wimps, snicker :P

the contest is run on, a really cool new site that specializes in datamining competitions with monetary prizes. if I had my own mansion on a private island, Id be crunching on these contests in my secret lair right now….

datamining is a big deal and the simons foundation announced a $60M grant to UC Berkely (see NYT article) for a institute for Theory of Computing with datamining as one element of it. headed by the famous Karp as director! and Papadimitriou as manager! that is way cool stuff. its really inspiring to see this hit the big time, and with celebrity names from computational complexity theory. 8-)

Part science and part engineering, computer science has long been viewed warily by scientists in other disciplines. But that is changing, not only because the computer has become the standard scientific instrument but also because “computational thinking” offers new ways to analyze the vast amounts of data now accessible to scientists. This new approach — what researchers call the “algorithmic” or “computational” lens — is transforming science in much the way the microscope and telescope did. When computer scientists train their sights on other disciplines, said Christos H. Papadimitriou, a Berkeley computer scientist who will help manage the institute, “truths come out that wouldn’t have come out otherwise.”

Moreover, the flood of experimental results generated by inexpensive sensors, combined with the Internet’s ubiquitous connectivity, is threatening to drown scientists in vast data sets often called “big data.”

“It’s analytics with big data, it’s the ability to compute and analyze in massive parallel architectures,” said Jeannette M. Wing, head of the computer science department at Carnegie Mellon University. “All the science and engineering disciplines realize this is part of the future.”

Ive been publicly advocating Big Data and datamining for many years, probably close to a decade, and its finally starting to gain serious traction in public consciousness and research directions. and complexity theory has finally found its “killer app” other than P vs NP to thrust it in the applied, rubber-meets-the-road limelight. hey, its a *teeny* sleight of hand there based on its history, but Im certainly not complaining.

out of the minor leagues and into the majors! with multimilliondollar budgets to show for it! all we need next is a movie with tom cruise or keanu reeves! or maybe I am just too dated huh? how about ryan gosling?

and theres *yet another* geek angle. the simons foundation is funded by

Dr. Simons, who earned his doctorate in mathematics at Berkeley, was chairman of the math department at Stony Brook before creating Renaissance Technologies, a private investment firm. Forbes magazine estimates his current worth at $10.6 billion.

now if only a bit of that $60M could be awarded to bloggers or independent/”unaffliated” researchers. oh well, dream on geeks. myself included…. guess I will just have to become a billionaire first and self-fund that one….

* * *

below I also include a whole bunch of links on the netflix contest. yep I grinded away on this contest for 3 solid years including buying a separate linux/ubuntu computer for it. what a great contest! alas, it ran into the lawyer department when they got sued because the information could be reverse engineered to obtain identity.

I dont know the exact history but I have always suspected the lawyers were prompted by a paper by Narayanan and Shmatikov called “robust deanonymization of large sparse datasets”. they showed that it could be used to find “apparent political preferences and other potentially sensitive information.” an article about a lesbian that sued netflix said “The researchers also made educated guesses about the customers’ politics and sexual orientation.” I believe that article is referring to that paper. although I cant find a reference to “sex” anywhere in their paper.

there is also a connection to an old political controversy from the late 80s when robert bork was being considered for the US Surpreme Court and the media managed to get ahold of his video rental list! this was from the late 80s when video rental from VHS was still rather avante garde and it causd quite a commotion in the media, verging on a scandal! imho it was the equivalent of going through a trash bin (“dumpster diving”) to find records about someone, but it literally led to a later act of congress to conceal video rental records! unfortunately Netflix did not seem to anticipate that fully. they did attempt to anonymize or “deidentify” the data but what research shows is that “deidentification” is rarely anywhere near an airtight process that can resist a determined attacker with good datamining resources.

so that lawsuit and netflix settlement was a huge wet blanket and setback for the field of datamining Id say, to say the least. *devastating!* netflix had already announced a sequel to the contest that was cancelled! netflix had one of the most forward-looking and visionary projects in the entire history of datamining and it was shot down by an overzealous lesbian with a very good lawyer and a quivering corporation that quickly caved. (a huge case of “the right hand doesnt know what the left hand is doing.”) the full scope of this tragedy has never been fully documented. the science of the contest was absolutely groundbreaking. it seems highly likely netflix could have continued its role as a benefactor of the field of datamining, but instead it now sits in the corner cowering in fear. politely stated, emasculated! more bluntly stated…. castrated!

so this datamining thing is really not quite so simple, is it? another point to make is that one of the main uses of datamining is by spy and surveillance agencies of the US, which hoo boy I sure dont want to get started on that, its a whole other post *at least*, wink…. guess I will just have to let the links speak for themselves for now….

