Français | English
Scientific Publications 3.0.
Conferences       Bibliography       Links       About Us


What Science can learn from Google?
Chris Anderson


 Moderators: Judith Simon, Luc Schneider, Giuseppe Veltri, Gloria Origgi, Roberto Casati
  "All models are wrong, but some are useful."So proclaimed statistician George Box 30 years ago, and he was right. But what choice did we have? Only models, from cosmological equations to theories of human behavior, seemed to be able to consistently, if imperfectly, explain the world around us. Until now. Today companies like Google, which have grown up in an era of massively abundant data, don't have to settle for wrong models. Indeed, they don't have to settle for models at all. Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age. The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to the library analogy to — well, at petabytes we ran out of organizational analogies. At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data mathematically first and establish a context for it later. For instance, Google conquered the advertising world with nothing more than applied mathematics. It didn't pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right. Google's founding philosophy is that we don't know why this page is better than that one: If the statistics of incoming links say it is, that's good enough. No semantic or causal analysis is required. That's why Google can translate languages without actually "knowing" them (given equal corpus data, Google can translate Klingon into Farsi as easily as it can translate French into German). And why it can match ads to content without any knowledge or assumptions about the ads or the content. Speaking at the O'Reilly Emerging Technology Conference this past March, Peter Norvig, Google's research director, offered an update to George Box's maxim: "All models are wrong, and increasingly you can succeed without them." This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. The big target here isn't advertising, though. It's science. The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years. Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise. But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the "beautiful story" phase of a discipline starved of data) is that we don't know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on. Now biology is heading in the same direction. The models we were taught in school about "dominant" and "recessive" genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton's laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility. In short, the more we learn about biology, the further we find ourselves from a model that can explain it. There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms. If the words "discover a new species" call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn't know what they look like, how they live, or much of anything else about their morphology. He doesn't even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species. This sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It's just data. By analyzing it with Google-quality computing resources, though, Venter has advanced biology more than anyone else of his generation. This kind of thinking is poised to go mainstream. In February, the National Science Foundation announced the Cluster Exploratory, a program that funds research designed to run on a large-scale distributed computing platform developed by Google and IBM in conjunction with six pilot universities. The cluster will consist of 1,600 processors, several terabytes of memory, and hundreds of terabytes of storage, along with the software, including IBM's Tivoli and open source versions of Google File System and MapReduce.1 Early CluE projects will include simulations of the brain and the nervous system and other biological research that lies somewhere between wetware and software. Learning to use a "computer" of this scale may be challenging. But the opportunity is great: The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all. There's no reason to cling to our old ways. It's time to ask: What can science learn from Google?
Open A softer science (0 replies)
Gloria Origgi, Jan 8, 2009 14:41 UT
Close On statistical reasoning  
Judith Simon
Jan 6, 2009 14:40 UT

I have to admit that I always get a bit suspicious, when new approaches are considered or portrayed to be successors of previous models instead of complementary approaches. As luck would have it, I just finished reading Ayres’ “Supercrunchers” (2007), Keen’s “The Cult of the Amateur” (2008) and Benkler’s “Coase's Penguin, or, Linux and The Nature of the Firm” (2002) in order to get a grip on some processes of collective reasoning and decision making. Thus, my reply is not only directed at Chris Anderson’s paper, but spans the topic a bit. What Anderson and Ayres (2007), and also Benkler (2002), have in common is that they argue from a stochastical point of view. Anderson and Ayres follow to some extent Surowiecki’s (2004) stochatics-based ‘wisdom of the crowds approach’, arguing that expertise – or in Anderson’s case, science as we know it – is outdated and should be replaced by Supercrunching. While I am thouroughly convinced that the processing of huge amounts of data lead to extremely helpful new insights, I keep wondering why this has to replace all other forms of reasoning.

Benkler (2002) for instance, while employing stochastical models himself in modeling the actions of agents in collaborative environments, stresses the role of different forms of communication and interaction processes, different technical constraints and social norms involved in collective creation. Thus, while he departs from stochastical reasoning, his considerations do not stop there.

Ayres and especially Surowiecki, by contrast follow through a stringent stochastical approach and consider only aggregated independent individuals. This focus on independence instead of interdependence is clearly a side effect of modeling a model of human co-creation too closely to stochastical information aggregation, where the independence of input variables is frequently recommended to make calculations easier – but not necessarily more adequate.

Another point that even Ayres (2007) recognizes is that statistics do not run on their own. Someone has to ask question, someone has to pose hypotheses. I am not convinced how even supercruching should work without asking questions to begin with. Sure enough, factor analysis has been used to generate hypothesis from data for a long time – but still, there has been the need to then test these hypotheses independently and with different methods.

I do appreciate Casati’s request for more meta-analyses, however, I do have a question for clarification. What data do we have in mind when talking about meta-analyses? If we are talking about already quantified data various methods for meta-analyses of experimental studies, etc. do exist in the social sciences. However, with respect to the abundance of more qualitative data, taxonomies, ontologies (and along with them all the things that get sorted out , cf. Bowker & Star 1995), resp. the pros and cons of tagging, etc. come into play and would have to be discussed.

Moreover, as Olson rightly insists, even when it comes to quantitative data someone has to provide these data in the first place. And it surely makes a difference whether there is scarcity or over-abundance of data, whether we are analyzing data of a linear accelerator or the tediously transcribed observational data about the lifecycle of some animals in Papua New Guinea. Consonant with Olson’s remarks, I was reminded of a conversation I had last month with an anthropologist. She usually does field studies in South America, but recently got interested in analyzing online fan communities of a famous Indian movie star. Ask her about the differences, she encountered with respect to the analyses of data.

Thus, the question is, who provides the data other than those who are already digitally available? And will those lose relevance as already is the case for literature that is not available electronically? Do we want this?

Furthermore, as is always the case – your results can only be as good as the input data. This problem surely gets more exigent as we progress from original research over meta-analyses to supercrunching. Put differently, I am bit afraid that quality assessment of input data into the process of further processing gets less and less important as the numbers increase.

Thus, I may conclude my comment with the question of what gets lost when a “statistical style of reasoning” (Hacking 1992) is in total control? Which questions can be asked, which can be answered by statistical means? Which questions might disappear? When one method is proposed as the only way to approach truth, I feel strangely reminded of some tendencies in mainstream psychology (my home discipline), where things that can’t be measured sometimes are simply denied to exist.

Literature Ayres, I. (2007). Super Cruncher. Why Thinking-by-Numbers Is the New Way to Be Smart. New York, Bantam. Benkler, Y. (2002). "Coase's Penguin, or, Linux and The Nature of the Firm." The Yale Law Journal 112 369-446. Bowker, G. C. and S. L. Star (1999). Sorting Things Out: Classification and Its Consequences. Cambridge, MIT Press. Hacking, I. (1992). Statistical Language, Statistical Truth and Statistical Reason: The Self-Authentification of a Style of Scientific Reasoning. Social Dimensions of Science. E. McMullin. Notre Dame, Indiana: 130-157. Keen, A. (2008). The Cult of the Amateur. New York, Doubleday. Surowiecki, J. (2004). The wisdom of crowds. Why the many are smarter than the few and how collective wisdom shapes business, economics, societies and nations. New York, Random House.

  1 reply to On statistical reasoning:
    Open PS: Meta-Analyses and Knowledge Organization
Judith Simon, Jan 6, 2009 21:09 UT
Open The Scope (and Target) of the Cloud (0 replies)
matthew doyle olson, Jan 1, 2009 7:39 UT
Open Meta-analyses: A homework for Google-minded scientists (0 replies)
Roberto Casati, Dec 27, 2008 15:15 UT
 
Note: yellow triangles (   ) indicate new messages that have been posted since your last visit to the site.
 
© 2010 interdisciplines.