L’etat, c’est Google

Posted on 27 November 2004 to: Computing, Information Security, Networks

The web is surely a wonderful thing. A simple Google search can bring you information on almost any topic. Such as, oh, nitrogen tire inflation.

If you choose to scroll down the list of Google results for nitrogen tire inflation far enough, you’ll find a link to a previous weblog entry I wrote about the state of science education in the United States. (Nota Bene: This may not be true since the server change in December 2004. My argument still stands.) The gist of my point went something like this: Isn’t it depressing that tire stores run commercials advertising that it’s safe to mix air with nitrogen, given that anyone old enough to drive should know that air is 78% nitrogen?

However, Google doesn’t understand subtlety or the use of examples to make a point. All it understands is that the words “nitrogen tire inflation” had appeared in that post a fair number of times, and therefore that my post should be returned as a result whenever someone searches on those terms. For some combinations of search terms, I have been informed that my post is the first result returned by Google.

Google is doing nothing out of the ordinary here: The process by which my page has been indexed and ranked is the exact same process that ranks every page on the web. When someone types search terms into Google, she isn’t going to get a list of pages that answer her question. Instead, she’s going to get a list of pages that appear to match her search terms according to Google’s statistical rankings. Most of the time, these two sets of results are similar enough that there’s no effective difference between them. In the case of the post I wrote, however, it did nothing to answer questions an inquisitive shopper might have had about nitrogen tire inflation.

I previously shrugged this phenomenon off, until I started getting a string of comments from a particular commenter on that nitrogen inflation post. He desperately wanted to know whether or not a chain of tire stores in Ohio was trustworthy, and kept asking the question even after I responded to tell him that I didn’t have a clue. (If you want to read his posts, you’re out of luck: I have since purged the comments in a manner reminiscent of Genghis Khan in order to protect the guilty.)

Most of the essays that I write here have their origin in some mundane bit of news that’s seemingly unrelated to what I end up talking about. For this essay, that lost websurfer’s series of comments were the seed crystal that helped to solidify several disparate lines of thought into a coherent whole. What alarms me about these comments is that this commenter treated Google as a “black box:” A mystical oracle which, when appeased with the proper incantations, would somehow return the proper answers to a question. In this websurfer’s eyes, Google’s insistence that my page had the answers even override my insistence that I did not.

The thing that struck me as that the commenter’s writing was polite and relatively literate. Clearly, he wasn’t just a random luser, but an intelligent individual who didn’t happen to be familiar with how Google worked. For all I know, he might be the furnace repairman who dropped by my house last year: A man probably wasn’t familiar with the statistical ranking of webpages in a search engine, but who could find the error in a furnace control circuit with a little intuition and a butane lighter to test the switches. Google might be one of his “black boxes,” but the finer points of heating control systems are certainly one of mine.

The mere existence of “black boxes” isn’t particularly surprising: Technology is simply too diverse for any one person to understand the workings of everything he deals with in life, or even to understand the workings of those systems which he depends on to survive. I know nothing about mechanical combines, for instance, but they’re the reason that I’ll be eating dinner tonight. I know just as little about the workings of albuterol in the treatment of asthma, but it’s the main reason I lived past the age of 10. (Childhood asthma is an excellent way to learn that breathing is not an optional activity.)

The main problem with a black box arises when it goes wrong. In the case of some black boxes, failures can be spotted easily. I’ll notice if the furnace doesn’t turn on in the dead of winter, or if there’s no flour available at the supermarket, or (briefly) if my albuterol inhaler fails to treat a severe asthma attack. Occasionally, however, black boxes break in ways that are subtle and hard to notice. Consider, for example, the infamous Pentium FDIV bug, which caused an error in the floating point division procedures of Pentium 1 chips for specific input values. This bug was not discovered until after the chips were actually in use - leading to a notable headache for anyone who had placed trust in the black box of solid-state electronics.

In some cases, the flaws in black boxes are never discovered by users of the devices, but are only revealed by those who designed the devices years later. Such was the case with Ken Thompson’s introduction of a backdoor into some early versions of the UNIX operating system. Thompson accomplished this feat by modifying the UNIX C compiler in such a way that it would recognize the code of the UNIX login program and introduce a backdoor whenever that program was compiled from source, meaning that the backdoor would never show up in the program’s source code. Thompson then used this same technique to make the C compiler introduce this backdoor into any new versions of itself that it compiled, ensuring that any new version of the compiler would contain this “feature,” and making it extraordinarily difficult to remove the backdoor. Thompson relied on the trusted “black box” of the compiler to create a backdoor that would be practically impossible to detect and exceedingly difficult to remove.

Thompson never released this hacked version of UNIX into the wild, but it is still a sobering reminder of exactly what can be done by relying on someone’s trust in black boxes. In this case, most UNIX programmers implicitly trusted the “black box” of the compiler to accurately convert C source code into assembly language. By attacking that compiler, Thompson was able to create a backdoor that would evade anyone who didn’t understand the inner workings of that particular black box.

When viewed in this light, the concept of Google as a black box becomes exceedingly frightening. The problem with Google is not simply that it’s a black box, but that for many it is a trusted black box. More importantly, the information gained through Google may determine how much we trust other black boxes, or how much we trust Google itself. Much like Thompson’s compiler hack, a flaw in Google - or any other information source - risks creating positive feedback and catching us in a form of circular logic: Google is trustworthy because the sources I found using Google say that it’s trustworthy. Or, in a more likely scenario, fact X must be true, because Google found page A saying it’s true - but the person who wrote page A used Google to do his research, which led him to the erroneous page B. This sort of circular reasoning becomes almost inevitable whenever one’s trusted information source is a single black box: How can one tell whether or not the box is right?

At this point, most of my readers are scoffing to themselves: Sure, Google might be a major source of information, and maybe people trust it too much, but isn’t it a little bit preposterous to assume that it might be deliberately compromised? The problem with this reasoning is that Google results have already been deliberately compromised both for malicious and for non-malicious purposes. Consider Googlebombers, who use fake links to deliberately inflate the importance of a webpage for humorous or satirical effects. If that’s not enough, consider comment spammers, who fill weblog comments with links to inflate the pagerank of commercial sites. Not only are they deliberately attacking Google results, but they also think that they’re increasing revenue enough to make it worth their while. Clearly, someone sees opportunity in attacking Google results, albeit on a small scale.

Looking at things from a historical perspective, a larger-scale attack on the validity of Google results will not be surprising. Indeed, I personally think that a major attack aimed at influencing public opinion is inevitable. Almost every medium that has served as a black box provider of information has, at one point or another, been manipulated by others or manipulated its own information to influence public opinion. In most cases, that manipulation of information had significant historical consequences. In the early 1900s, it was the newspapers who tried to keep readership high with sensational “yellow journalism.” In the process, these papers transformed a tragic coal bunker explosion on board the USS Maine in Havana harbor into a work of Spanish treachery, and managed to spark the public outrage that led to the Spanish-American War. In the 1960s, the press reported the Tet offensive as a success for the Viet Cong, even though that offense nearly broke the back of the Communist Vietnamese guerrilla forces without achieving any strategic objective of note. As a result, the United States ended up withdrawing from Vietnam and giving the nation to the Communists.

The most recent example of this pattern is the now-infamous Rathergate scandal, in which Dan Rather raised questions about George W. Bush’s National Guard service shortly before the election. However, investigation by webloggers found that the documents Dan Rather used on the air were forgeries, and within days the story changed from one of youthful Presidential misconduct into one of how a CBS-perpetrated fraud aimed at altering the results of a national election was exposed by a handful of graphic designers and lawyers.

On the surface, weblogging represents a new era of journalism: No longer do one or a few black box entities control access to information. Instead, weblogging represents a form of peer-reviewed journalism, as any weblogger is subject to the cross-examination of another. In the crass but succinct words of Ken Layne: “We can fact-check your ass.” The up-and-coming Machiavellian is thus faced with a seemingly insuperable obstacle the likes of which never confronted William Randolph Hearst: How can thousands of different weblogs be influenced at once?

What that Machiavellian will stumble across, sooner or later, is weblogging’s dirty little secret: We may all be fact checking each other, but we’re all using Google to do it. Just as Thompson’s C compiler was almost inevitably the trusted component in a UNIX system, so is a search engine the trusted component in a weblogger’s toolkit. (The links in this essay didn’t come from my personal stash of bookmarks!) The logical answer to this problem is to move one’s disinformation campaign a link further up the media chain: Rather than attempting to influence webloggers, why not influence Google? Webloggers are used to dealing with Google results that are inaccurate, irrelevant, and off-target. Are they ready to deal with Google results that are deliberately deceptive?

We’re already beginning to see attacks being made on Google search results in crude attempts to influence public opinion through censorship. In 2002, the Church of Scientology has used the threat of the DCMA to have anti-Scientology pages removed from the Google index. (The pages in question have since been returned to the index.) More worrisome are recent indications that Google has begun filtering the Chinese version of Google News so as not to include links to blocked websites in China. China has occasionally blocked Google search entirely in the past. How much more attractive would it be to the Chinese to offer access to Google, but simply to filter out “objectionable” search results?

This problem isn’t limited to Google: It’s a fundamental issue that will exist whenever a black box is trusted as a source of information. The content provided by web search engines is less centrally controlled than the information provided by most previous black boxes, but the problem remains. It’s only a matter of time before some clever individual, company, or nation works out how to manipulate web search results transparently, and it won’t matter if the search engine to be manipulated is based in Mountain View or Redmond.

Unfortunately, there’s no easy answer to this problem. Replacing one black box with another certainly won’t solve the fundamental issue, replacing one large black box with a few smaller black boxes will only help to a limited degree. Historically, solutions to these sorts of problems have been accomplished by using several radically different systems operating in parallel and cross-checking results between them. But that sort of independence is hard to achieve when one search engine provider appears to be seeding its index with the results from a competitor’s search engine.

Personally, I’m going to keep my eyes open, read as many primary sources as I can, and keep a critical eye on what Google turns up. I may not be able to know the details of Google’s workings, but I at least want to have a feel for when this particular black box starts putting out fishy data, and I encourage others to do the same.

For starters, you can take my word when I say that I’m not an expert on nitrogen tire inflation.

Oceana has always been at war with Eurasia. - George Orwell, 1984

1 Comment »

The URI to TrackBack this entry is: http://port80.blogsome.com/2004/11/27/letat-cest-google/trackback/

  1. That was a really cool paper,i’vo often wondered myself how much we all trust in google and how samll doses over such a period can effect us individualy!I’m currently studying networks and webdesign in college and turns out your computer is communicating with google even if internet explorer isn’t up and your net even doing something on the web!Interesting read,by the way i found this link on google,lol, later

    Comment by LogisticTitan — 17 April 2007 @ 17:21

RSS feed for comments on this post.

Leave a comment

Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>