Thursday, July 16, 2009

The Rise of Google: Beating Yahoo at Its Own Game by Tom Hormby

Before 1995, search engines relied on databases of textual keywords to find relevant results. Whenever a user entered a search term, search engines such as AltaVista and Lycos would compare the search term to their databases of terms. The pages that had text most similar to the search term were considered to be more relevant and were featured higher in the list of search results.

Excite, a popular search engine in 1996.

Unfortunately, this automated process did not always return the logical results. For instance, when searching for "Microsoft", pages for retailers selling Microsoft products might be featured before the Microsoft corporate homepage because a single page might list dozens of Microsoft products.

By 1997, students at Stanford had discovered a better approach to Web search called Google. Google delivered unusually relevant results compared to the existing search engines. Not only did Google offer superior results, it lacked the growing clutter found on the popular search portals of the time.

Larry Page at Stanford

Despite its rapid ascent to search engine supremacy, the technology behind Google was not envisioned as a search engine - or even as a commercial product. Instead, it was the product of Stanford graduate student Larry Page's research.

Page, the son of a computer scientist from Lansing Michigan, came to Stanford in 1995 at age 22 after graduating from the University of Michigan. Page selected Terry Winograd as his academic advisor but did not have a clear idea what he wanted to write about in his dissertation, though he was intrigued by the mathematical implications of the World Wide Web.

At a time when Stanford graduates were launching lucrative startups all over Silicon Valley, Page was not interested in finding potential commercial applications for his work. The Web could easily lend itself to being diagrammed through nodes and branches. Such a graphical representation shows the interconnectedness of websites and how users move from site to site.

Page, along with many other Stanford researchers, decided to take advantage of the Digital Library Initiative (DLI) of the National Science Foundation, whose mission was to create a universal library of data including personal information, the information found in conventional libraries, and information published by scientific researchers. Archiving and making this data accessible was an important part of the project, and Page used some of this funding in his research (much of which resulted in published papers in peer-reviewed journals). Focusing on projects related to the Web and search, the DLI would ultimately distribute $4.3 million to different research initiatives and was largely responsible for paying for much of the research behind Google.

Patterns and Relevance

Page decided to create an algorithm to evaluate the relevance of certain pages by analyzing the patterns formed by hyperlinks. As described earlier, search engines of the time determined relevancy by looking at the text of a website, or, in Yahoo's case, by employing human editors to categorize and describe each page. Instead of using text or human-beings, Page would calculate the relevancy (a numerical value) of a page based on links embedded in its HTML and by outside links to the page.

Links originating on the website being examined are easy to gather, a spider (a piece of software that crawls the Internet looking for data) parses the HTML and can create a database of links. Incoming links are more difficult to capture. Analyzing these links requires a more sophisticated spider. By following link after link after link ad infinitum, a spider is able to work backwards and create a list of links pointing towards each page This process consumes a great deal of both CPU resources and bandwidth, but it provides a more accurate estimate of a site's relevancy, because linking to a website requires both human effort and human judgment.

Page's method of evaluating the relevancy of websites is very similar to how academic writing is evaluated. Authors pay attention to the abstract (the "keywords" of an academic paper) but determine the importance of a paper by finding the number of academic papers that it cites and the number of times it is cited in other papers. Papers that have a breadth of sources and that are cited by other authors are more likely to be useful than papers that are not. Furthermore, articles that are cited and cite more prestigious journals are likely to be even more relevant than a paper that uses only obscure sources.

Page applied this method of evaluating academic writing to his evaluating the relevancy of pages on the Web. This system is intuitive but complicated. Relationships between pages become very elaborate. For example, links from the Yahoo homepage are going to be more authoritative than links from a child's GeoCities page. Page devised a system of assigning numerical values to each page based on the number of links it has and the number of times the page is linked to. This system was called PageRank, named after Page himself.

Enter Sergey Brin

In 1996, Page teamed up with fellow Stanford graduate student Sergey Brin. Page had met Brin during a campus tour, and by all accounts, the two had not hit it off. As the group walked around the Stanford campus and nearby San Francisco, the two computer scientists argued incessantly, even discussing the finer points of urban planning amidst the hills of San Francisco.

Both of Google's cofounders recall the other as being obnoxious, but they stayed in touch. Page recruited Brin to help write the software that would keep track of each page's relevancy, a complex task that could easily overwhelm the network at Stanford if resources were not used efficiently.

Page and Brin's research thrived at Stanford, an institution known for incubating startups. Some of the most famous figures in the field of computer science have been on the faculty or are alumni of the university.

Proof of Concept

At the urging of Page's academic advisor, with whom he had coauthored several papers, a search engine based on PageRank was made available to Stanford students in August of 1996 with both a text index of 24 million pages (like the databases maintained by search engines such as Alta Vista) and databases of the links between these pages. By year's end, the search engine, named BackRub, was receiving 10,000 searches a day.

The initial index of PageRank began from Page's personal website on Stanford's server. Following the handful of links on the bare-bones site, the index swelled to over 28 GB by the time Page and Brin left Stanford, a considerable size for 1996, when storage was still quite expensive.

Another advisor, Rajeev Matwani, famous for his work in the field of databases - and data mining in particular - helped develop the search tools that tapped into the PageRank database.

As the search engine became more popular within Stanford and with the general public, it was renamed Google, a corruption of the name googol given to the number with '1' followed by 100 zeroes. Soon, Google was using so much bandwidth crawling the Internet that it would occasionally overwhelm Stanford's connection. Students and faculty, many of whom were enthusiastic Google users, did not seem to mind, but it was time for the search engine to find a new home.

Incorporation and Independence

After a 15 minute presentation to Sun cofounder and venture capitalist Andy Bechtolsheim, Google received a $100,000 investment. This investment allowed Google to find a new home off campus (appropriately for a new startup, Google's new home was a Silicon Valley garage), but the investment created some practical difficulties for Page and Brin. Google did not yet exist as a legal identity; it was a project being run out of student offices at Stanford's computer science department. As a result, the check was not deposited until September 4, the earliest date that Google could be incorporated.

Now separate from Stanford, Google continued to expand. Page and Brin were both driven to reach profitability as soon as possible and started expanding Google's operations with an eye towards controlling costs.

When Google was still at Stanford, Page and Brin would beg for components from other departments. A CPU was salvaged from the loading dock, and faulty hard drives were rescued from all over campus. Brin wrote a piece of software that made these broken drives usable, important for storing the very large (and ever-growing) databases holding PageRank ratings and text indexes.

This type of scrounging was impossible in the private sector, but Google was still frugal. Instead of buying dedicated servers running expensive software, Google's datacenter ran on Linux and used (literally) homemade rack-mountable servers. Pictures of the server racks from the time looked like rats nests with tangles of cable and components scattered everywhere.

As Google found more outside funding, its datacenter became more professional, but it was always characterized by being efficient, stable, and cheap. Google was also careful to control hiring practices, avoiding piling up costs before it had a chance at becoming profitable.

Google received outside recognition in the December issue of PC Magazine, which hailed Google for having "an uncanny knack for returning extremely relevant results" and was included in the Top 100 Web Sites of 1999.

Focus on Search

Despite Google's increasing reach, Page and Brin refused to change the core focus on search. Brin, in a later interview, said that "with 100 services, they assumed they would be 100 times as successful. But they learned that not all services are created equal. Finding information is much more important to most people than horoscopes, stock quotes, or a whole range of other things." As a result, Google retained its now-famous user interface consisting of little more than a search box and an early version of the iconic logo.

Traffic to Google's website increased quickly with the publicity. By February of 1999, Google was handling 500,000 searches a day. The growing traffic attracted both venture capitalists and technology partners. Armed with the reputation of Google's original investor, Bechtolsheim (who had backed many other successful startups before), Google was a darling of the top tier venture capitalist firms of Silicon Valley. Both Kleiner Perkins Caufield & Byers and Sequoia Capital were brought in as new investors for a total of $25 million of new capital, recognition of Google's rapidly increasing valuation.

Improving with Age

On June 26, 2000, Google made two major announcements. First, it had become the largest search engine in the world in terms of pages indexed, beating out much older competitors like Lycos and HotBot. The longer that Google's spiders crawled the Internet, building its index and recalculating PageRank, the better its results. Instead of becoming bloated and unfocused, like many of the other portals, Google was actually becoming better with age.

Google's results had become so accurate and so popular (it was now the second most popular search engine on the Internet, only behind Yahoo) that it was impossible to ignore. Short of writing a new search engine, Google's competition was unable to match Google's prowess at search. This forced some of the most popular portals on the Internet to turn to Google for their search technology, turning over a significant amount of their traffic to a third party and a competitor.

The second announcement: The first major portal to switch from its own search engine to Google was Yahoo. This was a major coup for Google, which had finally found a reliable revenue source (this was before AdWords) and whose homepage would receive even greater traffic because of the publicity of the deal.

Yahoo had compromised. The company could not afford to compete with Google - Google's results were too good, and Yahoo needed a stopgap before it would be able to compete. Yahoo feared that customers would realize how good the Google-powered results were and would just move to Google.com.

Only two years after incorporation, Google had beat Yahoo at its own game.

No comments: