Engineering a search engine is challengingtask.
The web creates new challenges for information retrieval. Automatedsearch engines that rely on keyboard search matching usually return too manylow quality matches. We choose our search engine name, Google, because it fitswell with our goal of building very large-scale search. The goal of our systemis to address many of problems, both in quality and scalability. Fast crawlingtechnology is needed to gather the web documents and keep them up to date.
Storage space must be use efficiently. Queries must be handled quickly, at arate of hundreds to thousands per second. These tasks are becoming difficult asthe web grows.
Some people believe that a complete search index will makeanything to find easily. Aside from tremendous growth the web have also becomeincreasingly commercial over time. One of our main goal in designing the Googlewas to set up an environment where other researchers can come in quickly.Google utilizes link to improve search result. PageRank can be thought of as amodel of user behavior.
The text of link is treated in special way in our searchengine. Using anchor text efficiently is technically difficult because of largeamount of data which may be process. In our current crawl or 24 million pages,we had over 259 million anchors which we indexed. Aside from PageRank Googlehas location to make search extensive proximity in search. Words in large orbold are weighted more. We believe standard information retrieval work need tobe extended to deal effectively with the web. The web is the vast collection ofcompletely uncontrolled documents.
Another difference between the web andtraditional well controlled collection is that virtually there is no controlover what people can put on web. Google’s Data structures are optimized so thata large amount of data can be crawled, indexed and searched with little cost. Thechoice of compression technique is a tradeoff between speed and compressionratio. We can rebuild all the data structure from the repository and a filewhich list crawler errors. The Document Index keeps information about eachDocument, and the ability to fetch a record in one disk seek during a search.The Lexicon fit in memory for a reasonable price. A hit list corresponds to theoccurrence of a particular word including information.
Running a web crawler ischallenging task. There are reliability and performance issues and even moreimportantly, there are social issues. In order to scale a hundred or millionsof web pages Google has a fast distributed crawling system. It turn out thatrunning a crawler which connects to more than a half million servers, generatesa fair amount of Phone calls and Emails. Because of vast numbers of Peoplecoming online.
There are always those who don’t know what a crawler is becausethis is the first one they have seen. The goal of searching is to providequality search result efficiently. Many search engines have seem to make goodprogress in term of efficiency, therefore we have focused on quality of oursearch on our research. Google maintains more information about documents morethan the typical search engines. Combining all this information into a rank isdifficult.
We have designed our ranking system so that no particular functioncan have too much influence. A single word query is a simple case. In order torank a document with single word query, Google looks at hit list for that word.For Multi word search the situation is more complicated. The ranking system hasmany parameters, figuring out the right value for these parameters is somethingof a black art. In order to do this, we have a user feedback mechanism in thesearch engine.
A trusted user may optionally evaluate all of these results thatare returned. This feedback is saved. Our own experience with Google has shownit to produce better results than the major commercial search engines for mostsearches. Google relied on anchor text to determine this was good answer forthe query.
Similarly, another result is an email address which, of course ourimmediate goal are to improve efficiency and to scale approximately 100 millionpages. The biggest problem facing users of web search engine today is thequality of the result they get back.