WWW F'10

Search Engines: Past, Present, Future...

Introduction

Whenever we come across an answer we don’t know, the first thing that comes to mind for most people is “Google it.” This simple two word phrase shows how far we’ve come technologically speaking; the majority of our information is either on the Internet or is in the process of being added to the Internet. Since the inception of search engines twenty years ago, both the Internet and search engines have evolved in such a way that makes them a staple in everyday life for billions of people. Every second, there are 500,000 web searches ranging from gift ideas to celebrity gossip and in 2010, it is impossible to imagine the world without search engines.

What is a Search Engine?

In order to search for information, people need search engines. But what is a search engine? One definition of a search engine is “a computer program that retrieves documents or files or data from a database or from a computer network (especially from the internet)[1].” Search engines are designed to search for information from the World Wide Web and FTP servers and return this information in the form of hits nearly instantaneously. Search engines have to provide accurate results at the top of the page because user studies show that most people only click the first few hits before making another search. Although search engines can vary in the data they search through, they all share the same common element—they are all algorithms designed to sort through the mess of data that is available on the Internet[2].

Humble Beginnings?

All great inventions start off with an idea; in the case of search engines, the idea was first published in 1945. The Atlantic Monthly published an article in 1945 written by Vannevar Bush where he urged the creation of a body of knowledge for all of mankind. He wrote this article post WWII after there was a surge in scientific camaraderie to help the war effort. He was a strong believer in storing data and representing that storage in the same manner our brains represent memories so that it is most useful to humans. The most important part of the article is: “A record, if it is to be useful to science, must be continuously extended, it must be stored, and above all it must be consulted[3].” This was the forefront of search engine technologies the same way Richard Feynman’s lecture was the forefront of nanotechnology. Both scientists had the idea but did not have the technology at the time to make their idea reality.

Jumping Ahead

In 1990, students at McGill University in Montreal created Archie, which is considered the first search engine. Archie is a tool for indexing FTP archives which allows people to search for and find specific files. Earlier version of Archie allowed you to search the Internet only if you knew the name of the file you were searching for. Indexing of file content did not come until a year later when Gopher introduced this technology in 1991. The University of Nevada System Computing Services group created an Archie-spin off and called it Veronica. Veronica worked the same way as Archie but instead it worked on plain text files[4].

You Have to Learn to Crawl Before You Search

Today, search engines use web crawling technology to sort through online content. This software not only remembers what words show up in the search, but also how often they show up. Additionally, they make use of hyperlinks and how often they are accessed. The degree to which web pages cross-reference each other with hyperlinks provides a measure of each page’s importance and relevance. This software provides a way for developers to structure the sprawl of the web and rank the pages by importance and relevance³. This software is what allowed Google to become the search engine giant that it is today.

What Comprises a Search Engine?

A search engine is comprised of three parts: spiders, index, and search interface and relevancy software. The search engine spiders follow links on the web and request pages that have either been updated since their last indexing or pages that have yet to be indexed. These pages are said to be ‘crawled’ and then they are added to the search engine’s index. Also known as the catalog, the index of a search engine represents a slightly out-of-date version of the World Wide Web. When you enter a query into a search engine that uses crawling software, it searches its index and returns the results. Lastly, is the search interface and relevancy software and it has many jobs after a query is entered. For example, it does the following after a query³:

Accept the user inputted query, checking to match any advanced syntax and checking to see if the query is misspelled to recommend more popular or correct spelling variations.
Check to see if the query is relevant to other vertical search databases (such as news search or product search) and place relevant links to a few items from that type of search query near the regular search results.
Gather a list of relevant pages for the organic search results. These results are ranked based on page content, usage data, and link citation data.
Request a list of relevant ads to place near the search results.

The Present

For information on what is happening presently with search engines, look at Tony’s powerpoint and the handout he gave in lecture—both contain a vast amount of information on the world of searching in today’s world.

The Future

As previously mentioned, after uses make a search, they only scan the top ten results. Search engines have to take into account that users do not want to sift through a lot of information because of the thought of being overwhelmed. Search engines also have the tough task of dealing with the addition of billions of web pages added each day. Users expect search engines to keep up with the most current news, tweets, and Facebook updates even though it takes web crawlers a few hours to crawl through and add this new information. Since searching is done on an out-of-date version of the World Wide Web, it is important for search engines to add this new information as fast as possible because the most current news will be old news by the time users have access to it. This dilemma is leading search engines to employ real-time results. Like most people, I would think real-time and faster results could be achieved by just increasing the hardware behind the searching (i.e. get more powerful computers); however, that is not the case[5]. To solve the news problem, search engine developers are now looking at strategies that index the contents of links posted by tweets. Also, they are looking at spotting newsworthy events by looking at the prevalence of posts by smart phones that occur at a certain location. A classic example would be looking at the term ‘earthquake’ in a city that is susceptible to earthquakes. The biggest problem with this approach is the reputability of the links, tweets, and other posts by the people at these locations. One idea to prioritize is by number of followers but that does not correlate with reliability in my opinion.

A more radical approach to solving the real-time search problem is to use users as part of the solution. Start-up search engines Wowd and Faroo are look at “decentralizing” the search engine by convincing users to download a piece of software. This software would be a file-sharing program that tracks the user’s browsing habits. Instead of data centers, the index created by these browsing patterns would be stored on the user’s local machine and can be accessed by other users. This would make searching much faster than what we experience today and could serve as a way to rank page relevance—the more the users visit, the more relevant the page is⁵. Using data directly from a user’s machine will allow searching to be based on user-centric data and could be the way of the future.

Another type of searching that is currently being used are dubbed “help engines.” An example of this would be Aardvark’s service where users post questions that cannot be found with traditional search engines such as “What’s a great biking path around Golden Gate Park?” This question will be sent to users who are specialists in either biking paths, Golden Gate Park, etc. so that the person asking the question can get a reliable answer from someone who has biked around that park. Aardvark chooses these specialists by looking at blog posts, tweets, and online profiles to determine which people are best at answering certain questions. According to Aardvark, 60% of the questions are answered in less than ten minutes[6]. This past February, Google purchased Aardvark which means help engines might be a big part of Google’s searching future.

The distant future in searching is already in the process of being made. The future is a semantic search engine with natural language parsing. This would mean the search engine actually understands what the user wants and will be able to deliver more accurate results. Semantic searching is ongoing now but progress has been slow because it relies on programmers and users to attach extra, computer-readable information to each web page. Once again Google is playing a major role in this form of search when they purchased Metaweb, which is a large database with tagged entries that are able to be read by computers. Google purchased Metaweb in hopes of making their already smart searching smarter. Now that Twitter and Facebook allow for tagging and annotation of user posts, semantic searching might be here sooner than we think.

[1] wordnetweb.princeton.edu/perl/webwn

[2] http://en.wikipedia.org/wiki/Search_engine

[3] http://www.searchenginehistory.com/

[4] http://www.seobythesea.com/?p=106

[5] New Scientist. The Search. November 2010.

[6] http://en.wikipedia.org/wiki/Aardvark_(search_engine)

WWW F'10

Tuesday, November 23, 2010

No comments:

Post a Comment