Header Ads

Crawling the Deep Web


A rather large chunk of the web is still undiscovered. We present ways to look deeper

Google has become synonymous with search. Search engines working on algorithms that yield results faster than we can say search, makes us believe we have all the information there is. While popular search engines are capable of searching the majority of the web, there are sites that lie below their radar. So there are sites that you probably will never come across. Welcome to Deep Web. The web that search engine spiders are able to crawl is called the Surface web, like the iceberg that is visible. On the contrary, the deep web is the huge repository of information that search engine spiders do not have access to.

Why search engines don’t search the deep web?

Each search engine has a program called spider, crawler or bot that constantly crawls the internet looking for information. It then indexes all the web pages it crawls through and ranks those pages according to the relevance in content. While crawling web sites, the bots also follow the links on your site thus increasing their footprint. Depending on algorithms, a search engine can either confirm the presence of a page without indexing it, or index the page content and look for hyperlinks on a page. The frequency of the spider crawling web sites depends on the search engine’s discretion.

However, search engines have some limitations as they operate on fixed algorithms, often leading to irrelevant results as the search engine is sometimes not able to contextualize the search query. Also search engine bots only crawl static web pages, whereas a majority of the information on the net is stored in databases, which the spiders are not able to crawl. Thus the search results miss out on the data present in databases in universities and government organizations among others. Moreover, there are a lot of anonymous surfing web sites that the search engines do not crawl. All this adds up to huge numbers making the search results only a fraction of the total data available.

The deep web

The most logical question right now in your mind will be, if Google cannot search the data, what exactly is it and how come it is not crawled. Let’s try to decode the deep web content wise.

A database contains information stored in tables which are created by programs such as Access, SQL or Oracle among others. Now this data can only be retrieved by posting a query. The query when executed searches the database to come up with the result that has been specified. This is very different from searching static web pages which can be accessed directly by crawlers.

This database can be text related or non-text related like multimedia fi les, pictures, presentations, PDFs among other things. It may also include dynamically changing content. Many universities, libraries, medical organizations, scientific research organizations have data that cannot appear in popular search engine results. It may only be visible to members who have access rights to those databases.

There are web sites that do not want the search engine crawlers to index their web sites for reasons best known to them.

The dark web

No matter what technology we have, there are some elements in the society that come up with ways to misuse it. There is a sinister side as well to the Deep Web called the Dark Web. A significant portion of the deep web is used by nefarious elements in the society to propagate objectionable material.

Freenet, a distributed, decentralized information storage and retrieval system which allows users to surf the internet anonymously. It has come under a lot of flak as it not only allows users to remain anonymous so that their paths are untraceable, but also hides the fact that someone is using Freenet at all. This has given a lot of leeway to cyber criminals to flourish under the guise of anonymity.

This guise of anonymity though is good for users in countries where internet is highly censored, where freedom of speech is stifled. But at the same time a large number of web sites on paedophilia, terrorist activities, virus coding and other cyber-crimes are floating on Freenet.

Another case in point is the Russian mafia arm called Russian Business Net-work (RBN) which is synonymous with online cyber-crime. RBN takes advantages of the unused or discarded web addresses that are lying in the Deep Web. They activate these addresses for a couple of minutes, send out millions of spam email and then deactivate the address. This makes them untraceable. They also host web sites whose content relates to child porn, malware, spyware among other such cyber-crimes.

Federated Search

Now that we have the information on the existence of another web, let us try to find out ways in which we can harness this repository of information. Popular search engines only crawl web sites for links but are not able to search databases. Federated search engines is a category which searches multiple online databases or web resources.

Federated search engines work with the help of software connectors. Connectors are software programs written to access a content source which can be a database. A connector apart from knowing the URL of the source should know how to send search commands based on the search syntax of that particular source. Complexity of the connector program depends on the authentication methods, URL redirections, cookies number of steps required to access data from that source.

A federated search engine uses these connectors to return the final results from various sources to you. The results from various sources are consolidated and checked for duplication and then presented to you on a single page.

These type of search engines act as an intermediary between you and the sources of your content, such that you do not have to visit the individual web site which has the database to perform your search.

Federated search is different from Meta search, which is basically a search engine that searches more than one popular search engine at a time. The reason being that the underlying search element for popular web search is still a crawler, unlike in federated search where the search query is enabled to search the databases.

Juliana Friere, Associate Professor at the School of Computing at the University of Utah along with her team has come up with Deep Peep (www.deeppeep.org) which aims to crawl public databases for casual as well as expert users. “DeepPeep is a vertical search engine: it is specialized in web forms – the entry points to hidden web content. Currently, we index 45,000 pages that contain forms, spanning 7 distinct domains,” she said.

According to her a lot of mainstream web sites are searching the deep web to a certain extent. “Having the ability to combine visible and hidden web information in a search engine’s results can lead to higher quality information being presented to users. But if not done right, this can lead to further overwhelming users with information,’’ she added.


Federated search engines

www.deeppeep.org

Deep Peep is a search engine specialized in Web forms. The current beta version currently tracks 45,000 forms across 7 domains. These domains include auto, airfare, biology, book, job and rental.

www.deepwebtech.com

Deep Web Technologies, provides federated searches for scientific and technology related search queries. It has federated search sites for particular topics like mednar for medical research; biznar for searching business related sources; worldwide science for searching scientific content from databases across the world.

www.virtualprivatelibrary.com

On the other hand Marcus Zillman, executive director of Virtual Private Library and a deep web expert, has developed a subject tracer information blog, which has links to research web sites divided into neat categories. This blog does not have any search query box. So you will have to decide your area of interest and then according follow the links on the virtual private library’s home page.

www.pipl.com

This is an excellent site that scours the deep web for searching people based on their names, email addresses, usernames or even phone numbers.

Benefits of federated search

Saves Time

Federated search engines are a boon to research students as it helps in saving on a lot of time. The federated search engine performs simultaneous searches on different databases such that the user does not have to visit individual site to perform a search. It consolidates the results of all the various database searches onto one web page.

Quality of results

The quality of results in a federated search is much better off than a popular search engine result on the same topic. This is because the federated search gets its results from databases which are associated with particular organizations which are authorities on the subject.

Let’s say for instance a medical student wants details on Stem Cell Research. Now a popular engine search will provide links to news articles or say Wikipedia which corresponds to Stem Cell Research. A federated search engine like Mednar, on the other hand will provide links to articles from government recognized health institutes. It’s a no brainer, that the quality of information derived from the medical institution will be of far greater significance than say the information derived from Wikipedia.

Real time search

Federated search does real time searching such that you get the most up to date information from the source on your next query. Whereas in popular search engines, the search results are updated only when the crawlers crawl the web. Deep web search engines search each source live for all the queries that it gets. So as soon as the parent database is updated to include a new document, the next search will find it.

However the time taken for coming up with results on a federated search engine is quite high as compared to popular search engines. This is because the federated search results depend on how fast the underlying database search performs. The most common work around that most federated search engines use is to populate the results incrementally as they get the results from the underlying search engines.

So next time you log on to the web, do not just limit your search to popular search engines. Do try to explore the deep web.