25 December 2013

Web Crawling

Web crawler allows one to search and scrap through document URLs based on a specific criteria for indexing. It also needs to be approached from a netiquette friendly way conforming to the robots.txt rules. Scalability can be an issue as well as different approaches can be devised for an optimal outcome. An algorithm driven approach is vital for a constructive approach of meeting requirements that might incorporate either an informed or an uninformed search strategy. At times, they even incorporate a combination as well as heuristics. This ultimately implies that, from an algorithmic point of view of a crawler, the web is seen as a graph search and lends itself well with linked data. They could be conducted in a distributed fashion utilizing multiagent approach or as singular agents. Web crawlers can also be used for monitoring websites usage, security, and dispensing information analytics that might otherwise be hidden from a web master. There are quite a few open source tools and services available for a developer. There is always a period in which testing would need to be done locally to work out the ideal and web friendly approach. There is no one best solution out there if the needs go beyond the limitations of any existing libraries can offer. In that respect, it really means designing one's own custom search strategy. And, perhaps, making it open source to share with the community.

Python:

Java:
LinkedData:

Services:

Also, HBase appears to be in general a very good back-end for a crawler architecture which plays well with Hadoop.

Obviously, there are a lot more options out there most likely of which have a premium. Majority of the premium options have been avoided a mention.

high performance distributed web crawler
high performance distributed web crawler survey 
learning and discovering structure in web pages
UbiCrawler: A Scalable Fully Distributed Web Crawler
Searching the Web