What is a Crawler?
A crawler (also known as spider or robot) is an automated program which browses some source for information using a methodical manner. It starts using some seeds, or initial URLs. These URLs can be something like file:///mydocuments, or http://mydomain/resource. Therefore, the crawler analyzes all links from the seeds to other resources, storing all data it finds for further manipulation.
ISCrawler
ISCrawler is an ISKMM module responsible for extracting digital content in different formats and sources. Considering that nowadays a huge amount of textual content is stored in free text files, and also considering that other ISKMM modules depend on this data, this module is essential to capture and make information available.
There are two main behaviors within ISCrawler. The first one is called disk behavior, while the second is named web behavior. Both behaviors use a recursive strategy to navigate through some set of nodes, that can be a bunch of directories in a hard disk, or web pages.
All crawled information is stored in an inverted index structure using the Lucene API. The inverted index structure, combined with the power delivered by Lucene’s API, are two main ingredients for accessing crawled data in an easy, fast, and reliable way.