Instituto Stela

portalinovacao.info

Contato|Mapa do site

ISCrawler

2007-08-15

What is a Crawler?
A crawler (also known as spider or robot) is an automated program which browses some source for information using a methodical manner. It starts using some seeds, or initial URLs. These URLs can be something like file:///mydocuments, or http://mydomain/resource. Therefore, the crawler analyzes all links from the seeds to other resources, storing all data it finds for further manipulation.

ISCrawler

ISCrawler is an ISKMM module responsible for extracting digital content in different formats and sources. Considering that nowadays a huge amount of textual content is stored in free text files, and also considering that other ISKMM modules depend on this data, this module is essential to capture and make information available.

There are two main behaviors within ISCrawler. The first one is called disk behavior, while the second is named web behavior. Both behaviors use a recursive strategy to navigate through some set of nodes, that can be a bunch of directories in a hard disk, or web pages.

All crawled information is stored in an inverted index structure using the Lucene API. The inverted index structure, combined with the power delivered by Lucene’s API, are two main ingredients for accessing crawled data in an easy, fast, and reliable way.


O que você achou desta informação?
FracoMédioExcelente


copyright © 2005-2009 Instituto Stela. Todos os direitos reservados.
Termos de Uso | Política de privacidade