SMILA 1.0 API documentation

org.eclipse.smila.importing.crawler.web.fetcher
Class SimpleFetcher

java.lang.Object
  extended by org.eclipse.smila.importing.crawler.web.fetcher.SimpleFetcher
All Implemented Interfaces:
Fetcher

public class SimpleFetcher
extends java.lang.Object
implements Fetcher

Example implementation of a Fetcher service. It uses GET method to access the resource.

It does not (yet) follow any redirects and does not (yet) support authentication. It is based on Apache HttpClient 3.1.


Constructor Summary
SimpleFetcher()
          initialize HttpClient with disabled redirects.
 
Method Summary
 void crawl(Record linkRecord, AnyMap parameters, TaskLog taskLog)
          invoked by WebCrawlerWorker to resolve the URL in an input record.
 void fetch(Record crawledRecord, AnyMap parameters, TaskLog taskLog)
          invoked by WebFetcherWorker to get the content of a resource for which the crawler did not already attach the content.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SimpleFetcher

public SimpleFetcher()
initialize HttpClient with disabled redirects.

Method Detail

crawl

public void crawl(Record linkRecord,
                  AnyMap parameters,
                  TaskLog taskLog)
           throws WebCrawlerException
Description copied from interface: Fetcher
invoked by WebCrawlerWorker to resolve the URL in an input record. Must write metadata from HTTP header to attributes, and attaches the content of resources that can be used for link extraction.

Specified by:
crawl in interface Fetcher
Parameters:
linkRecord - record containing the URL and maybe additional information necessary to access the web resource.
parameters - configuration parameters, may be null.
taskLog - log facility provided by worker frame.
Throws:
WebCrawlerException - if resource cannot be crawled. If recoverable the request should be retried later, else the record should be skipped by the crawler worker.

fetch

public void fetch(Record crawledRecord,
                  AnyMap parameters,
                  TaskLog taskLog)
           throws WebCrawlerException
Description copied from interface: Fetcher
invoked by WebFetcherWorker to get the content of a resource for which the crawler did not already attach the content.

Specified by:
fetch in interface Fetcher
parameters - configuration parameters, may be null.
taskLog - log facility provided by worker frame.
Throws:
WebCrawlerException - if resource cannot be fetched. If recoverable the request should be retried later, else the record should be skipped by the crawler worker.

SMILA 1.0 API documentation