SMILA 1.0 API documentation

org.eclipse.smila.importing.crawler.web.fetcher
Class DefaultFetcher

java.lang.Object
  extended by org.eclipse.smila.importing.crawler.web.fetcher.DefaultFetcher
All Implemented Interfaces:
ContentFetcher, Fetcher

public class DefaultFetcher
extends java.lang.Object
implements Fetcher

Example implementation of a Fetcher service. It uses GET method to access the resource.

It does not (yet) support authentication. It is based on Apache HttpClient 4.1.


Constructor Summary
DefaultFetcher()
          initialize HttpClient with disabled redirects.
 
Method Summary
 void crawl(java.lang.String url, Record linkRecord, WebCrawlingContext context)
          invoked by WebCrawlerWorker to resolve the URL in an input record.
 void fetch(java.lang.String url, Record crawledRecord, WebCrawlingContext context)
          invoked by WebFetcherWorker to get the content of a resource for which the crawler did not already attach the content.
 java.io.InputStream getContent(Record crawledRecord, TaskContext taskContext)
          get a stream on a content object.
 void setLinkFilter(LinkFilter linkFilter)
          DS service reference injection method.
 void setVisitedLinks(VisitedLinksService visitedLinks)
          DS service reference injection method.
 void unsetLinkFilter(LinkFilter linkFilter)
          DS service reference removal method.
 void unsetVisitedLinks(VisitedLinksService visitedLinks)
          DS service reference removal method.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DefaultFetcher

public DefaultFetcher()
initialize HttpClient with disabled redirects.

Method Detail

crawl

public void crawl(java.lang.String url,
                  Record linkRecord,
                  WebCrawlingContext context)
           throws WebCrawlerException
Description copied from interface: Fetcher
invoked by WebCrawlerWorker to resolve the URL in an input record. Must write metadata from HTTP header to attributes, and attaches the content of resources that can be used for link extraction.

Specified by:
crawl in interface Fetcher
Parameters:
url - the url to crawl
linkRecord - record containing the URL and maybe additional information necessary to access the web resource.
Throws:
WebCrawlerException - if resource cannot be crawled. If recoverable the request should be retried later, else the record should be skipped by the crawler worker.

fetch

public void fetch(java.lang.String url,
                  Record crawledRecord,
                  WebCrawlingContext context)
           throws WebCrawlerException
Description copied from interface: Fetcher
invoked by WebFetcherWorker to get the content of a resource for which the crawler did not already attach the content.

Please note: the crawledRecord will already have been mapped.

Specified by:
fetch in interface Fetcher
Parameters:
url - the url to fetch into the record
Throws:
WebCrawlerException - if resource cannot be fetched. If recoverable the request should be retried later, else the record should be skipped by the crawler worker.

getContent

public java.io.InputStream getContent(Record crawledRecord,
                                      TaskContext taskContext)
                               throws ImportingException
get a stream on a content object. Make sure to close the stream after usage.

Please note: a mapped record (at least URL must be mapped) is expected here!

Specified by:
getContent in interface ContentFetcher
Parameters:
crawledRecord - a crawled record describing the content object.
taskContext - the TaskContexrt containing job parameters and more
Returns:
content stream
Throws:
ImportingException - error accessing the content object.

setVisitedLinks

public void setVisitedLinks(VisitedLinksService visitedLinks)
DS service reference injection method.


unsetVisitedLinks

public void unsetVisitedLinks(VisitedLinksService visitedLinks)
DS service reference removal method.


setLinkFilter

public void setLinkFilter(LinkFilter linkFilter)
DS service reference injection method.


unsetLinkFilter

public void unsetLinkFilter(LinkFilter linkFilter)
DS service reference removal method.


SMILA 1.0 API documentation