DefaultFetcher (SMILA 1.2 API documentation)

java.lang.Object
- org.eclipse.smila.importing.crawler.web.fetcher.DefaultFetcher

All Implemented Interfaces:

ContentFetcher, Fetcher
```
public class DefaultFetcher
extends java.lang.Object
implements Fetcher
```
Example implementation of a Fetcher service. It uses GET method to access the resource.
- During crawling it reads metadata for content-length, content-type and last-modified from the HTTP header to attributes and attaches the content of resources that are reported as mime type "text/html".
- During fetching it just attaches the content of any resource.
It does not (yet) support authentication. It is based on Apache HttpClient 4.1.

Constructor Summary

Constructors
Constructor and Description

DefaultFetcher()
initialize HttpClient with disabled redirects.

Constructors
Constructor and Description
`DefaultFetcher()` initialize HttpClient with disabled redirects.

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`crawl(java.lang.String url, Record linkRecord, WebCrawlingContext context)` invoked by WebCrawlerWorker to resolve the URL in an input record.
`void`	`fetch(java.lang.String url, Record crawledRecord, WebCrawlingContext context)` invoked by WebFetcherWorker to get the content of a resource for which the crawler did not already attach the content.
`java.io.InputStream`	`getContent(Record crawledRecord, TaskContext taskContext)` get a stream on a content object.
`void`	`setJobRunDataProvider(JobRunDataProvider jobRunDataProvider)` DS service reference injection method.
`void`	`setLinkFilter(LinkFilter linkFilter)` DS service reference injection method.
`void`	`setVisitedLinks(VisitedLinksService visitedLinks)` DS service reference injection method.
`void`	`unsetJobRunDataProvider(JobRunDataProvider jobRunDataProvider)` DS service reference removal method.
`void`	`unsetLinkFilter(LinkFilter linkFilter)` DS service reference removal method.
`void`	`unsetVisitedLinks(VisitedLinksService visitedLinks)` DS service reference removal method.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - DefaultFetcher
```
public DefaultFetcher()
```
    initialize HttpClient with disabled redirects.
- Method Detail
  - crawl
```
public void crawl(java.lang.String url,
         Record linkRecord,
         WebCrawlingContext context)
           throws WebCrawlerException
```
    Description copied from interface: Fetcher
    
    invoked by WebCrawlerWorker to resolve the URL in an input record. Must write metadata from HTTP header to attributes, and attaches the content of resources that can be used for link extraction.
    
    Specified by:
    
    crawl in interface Fetcher
    
    Parameters:
    url - the url to crawl
    linkRecord - record containing the URL and maybe additional information necessary to access the web resource.
    
    Throws:
    
    WebCrawlerException - if resource cannot be crawled. If recoverable the request should be retried later, else the record should be skipped by the crawler worker.
  - fetch
```
public void fetch(java.lang.String url,
         Record crawledRecord,
         WebCrawlingContext context)
           throws WebCrawlerException
```
    Description copied from interface: Fetcher
    
    invoked by WebFetcherWorker to get the content of a resource for which the crawler did not already attach the content.
    Please note: the crawledRecord will already have been mapped.
    
    Specified by:
    
    fetch in interface Fetcher
    
    Parameters:
    url - the url to fetch into the record
    
    Throws:
    
    WebCrawlerException - if resource cannot be fetched. If recoverable the request should be retried later, else the record should be skipped by the crawler worker.
  - getContent
```
public java.io.InputStream getContent(Record crawledRecord,
                             TaskContext taskContext)
                               throws ImportingException
```
    get a stream on a content object. Make sure to close the stream after usage.
    Please note: a mapped record (at least URL must be mapped) is expected here!
    
    Specified by:
    
    getContent in interface ContentFetcher
    
    Parameters:
    crawledRecord - a crawled record describing the content object.
    taskContext - the TaskContexrt containing job parameters and more
    
    Returns:
    content stream
    
    Throws:
    
    ImportingException - error accessing the content object.
  - setVisitedLinks
```
public void setVisitedLinks(VisitedLinksService visitedLinks)
```
    DS service reference injection method.
  - unsetVisitedLinks
```
public void unsetVisitedLinks(VisitedLinksService visitedLinks)
```
    DS service reference removal method.
  - setLinkFilter
```
public void setLinkFilter(LinkFilter linkFilter)
```
    DS service reference injection method.
  - unsetLinkFilter
```
public void unsetLinkFilter(LinkFilter linkFilter)
```
    DS service reference removal method.
  - setJobRunDataProvider
```
public void setJobRunDataProvider(JobRunDataProvider jobRunDataProvider)
```
    DS service reference injection method.
  - unsetJobRunDataProvider
```
public void unsetJobRunDataProvider(JobRunDataProvider jobRunDataProvider)
```
    DS service reference removal method.

Class DefaultFetcher

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

DefaultFetcher

Method Detail

crawl

fetch

getContent

setVisitedLinks

unsetVisitedLinks

setLinkFilter

unsetLinkFilter

setJobRunDataProvider

unsetJobRunDataProvider