Fetcher (SMILA 1.0 API documentation)

Overview

Package

Class

Tree

Deprecated

Index

Help

SMILA 1.0 API documentation

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.eclipse.smila.importing.crawler.web
Interface Fetcher

All Superinterfaces:: ContentFetcher

All Known Implementing Classes:: DefaultFetcher

public interface Fetcher
extends ContentFetcher
extends ContentFetcher

Interface for Fetcher service of the WebCrawlerWorker and WebFetcherWorker. The fetcher is responsible for getting metadata and content

Author:: scum36

Method Summary
`void`	`crawl(java.lang.String url, Record linkRecord, WebCrawlingContext context)` invoked by WebCrawlerWorker to resolve the URL in an input record.
`void`	`fetch(java.lang.String url, Record crawledRecord, WebCrawlingContext context)` invoked by WebFetcherWorker to get the content of a resource for which the crawler did not already attach the content.

Methods inherited from interface org.eclipse.smila.importing.ContentFetcher
`getContent`

Method Detail

crawl

void crawl(java.lang.String url,
           Record linkRecord,
           WebCrawlingContext context)
           throws WebCrawlerException

invoked by WebCrawlerWorker to resolve the URL in an input record. Must write metadata from HTTP header to attributes, and attaches the content of resources that can be used for link extraction.

Parameters:: url - the url to crawl; linkRecord - record containing the URL and maybe additional information necessary to access the web resource.
Throws:: WebCrawlerException - if resource cannot be crawled. If recoverable the request should be retried later, else the record should be skipped by the crawler worker.

fetch

void fetch(java.lang.String url,
           Record crawledRecord,
           WebCrawlingContext context)
           throws WebCrawlerException

invoked by WebFetcherWorker to get the content of a resource for which the crawler did not already attach the content.

Please note: the crawledRecord will already have been mapped.

Parameters:: url - the url to fetch into the record; crawledRecord -
Throws:: WebCrawlerException - if resource cannot be fetched. If recoverable the request should be retried later, else the record should be skipped by the crawler worker.