SMILA 1.0 API documentation

org.eclipse.smila.importing.crawler.web
Class WebExtractorWorker

java.lang.Object
  extended by org.eclipse.smila.importing.compounds.ExtractorWorkerBase
      extended by org.eclipse.smila.importing.crawler.web.WebExtractorWorker
All Implemented Interfaces:
Worker

public class WebExtractorWorker
extends ExtractorWorkerBase

Compound extractor worker to use in web crawling workflows.


Field Summary
static java.lang.String NAME
          name of worker.
 
Constructor Summary
WebExtractorWorker()
           
 
Method Summary
protected  Record convertRecord(Record compoundRecord, Record extractedRecord, TaskContext taskContext)
          create a record from the extracted record that conforms to the records produced by the matching crawler.
protected  boolean filterRecord(Record record, TaskContext taskContext)
          Filters applied to extracted records: urlPatterns (to the name of the extracted file).
protected  ContentFetcher getContentFetcher()
          get a content fetcher for the data source type.
 java.lang.String getName()
          
protected  java.util.Iterator<Record> invokeExtractor(CompoundExtractor extractor, Record compoundRecord, java.io.InputStream compoundContent, TaskContext taskContext)
          invoke extractor with data from the crawled record.
protected  void mapRecord(Record record, TaskContext taskContext)
          Hook for subclasses to support mapping of the converted record according to mapping rules.
 void setFetcher(Fetcher fetcher)
          DS service reference injection method.
 void unsetFetcher(Fetcher fetcher)
          DS service reference removal method.
 
Methods inherited from class org.eclipse.smila.importing.compounds.ExtractorWorkerBase
concatAttributeValues, copyAttachment, copyAttribute, copyCompoundAttributes, copySetToStringAttribute, perform, setCompoundExtractor, unsetCompoundExtractor
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NAME

public static final java.lang.String NAME
name of worker.

See Also:
Constant Field Values
Constructor Detail

WebExtractorWorker

public WebExtractorWorker()
Method Detail

getName

public java.lang.String getName()

Returns:
the name of the worker. The worker function will be executed for tasks tied to this worker name.

invokeExtractor

protected java.util.Iterator<Record> invokeExtractor(CompoundExtractor extractor,
                                                     Record compoundRecord,
                                                     java.io.InputStream compoundContent,
                                                     TaskContext taskContext)
                                              throws CompoundExtractorException
invoke extractor with data from the crawled record.

Specified by:
invokeExtractor in class ExtractorWorkerBase
Throws:
CompoundExtractorException

convertRecord

protected Record convertRecord(Record compoundRecord,
                               Record extractedRecord,
                               TaskContext taskContext)
create a record from the extracted record that conforms to the records produced by the matching crawler.

Specified by:
convertRecord in class ExtractorWorkerBase

filterRecord

protected boolean filterRecord(Record record,
                               TaskContext taskContext)
Filters applied to extracted records:

Overrides:
filterRecord in class ExtractorWorkerBase
Parameters:
record - the record to check
taskContext - the task context containing the task parameters
Returns:
true if the record passes the filter(s), false if not.

mapRecord

protected void mapRecord(Record record,
                         TaskContext taskContext)
Hook for subclasses to support mapping of the converted record according to mapping rules.

Overrides:
mapRecord in class ExtractorWorkerBase
Parameters:
record - the Record
taskContext - the TaskContext

getContentFetcher

protected ContentFetcher getContentFetcher()
get a content fetcher for the data source type.

Specified by:
getContentFetcher in class ExtractorWorkerBase

setFetcher

public void setFetcher(Fetcher fetcher)
DS service reference injection method.


unsetFetcher

public void unsetFetcher(Fetcher fetcher)
DS service reference removal method.


SMILA 1.0 API documentation