SMILA 1.0 API documentation

org.eclipse.smila.processing.pipelets
Class HtmlToTextPipelet

java.lang.Object
  extended by org.eclipse.smila.processing.pipelets.ATransformationPipelet
      extended by org.eclipse.smila.processing.pipelets.HtmlToTextPipelet
All Implemented Interfaces:
Pipelet

public class HtmlToTextPipelet
extends ATransformationPipelet

Simple HTML-to-Text extractor pipelet using NekoHTML parser.

Author:
jschumacher

Nested Class Summary
 class HtmlToTextPipelet.CommentRemover
          removes comments from HTML files.
 class HtmlToTextPipelet.MetadataExtractor
          extract metadata from META tags.
 class HtmlToTextPipelet.PlainTextWriter
          Append plain text from document to a string builder.
 
Field Summary
 
Fields inherited from class org.eclipse.smila.processing.pipelets.ATransformationPipelet
_config, ENCODING_ATTACHMENT, PROP_INPUT_NAME, PROP_INPUT_TYPE, PROP_OUTPUT_NAME, PROP_OUTPUT_TYPE
 
Constructor Summary
HtmlToTextPipelet()
           
 
Method Summary
 void configure(AnyMap configuration)
          set configuration of pipelet. called once after instantiation before the pipelet is actually used in a workflow.
protected  java.lang.String getDefaultEncoding(ParameterAccessor paramAccessor)
           
protected  java.lang.String[] getRemoveContentTags(ParameterAccessor paramAccessor)
           
 java.lang.String[] process(Blackboard blackboard, java.lang.String[] recordIds)
          process given records.
 
Methods inherited from class org.eclipse.smila.processing.pipelets.ATransformationPipelet
getInputName, getInputType, getOutputName, getOutputType, isReadFromAttribute, isStoreInAttribute, readInput, readStringInput, storeResult, storeResult, storeResults
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlToTextPipelet

public HtmlToTextPipelet()
Method Detail

getDefaultEncoding

protected java.lang.String getDefaultEncoding(ParameterAccessor paramAccessor)
                                       throws MissingParameterException
Returns:
default encoding parameter.
Throws:
MissingParameterException

getRemoveContentTags

protected final java.lang.String[] getRemoveContentTags(ParameterAccessor paramAccessor)
                                                 throws MissingParameterException
Returns:
the tag names for which the complete content is removed from result.
Throws:
MissingParameterException

configure

public void configure(AnyMap configuration)
               throws ProcessingException
set configuration of pipelet. called once after instantiation before the pipelet is actually used in a workflow. note: additionally configures mata attribute mapping (which is not applicable via parameter accessor.

Specified by:
configure in interface Pipelet
Overrides:
configure in class ATransformationPipelet
Parameters:
configuration - configuration of pipelet.
Throws:
ProcessingException - configuration is not applicable for pipelet (missing properties, wrong datatypes)

process

public java.lang.String[] process(Blackboard blackboard,
                                  java.lang.String[] recordIds)
                           throws ProcessingException
process given records.

Parameters:
blackboard - Blackboard holding and managing the records.
recordIds - Ids of records to process.
Returns:
Ids of records to be passed into the next pipelet. By default this should be the same as the passed in recordIds unless there is a specific (businesslogic) reason not to do so.
Throws:
ProcessingException - error during processing.

SMILA 1.0 API documentation