SMILA (incubation) API documentation

org.eclipse.smila.connectivity.framework.crawler.web
Class IndexDocument

java.lang.Object
  extended by org.eclipse.smila.connectivity.framework.crawler.web.IndexDocument
All Implemented Interfaces:
java.io.Serializable

public class IndexDocument
extends java.lang.Object
implements java.io.Serializable

This class for indexing contains all relevant data rom the web page.

See Also:
Serialized Form

Constructor Summary
IndexDocument(java.lang.String url, java.lang.String title, byte[] content, java.util.List<java.lang.String> responseHeaders, java.util.List<java.lang.String> htmlMetaData, java.util.List<java.lang.String> metaDataWithResponseHeaderFallBack)
          Constructor.
 
Method Summary
 java.lang.String extractFromResponseHeaders(java.util.regex.Pattern pattern, int group)
          extract something from response headers.
 byte[] getContent()
          Returns content of the downloaded document.
 java.util.List<java.lang.String> getHtmlMetaData()
          Returns the list of HTML meta data extracted from HTML meta tags.
 java.util.List<java.lang.String> getMetaDataWithResponseHeaderFallBack()
          Returns combination of response headers and HTML meta data.
 java.util.List<java.lang.String> getResponseHeaders()
          Returns response headers.
 java.lang.String getTitle()
          Returns title of the web page.
 java.lang.String getUrl()
          Returns url of the page.
 void setContent(byte[] content)
          Assigns text content of the web page to the index document.
 void setHtmlMetaData(java.util.List<java.lang.String> metaData)
          Assigns HTML meta data to the index document.
 void setMetaDataWithResponseHeaderFallBack(java.util.List<java.lang.String> metaDataWithResponseHeaderFallBack)
          Assigns combination of response headers and HTML meta data to the index document.
 void setResponseHeaders(java.util.List<java.lang.String> headers)
          Assigns response headers to the index document.
 void setTitle(java.lang.String title)
          Assigns title of the page.
 void setUrl(java.lang.String url)
          Assigns URL of the page.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

IndexDocument

public IndexDocument(java.lang.String url,
                     java.lang.String title,
                     byte[] content,
                     java.util.List<java.lang.String> responseHeaders,
                     java.util.List<java.lang.String> htmlMetaData,
                     java.util.List<java.lang.String> metaDataWithResponseHeaderFallBack)
Constructor.

Parameters:
url - URL of the web page
title - title of the web page
content - extracted content
responseHeaders - list of response headers
htmlMetaData - list of extracted HTML meta data
metaDataWithResponseHeaderFallBack - responseHeaders and htmlMetaData merged together
Method Detail

getContent

public byte[] getContent()
Returns content of the downloaded document.

Returns:
byte[]

setContent

public void setContent(byte[] content)
Assigns text content of the web page to the index document.

Parameters:
content - String

getTitle

public java.lang.String getTitle()
Returns title of the web page.

Returns:
String

setTitle

public void setTitle(java.lang.String title)
Assigns title of the page.

Parameters:
title - String

getUrl

public java.lang.String getUrl()
Returns url of the page.

Returns:
String

setUrl

public void setUrl(java.lang.String url)
Assigns URL of the page.

Parameters:
url - String

getHtmlMetaData

public java.util.List<java.lang.String> getHtmlMetaData()
Returns the list of HTML meta data extracted from HTML meta tags.

Returns:
List of strings

setHtmlMetaData

public void setHtmlMetaData(java.util.List<java.lang.String> metaData)
Assigns HTML meta data to the index document.

Parameters:
metaData - List

getResponseHeaders

public java.util.List<java.lang.String> getResponseHeaders()
Returns response headers.

Returns:
List

setResponseHeaders

public void setResponseHeaders(java.util.List<java.lang.String> headers)
Assigns response headers to the index document.

Parameters:
headers - List

getMetaDataWithResponseHeaderFallBack

public java.util.List<java.lang.String> getMetaDataWithResponseHeaderFallBack()
Returns combination of response headers and HTML meta data.

Returns:
List

setMetaDataWithResponseHeaderFallBack

public void setMetaDataWithResponseHeaderFallBack(java.util.List<java.lang.String> metaDataWithResponseHeaderFallBack)
Assigns combination of response headers and HTML meta data to the index document.

Parameters:
metaDataWithResponseHeaderFallBack - List

extractFromResponseHeaders

public java.lang.String extractFromResponseHeaders(java.util.regex.Pattern pattern,
                                                   int group)
extract something from response headers. The pattern is tested on all response headers until one matches, then the value of requested group is returned.

Parameters:
pattern - a regular expression
group - index of group in regular expression to return
Returns:
MimeType, if any could be found.

SMILA (incubation) API documentation