SMILA 1.0 API documentation

org.eclipse.smila.importing
Interface VisitedLinksService

All Known Implementing Classes:
ObjectStoreVisitedLinksService

public interface VisitedLinksService

Service interface for checking if a crawled link was already crawled before.


Method Summary
 void clearAll()
          delete all state information in the service about all data sources.
 void clearSource(java.lang.String sourceId)
          delete all state information in the service about the given data source.
 long countEntries(java.lang.String sourceId, boolean countExact)
           
 java.util.Collection<java.lang.String> getSourceIds()
          get Ids of all sources that currently have entries in the VisitedLinksService.
 boolean isVisited(java.lang.String sourceId, java.lang.String link, java.lang.String jobRunId, java.lang.String inputBulkId)
          Determines if the link was already visited for this sourceId or not.
 void markAsVisited(java.lang.String sourceId, java.lang.String link, java.lang.String jobRunId, java.lang.String inputBulkId)
          Mark the link as visited in the current crawl job run.
 

Method Detail

isVisited

boolean isVisited(java.lang.String sourceId,
                  java.lang.String link,
                  java.lang.String jobRunId,
                  java.lang.String inputBulkId)
                  throws VisitedLinksException
Determines if the link was already visited for this sourceId or not.

Parameters:
sourceId - the name of the data source that contains the link.
link - the link to check, e.g. an URL.
jobRunId - the current job run id in which the crawler is running.
inputBulkId - the id of the inputBulk where the URL to check originates from.
Returns:
true if the URL was already visited for this sourceId, false otherwise
Throws:
VisitedLinksException

markAsVisited

void markAsVisited(java.lang.String sourceId,
                   java.lang.String link,
                   java.lang.String jobRunId,
                   java.lang.String inputBulkId)
                   throws VisitedLinksException
Mark the link as visited in the current crawl job run.

Parameters:
sourceId - the name of the data source that contains the link.
link - the link to mark, e.g. an URL.
jobRunId - the current job run id in which the crawler is running.
inputBulkId - the id of the inputBulk where the URL to mark originates from.
Throws:
VisitedLinksException

clearSource

void clearSource(java.lang.String sourceId)
                 throws VisitedLinksException
delete all state information in the service about the given data source.

Parameters:
sourceId - data source name.
Throws:
VisitedLinksException

clearAll

void clearAll()
              throws VisitedLinksException
delete all state information in the service about all data sources.

Throws:
VisitedLinksException

getSourceIds

java.util.Collection<java.lang.String> getSourceIds()
                                                    throws VisitedLinksException
get Ids of all sources that currently have entries in the VisitedLinksService.

Throws:
VisitedLinksException

countEntries

long countEntries(java.lang.String sourceId,
                  boolean countExact)
                  throws DeltaException
Parameters:
countExact - set to true to get an exact reault, but this may take some time. Else the service may return only an estimated value.
Returns:
number of entries for given source id.
Throws:
DeltaException

SMILA 1.0 API documentation