SMILA 1.0 API documentation

org.eclipse.smila.importing
Interface DeltaService

All Known Implementing Classes:
ObjectStoreDeltaService

public interface DeltaService

Service interface for checking if a crawled record must be sent to the processing job.


Nested Class Summary
static class DeltaService.EntryId
          returned by getUnvisitedEntries(String, String).
 
Method Summary
 State checkState(java.lang.String sourceId, java.lang.String recordId, java.lang.String jobRunId, java.lang.String hashCode)
          Determine delta state of record identified by sourceId and recordId.
 State checkState(java.lang.String sourceId, java.lang.String recordId, java.lang.String compoundRecordId, java.lang.String jobRunId, java.lang.String hashCode)
          Determine delta state of record identified by sourceId and recordId.
 void clearAll()
          delete all state information in the service about all data sources.
 void clearSource(java.lang.String sourceId)
          delete all state information in the service about the given data source.
 long countEntries(java.lang.String sourceId, boolean countExact)
           
 void deleteEntry(java.lang.String sourceId, DeltaService.EntryId entryId)
          remove an entry, e.g. after it has been deleted.
 java.util.Collection<java.lang.String> getShardPrefixes(java.lang.String sourceId)
          get possible input values for #getRecordIdsToDelete(String).
 java.util.Collection<java.lang.String> getSourceIds()
          get Ids of all sources that currently have entries in the DeltaService.
 java.util.Collection<DeltaService.EntryId> getUnvisitedEntries(java.lang.String sourceAndShardPrefix, java.lang.String jobRunId)
          get the record IDs in the given data source and shard that have not been visited in the given job run and therefore must be sent as deleted records to the target job.
 void markAsUpdated(java.lang.String sourceId, java.lang.String recordId, java.lang.String jobRunId, java.lang.String hashCode)
          Mark the record as visited in the current crawl job run.
 void markAsUpdated(java.lang.String sourceId, java.lang.String recordId, java.lang.String compoundRecordId, java.lang.String jobRunId, java.lang.String hashCode)
          Mark the record that was extracted from a compound as visited in the current crawl job run.
 void markCompoundElementsVisited(java.lang.String sourceId, java.lang.String compoundRecordId, java.lang.String jobRunId)
          Set jobRunId of all elements of the given compound record, because the compound itself has not changed.
 

Method Detail

checkState

State checkState(java.lang.String sourceId,
                 java.lang.String recordId,
                 java.lang.String jobRunId,
                 java.lang.String hashCode)
                 throws DeltaException
Determine delta state of record identified by sourceId and recordId. If the result is State.UPTODATE the service also marks the record as visited in the current crawl job run already, so there is no need to call markAsUpdated(String, String, String, String) afterwards. In the other cases the crawler should call markAsUpdated(String, String, String, String) only if the record is actually submitted to a processing job.

Parameters:
sourceId - the name of the data source that contains the record.
recordId - the record id
jobRunId - the current job run id in which the crawler is running.
hashCode - a string that reflects changes in the record content. This can be as simple as a version identifier if such is available in record metadata, or even a hash calculated on the actual content of the record.
Returns:
an appropriate State value.
Throws:
DeltaException

checkState

State checkState(java.lang.String sourceId,
                 java.lang.String recordId,
                 java.lang.String compoundRecordId,
                 java.lang.String jobRunId,
                 java.lang.String hashCode)
                 throws DeltaException
Determine delta state of record identified by sourceId and recordId. If the result is State.UPTODATE the service also marks the record as visited in the current crawl job run already, so there is no need to call markAsUpdated(String, String, String, String) afterwards. In the other cases the crawler should call markAsUpdated(String, String, String, String) only if the record is actually submitted to a processing job.

Parameters:
sourceId - the name of the data source that contains the record.
recordId - the record id
compoundRecordId - the record id of the compound this record was extracted from. May be null.
jobRunId - the current job run id in which the crawler is running.
hashCode - a string that reflects changes in the record content. This can be as simple as a version identifier if such is available in record metadata, or even a hash calculated on the actual content of the record.
Returns:
an appropriate State value.
Throws:
DeltaException

markCompoundElementsVisited

void markCompoundElementsVisited(java.lang.String sourceId,
                                 java.lang.String compoundRecordId,
                                 java.lang.String jobRunId)
                                 throws DeltaException
Set jobRunId of all elements of the given compound record, because the compound itself has not changed.

Parameters:
sourceId -
compoundRecordId -
jobRunId -
Throws:
DeltaException

markAsUpdated

void markAsUpdated(java.lang.String sourceId,
                   java.lang.String recordId,
                   java.lang.String jobRunId,
                   java.lang.String hashCode)
                   throws DeltaException
Mark the record as visited in the current crawl job run.

Parameters:
sourceId - the name of the data source that contains the record.
recordId - the record id
jobRunId - the current job run id in which the crawler is running.
hashCode - a string that reflects changes in the record content. This can be as simple as a version identifier if such is available in record metadata, or even a hash calculated on the actual content of the record.
Throws:
DeltaException

markAsUpdated

void markAsUpdated(java.lang.String sourceId,
                   java.lang.String recordId,
                   java.lang.String compoundRecordId,
                   java.lang.String jobRunId,
                   java.lang.String hashCode)
                   throws DeltaException
Mark the record that was extracted from a compound as visited in the current crawl job run.

Parameters:
sourceId - the name of the data source that contains the record.
recordId - the record id
compoundRecordId - the record id of the compound this record was extracted from. May be null.
jobRunId - the current job run id in which the crawler is running.
hashCode - a string that reflects changes in the record content. This can be as simple as a version identifier if such is available in record metadata, or even a hash calculated on the actual content of the record.
Throws:
DeltaException

clearSource

void clearSource(java.lang.String sourceId)
                 throws DeltaException
delete all state information in the service about the given data source.

Parameters:
sourceId - data source name.
Throws:
DeltaException

clearAll

void clearAll()
              throws DeltaException
delete all state information in the service about all data sources.

Throws:
DeltaException

getSourceIds

java.util.Collection<java.lang.String> getSourceIds()
                                                    throws DeltaException
get Ids of all sources that currently have entries in the DeltaService.

Throws:
DeltaException

countEntries

long countEntries(java.lang.String sourceId,
                  boolean countExact)
                  throws DeltaException
Parameters:
sourceId - the name of the data source to examine
countExact - set to true to get an exact reault, but this may take some time. Else the service may return only an estimated value.
Returns:
number of entries for given source id.
Throws:
DeltaException

getShardPrefixes

java.util.Collection<java.lang.String> getShardPrefixes(java.lang.String sourceId)
                                                        throws DeltaException
get possible input values for #getRecordIdsToDelete(String). This makes it possible to parallelize and distribute the check for records to delete.

Parameters:
sourceId - the name of the data source to examine.
Throws:
DeltaException

getUnvisitedEntries

java.util.Collection<DeltaService.EntryId> getUnvisitedEntries(java.lang.String sourceAndShardPrefix,
                                                               java.lang.String jobRunId)
                                                               throws DeltaException
get the record IDs in the given data source and shard that have not been visited in the given job run and therefore must be sent as deleted records to the target job. To get all unvisited records in the source the caller must iterate over the result of getShardPrefixes(String) and call this method with each of the shard-prefix values.

Parameters:
sourceAndShardPrefix - one of the values returned by getShardPrefixes(String)
Returns:
Throws:
DeltaException

deleteEntry

void deleteEntry(java.lang.String sourceId,
                 DeltaService.EntryId entryId)
                 throws DeltaException
remove an entry, e.g. after it has been deleted.

Parameters:
sourceId - data source Id
entryId - ID of the entry, e.g. as returned by getUnvisitedEntries(String, String)
Throws:
DeltaException

SMILA 1.0 API documentation