org.eclipse.smila.importing.crawler.web.filter
Class SimpleLinkFilter
java.lang.Object
org.eclipse.smila.importing.crawler.web.filter.SimpleLinkFilter
- All Implemented Interfaces:
- LinkFilter
public class SimpleLinkFilter
- extends java.lang.Object
- implements LinkFilter
Simple example implementation:
- Removed fragment parts from URLs ("#...")
- Filter links with parameters ("?...")
- Filter links to other hosts.
Also removes duplicates with exactly the same URL.
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
SimpleLinkFilter
public SimpleLinkFilter()
filterLinks
public java.util.Collection<Record> filterLinks(java.util.Collection<Record> extractedLinks,
Record sourceLink,
AnyMap parameters,
TaskLog taskLog)
throws WebCrawlerException
- Description copied from interface:
LinkFilter
- filter extracted links.
- Specified by:
filterLinks in interface LinkFilter
- Parameters:
extractedLinks - result from LinkExtractor service.sourceLink - record from which links where extracted.parameters - task parameters, can configure the operation.taskLog - log facility provided by WorkerManager.
- Returns:
- links to follow in follow-up tasks
- Throws:
WebCrawlerException - error in processing the links.
getHost
protected static java.lang.String getHost(java.lang.String urlString,
TaskLog log)
- Returns:
- host part of URL in link record.