SMILA (incubation) API documentation

org.eclipse.smila.connectivity.framework.crawler.web.http
Class RobotRulesParser

java.lang.Object
  extended by org.eclipse.smila.connectivity.framework.crawler.web.http.RobotRulesParser
All Implemented Interfaces:
Configurable

public class RobotRulesParser
extends java.lang.Object
implements Configurable

This class handles the parsing of robots.txt files. It emits RobotRules objects, which describe the download permissions as described in RobotRulesParser.


Nested Class Summary
static class RobotRulesParser.RobotRuleSet
          This class holds the rules which were parsed from a robots.txt file, and can test paths against those rules.
 
Constructor Summary
RobotRulesParser(Configuration conf)
          Creates new RobotRulesParser with the given configuration.
 
Method Summary
 Configuration getConf()
          Return the configuration used by this object.
 long getCrawlDelay(HttpBase http, java.net.URL url)
          Returns a Crawl-Delay value extracted from robots.txt file.
 boolean isAllowed(HttpBase http, java.net.URL url)
          Returns true if the URL is allowed for fetching and false otherwise.
 void setConf(Configuration conf)
          Set the configuration to be used by this object.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

RobotRulesParser

public RobotRulesParser(Configuration conf)
Creates new RobotRulesParser with the given configuration.

Parameters:
conf - Configuration
Method Detail

setConf

public void setConf(Configuration conf)
Set the configuration to be used by this object.

Specified by:
setConf in interface Configurable
Parameters:
conf - Configuration

getConf

public Configuration getConf()
Return the configuration used by this object.

Specified by:
getConf in interface Configurable
Returns:
Configuration

isAllowed

public boolean isAllowed(HttpBase http,
                         java.net.URL url)
Returns true if the URL is allowed for fetching and false otherwise.

Parameters:
http - HttpBase object that is used to get the robots.txt contents.
url - URL to be checked.
Returns:
boolean

getCrawlDelay

public long getCrawlDelay(HttpBase http,
                          java.net.URL url)
Returns a Crawl-Delay value extracted from robots.txt file.

Parameters:
http - HttpBase object that is used to get the robots.txt contents.
url - URL
Returns:
long Crawl-Delay in milliseconds; -1 if not set.

SMILA (incubation) API documentation