|
SMILA 1.0 API documentation | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectorg.eclipse.smila.connectivity.framework.crawler.web.parse.html.DOMContentUtils
public class DOMContentUtils
A collection of methods for extracting content from DOM trees. This class holds a few utility methods for pulling content out of DOM nodes, such as getOutlinks, getText, etc.
| Nested Class Summary | |
|---|---|
static class |
DOMContentUtils.LinkParams
|
| Constructor Summary | |
|---|---|
DOMContentUtils(Configuration conf)
|
|
| Method Summary | |
|---|---|
java.net.URL |
getBase(org.w3c.dom.Node node)
If Node contains a BASE tag then it's HREF is returned. |
void |
getJavascriptOutlinks(java.lang.String base,
java.util.List<Outlink> outlinks,
org.w3c.dom.Node node)
|
void |
getOutlinks(java.net.URL base,
java.util.List<Outlink> outlinks,
org.w3c.dom.Node node)
This method finds all anchors below the supplied DOM node, and creates appropriate Outlink
records for each (relative to the supplied base URL), and adds them to the outlinks
ArrayList. |
void |
getText(java.lang.StringBuffer sb,
org.w3c.dom.Node node)
This is a convinience method, equivalent to getText(sb, node, false). |
boolean |
getText(java.lang.StringBuffer sb,
org.w3c.dom.Node node,
boolean abortOnNestedAnchors)
This method takes a StringBuffer and a DOM Node, and will append all the content text found
beneath the DOM node to the StringBuffer. |
boolean |
getTitle(java.lang.StringBuffer sb,
org.w3c.dom.Node node)
This method takes a StringBuffer and a DOM Node, and will append the content text found beneath
the first title node to the StringBuffer. |
void |
setConf(Configuration conf)
|
void |
setJavascriptParser(Parser javascriptParser)
|
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public DOMContentUtils(Configuration conf)
| Method Detail |
|---|
public void setConf(Configuration conf)
public boolean getText(java.lang.StringBuffer sb,
org.w3c.dom.Node node,
boolean abortOnNestedAnchors)
StringBuffer and a DOM Node, and will append all the content text found
beneath the DOM node to the StringBuffer.
If abortOnNestedAnchors is true, DOM traversal will be aborted and the StringBuffer
will not contain any text encountered after a nested anchor is found.
public void getText(java.lang.StringBuffer sb,
org.w3c.dom.Node node)
getText(sb, node, false).
public boolean getTitle(java.lang.StringBuffer sb,
org.w3c.dom.Node node)
StringBuffer and a DOM Node, and will append the content text found beneath
the first title node to the StringBuffer.
public java.net.URL getBase(org.w3c.dom.Node node)
public void getOutlinks(java.net.URL base,
java.util.List<Outlink> outlinks,
org.w3c.dom.Node node)
node, and creates appropriate Outlink
records for each (relative to the supplied base URL), and adds them to the outlinks
ArrayList.
Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).
public void getJavascriptOutlinks(java.lang.String base,
java.util.List<Outlink> outlinks,
org.w3c.dom.Node node)
public void setJavascriptParser(Parser javascriptParser)
|
SMILA 1.0 API documentation | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||