All Packages  Class Hierarchy  This Package  Previous  Next  Index

Class Webcrawler.Crawler.HTMLNode

java.lang.Object
   |
   +----Webcrawler.Crawler.URLNode
           |
           +----Webcrawler.Crawler.LoadableNode
                   |
                   +----Webcrawler.Crawler.HTMLNode

public class HTMLNode
extends LoadableNode
This class is derived from LoadableNode and implements a node-class for storing HTML-file-data such as the links in the file and the title. For easy access to the links (=the sons) this class features an Enumeration or the links-Vector can directly be retrieved. To find HTML-page specific information such as the TITLE use the findHTMLPageInfos()-method, which parese through the local file and sets the fields of this object according to the found information. An object of this class can call the findSons() method which uses a Parser for finding the links and connecting them to the node as sons.

See Also:
URLNode, LoadableNode, Parser

Variable Index

 o links
Sons of this node.
 o title
Title of HTML-Page (if existing)
 o willSonsBeLoaded
Will the sons of this node be loaded in the future.

Constructor Index

 o HTMLNode()
 o HTMLNode(String)
 o HTMLNode(URL, String)

Method Index

 o check_Connect(String)
Finds out what type the specified url has (HTML,...) and connects a new node with the according nodetype (LoadableNode, HTLNode, URLNode) to this node.
 o ConnectDeadSon(String)
Connects a new son to this node, whose URLType is set to dead.
 o ConnectMalformedSon(String)
Creates a new LoadableNode, sets it's URLType to malformed and its infoText to "url couldn't be resolved....".
 o ConnectSon(URLNode)
Connects the given node to this node.
 o copy(HTMLNode)
Copies the title, but not the links, because if a URL is recursive it is also a leaf in the tree, so it doesn't need any links.
 o findHTMLPageInfos()
Parses through localfile and sets the HTML-page specific fields like TITLE.
 o getLinks()
 o getNoOfSons()
 o getSonEnumeration()
Access to the sons of the node via an Enumeration
 o getTitle()
 o getWillSonsBeLoaded()

Variables

 o title
 protected String title
Title of HTML-Page (if existing)

 o links
 protected Vector links
Sons of this node.

 o willSonsBeLoaded
 protected boolean willSonsBeLoaded
Will the sons of this node be loaded in the future. This info is important for the Parsers, cuz' they don't need to do the +"/index.html"-check for every son if it won't be loaded.

Constructors

 o HTMLNode
 public HTMLNode()
See Also:
URLNode, LoadableNode
 o HTMLNode
 public HTMLNode(String url) throws MalformedURLException
See Also:
URLNode, LoadableNode
 o HTMLNode
 public HTMLNode(URL context,
                 String spec) throws MalformedURLException
See Also:
URLNode, LoadableNode

Methods

 o copy
 public void copy(HTMLNode from)
Copies the title, but not the links, because if a URL is recursive it is also a leaf in the tree, so it doesn't need any links.

 o getTitle
 public String getTitle()
Returns:
The Title of this HTML-page
 o getNoOfSons
 public int getNoOfSons()
Returns:
The number of sons connected to this node
 o getSonEnumeration
 public Enumeration getSonEnumeration()
Access to the sons of the node via an Enumeration

Returns:
An Enumeration over all the sons
See Also:
Enumeration
 o getLinks
 public Vector getLinks()
Returns:
the Vector links stored in this node
 o getWillSonsBeLoaded
 public boolean getWillSonsBeLoaded()
Returns:
will this nodes sons be loaded in the future?
See Also:
Parsers
 o ConnectSon
 public URLNode ConnectSon(URLNode n)
Connects the given node to this node. Sets depth and father fields of node n before connecting.

Returns:
the connected node
 o ConnectMalformedSon
 public URLNode ConnectMalformedSon(String url)
Creates a new LoadableNode, sets it's URLType to malformed and its infoText to "url couldn't be resolved....". Connects that son to this node.

Returns:
the connected node
 o ConnectDeadSon
 public URLNode ConnectDeadSon(String url)
Connects a new son to this node, whose URLType is set to dead.

Returns:
the connected node
 o check_Connect
 public URLNode check_Connect(String url)
Finds out what type the specified url has (HTML,...) and connects a new node with the according nodetype (LoadableNode, HTLNode, URLNode) to this node.

Returns:
the connected node (null if error occured)
 o findHTMLPageInfos
 public void findHTMLPageInfos()
Parses through localfile and sets the HTML-page specific fields like TITLE.


All Packages  Class Hierarchy  This Package  Previous  Next  Index