All Packages  Class Hierarchy  This Package  Previous  Next  Index

Class Webcrawler.Crawler.HTMLParser

java.lang.Object
   |
   +----Webcrawler.Crawler.HTMLParser

public class HTMLParser
extends Object
A Parser-object scans through a local file and extracts HTML-specific information like the tags (and their values) and the text between the tags. If you are only interested in the tags just call ReadTag() sequentially (it overreads the text between the tags). ReadInfo() works the same way, it overreads the tags. If you need the entire HTML-file you can call comesTag() and comesInfo() and the according ReadTag()/ReadInfo(). One tag can consist of several attributes (+values); To scan through the attributes of the currently read tag, use getCurrAttributes() and convert the elements of the enumeration to HTMLAttribute. The first thing in a tag is the "element"-name (e.g: IMG). To get this simply call getCurrElement() after you called readTag(). If readTag realizes that the next thing to be read is a comment, it reads the entire comment and stores it in the currComment field. currElement then is null, otherwise currComment is null.


Variable Index

 o currAttributes
Stores the currently read attributes (=list of HTMLAttributes)
 o currChar
 o currComment
Stores the currently read Comment !--......--
 o currElement
Stores the currently read Element (e.g: "BODY")
 o currInfo
Stores the currently read info (=text between tags)
 o in
 o parsedCharsNum
 o success

Constructor Index

 o HTMLParser(String)
Creates a new Parser on the local file htmlFile Use ReadTag and ReadValue for reading from the file, then use getCurrTag and getCurrInfo for finding out what was read.

Method Index

 o afterBlanks()
overreads blanks, carriage-returns, newlines and tabs
 o close()
 o comesInfo()
 o comesTag()
 o finalize()
 o getCurrAttributes()
 o getCurrComment()
 o getCurrElement()
 o getCurrInfo()
 o getParsedCharsNum()
 o isWhiteSpace(char)
 o read()
 o readAttribute()
 o readComment(String)
 o readElement()
 o readInfo()
Reads the next info between tags.
 o readTag()
Reads the next tag.
 o readValue()
 o success()

Variables

 o in
 protected FileInputStream in
 o success
 protected boolean success
 o currChar
 protected char currChar
 o parsedCharsNum
 protected int parsedCharsNum
 o currElement
 protected String currElement
Stores the currently read Element (e.g: "BODY")

 o currComment
 protected String currComment
Stores the currently read Comment !--......--

 o currAttributes
 protected Vector currAttributes
Stores the currently read attributes (=list of HTMLAttributes)

See Also:
HTMLAttribute
 o currInfo
 protected String currInfo
Stores the currently read info (=text between tags)

Constructors

 o HTMLParser
 public HTMLParser(String htmlFile)
Creates a new Parser on the local file htmlFile Use ReadTag and ReadValue for reading from the file, then use getCurrTag and getCurrInfo for finding out what was read.

Methods

 o finalize
 protected void finalize() throws Throwable
Overrides:
finalize in class Object
 o close
 public void close()
 o read
 private char read()
 o isWhiteSpace
 private boolean isWhiteSpace(char ch)
 o afterBlanks
 private char afterBlanks()
overreads blanks, carriage-returns, newlines and tabs

 o readTag
 public boolean readTag()
Reads the next tag. If the next thing to be read is an info, it is overread. Afterwards currChar contains the closing > of the tag. Get the element-name of the tag using getCurrElement(). Get the tag using getCurrTag() which gives you an Enumeration over HTMLAttribute-objects The attribute's values contain the quotation-marks if there are any

Returns:
success (false if error or eof)
 o readElement
 private String readElement()
 o readComment
 private String readComment(String element)
 o readAttribute
 private String readAttribute()
 o readValue
 private String readValue()
 o readInfo
 public boolean readInfo()
Reads the next info between tags. If the next thing to be read is a tag it is overread. Afterwards currChar contains the < of the next tag get the read info using getCurrInfo()

Returns:
success (false if error or eof)
 o comesTag
 public boolean comesTag()
Returns:
Is the next thing to be read a tag?
 o comesInfo
 public boolean comesInfo()
Returns:
Is the next thing to be read an info?
 o success
 public boolean success()
Returns:
Successful (like readTag() and readInfo())
 o getParsedCharsNum
 public int getParsedCharsNum()
 o getCurrElement
 public String getCurrElement()
Returns:
the Element-name (e.g: BODY) of the currently read tag
 o getCurrComment
 public String getCurrComment()
Returns:
the read Comment, if getCurrElement()==null then getCurrComment()
 o getCurrAttributes
 public Enumeration getCurrAttributes()
Returns:
Enumeration over the attributes of the last read HTML-tag
See Also:
HTMLAttribute
 o getCurrInfo
 public String getCurrInfo()
Returns:
The info/text between HTML-tags (=the "real" text of the page)

All Packages  Class Hierarchy  This Package  Previous  Next  Index