Class HTMLTokenizer

java.lang.Object
  |
  +--HTMLTokenizer

public class HTMLTokenizer
extends java.lang.Object

The HTMLTokenizer class takes an input stream and parses it into "tokens", allowing the tokens to be read one at a time. The html tokenizer can recognize tags, entitites, and raw text.

Whitespace that is ignored by browsers is also ignored by instances of this class.

Each instance has two flags. These flags indicate:

Both flags are false by default. So you could safely ignore anything below about HTML 'entities', since you are not going to see them by default.

A typical application first constructs an instance of this class and then repeatedly loops, calling the nextToken method in each iteration of the loop until it returns the value TT_EOF

note: TT_EOF will also be returned in the case of invalid HTML code

See Also:
java.io.StreamTokenizer

Field Summary
static int TT_ENTITY
          A constant indicating that an HTML entity has been read.
static int TT_EOF
          A constant indicating that the end of the stream has been read.
static int TT_TAG
          A constant indicating that an HTML tag has been read.
static int TT_TEXT
          A constant indicating that raw text has been read.
 
Constructor Summary
HTMLTokenizer(java.io.Reader r)
          creates an HTMLTokenizer for a Reader input stream
HTMLTokenizer(java.lang.String url)
          creates an HTMLTokenizer for a file with a given url.
 
Method Summary
 void entityMode(boolean flag)
          entityMode(false) indicates that html entities should be treated as normal text.
 java.lang.String getToken()
          If the type is TT_TEXT, TT_TAG, or TT_ENTITY, the corresponding text will be returned.
 void lowerCaseMode(boolean flag)
          lowerCaseMode(true) indicates that all tokens should be returned as lower-case.
 boolean nextEntityMatch(java.lang.String entity)
          nextEntityMatch repeatedly calls nextToken() until EOF is reached or a an exact match is found between the specified entity and a type TT_ENTITY token.
 java.lang.String nextEntitySubstring(java.lang.String entity)
          nextEntitySubstring repeatedly calls nextToken() until EOF is reached or the specified entity is found as a substring of an html entity.
 java.lang.String nextEntityToken()
          nextEntityToken repeatedly calls nextToken() until EOF is reached or an entity type token is found.
 boolean nextTagMatch(java.lang.String tag)
          nextTagMatch repeatedly calls nextToken() until EOF is reached or a an exact match is found between the specified tag and a type TT_TAG token.
 java.lang.String nextTagSubstring(java.lang.String tag)
          nextTagSubstring repeatedly calls nextToken() until EOF is reached or the specified tag is found as a substring of an html tag.
 java.lang.String nextTagToken()
          nextTagToken repeatedly calls nextToken() until EOF is reached or a tag type token is found.
 boolean nextTextMatch(java.lang.String phrase)
          nextTextMatch repeatedly calls nextToken() until EOF is reached or a an exact match is found between the specified phrase and a type TT_TEXT token.
 java.lang.String nextTextSubstring(java.lang.String phrase)
          nextTextSubstring repeatedly calls nextToken() until EOF is reached or the specified phrase is found in the html text.
 java.lang.String nextTextToken()
          nextTextToken repeatedly calls nextToken() until EOF is reached or a text type token is found.
 int nextToken()
          nextToken() returns the type of the token read.
 void pushBack()
          The pushBack() method allows you to 'unread' the last token so that the next call to nextToken() will return the same value
 java.lang.String toString()
          The method toString() returns a string representation of the current token.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

TT_EOF

public static final int TT_EOF
A constant indicating that the end of the stream has been read. In the case of an open-ended tag or open-ended entity (because of invalid HTML), this constant will be returned.

TT_TAG

public static final int TT_TAG
A constant indicating that an HTML tag has been read.

TT_ENTITY

public static final int TT_ENTITY
A constant indicating that an HTML entity has been read.

TT_TEXT

public static final int TT_TEXT
A constant indicating that raw text has been read.
Constructor Detail

HTMLTokenizer

public HTMLTokenizer(java.io.Reader r)
creates an HTMLTokenizer for a Reader input stream

HTMLTokenizer

public HTMLTokenizer(java.lang.String url)
              throws java.io.FileNotFoundException,
                     java.net.MalformedURLException,
                     java.io.IOException
creates an HTMLTokenizer for a file with a given url. If the url string doesn't start with "http://", then the string is considered as a local file name.
Method Detail

entityMode

public void entityMode(boolean flag)
entityMode(false) indicates that html entities should be treated as normal text. The default is false.

lowerCaseMode

public void lowerCaseMode(boolean flag)
lowerCaseMode(true) indicates that all tokens should be returned as lower-case. The default is false.

getToken

public java.lang.String getToken()
If the type is TT_TEXT, TT_TAG, or TT_ENTITY, the corresponding text will be returned. Otherwise, null will be returned. For type TT_TEXT, the text returned will be the text between tags and entities (or including entities if entityMode is false). For TT_TAG, the text between the opening and closing brackets will be returned. For TT_ENTITY, the text between the opening ampersand and the closing semi-colon will be returned.

nextToken

public int nextToken()
              throws java.io.IOException
nextToken() returns the type of the token read. To get the string of the token, use getToken().

nextTextToken

public java.lang.String nextTextToken()
                               throws java.io.IOException
nextTextToken repeatedly calls nextToken() until EOF is reached or a text type token is found. If the token is found, the text is returned. Otherwise, null is returned.

nextTagToken

public java.lang.String nextTagToken()
                              throws java.io.IOException
nextTagToken repeatedly calls nextToken() until EOF is reached or a tag type token is found. If the token is found, the text of the tag is returned. Otherwise, null is returned.

nextEntityToken

public java.lang.String nextEntityToken()
                                 throws java.lang.IllegalStateException,
                                        java.io.IOException
nextEntityToken repeatedly calls nextToken() until EOF is reached or an entity type token is found. If the token is found, the text of the entity is returned. Otherwise, null is returned.

nextTextMatch

public boolean nextTextMatch(java.lang.String phrase)
                      throws java.io.IOException,
                             java.lang.IllegalArgumentException
nextTextMatch repeatedly calls nextToken() until EOF is reached or a an exact match is found between the specified phrase and a type TT_TEXT token. Use the method lowerCaseMode() to set the case sensitivity. The method returns a boolean value indicating whether a match was found.

nextTextSubstring

public java.lang.String nextTextSubstring(java.lang.String phrase)
                                   throws java.io.IOException,
                                          java.lang.IllegalArgumentException
nextTextSubstring repeatedly calls nextToken() until EOF is reached or the specified phrase is found in the html text. Use the method lowerCaseMode() to set the case sensitivity. The method returns the entire String token in which it was found. For example, nextTextSubstring("auction") could return the string, "The auction began on Monday, July 5, 2000". If the phrase is not found, null is returned.

nextTagMatch

public boolean nextTagMatch(java.lang.String tag)
                     throws java.io.IOException,
                            java.lang.IllegalArgumentException
nextTagMatch repeatedly calls nextToken() until EOF is reached or a an exact match is found between the specified tag and a type TT_TAG token. Use the method lowerCaseMode() to set the case sensitivity. The method returns a boolean value indicating whether a match was found.

nextTagSubstring

public java.lang.String nextTagSubstring(java.lang.String tag)
                                  throws java.io.IOException,
                                         java.lang.IllegalArgumentException
nextTagSubstring repeatedly calls nextToken() until EOF is reached or the specified tag is found as a substring of an html tag. Use the method lowerCaseMode() to set the case sensitivity. The method returns the entire text of the tag in which it was found or null if it isn't found.

nextEntityMatch

public boolean nextEntityMatch(java.lang.String entity)
                        throws java.io.IOException,
                               java.lang.IllegalArgumentException,
                               java.lang.IllegalStateException
nextEntityMatch repeatedly calls nextToken() until EOF is reached or a an exact match is found between the specified entity and a type TT_ENTITY token. Use the method lowerCaseMode() to set the case sensitivity. The method returns a boolean value indicating whether a match was found.

nextEntitySubstring

public java.lang.String nextEntitySubstring(java.lang.String entity)
                                     throws java.io.IOException,
                                            java.lang.IllegalArgumentException,
                                            java.lang.IllegalStateException
nextEntitySubstring repeatedly calls nextToken() until EOF is reached or the specified entity is found as a substring of an html entity. Use the method lowerCaseMode() to set the case sensitivity. The method returns the entire text of the entity in which it was found or null if it isn't found.

pushBack

public void pushBack()
The pushBack() method allows you to 'unread' the last token so that the next call to nextToken() will return the same value

toString

public java.lang.String toString()
The method toString() returns a string representation of the current token.
Overrides:
toString in class java.lang.Object