--- doc/gutshtml/SessionFou1.html 2002/06/28 20:30:29 1.1 +++ doc/gutshtml/SessionFou1.html 2003/07/22 14:47:00 1.2 @@ -1,430 +1,860 @@ - - - - - -Session Four: XML Handler (Simple tags, Globals, Multiple Targets, Style -Files) (Guy) - - - -
-

Session Four: XML Handler (Simple tags, Globals, Multiple Targets, Style - Files) (Guy)

-

XML Files

-

All HTML / XML files are run through the lonxml - handler before being served to a user. This allows us to rewrite many portion - of a document and to support serverside tags. There are 2 ways to add new - tags to the xml parsing engine, either through LON-CAPA style files or by - writing Perl tag handlers for the desired tags.

-

Global Variables

-

*          - $Apache::lonxml::debug - debugging control

-

*          - @Apache::lonxml::pwd - path to the directory containing the file currently being - processed

-

*          - @Apache::lonxml::outputstack

-

$Apache::lonxml::redirection - these two are used for capturing a subset of the output - for later processing, don't touch them directly use &startredirection - and &endredirection

-

*          - $Apache::lonxml::import - controls whether the <import> tag actually does anything -

-

*          - @Apache::lonxml::extlinks - a list of URLs that the user is allowed to look at because - of the current resource (images, and links)

-

*          - $Apache::lonxml::metamode - some output is turned off, the meta target wants a specific - subset, use <output> to guarentee that the catianed data will be in - the parsing output

-

*          - $Apache::lonxml::evaluate - controls whether run::evaluate actually derefences variable - references

-

*          - %Apache::lonxml::insertlist - data structure for edit mode, determines what tags can - go into what other tags

-

*          - @Apache::lonxml::namespace - stores the list of tag namespaces used in the insertlist.tab - file that are currently active, used only in edit mode.

-

*          - $Apache::lonxml::registered - set to 1 once the remote has been updated to know what - resource we are looking at.

-

*          - $Apache::lonxml::request - current Apache request object, or undef

-

*          - $Apache::lonxml::curdepth - current depth of the overall parse depth. Will be a string - like: 2_3_1 (first tag in the third second level tag in the second toplevel - tag). It gets set by callsub, and can be used in Perl tag implementations. - It relies upon the internal globals: @Apache::lonxml::depthcounter, $Apache::lonxml::depth, $Apache::lonxml::olddepth

-

*          - $Apache::lonxml::prevent_entity_encode - By default the xmlparser will try to rencode any 8-bit - characters into HTMLEntity Codes, If this is set to a true value it will be - prevented.

-

In common usage, $Apache::lonxml::prevent_entity_encode, $Apache::lonxml::evaluate, $Apache::lonxml::metamode, $Apache::lonxml::import, should never be set to a value directly, but rather incremented - when you want the effect on, and decremented when you want the effect off. -

-

Notable Perl subroutines

-

If not specified these functions are in Apache::lonxml -

-

*          - xmlparse - see the XMLPARSE figure - also not callable from inside - a tag, if one needs to restart parsing, either create add a new LCParser to - the parser stack parser using the newparser function, or call inner_xmlparser, - see the xmlparse function in scripttag.pm

-

*          - recurse - acts just like xmlparse, except it doesn't do the style definition check it always - calls callsub

-

*          - callsub - callsub looks if a perl subroutine is defined for the current - tag and calls. Otherwise it just returns the tag as it was read in. It also - will throw on a default editing interface unless the tag has a defined subroutine - that either returns something or requests that call sub not add the editing - interface.

-

*          - afterburn - called on the output of xmlparse, it can add highlights, - anchors, and links to regular expersion matches to the output.

-

*          - register_insert - builds the %Apache::lonxml::insertlist structure of what - tags can have what other tags inside.

-

*          - whichuser - returns a list of $symb, $courseid, $domain, $name that - is correct for calls to lonnet functions for this setup. Uses form.grade_ - parameters, if the user is allowed to mgr in the course

-

*          - setup_globals - initializes all lonxml globals when xmlparse is called. - If you intend to create a new target you will likely need to tweak how the - globals are setup upon start up.

-

*          - init_safespace - creates Holes to external functions, creates some global - variables, and set the permitted operators of the global Safespace intepreter. -

-

Functions Tag Handlers can use

-

If not specified these functions are in Apache::lonxml -

-

*          - debug - a function to call to printout debugging messages. Will - only print when Apache::lonxml::debug is set to 1

-

*          - warning - a function to use for warning messages. The message will - appear at the top of a resource when it is viewed in construction space only. -

-

*          - error - a function to use for error messages. The message will - appear at the top of a resource when it is viewed in construction space, and - will message the resource author and course instructor, while informing the - student that an error has occured otherwise.

-

*          - get_all_text - 2 args, tag to look for (need to use /tag to look for an - end tag) and a HTML::TokeParser reference, it will repedelyt get text from - the TokeParser until the requested tag is found. It will return all of the - document it pulled form the TokeParser. (See Apache::scripttag::start_script - for an example of usage.)

-

*          - get_param - 4 arguments, first is a scaler sting of the argument needed, - second is a reference to the parser arguments stack, third is a reference - to the Safe space, and fourth is an optional "context" value. This - subroutine allows a tag to get a tag argument, after being interpolated inside - the Safe space. This should be used if the tag might use a safe space variable - reference for the tag argument. (See Apache::scripttag::start_script for an - example.) This version only handles scalar variables.

-

*          - get_param_var - 4 arguments, first is a scaler sting of the argument needed, - second is a reference to the parser arguments stack, third is a reference - to the Safe space, and fourth is an optional "context" value. This - subroutine allows a tag to get a tag argument, after being interpolated inside - the Safe space. This should be used if the tag might use a safe space variable - reference for the tag argument. (See Apache::scripttag::start_script for an - example.) This version can handle list or hash variables properly.

-

*          - description - 1 argument, the token object. This will return the textual - decription of the current tag from the insertlist.tab file.

-

*          - whichuser - 0 arguments. This will take a look at the current environment - setting and return the current $symb, $courseid, $udom, $uname. You should - always use this function if you want to determine who the current user is. - (Since a instructor might be trying to view a students version of a resource.) -

-

*          - inner_xmlparse - 6 arguments, the target, an array pointer to the current - stack of tags, and array pointer to the current stack of tag arguments, an - array pointer to the current stack of LCParser's, a pointer to the current - Safe space, a pointer to the hash of current style definitions

-

*          - newparser - 3 args, first is a reference to the parser stack, second - should be a reference to a string scaler containg the text the newparser should - run over, third should be a scaler of the directory path the file the parser - is parsing was in. (See Apache::scripttag::start_import for an example.)

-

*          - register - should be called in a file's BEGIN block. 2 arguments, - a scaler string, and a list of strings. This allows a file to register what - tags it handles, and what the namespace of those tags are. Example:

-

sub BEGIN {

-

  &Apache::lonxml::register('Apache::scripttag',('script','display'));

-

}

-

Would tell xmlparse that in Apache::scripttag it - can find handlers for <script> and <display>, if one regsiters - a tag that was already registered the previous one is remembered and will - be restored on a deregister.

-

*          - deregister - used to remove a previously registered tag implementation. - It will restore the previous registration if there was one.

-

*          - startredirection - used when a tag wants to save a portion of the document - for its end tag to use, but wants the intervening document to be normally - processed. (See Apache::scripttag::start_window for an example.)

-

*          - endredirection - used to stop preventing xmlparse from hiding output. The - return value is everthing that xmlparse has processed since the corresponding - startredirection. (See Apache::scripttag::end_window for an example.)

-

*          - Apache::run::evaluate - 3 args, first a string, second a reference to the Safe - space, 3 a string to be evaluated before the first arg. This subroutine will - do variable interpolation and simple function interpolations on the first - argument. (See Apache::lonxml::inner_xmlparse for an example.)

-

*          - Apache::run::run - 2 args, first a string, second a reference to the Safe - space. This handles passing the passed string into the Safe space for evaluation - and then returns the result. (See Apache::scripttag::start_script for an example.)

-

Style Files

-

-

Fig. 2.4.1 Ð Using a style file

-

Style File specific tags

-

<definetag> - 2 arguments, name - name of new tag being defined, if proceeded with a / defining an end tag, - required; parms parameters of the - new tag, the value of these parameters can be accesed by $parametername.

-

*          - <render> - define what the new tag does for a non meta target

-

*          - <meta> - define what the new tag does for a meta target

-

*          - <tex> / <web> / <latexsource> - - define what a new tag does for a specific no meta target, all data inside - a <render> is render to all targets except when surrounded by a specific - target tags.

-

-

Fig. 2.4.2 Ð The parser

-

HTML::LCParser - Alternative HTML::Parser interface

-

SYNOPSIS

-

 require HTML::LCParser;

-

 $p = HTML::LCParser->new("index.html") - || die "Can't open: $!";

-

 while (my $token = $p->get_token) {

-

     #...

-

 }

-

DESCRIPTION

-

The C<HTML::LCParser> is an alternative interface - to the

-

C<HTML::Parser> class.  It is an C<HTML::PullParser> - subclass.

-

The following methods are available:

-

* $p = HTML::LCParser->new( $file_or_doc );

-

The object constructor argument is either a file name, - a file handle

-

object, or the complete document to be parsed.

-

If the argument is a plain scalar, then it is taken as - the name of a

-

file to be opened and parsed.  If the file can't - be opened for

-

reading, then the constructor will return an undefined - value and $!

-

will tell you why it failed.

-

If the argument is a reference to a plain scalar, then - this scalar is

-

taken to be the literal document to parse.  The value - of this

-

scalar should not be changed before all tokens have been - extracted.

-

Otherwise the argument is taken to be some object that - the

-

C<HTML::LCParser> can read() from when it needs - more data.  Typically

-

it will be a filehandle of some kind.  The stream - will be read() until

-

EOF, but not closed.

-

It also will turn attr_encoded on by default.

-

* $p->get_token

-

This method will return the next I<token> found - in the HTML document,

-

or C<undef> at the end of the document.  The - token is returned as an

-

array reference.  The first element of the array - will be a (mostly)

-

single character string denoting the type of this token: - "S" for start

-

tag, "E" for end tag, "T" for text, - "C" for comment, "D" for

-

declaration, and "PI" for process instructions.  - The rest of the array

-

is the same as the arguments passed to the corresponding - HTML::Parser

-

v2 compatible callbacks (see L<HTML::Parser>).  - In summary, returned

-

tokens look like this:

-

  ["S",  $tag, $attr, $attrseq, $text, - $line]

-

  ["E",  $tag, $text, $line]

-

  ["T",  $text, $is_data, $line]

-

  ["C",  $text, $line]

-

  ["D",  $text, $line]

-

  ["PI", $token0, $text, $line]

-

where $attr is a hash reference, $attrseq is an array - reference and

-

the rest are plain scalars.

-

* $p->unget_token($token,...)

-

If you find out you have read too many tokens you can - push them back,

-

so that they are returned the next time $p->get_token - is called.

-

* $p->get_tag( [$tag, ...] )

-

This method returns the next start or end tag (skipping - any other

-

tokens), or C<undef> if there are no more tags in - the document.  If

-

one or more arguments are given, then we skip tokens until - one of the

-

specified tag types is found.  For example:

-

   $p->get_tag("font", "/font");

-

will find the next start or end tag for a font-element.

-

The tag information is returned as an array reference - in the same form

-

as for $p->get_token above, but the type code (first - element) is

-

missing. A start tag will be returned like this:

-

  [$tag, $attr, $attrseq, $text]

-

The tagname of end tags are prefixed with "/", - i.e. end tag is

-

returned like this:

-

  ["/$tag", $text]

-

* $p->get_text( [$endtag] )

-

This method returns all text found at the current position. - It will

-

return a zero length string if the next token is not text.  - The

-

optional $endtag argument specifies that any text occurring - before the

-

given tag is to be returned. All entities are unmodified.

-

The $p->{textify} attribute is a hash that defines - how certain tags can

-

be treated as text.  If the name of a start tag matches - a key in this

-

hash then this tag is converted to text.  The hash - value is used to

-

specify which tag attribute to obtain the text from.  - If this tag

-

attribute is missing, then the upper case name of the - tag enclosed in

-

brackets is returned, e.g. "[IMG]".  The - hash value can also be a

-

subroutine reference.  In this case the routine is - called with the

-

start tag token content as its argument and the return - value is treated

-

as the text.

-

The default $p->{textify} value is:

-

  {img => "alt", applet => "alt"}

-

This means that <IMG> and <APPLET> tags are - treated as text, and that

-

the text to substitute can be found in the ALT attribute.

-

* $p->get_trimmed_text( [$endtag] )

-

Same as $p->get_text above, but will collapse any sequences - of white

-

space to a single space character.  Leading and trailing - white space is

-

removed.

-

EXAMPLES

-

This example extracts all links from a document.  - It will print one

-

line for each link, containing the URL and the textual - description

-

between the <A>...</A> tags:

-

  use HTML::LCParser;

-

  $p = HTML::LCParser->new(shift||"index.html");

-

  while (my $token = $p->get_tag("a")) - {

-

      my $url = $token->[1]{href} - || "-";

-

      my $text = $p->get_trimmed_text("/a");

-

      print "$url\t$text\n";

-

  }

-

This example extract the <TITLE> from the document:

-

  use HTML::LCParser;

-

  $p = HTML::LCParser->new(shift||"index.html");

-

  if ($p->get_tag("title")) {

-

      my $title = $p->get_trimmed_text;

-

      print "Title: $title\n";

-

  }

-
-
-
- - + + + + + + + + + + +Session Four: XML Handler (Simple tags, Globals, Multiple Targets, Style + +Files) (Guy) + + + + + + + +
+ +

Session Four: XML Handler (Simple tags, Globals, Multiple Targets, Style + + Files) (Guy)

+ +

XML Files

+ +

All HTML / XML files are run through the lonxml + + handler before being served to a user. This allows us to rewrite many portion + + of a document and to support serverside tags. There are 2 ways to add new + + tags to the xml parsing engine, either through LON-CAPA style files or by + + writing Perl tag handlers for the desired tags.

+ +

Global Variables

+ +

*          + + $Apache::lonxml::debug - debugging control

+ +

*          + + @Apache::lonxml::pwd - path to the directory containing the file currently being + + processed

+ +

*          + + @Apache::lonxml::outputstack

+ +

$Apache::lonxml::redirection - these two are used for capturing a subset of the output + + for later processing, don't touch them directly use &startredirection + + and &endredirection

+ +

*          + + $Apache::lonxml::import - controls whether the <import> tag actually does anything + +

+ +

*          + + @Apache::lonxml::extlinks - a list of URLs that the user is allowed to look at because + + of the current resource (images, and links)

+ +

*          + + $Apache::lonxml::metamode - some output is turned off, the meta target wants a specific + + subset, use <output> to guarentee that the catianed data will be in + + the parsing output

+ +

*          + + $Apache::lonxml::evaluate - controls whether run::evaluate actually derefences variable + + references

+ +

*          + + %Apache::lonxml::insertlist - data structure for edit mode, determines what tags can + + go into what other tags

+ +

*          + + @Apache::lonxml::namespace - stores the list of tag namespaces used in the insertlist.tab + + file that are currently active, used only in edit mode.

+ +

*          + + $Apache::lonxml::registered - set to 1 once the remote has been updated to know what + + resource we are looking at.

+ +

*          + + $Apache::lonxml::request - current Apache request object, or undef

+ +

*          + + $Apache::lonxml::curdepth - current depth of the overall parse depth. Will be a string + + like: 2_3_1 (first tag in the third second level tag in the second toplevel + + tag). It gets set by callsub, and can be used in Perl tag implementations. + + It relies upon the internal globals: @Apache::lonxml::depthcounter, $Apache::lonxml::depth, $Apache::lonxml::olddepth

+ +

*          + + $Apache::lonxml::prevent_entity_encode - By default the xmlparser will try to rencode any 8-bit + + characters into HTMLEntity Codes, If this is set to a true value it will be + + prevented.

+ +

In common usage, $Apache::lonxml::prevent_entity_encode, $Apache::lonxml::evaluate, $Apache::lonxml::metamode, $Apache::lonxml::import, should never be set to a value directly, but rather incremented + + when you want the effect on, and decremented when you want the effect off. + +

+ +

Notable Perl subroutines

+ +

If not specified these functions are in Apache::lonxml + +

+ +

*          + + xmlparse - see the XMLPARSE figure - also not callable from inside + + a tag, if one needs to restart parsing, either create add a new LCParser to + + the parser stack parser using the newparser function, or call inner_xmlparser, + + see the xmlparse function in scripttag.pm

+ +

*          + + recurse - acts just like xmlparse, except it doesn't do the style definition check it always + + calls callsub

+ +

*          + + callsub - callsub looks if a perl subroutine is defined for the current + + tag and calls. Otherwise it just returns the tag as it was read in. It also + + will throw on a default editing interface unless the tag has a defined subroutine + + that either returns something or requests that call sub not add the editing + + interface.

+ +

*          + + afterburn - called on the output of xmlparse, it can add highlights, + + anchors, and links to regular expersion matches to the output.

+ +

*          + + register_insert - builds the %Apache::lonxml::insertlist structure of what + + tags can have what other tags inside.

+ +

*          + + whichuser - returns a list of $symb, $courseid, $domain, $name that + + is correct for calls to lonnet functions for this setup. Uses form.grade_ + + parameters, if the user is allowed to mgr in the course

+ +

*          + + setup_globals - initializes all lonxml globals when xmlparse is called. + + If you intend to create a new target you will likely need to tweak how the + + globals are setup upon start up.

+ +

*          + + init_safespace - creates Holes to external functions, creates some global + + variables, and set the permitted operators of the global Safespace intepreter. + +

+ +

Functions Tag Handlers can use

+ +

If not specified these functions are in Apache::lonxml + +

+ +

*          + + debug - a function to call to printout debugging messages. Will + + only print when Apache::lonxml::debug is set to 1

+ +

*          + + warning - a function to use for warning messages. The message will + + appear at the top of a resource when it is viewed in construction space only. + +

+ +

*          + + error - a function to use for error messages. The message will + + appear at the top of a resource when it is viewed in construction space, and + + will message the resource author and course instructor, while informing the + + student that an error has occured otherwise.

+ +

*          + + get_all_text - 2 args, tag to look for (need to use /tag to look for an + + end tag) and a HTML::TokeParser reference, it will repedelyt get text from + + the TokeParser until the requested tag is found. It will return all of the + + document it pulled form the TokeParser. (See Apache::scripttag::start_script + + for an example of usage.)

+ +

*          + + get_param - 4 arguments, first is a scaler sting of the argument needed, + + second is a reference to the parser arguments stack, third is a reference + + to the Safe space, and fourth is an optional "context" value. This + + subroutine allows a tag to get a tag argument, after being interpolated inside + + the Safe space. This should be used if the tag might use a safe space variable + + reference for the tag argument. (See Apache::scripttag::start_script for an + + example.) This version only handles scalar variables.

+ +

*          + + get_param_var - 4 arguments, first is a scaler sting of the argument needed, + + second is a reference to the parser arguments stack, third is a reference + + to the Safe space, and fourth is an optional "context" value. This + + subroutine allows a tag to get a tag argument, after being interpolated inside + + the Safe space. This should be used if the tag might use a safe space variable + + reference for the tag argument. (See Apache::scripttag::start_script for an + + example.) This version can handle list or hash variables properly.

+ +

*          + + description - 1 argument, the token object. This will return the textual + + decription of the current tag from the insertlist.tab file.

+ +

*          + + whichuser - 0 arguments. This will take a look at the current environment + + setting and return the current $symb, $courseid, $udom, $uname. You should + + always use this function if you want to determine who the current user is. + + (Since a instructor might be trying to view a students version of a resource.) + +

+ +

*          + + inner_xmlparse - 6 arguments, the target, an array pointer to the current + + stack of tags, and array pointer to the current stack of tag arguments, an + + array pointer to the current stack of LCParser's, a pointer to the current + + Safe space, a pointer to the hash of current style definitions

+ +

*          + + newparser - 3 args, first is a reference to the parser stack, second + + should be a reference to a string scaler containg the text the newparser should + + run over, third should be a scaler of the directory path the file the parser + + is parsing was in. (See Apache::scripttag::start_import for an example.)

+ +

*          + + register - should be called in a file's BEGIN block. 2 arguments, + + a scaler string, and a list of strings. This allows a file to register what + + tags it handles, and what the namespace of those tags are. Example:

+ +

sub BEGIN {

+ +

  &Apache::lonxml::register('Apache::scripttag',('script','display'));

+ +

}

+ +

Would tell xmlparse that in Apache::scripttag it + + can find handlers for <script> and <display>, if one regsiters + + a tag that was already registered the previous one is remembered and will + + be restored on a deregister.

+ +

*          + + deregister - used to remove a previously registered tag implementation. + + It will restore the previous registration if there was one.

+ +

*          + + startredirection - used when a tag wants to save a portion of the document + + for its end tag to use, but wants the intervening document to be normally + + processed. (See Apache::scripttag::start_window for an example.)

+ +

*          + + endredirection - used to stop preventing xmlparse from hiding output. The + + return value is everthing that xmlparse has processed since the corresponding + + startredirection. (See Apache::scripttag::end_window for an example.)

+ +

*          + + Apache::run::evaluate - 3 args, first a string, second a reference to the Safe + + space, 3 a string to be evaluated before the first arg. This subroutine will + + do variable interpolation and simple function interpolations on the first + + argument. (See Apache::lonxml::inner_xmlparse for an example.)

+ +

*          + + Apache::run::run - 2 args, first a string, second a reference to the Safe + + space. This handles passing the passed string into the Safe space for evaluation + + and then returns the result. (See Apache::scripttag::start_script for an example.)

+ +

Style Files

+ +

+ +

Fig. 2.4.1 Ð Using a style file

+ +

Style File specific tags

+ +

<definetag> - 2 arguments, name + + name of new tag being defined, if proceeded with a / defining an end tag, + + required; parms parameters of the + + new tag, the value of these parameters can be accesed by $parametername.

+ +

*          + + <render> - define what the new tag does for a non meta target

+ +

*          + + <meta> - define what the new tag does for a meta target

+ +

*          + + <tex> / <web> / <latexsource> + + - define what a new tag does for a specific no meta target, all data inside + + a <render> is render to all targets except when surrounded by a specific + + target tags.

+ +

+ +

Fig. 2.4.2 Ð The parser

+ +

HTML::LCParser - Alternative HTML::Parser interface

+ +

SYNOPSIS

+ +

 require HTML::LCParser;

+ +

 $p = HTML::LCParser->new("index.html") + + || die "Can't open: $!";

+ +

 while (my $token = $p->get_token) {

+ +

     #...

+ +

 }

+ +

DESCRIPTION

+ +

The C<HTML::LCParser> is an alternative interface + + to the

+ +

C<HTML::Parser> class.  It is an C<HTML::PullParser> + + subclass.

+ +

The following methods are available:

+ +

* $p = HTML::LCParser->new( $file_or_doc );

+ +

The object constructor argument is either a file name, + + a file handle

+ +

object, or the complete document to be parsed.

+ +

If the argument is a plain scalar, then it is taken as + + the name of a

+ +

file to be opened and parsed.  If the file can't + + be opened for

+ +

reading, then the constructor will return an undefined + + value and $!

+ +

will tell you why it failed.

+ +

If the argument is a reference to a plain scalar, then + + this scalar is

+ +

taken to be the literal document to parse.  The value + + of this

+ +

scalar should not be changed before all tokens have been + + extracted.

+ +

Otherwise the argument is taken to be some object that + + the

+ +

C<HTML::LCParser> can read() from when it needs + + more data.  Typically

+ +

it will be a filehandle of some kind.  The stream + + will be read() until

+ +

EOF, but not closed.

+ +

It also will turn attr_encoded on by default.

+ +

* $p->get_token

+ +

This method will return the next I<token> found + + in the HTML document,

+ +

or C<undef> at the end of the document.  The + + token is returned as an

+ +

array reference.  The first element of the array + + will be a (mostly)

+ +

single character string denoting the type of this token: + + "S" for start

+ +

tag, "E" for end tag, "T" for text, + + "C" for comment, "D" for

+ +

declaration, and "PI" for process instructions.  + + The rest of the array

+ +

is the same as the arguments passed to the corresponding + + HTML::Parser

+ +

v2 compatible callbacks (see L<HTML::Parser>).  + + In summary, returned

+ +

tokens look like this:

+ +

  ["S",  $tag, $attr, $attrseq, $text, + + $line]

+ +

  ["E",  $tag, $text, $line]

+ +

  ["T",  $text, $is_data, $line]

+ +

  ["C",  $text, $line]

+ +

  ["D",  $text, $line]

+ +

  ["PI", $token0, $text, $line]

+ +

where $attr is a hash reference, $attrseq is an array + + reference and

+ +

the rest are plain scalars.

+ +

* $p->unget_token($token,...)

+ +

If you find out you have read too many tokens you can + + push them back,

+ +

so that they are returned the next time $p->get_token + + is called.

+ +

* $p->get_tag( [$tag, ...] )

+ +

This method returns the next start or end tag (skipping + + any other

+ +

tokens), or C<undef> if there are no more tags in + + the document.  If

+ +

one or more arguments are given, then we skip tokens until + + one of the

+ +

specified tag types is found.  For example:

+ +

   $p->get_tag("font", "/font");

+ +

will find the next start or end tag for a font-element.

+ +

The tag information is returned as an array reference + + in the same form

+ +

as for $p->get_token above, but the type code (first + + element) is

+ +

missing. A start tag will be returned like this:

+ +

  [$tag, $attr, $attrseq, $text]

+ +

The tagname of end tags are prefixed with "/", + + i.e. end tag is

+ +

returned like this:

+ +

  ["/$tag", $text]

+ +

* $p->get_text( [$endtag] )

+ +

This method returns all text found at the current position. + + It will

+ +

return a zero length string if the next token is not text.  + + The

+ +

optional $endtag argument specifies that any text occurring + + before the

+ +

given tag is to be returned. All entities are unmodified.

+ +

The $p->{textify} attribute is a hash that defines + + how certain tags can

+ +

be treated as text.  If the name of a start tag matches + + a key in this

+ +

hash then this tag is converted to text.  The hash + + value is used to

+ +

specify which tag attribute to obtain the text from.  + + If this tag

+ +

attribute is missing, then the upper case name of the + + tag enclosed in

+ +

brackets is returned, e.g. "[IMG]".  The + + hash value can also be a

+ +

subroutine reference.  In this case the routine is + + called with the

+ +

start tag token content as its argument and the return + + value is treated

+ +

as the text.

+ +

The default $p->{textify} value is:

+ +

  {img => "alt", applet => "alt"}

+ +

This means that <IMG> and <APPLET> tags are + + treated as text, and that

+ +

the text to substitute can be found in the ALT attribute.

+ +

* $p->get_trimmed_text( [$endtag] )

+ +

Same as $p->get_text above, but will collapse any sequences + + of white

+ +

space to a single space character.  Leading and trailing + + white space is

+ +

removed.

+ +

EXAMPLES

+ +

This example extracts all links from a document.  + + It will print one

+ +

line for each link, containing the URL and the textual + + description

+ +

between the <A>...</A> tags:

+ +

  use HTML::LCParser;

+ +

  $p = HTML::LCParser->new(shift||"index.html");

+ +

  while (my $token = $p->get_tag("a")) + + {

+ +

      my $url = $token->[1]{href} + + || "-";

+ +

      my $text = $p->get_trimmed_text("/a");

+ +

      print "$url\t$text\n";

+ +

  }

+ +

This example extract the <TITLE> from the document:

+ +

  use HTML::LCParser;

+ +

  $p = HTML::LCParser->new(shift||"index.html");

+ +

  if ($p->get_tag("title")) {

+ +

      my $title = $p->get_trimmed_text;

+ +

      print "Title: $title\n";

+ +

  }

+ +
+ +
+ +
+ + + + +