%# use Data::Dumper; %# $Data::Dumper::Indent=1; FilterProxy::Rewrite Config % if(defined $CGI->param('showfiltering')) {
Names in BLUE are the names of the rules that made the modification.
Stuff in RED would have been stripped by the rule.
Stuff in GREEN would have replaced the stuff in RED (rewrite rule).
Notes:
  1. Overlapping matches (one inside another) will appear next to each other here, in reverse order that they were stripped/rewritten. The highlighted text corresponds to what the rule actually saw when it examined the document.
  2. A lot of HTML is generated by scripts and not really meant for human eyes. You may have to scroll far to the right to see things that were filtered.

<% $MESSAGE %>
% } elsif(defined $CGI->param('editfiltering')) {

FilterProxy::Rewrite Config for <%$site%>

<% $MESSAGE %>

Back to FilterProxy main configuration. % foreach my $sitere (keys %{$FilterProxy::CONFIG->{'filters'}}) { % if($site =~ $sitere && defined $FilterProxy::CONFIG->{'filters'}->{$sitere}->{'Rewrite'}) { % $siteconfig{$sitere} = $FilterProxy::CONFIG->{'filters'}->{$sitere}->{'Rewrite'}; % } % } % foreach my $sitere (keys %siteconfig) { % my $escsitere = $sitere; % $escsitere =~ s/([^a-zA-Z0-9_.-])/uc sprintf("%%%02x",ord($1))/eg; % $SITECONFIG = $siteconfig{$sitere};
Rules from site regex <%$sitere%> % my($name, $filter, $escfilter, $action); % if(!defined %{$SITECONFIG->{filters}}) { % $SITECONFIG->{filters} = []; % } % foreach $filter (sort @{$SITECONFIG->{filters}}) { % # escape the meta-characters in the site. % $escfilter = $filter; %# $escfilter =~ s/([^a-zA-Z0-9_.-])/uc sprintf("%%%02x",ord($1))/eg; % if($filter =~ /^(?:(\w+):)?\s*(\w+)\s(.*)$/s) { % $name = (defined $1)?$1:""; % $action = $2; % $escfilter = $filter = $3; % } else {

Error in filter:

<% $filter %> % return; % } % $escfilter =~ s/>/>\;/g; % $escfilter =~ s/ % } # foreach $filter
Name Action Finder Operation Submit
<% $name?$name:" " %>
% } # foreach site % } else {

FilterProxy::Rewrite Config

<% $MESSAGE %>

Back to FilterProxy main configuration. % if(defined $SITECONFIG) { % my($name, $filter, $escfilter, $action); % if(!defined %{$SITECONFIG->{filters}}) { % $SITECONFIG->{filters} = []; % } % foreach $filter (sort @{$SITECONFIG->{filters}}) { % # escape the meta-characters in the site. % $escfilter = $filter; %# $escfilter =~ s/([^a-zA-Z0-9_.-])/uc sprintf("%%%02x",ord($1))/eg; % if($filter =~ /^(?:(\w+):)?\s*(\w+)\s(.*)$/s) { % $name = (defined $1)?$1:""; % $action = $2; % $escfilter = $filter = $3; % } else {

Error in filter:

<% $filter %> % return; % } % $escfilter =~ s/>/>\;/g; % $escfilter =~ s/ % } # foreach $filter
Name Action Finder Operation Submit
<% $name?$name:" " %>

How do Matchers work?

A "Matcher" (currently either tag, attrib, or regex) is applied to the file to find the content desired. So a Matcher like

tag <img src>
will match all 'img' tags which have a src attribute (regardless of the value of that attribute). A matcher like
regex /blue chickens/
will match all occurances of the string "blue chickens". You can then apply "add" to expand the match beyond the initial tag or regex. For example,
tag </(a|img)/ /(src|href)/> add tagblock <script>
Will match any 'a' tag with an 'href' attribute, or any 'img' tag with a 'src' attribute. (and also <a src> and <img href> but these are nonsensical) The add will then expand the match to include a <script> block that follows the initial <a href> or <img src>. You can use the encloser matcher instead of tag to cause it to grow to a <script> block that encloses the initial <a href> or <img src>.

add alternate adds "alternate content". In other words, if you match a <script> block, it's alternate content is a <noscript> block. This is usually used to show banner ads to browsers which don't support javascript, or don't have it turned on. Often it's easy to match the ad inside <noscript> but almost impossible to match a javascript ad. Since these are often right next to each other in a page, alternate will consider them one block. alternate also knows about <layer>, <ilayer> and it's alternate content <nolayer>, and <frame> and it's alternate content <noframe>.

add balanced adds "balanced enclosers". In other words, if you match <img src=...> and it has a <center> preceeding it and a </center> trailing it, balanced will consider the center tags part of the match. It continues adding in balanced enclosers until it reaches a leading tag that does not have a corresponding closing tag trailing the match. balanced ignores whitespace, comments, and a few other tags like <br> and <p>.

Clever combinations of add, balanced, alternate, and encloser can make most pages look like it never had an ad.

Once your match is found it is either stripped or rewritten. Strip should be obvious (removes the match from the page). Rewrite requires the matcher to be followed by as [block]. The match will be replaced by the text following the as keyword. No interpretation is done of the as part, it is simply replaced verbatim.

How do names and 'ignore' work?

Each rule can be named, so that if a rule BADRULE destroys the layout of one page, you can create a site regex for it (on the FilterProxy main page) which will contain the rule ignore BADRULE. This will cause BADRULE to not be applied to sites matching that site regex. You don't have to name your rules if you don't want to. You can even name ignore rules, so that you can ignore your ignore rules. But that is probably a little silly. Rules and ignores are processed in alphabetical order, so if you want one rule or ignore to be processed before another, you can preceed the name with a number (i.e. 1_MYRULE), or just name it something that comes before it alphabetically.

Hints

Speed Considerations

Terse syntax description:

  The basic syntax is:
    [NAME:] command matcher [[qualifying predicate] [expanding predicate] ...]

    Note: [] means "optional", {} means "mandatory", ... means "more than one"
          <> are literal, and must be included as part of the rule.
  Commands:
    strip                         remove from file
    rewrite {matcher} as {html}   change matched text to something else
    ignore {NAME} [...]           ignore a named rule (can specify more than one)

  Expanding Predicates: (modifiers that expand an existing match, but will not
                         cause the match to fail if not found)
    add {matcher}
          grow match to include text matched by [matcher] (use a matcher below)
          (can apply more than once, order matters)  if [matcher] is one of
          (tag, tagblock, regex, attrib), the match will grow forward from the
          previous point until it finds [matcher].

  Qualifying Predicates: (modifiers that also must match in order to consider it
                          a match)
    inside {matcher}
          like encloser, except that the match that preceeds it must be *inside*
          the match that follows it.  This does not change the original match
          (use add encloser instead if you want to strip the thing it's inside).
          Note that currently, the only matcher that will succeed is 'encloser',
          since none of the other matchers can search backwards.

    containing {matcher}
          the matched block must contain {matcher}

    preceeding {matcher}
          the matched block must come before {matcher}.  {matcher} is not considered
          a part of the match.


  Matchers:  Each matcher "finds" a block of text that gets passed to the
      predicates that follow it.

    tag [options] <{tagname} [attrib[=value]] [...]>  
          Will grab a single tag (without corresponding closing tag).  Any of
          tagname, attrib, or value can be a regular expression by enclosing
          them in one of these regex delimiters: [/#%&!,=:].  An empty regex
          (//) is valid (and will match everything).  If a value is not
          enclosed by regex delimiters, it will match all valid means of
          specifying that value, regardless of the quote characters surrounding
          the value in the rule, or the matched HTML.  (i.e. "tag <img
          width=1>" will match <img width="1">, <img width='1'>
          and <img width=1>) If more attribs are present in the HTML than
          are specified in the rule, it will still match.  'tag', 'tagblock',
          'attrib', and 'encloser' all use this method of specifying the tag to
          be matched.

    tagblock <{tagname} [attrib[=value]] [...]>  
          Matches the block starting at the specified tag, and ending at the
          corresponding closing tag. (like old 'tag -tagblock')  (see 'tag')

    attrib <{tagname} attrib[=value] [attrib[=value]] [...]>     
          Will grab the attribute specified.  Note that you can specify more
          than one attribute, and the *first* one is the one that will be
          stripped/rewritten, but the tagname must match and other attribs are
          required to be present.  (see 'tag')

    regex /regex/
          Match any (perl) regex.  Regex must be delimited by one of:
          [/#%&!,=:].  Note that this does matching (m//), not s///, 
          tr/// or y///. (yet)

  Expanding Matchers:  These matchers must be given as an argument to 'add'.

    encloser <{tagname} [attrib[=value]] [...]> 
          Like tagblock, except that the block must enclose the previous match.
          (only makes sense as argument to 'add', and should really be named 
          "enclosing tag block" but that's too long)  (see 'tag')

    leader <{tagname} [attrib[=value]] [...]>
          Like tag, except that it searches backwards from the current match.
          (only makes sense as argument to 'add')  Note: There is no 'trailer' 
          or 'follower' matcher.  Use 'add tag ...' to search forward in a 
          similar manner.

    balanced                                      
          Grow match to include "balanced" tags that have the tag preceeding
          the match, and the corresponding closing tag trailing the match (with
          nothing in between).  Only makes sense as argument to 'add'.

    alternate                                     
          Grow the match to include "alternate content".  i.e. script/noscript,
          frame/noframe, layer/nolayer etc.  Only makes sense as argument to 
          'add'.  "alternate content" may preceed or trail the original match.

    whitespace
          Grow the match to include whitespace (' ', '\n', '\t') as well as
          whitespace-like tags such as <p>, <br>, <hr>, and
          entities like &nbsp;.  Note that only tags that preceede the
          match will be added for speed. (searching backwards is hard)

  In all cases more than one attrib can be specified.  You may chain as many
  matchers and predicates as you like, but if it starts to get too long it will
  probably be ambiguous and not do what you might expect.  (I need a BNF form
  grammar for this syntax...see this discussion on perlmonks.org.)
  
% } else { # if(defined $SITECONFIG)

Rewrite has no global configuration. Please select a site from the main FilterProxy Config page. % }


Rewrite was written by Bob McElrath. Please see the README, BUGS, and any relevant module documentation before mailing me with problems. % }