Proxomitron.Info  |  Online version of the Naoko 4.5j Help file  |  © Scott R. Lemmon
Matching Rule Tips and Tricks
Previous Back to contents Next

The matching rules are the most complex part of the Proxomitron. Understanding them can be confusing at first - especially if you've never used a pattern matching language before. However, don't despair, even very simple rules can accomplish quite a bit. Take it a step at a time and soon it'll become second nature. To get you started, Here's a few tips covering some basic HTML matching tasks.

Note: This section assumes you know a little about HTML - if not there are many excellent tutorials available on the net. Also If you don't intend to write your own rules, you can ignore this information entirely.

Formatting your matching rules

Complex matching rules can often be hard to read. However, to make them a little more legible, both the matching expressions and replacement text can be split over multiples lines when you enter them into the filter editor. These line breaks have no effect on the actual filter. To actually include a line break in the HTML you're sending to the browser use "\n".

Also because spaces always match true, you can normally use them freely to separate elements of a matching expression. Just be aware that they do consume spaces they find.

Some general info

When designing a new rule it's more common for it to not match at all rather than match too much. Always start simple - then add refinements as needed. That way, when a rule suddenly doesn't match when you expected it to, you'll have an idea which part is causing the trouble.

Use the log window to see when a filter matches, and especially use the "HTML debug info" option in the log (or type "dbug.." in front of the URLs hostname) to view webpages as source and see what your filters are up to. These are two very helpful tools when designing rules. Even more helpful is the new Match Testing Window which allows you to see exactly how a filter will change a bit of HTML text.

Cutting and pasting HTML

Often a good way to get started on a new filter is to cut and paste the HTML you're interested in directly into the matching clause. One thing to watch out for here is line endings - since the matching clause ignores line breaks, a line of HTML that looks like so (two lines...

<br>
<p>

seems just like "<br><p>" (no spaces) to it. This can cause trouble since the actual HTML also has a newline character that must be matched. The solution is simply to place a space at the beginning or end of each line you insert like "<br> (space) <p>" (if it doesn't already have one) this will match all "whitespace" including any newlines.

Disabling a tags and tag elements

Since browsers will ignore any tags and element they don't understand, a "quick and dirty" but effective way to disable a tag or one of its elements is to rename it. This comes in especially useful when the same element may be used by several different tags. Take "onload" for example, this element auto-runs a JavaScript. Although normally in the "<body ... >" tag, it may occur elsewhere as well. To stop it you could use...

Matching: onload=
Replace: LoadOff=

Which would make a tag like...

<body background="bak.gif" onload="window.open(myadd);" >

become...

<body background="bak.gif" LoadOff="window.open(myadd);" >

Notice how simple this rule is! It's also a bit risky too, since there's a chance the phrase "onload=" could occur outside a tag in the actual text of a web page. In practice however this seldom happens (including the equal sign helps guard against this a bit). Even if it does, it's normally not a big deal as long as you realize what's happened.

Killing two stones with one bird

Here's a simple trick for changing both a start and end tag with the same rule. This trick is used by the "Blink to Bold" rule among others. In this rule we want to convert "<blink>" to "<b>" and "</blink>" to "</b>" - Let's take a look how it's done...

Matching: <\1blink>
Replace: <\1b>

By using the "\1" meta character, the rule will match both the start tag: "<blink>" and also the end tag: "</blink>". Additionally, the "\1" captures the end tag's slash for use in the replacement text. A safer, but more complex, version of the rule might be...

Matching: < ( / | )\1 blink>
Replace: <\1b>

Can you tell why? If not read "Testing for something or nothing".

Capturing a tag's contents

Often you'll want to change only one element of a tag while leaving the rest as they are. This is where the number variable "\#" matching is very useful. Take the following example of a rule to kill web page backgrounds.

Matching: <body \1 background=\w \2 >
Replace: <body \1 \2>

When they don't directly follow parenthesis ( ... )'s the \# variables act just like an asterisk "*". Here, the "\1" captures anything before the background element, while the "\2" captures everything afterwards. In the replacement text, the background element is simply left out, but you could also include your own background here.

Adding an new element to a tag

Here's quick trick to add an element to a tag. Although the "proper" method would be to replace an element if it already exists and add it only if it doesn't, this can sometimes be difficult. It's often simpler to just add the element regardless. We just need to make sure the browser will use our tag instead of any pre-existing one. For example, to add a border to all "<img ... >" tags, you could use.

Matching: <img \1 >
Replace: <img border=1 \1 border=1>

Why add border twice? Well ideally, when a browser finds a duplicate element it should use the first one and ignore the rest. This is in fact what Netscape does, but as it turns out, Internet Explorer does the exact opposite! (what a surprise huh?) By placing the element at the beginning and end of the tag, it works in either case. Note that being browser independent isn't as important here as it is for designing web pages. You're likely to know what browser you intend to use, so it's ok to just arrange thing in the way your browser expects.

Capturing specific tag attribute values

The values of a tag's attributes can often be tricky to match. Take "<a href=... >" for instance. "href" indicates a URL, but the URL value could be surrounded by single quotes, double quotes, or even no quotes at all. This is where the special $AVQ(...) (Attribute Value & Quotes) matching command comes in very handy. It will match all of the value, including any quotes it may find. For example, if you wanted to capture the URL into the \1 variable you could use...

<a * href=$AVQ(\1) * >

Remember that when a "\1" or other \(number) immediately follows parenthesis it captures whatever text those parenthesis matched? Say you now want to capture a value only if it contains specific stuff. To do this we can use parens (...) and the $AV(...) command...

<a * href=( $AV(*(banner|advert|cgi)*) )\1 * >

Which would only match URLs containing the words "banner", "advert", or "cgi". We now have the beginnings of a "banner blaster" type rule. $AV(...) is like $AVQ except the quotes are not included as part of the inner match so you don't have to bother checking if it actually has quotes or not. However since we also want to capture the whole thing (including any quotes) we'll just wrap it all in parens and tack a \1 on the end. This will stuff the entire attribute value into \1

Testing for something or nothing

Often you'll find you'll want an expression to match whether a particular value is there or not. To do that use the following rule... "( something | )"

This will first test for the word "something" but if it isn't found the expression is still true. Why? Notice there's an OR symbol (vertical bar) with nothing between it and the closing parenthesis. This creates an empty expression and an empty expression is always true and consumes no characters. Read this as saying - match "something" OR nothing.

Note that if the expression had been written "(|something)", the word "something" would never be matched! Since ORs are processed from left to right, the empty expression would always be tested (and match) first before the word "something" got a chance.

A practical example is ( " | ) * ( " | ) which tests for something that may or may not be surrounded by quotes.

Here's a more elaborate example which grabs the "border" value from an "<img ... >" tag if it exists and places it into the variable \1

<img ( * (border=\w)\1 | ) * >

Be careful of the placement of asterisks here, for example "<img*(border=\w|)\1*>" might not do what you expect. Upon scanning the first character after "<img ", if it turned out not to be "border" the sub-expression would still match! Then when "border" occurred later in the string, it would be matched by the second asterisk instead, since the initial test had already passed by.

Using "AND" to capture multiple tag attributes

By using the ampersand "&" you can capture certain tag attributes regardless of the order they're found in. For example. Say you wanted to re-write an "<img ... >" tag to contain an image of your own, but wanted to keep the original "width" and "height" values. You might use...

Matching: <img ( (*(height=\w)\1*| ) & (*(width=\w)\2*| ) ) >
Replace: <img src="file://d|/my_pictures/shonen.gif" \1 \2 >

Note that the height is captured into variable \1 and the width into \2. Also by using the "something or nothing" syntax described above, the expression will still match even if the width or height value is missing from the tag. In which case the corresponding \# variables will simply be blank.

Using "smart" quotes

Most of the time "\w" works well for capturing a tags attributes. However there are times when you need something more. Something like the an "<img ... >" tag's "alt" element often contains spaces as in... alt="this is some text" or alt='also some text'. To capture this sort of thing use the double quotes...

Matching: alt=( " * " )\1

An even more complex situation often arises with JavaScript. Here is common to have "quotes within quotes" as in the following...

onmouseover="javascript:window.open( ' mywindow.html ' ); "

capturing this would normally be difficult as it could also be written...

onmouseover= ' javascript:window.open( " mywindow.html " ); '

This is where the single quote comes in, when it follows a double quote in a match, it looks for the correct closing quote for the previous starting quote...

Matching: onmouseover=( " * ' )\1

This will get single quotes, double quotes, or nested quotes! However, in this case, there's an even better way - remember the $AVQ( ) command? It also will handle all kinds of quotes (or even no quotes at all) and is great for capturing attribute values...

Matching: onmouseover=$AVQ(\1)

Using file URLs to include your own stuff

A "file URL" is a URL that points to a file on your hard drive rather than some location on the Internet. Browsers use file URLs to view web pages stored off-line, but they are also a very handy way to insert your own pictures, web pages, even JavaScripts into pages you view.

The Proxomitron now makes it very easy to insert a file URL into a matching rule's replacement text. First position your cursor where you wish to insert the URL then right-click and select "Insert file URL" from the context menu. A file requester will open up allowing you to choose the file to insert.

Here's an example of a "background replacer" rule that uses a file URL

Matching: <body ( \1 background=\w | ) \2 >
Replace: <body \1 background="file://c|/pictures/background.gif" \2 >

Note the matching expression has a space between the ")" and the "\2" - it's a common mistake to forget this space! This would result in the \2 containing what was matched in the "( ... )" phrase instead of whatever follows the background attribute.

Inserting JavaScript or other items in every web page you view.

Here's an trick for really taking control of a web page. JavaScript can be a very powerful tool - in the right hands... Now those hands can be your own! To insert a JavaScript, or for that matter anything else, into every web page you view, search for a tag you know will always be there - "<html>", "<head>" or "<body>" are often a good choices. Take for example, a rule to stop JavaScript error boxes from popping up. For Netscape we need to execute the following bit of JavaScript code on every page "<script> this.onerror=null; </script>" The rule would look like so...

Matching: <html>
Replace: <html>\n<script> onerror=null; </script>$STOP( )

This basically inserts the script directly after the <html> tag. Note the "\n" - this causes a line break between the <html> tag and the start of the script. Though not really necessary, it makes the web page source easier to view if you want to check your results. Also, when using this trick it's best to have the "Allow for multiple matches" check box enabled. This allows any other rules that may use the same trick to insert their text as well.

Also notice the use of the $STOP( ) command. This will turn the filter off for the rest of the page and makes sure our script will only be inserted once (especially important when you also use "multi-match").

Small scripts like the one above can easily be included directly in the replacement text, but for larger scripts this might be cumbersome. A better solution might be to have the "<script ...>" tag contain a "file URL" pointing to a file with the actual script. As In...

<html>\n<script src= "file://c|/scripts/myscript.js" >

An even better way to insert items into every web page you view.

I described the above method first because it's a useful trick when you need to insert something into a certain area of the page (like after the <body ... > tag for instance). However, if all you want to do is put it at the start or end of the page there's an even easier way....

The matching clause can have two special values <start> and <end> - they simply insert the replacement text either at the beginning or end of a web page. They're easy to use and very efficient since no searching has to be done. Using <start> the above rule could be written...

Matching: <start>
Replace: <script> onerror=null; </script>

This works great for JavaScripts (and overriding JavaScript functions - see below). Also there's no need to worry about allowing for multiple matches!

Overriding JavaScript Functions with your own

In Netscape and Internet Explorer 4.0+ there's a very effective trick for warping JavaScripts. Any JavaScript function - even built-in ones - can be redefined to do whatever we want. Say we wanted to get rid of those "alert( ... )" and "confirm( ... )" boxes. We could do it by simply inserting the following script at the start of a web page (use the <start> technique above)....

<script>
function alert( ){ return(1); }
function confirm( ){ return(1); }
</script>

Now whenever any other script on the page attempts to call an alert or confirm box, our functions get called instead. By returning a "1" we even make the unsuspecting script think we answered "yes" to the confirm box's prompt!

This is really a very powerful concept - although the functions in this example don't really do anything, more complex replacement functions could filter only certain alert boxes or even do something else entirely. There's really no limit.

Since it's so efficient, the Proxomitron's default rules make good use of this trick. One drawback, while it works with most all versions of Netscape, and IE 4.0 or higher - it won't work with Microsoft's JScript in IE 3.x The Proxomitron provides an alternate rule set for IE 3.0 users which use the normal search and replace techniques to accomplish these things instead.

How to use recursive matching

Recursive matching is when a rule matches its own previous results. Normally this is something to avoid, especially if it leads to infinite recursion - where the rule matches endlessly against itself. However, used properly, this can be a powerful technique. Take this scenario - say you want to eliminate any pop-up windows the occurs between the "<script ..." and "</script>" tags. Since JavaScript uses the "open(...)" command to pop open a window, the resulting rule might look like this...

Matching: <script \1 open \( * \) \2 </script>
Replace: <script \1 \2 </script>

(In actuality, you'd also want to use scope here, but we'll discuss that later - also note the use of the "\" to escape the special meaning of the parenthesis). This might work if there was only one open command in the JavaScript, but if there was more than that only the first would be eliminated. The solution? Well, two things. First, check "Allow for multiple matches" to let the rule match against its own results. Secondly, we need to change the replacement text to read...

Replace: \n<script \1 \2 </script>

Why? Well, in order to help guard against accidental infinite recursion, the Proxomitron's matching engine always advances one character forward after all rules have been checked. This means the next time through our rule would see only "script ..." instead of "<script ...". To get around this we simply insert a "newline" character in front of the replacement text. Although in the final output this will create a blank line before the "<script ..." tag, the browser will ignore it. However it lets the rule see "<script ..." next time around. A leading space could have also been used, in fact so could anything that doesn't affect the web browser's function. The idea is simply to push the entire result at least one character forward.

Once all the "open(...)" commands have been removed, the rule will no longer match, so there's no danger of infinite recursion.

Actually, although this is a useful trick, there's a better (and safer) way to do this...

How to NOT use recursive matching

Proxomitron also has a feature called the "replacement stack". This uses the \# variable to collect up multiple values. Perhaps the easiest way to this of this is \# works just like \0 through \9 and can be used in exactly the same places. The difference is if it's called more than once - either in a loop or by using it multiple times - the value is stuck onto the end rather than replacing what's already there. By using this we can match multiple items without having to get low-down recursive...

Match: <script (\#.open \( * \))+ \# </script>
Replace: <script \@ </script>

Here we create a loop using the plus after a set of parens: "(...)+". This will loop as long as it can find a match for what's inside. This particular loop searches for ".open()" commands. Anything before the open gets stuck into the first \# and as the loop repeats anything before the next open gets added onto \#. This continues until we run out of open's entirely. Then, anything left over gets stuck on the end by the second \#.

Essentially we've captured everything between the two script tags except the open commands. In the replace section we just call \@ which will dump out everything \# has stored up so far. This has much the same effect as the recursive match, but is more efficient and less dangerous (since it should never get stuck in an infinite loop).

A word about Scope Bounds

For simplicity, the examples here have made no use of the web filter's scope bounds settings. Bounds can be used to control how far ahead the Proxomitron's matching engine scans a web page when searching for a match (see the web page filter editor for a more detailed explanation).

With bounds you normally give the rule fixed starting and ending points to search between. Reusing an example from above, take the following rule...

Matching: <script \1 open \( * \) \2 </script>
Replace: <script \1 \2 </script>

Written using bounds might look like...

Bounds: <script\s*</script>
Byte Limit: 4096
Matching: \1 open \( * \) \2
Replace: \1 \2

Notice we just move the fixed beginning and ending text into the scope's bounds check. The byte limit is the maximum characters to search before giving up. Nothing to it really. Also we've used the \1 and \2 in the matching expression to capture the start and end of what the bounds matched - including <script and </script>. So we no longer need put them in the replacement text.

The main thing to remember when using bounds is this: The Match: section much match exactly the same amount of text as the Bounds: section.

So, if the bounds: matches from say "<HEAD>" to "</HEAD>", the match must also! The example above shows how to easily do this - just capture the start and end of the match into variables you can use in the replace section.

That's all there is... For now

Those are all the tips I can think of. If you have any ideas of your own please let me know. Also remember you can look at the rules that come with the Proxomitron! Use them as a starting point when writing new rules - often you'll find one that already comes close to what you want to do.


Return to main index