Proxomitron.Info  |  Online version of the Naoko 4.5j Help file  |  © Scott R. Lemmon
Proxomitron's Text Matching Language
Previous Back to contents Next

The text matching language is the key to understanding how the Proxomitron's filters work. It allows you to match complex combinations of HTML tags and store parts of the matched text into variables which can later be used in the replacement text.

If you're familiar with DOS and UNIX style filename wildcards (*,?,[...]) or regular expressions, you'll find much that's familiar in Proxomitron's matching rules. My original goal, in fact, was to create a matching language as easy to use as wildcards, but with much of the added power of regular expressions. I'm not exactly sure I succeeded, but it somehow I got it all to work! ;-)

If this kind of stuff is new to you, check out An Introduction to Text Matching for the basics.

Many of the rules have also been specifically designed to make working with HTML easier. For instance, since case is seldom important in HTML, all matching is case insensitive - saving you the trouble of testing for both upper and lower case.

Proxomitron's meta characters

Here's is a complete list of all the Proxomitron meta characters and what they do.

*The Asterisk will match any string of characters. For example, "foo*bar" would match "foobar", "fooma babar", or even "foo goat bat bison bar". Basically the asterisk means, "search forward matching anything until you find what follows the asterisk"
?The question mark matches any single character no matter what it is. "?oat" would match "boat" or "goat" or even "<oat"
[abc...]Square Brackets matches any single character listed within the '[' and ']' Ranges can also be checked by using a dash: "[A-Z]" will match any letter "A" through "Z" while "[0-9]" will match any single digit. If the first character is a "^" it will match any character not within the brackets - "[^0-9abc]" will match any character that's not a digit and not "a", "b", or "c".
[#n:n]Special numeric match. This is used to easily check for numeric value ranges in HTML tags. For example, to check for a number between 100 and 150 use "[#100:150]". If the second number is a '*' it acts as if it's infinitely large, "[#40:*]" would match any number greater than 40. To check for a number less than 40 simply use "[#0:40]". To check for an exact number the second number can be left out (as in "[#100]"). The numeric match will match regardless of leading zeros or quotes surrounding a number - tag="0100". earlier versions of Proxomitron used to use "-" instead of ":" to separate the neumbers, but this made testing for negative numbers difficult. Currently Proxomitron will accept either way, but if you're test includes a negative value (like [#-200:150]) you'll have to use ":".

" "

A space always matches - but it also consumes any number of spaces or tabs it may find. Use it where there may or may not be a space between items. For example

"<tag value>" would match "<tag value>" or "<tag   value>" or even "<tagvalue>".

\sBackslash-s: Like the space, will consume any number of spaces or tabs, but there must be at least one for it to match. For example "<tag\s>" would match "<tag >" or "<tag   >" but not "<tag>"
\w Backslash-w: Word match. Will match any number of non-space characters. it's basically the opposite of "\s", but in some ways it's also similar to "*". The difference being it will stop if it hits a space or a ">" (which marks the end of a HTML tag). It comes in very handy when matching tag values and URLs (See tips and tricks)
\tMatches a tab character explicitly
\rMatches a carriage return character explicitly
\nMatches a newline character explicitly
\0-9Backslash+Digit 0-9: Match into a variable. This is one of Proxomitrons key matching rules. It matches just like the "*" character, but stores whatever is matched into one of ten variables. These variables can then be used in the replacement text to include parts of the original HTML. Use it to change only part of a tag while leaving other parts intact. For example, to change only the background in a <body ... > tag you could use...

Matching:<body \1 background="*" \2 >

Replace:<body \1 background="mybackground.gif" \2 >

This way, anything else originally within the body tag, both before and after the background tag, will be included in the replacement text.

More complex matches can be captured by placing the \0-9 directly after a set of parenthesis with no spaces in-between as in "(foo*bar)\1" It this case anything matched within those parenthesis will be placed into the variable. This is similar to regular expressions, but with the added benefit of being able to choose which variable gets used.

\#Backslash-hash (or pound sign) works much like \0 through \9 except each time it's used the value is pushed onto the Replacement Stack. This can be thought of as sticking the new value onto the end of the variable instead of just replacing its previous value. For example, if \# first matched "foo" and then matched "bar", it would contain both values. Use "\@" in the replacement section to print out all values captured or "foobar".
|Use the vertical bar as an "OR" function. For example "foo|bar" would match either "foo" or "bar".
&Use the Ampersand as a "AND" function. For example "*foo&*bar" would match "foo bar" or "bar foo" but not "foo foo". Note the use of the asterisk - something like this is always needed with the AND function since a word could never be both "foo" and "bar" at the same place and time ;). AND is useful for situations where tag values may come in any order...

<img src="picture" height=60 width=200>

which could also be written...

<img width=200 height=60 src="picture">

both would be matched by...

<img (*src="picture" & *height=60 & *width=200)*>

&&The Double Ampersand (or AND-AND) function works similar to the single "&" with one important (but useful) difference - the second half of the AND is limited to matching exactly what the first part did. Confused? It's actually pretty simple. Say you have an expression like this...

(<img * > && \1 ) the "\1" normally acts as a "*" and given a regular AND would match from the start of "<img " all the way to the end of the available text (and well past the end of the image tag). The AND-AND limits it to matching only the contents of the <img ... > tag and no more (so the \1 will only capture the <img ...> tag). You can use this like a bounds to limit the scope of a match and prevent "run-away" expressions.

( ... )Use parenthesis to create matching sub-expressions within matching phrases. For example "foo(bar|bear|goat)" would match "foobar", "foobear" or even "foogoat". Parenthesis can be nested, as in "foo(bar|(black|brown|puce) bear|goat)" which would match "foobar" "fooblackbear" "foobrownbear" etc.. Also as with "[...]", if the first character following a "(" is "^" the expression will match only if the expression within does not match. For example, "(^foo|bar)" would match anything that's not "foo" or "bar". Note that a negated expression consumes no characters - it just test them. I think Perl calls this a "negative forward assertion"?
+The plus sign indicates a run of repeating characters. For instance, "a+" would match "a", "aa", or "aaaa". You can use it after other meta characters or parenthesis for more complex runs. For example...

[abc]+ would match a run of any characters "a","b",or "c" like "ababccba"

([a-z]&[^n])+ would match a run of letters "a" through "z" but not "n"

(foo)+ would match "foo", "foofoo", "foofoofoo", etc.

An important point to make about + is that it's a "blind" run. This means it repeats at long as the condition it's testing is true regardless of anything the follows it! For example "(foo)+foobar" could never match. Why? well the loop will eat up all the "foo's" their are leaving no "foo" for "foobar"! This can actually be very useful sometimes, but if it's not what you want try "++" instead.

++A double-plus acts much like the single "+" plus except it also pays attention to what comes afterwards (it can "see" so to speak). It only loops until it finds the rest of the expression matches. This is very similar to how the "." operator works in normal regular expressions for example.
{n,n}Either "+" or "++" can be followed by a pair of "curly braces". These can be used to control the minimum and maximum times the express may loop. For example, "[a]+{4,10}" would match a string of from four to ten "A's" while "[b]+{20}" would match a string of exactly twenty "B's". An asterisk "*" denotes "infinity" so for example, "[c]+{10,*}" would match ten or more "C's". (For the regexp people in the audience, one difference to keep in mind is this must follow either "+" or "++" and cannot be used on it's own)
\The Backslash can be used to "escape" any character that has special meaning and treat it as a normal character. For example, to match a parenthesis in the HTML text use "\(", to match a backslash itself use "\\".
=The equal character has special "magic". It will match not only the "=" itself, but also any whitespace before or after - making tests for tag values easier. For example foo="bar" also matches foo= "bar" or foo = "bar"
"The double quote - it will match either double or single quotes (since either may be used in HTML). for example " * " would match...

"oh happy mongoose" ...or... 'oh happy mengeese'

If you want to test for double quotes explicitly just use the backslash - \"

'The single quote is smarter than your average quote: It attempts to match the appropriate ending quote for any quote previously matched by the double quote - even if there are other quotes in between! Confused? Don't be - in HTML it's common to use a mixture of single and double quotes when you need "quotes within quotes" - look at the following examples...

single within double: href=" ' bison.html ' ); " or...

double within single: href=' " bison.html " ); '

both these could be matched by href=( " * ' ) - simply use the double quote to match the initial quote and the single quote to match the ending quote. There are some restrictions here: First both the starting & ending quotes must be in the same sub-expression - that means in the same set of nested parenthesis. for example....

" some text ' works...

( " some text ' ) works too...

" ( some text | other text) ' even works... but

" ( some text ' ) won't work worth a darn

( " | ) some text ( ' | )sadly, won't work either

Another restriction is start and end quotes can't be nested in the same sub-expression - a matching clause of...

" something " something else ' end of something ' won't work

However, you can nest them using a different sub-expression, like so...

" something ( " something else ' ) end of something '

It's also worth noting that if no previous double quote was matched, the single quote just matches a normal single quote. Still it's safer to use \' to explicitly check for a single quote if you need to.


Special Replacement Text codes

Besides the matching rules, there are a few special codes that are used in the replacement text.

First "\0" through "\9" are used to insert values stored into the corresponding variables from the matching expression. For stuff captured by "\#" you can either use "\@" which will print everything that's been stored or "\#" again which will print the next item it stored each time you use it in the replacement (so if for example you stored three items, using "\# \# \#" would print them with spaces in between).

There are also a few others special codes you can use in the replace section...

\uIncludes the full URL of the current web page.

\kKills the current connection - useful in HTTP headers to ban specific URLs and in web page filters to skip loading the rest of a page.
\hIncludes the host portion of the URL
\pIncludes the path portion
\qIncludes any query string from a URL (anything following a "?")
\aIncludes any anchor text from a URL (anything following a "#")
\dIncludes the application's base directory in a "file://" URL format
(used to aid in including substitute images or data)
\xIncludes the URL command prefix if you've set one
(careful with this! You don't want to accidentally give it to a remote server).

Note: \h \p \q \a and \u can actually be used in the matching section as well. \h in particular can sometimes be useful to see if a URL on the page has the same hostname as the page itself (or is located on a different server)


Matching commands (extended functions)

Besides the normal meta-characters above, Proxomitron also now features special matching commands. They extend the normal matching rules and add all sorts of useful functions. They all begin with a "$" have an upper case name followed by parens "(...)" with the command parameters. One example is the $LST(...) command which is used to check a blocklist from anywhere within a match - for example...

<img * src="$LST(ImageURLCheck)" * >

Might be used to check to see if the URL of a image tag is matched somewhere in the blocklist named "ImageURLCheck". Many more such commands are available and using them can make filter design much easier. While the example above is used in the match section, others are used in the replace (and some in both). Click here for full details.

This allows me to easily add new commands without creating conflicts or making even more characters "meta" characters.

The End...?

Well, that's all the rules in a nutshell. For more examples on how to use them see tips and tricks

Return to main index