2006-04-29

.NET regular expressions: finding material outside quotes and general RegEx advice

I was surprised when I couldn't find through Google a good recipe for finding "unquoted" material (i.e. material which is outside quotes) with regular expressions (RegExps). If that's all you want, skip through to Solution below.

Background
As a long-term user of Lucene, and in particular the series of Lucene .NET ports which went from Lucene.NET to dotLucene and now back again, we've done a lot of work to create helper interfaces to Lucene, and since early 2005, our site of 150 000 documents or so has had its search engine powered by Lucene.NET. Essentially we pre-parse user input to try and provide an "intuitive" search function - it parses the users' search terms and tries to recognise terms related to categories we use, constructing a Lucene search which is as relevant as possible.

I found a bug in which some of our substitutions were applied even when the material subsituted is inside quotes, which in our search means "an exact phrase match". So a user looking for the phrase "United States president" would get an error, because United States is one of the special substitutions we apply.

Strategy: only to apply tests on parts of the user's search query which are not inside quotes.

I started out with a solution for CSV data splitting on the 4GuysFromRolla.com site, but CSV data gives you a convenient anchor to start the search from (the comma). In any case a rather hurried reading suggests it would be liable to failure in several circumstances.

Solution
It's an easier problem to find text inside quotes because you have known anchors on either side. The solution for the problem of finding text outside quotes is based on combining zero-width negative look behind and positive look-ahead assertions. In general, you would try and avoid these because they are slower to perform than "normal" RegExps (particularly in PERL - I haven't seen any benchmarks in .NET), but I think it's the only reliable way of doing it.

The idea is the same as the rolla.com link above - to look behind for zero or an even number of quotes, and forward for the same. If you can do this, than everything at your current point is unquoted. So

(?<=^([^\"]*\"[^\"]*\")*(?![\"]*\"))(?<1>[^\"]*))

as your pattern will put any continuous piece of text outside quotes into your 1st capturing group.

Breaking this down, it says roughly: look back, and see if there have been an even number or zero double quotes before this point which are not followed by a single double quote. If that's right, pick up all the non-quote characters you can. This works fine for matching multiple times, beause the look-behind and look ahead assertions are zero-width, i.e. they do not "consume" any characters from the match. Only my explicit group "consumes" characters.

Note that, if you are intending to do any work on the captured material, it is best to signal that you are capturing only explicitly. In my expression above. (?<1>...) indicates that the expression should capture this group if it is matched. Microsoft describe this syntax and also explain how to exclude non-capturing groups using System.Text.RegularExpressions.RegexOptions.ExplicitCapture.

In general I would advise all but the simplest regular expressions to use explicit capture, as it simplifies keeping track of what will be captured in which group - especially important if you intend to use constructs like ? or * to indicate that some groups may or may not exist within a match. And if you need to capture groups which may or may not exist in the match, then it is far preferable to capture groups into a name, using the syntax

(?<this_group_may_exist>I\swas\sglad) ?

(match 0 or 1 of the phrase "I was glad" and store it in a group called "this_group_may_exist"

or perhaps more readably,

(?'this_group_may_exist'I\swas\sglad) ?

How we use this is to build and pre-compile regular expressions with a pattern of the general form

(?<=^([^\"]*\"[^\"]*\")*(?![\"]*\")[^\"]*)\b(?<1>United States|US|USA))\b

and then replacing (in this case) with the text phrase "United States" (including quotes). It's a bit more complicated than that, to minimise the actual number of RegExps applied, but it works OK.

Our solution works well, being both fast and reliable, but if we had the money, I'd probably still use a commercial solution that builds in a lot of thesaurus-like behaviour and a commercial-grade relevance engine.

0 Comments:

Post a Comment

<< Home