2006-04-29

.NET regular expressions: finding material outside quotes and general RegEx advice

I was surprised when I couldn't find through Google a good recipe for finding "unquoted" material (i.e. material which is outside quotes) with regular expressions (RegExps). If that's all you want, skip through to Solution below.

Background
As a long-term user of Lucene, and in particular the series of Lucene .NET ports which went from Lucene.NET to dotLucene and now back again, we've done a lot of work to create helper interfaces to Lucene, and since early 2005, our site of 150 000 documents or so has had its search engine powered by Lucene.NET. Essentially we pre-parse user input to try and provide an "intuitive" search function - it parses the users' search terms and tries to recognise terms related to categories we use, constructing a Lucene search which is as relevant as possible.

I found a bug in which some of our substitutions were applied even when the material subsituted is inside quotes, which in our search means "an exact phrase match". So a user looking for the phrase "United States president" would get an error, because United States is one of the special substitutions we apply.

Strategy: only to apply tests on parts of the user's search query which are not inside quotes.

I started out with a solution for CSV data splitting on the 4GuysFromRolla.com site, but CSV data gives you a convenient anchor to start the search from (the comma). In any case a rather hurried reading suggests it would be liable to failure in several circumstances.

Solution
It's an easier problem to find text inside quotes because you have known anchors on either side. The solution for the problem of finding text outside quotes is based on combining zero-width negative look behind and positive look-ahead assertions. In general, you would try and avoid these because they are slower to perform than "normal" RegExps (particularly in PERL - I haven't seen any benchmarks in .NET), but I think it's the only reliable way of doing it.

The idea is the same as the rolla.com link above - to look behind for zero or an even number of quotes, and forward for the same. If you can do this, than everything at your current point is unquoted. So

(?<=^([^\"]*\"[^\"]*\")*(?![\"]*\"))(?<1>[^\"]*))

as your pattern will put any continuous piece of text outside quotes into your 1st capturing group.

Breaking this down, it says roughly: look back, and see if there have been an even number or zero double quotes before this point which are not followed by a single double quote. If that's right, pick up all the non-quote characters you can. This works fine for matching multiple times, beause the look-behind and look ahead assertions are zero-width, i.e. they do not "consume" any characters from the match. Only my explicit group "consumes" characters.

Note that, if you are intending to do any work on the captured material, it is best to signal that you are capturing only explicitly. In my expression above. (?<1>...) indicates that the expression should capture this group if it is matched. Microsoft describe this syntax and also explain how to exclude non-capturing groups using System.Text.RegularExpressions.RegexOptions.ExplicitCapture.

In general I would advise all but the simplest regular expressions to use explicit capture, as it simplifies keeping track of what will be captured in which group - especially important if you intend to use constructs like ? or * to indicate that some groups may or may not exist within a match. And if you need to capture groups which may or may not exist in the match, then it is far preferable to capture groups into a name, using the syntax

(?<this_group_may_exist>I\swas\sglad) ?

(match 0 or 1 of the phrase "I was glad" and store it in a group called "this_group_may_exist"

or perhaps more readably,

(?'this_group_may_exist'I\swas\sglad) ?

How we use this is to build and pre-compile regular expressions with a pattern of the general form

(?<=^([^\"]*\"[^\"]*\")*(?![\"]*\")[^\"]*)\b(?<1>United States|US|USA))\b

and then replacing (in this case) with the text phrase "United States" (including quotes). It's a bit more complicated than that, to minimise the actual number of RegExps applied, but it works OK.

Our solution works well, being both fast and reliable, but if we had the money, I'd probably still use a commercial solution that builds in a lot of thesaurus-like behaviour and a commercial-grade relevance engine.

2006-04-05

Pun Corner

OK, my wife thinks these aren't funny. I suspect she's right. But at least they're my own.

What happens when the King of France tries to invade Norman England?

Resistance is feudal.

The US president sends his top negotiator over to the WTO talks because of an outcry by European fruit growers over protectionism in the Californian dried fruit market. Republican votes are precious in California. When she gets there, she finds that opinion is dead set against the American position and that given America's need to avoid tit-for-tat duties, she is in an impossible position. She cables back to her boss:

Raisin stance is futile.

One night an inebriated American scholar tries to put Archimedes' Lever to the test. But instead of moving the earth, his great desire is to move the stars around so they spell out his name. Having secured a long pole of extremely durable construction several million light years long, he manages to waggle it vaguely in the right direction. However, every time he manages to knock one out of position, gravitational fields quickly bring it back into place. His conclusion?

Raisin' stars is futile.

This probably runs in the family. Cassandra of the Daily Mail was a vague relative, I believe, and he was one of the greatest punsters of the 20th century. Two of his:

  1. Charity worker composes a letter asking potential donors to send back money for their appeals in Nigeria, Fiji and Greenland. Boss writes back saying that the message is too confusing: "Don't put all your begs in one ask-it."
  2. Construction working at a Bilbao bull-ring means that the ways in and out are much reduced. A bull escapes during the fight, and spectators stampede. Many are trampled to death at the egress. Moral of the story? Don't put all your Basques in one exit.

Which reminds me of the old Barnum trick "1 shilling to see the egress..."

But enough puns for now, any rate.

Short but successful

Off on holiday tomorrow, so keeping it short but sweet on the floor.
  • Completed a site deployment on Sunday with minimal downtime which enables a new form of portal service for some new styles of client who are integrating some of our service.
  • Closer to agreeing spec on new product.
  • Implementing feed technologies for our product is looking good. Some great ideas here, but marketing the implementation correctly will be more important.
  • Closed down some options on office migration.
Good week. Short.
O

2006-04-04

Db - successor to C#

OK, how about a language where you declare different protection levels on your classes by declaring them LowerClass, MiddleClass, or UpperClass?

See this CodeProject article