Multiline regex pattern

1

Task: Parse a file and capture whatever text appears between a pair of double quotes like the following:

“Catch me”

Not so difficult, you could use the following regex:

1“.*”

This will match any character within double quotes in a group
¿any?

Well, if you have to deal with multi-line characters (CR / LF) like in the following text:

1“Catch me
2if you can”

The . special character means “any character” In fact means “any character, except for new lines”. So it won’t work in our case. And you can’t also put it within a set of characters like: [.s] which would mean:

1. match any character
2s plus any space character (including new lines)

The problem is that inside [] special characters like . or * or ? lose their meaning and are treated as literals. Python address this issue with the option re.DOTALL, which makes a dot mean really any character even a new lines. If you are working with other language with a regex library but without this option, like C# for instance you could use this trick: “[wW]*”

The [wW] means: “match any word character + non word characters”. You could solve it by combining other special characters, but I find this way specially clear since there is no doubt that you will truly match any character as you are just adding two complementary sets.

Making things more complex

If you have a file like this:

1“Catch me
2if you can”
3“other line”
4“and the last one”

The regex will match from the first double quote, to the last one. To solve it you shoud use a non greedy multiplier like: *?

When you use * you are actually saying “match zero or more characters that fulfill the preceding condition.” But the regex engine will choose the longest match possible. This is because you are using a greedy quantifier. To solve this you should use a non-greedy quantifier, like in this regex:
“[wW]*?”
And to put anything (except the double-quotes) in a group (so you could for instance iterate over the results), just add some brackets after and before the quotes:

“([wW]*?)”

Code examples

In C# (Visual C# 2010) we need to do the following:

 1using System;
 2using System.Text.RegularExpressions;</code>
 3 
 4namespace ConsoleApplication1
 5{
 6    class TestRegularExpressions
 7    {
 8        static void Main()
 9        {
10            // double "" are used to escape double-quotes
11            // "?" is used to give the capture text a simple name
12            // @ means the text is a string literal and we don't want that C# escapes any character (like is usual when you write regex patterns)
13            string pattern = @"""(?[wW]*?)""";</code>
14 
15            Regex regex = new Regex(pattern);
16 
17            string text = new System.IO.StreamReader(@"c:Usersadriantest.txt").ReadToEnd();
18            /*
19            * Suppose that c:\Users\adrian\test.txt has the following content:
20            *
21            "Catch me
22            if you can"
23            "other line"
24            "and the last one"
25            */
26 
27            Match m = regex.Match(text);
28 
29            //iterate in all the captures
30            while (m.Success)
31            {
32                Console.WriteLine("Captured line: " + m.Groups["quoted_line"]);
33                m = m.NextMatch();
34            }
35 
36            Console.WriteLine();
37 
38        }
39    }
40}

This will print:
Captured line: Catch me
if you can
Captured line: other line
Captured line: and the last one


Of course in Python you have to do less effort to get the same result.

 1import re
 2 
 3pattern = r'"(?P.*?)"'
 4 
 5text = """"Catch me
 6if you can"
 7"other line"
 8"and the last one" """
 9 
10# Retrieve group(s) by name
11for m in re.finditer(pattern, text, re.DOTALL):
12    print "Captured line: %s " % m.group("quoted_line")

The output is the same as before:

Captured line: Catch me
if you can
Captured line: other line
Captured line: and the last one

Please note the differences between Python and C#:

  • As we previously mentioned, you can use re.DOTALL to capture also new lines.
  • To name a group “quoted line” you write in Python ?P<quoted_line> instead of the C# version ?<quoted_line>
  • You write less and get more!

  1. Source: https://xkcd.com/ ↩︎