Querying with regexes
When parsing simple data formats, it is often easier to write regular expressions (or regex for short) than use a parser. Rust has pretty decent support for this through its regex
crate.
Getting ready
In order to really understand this chapter, you should be familiar with regexes. There are countless free online resources for this, like regexone (https://www.regexone.com/).
Note
This recipe will not conform to clippy, as we kept the regexes intentionally too simple because we want to keep the focus of the recipe on the code, not the regex. Some of the examples shown could have been rewritten to use .contains()
instead.
How to do it...
Open the
Cargo.toml
file that was generated earlier for you- Under
[dependencies]
, add the following line:
regex = "0.2"
- If you want, you can go to regex's crates.io page (https://crates.io/crates/regex) to check for the newest version and use that one instead
In the
bin
folder, create a file calledregex.rs
Add the following code and run it with
cargo run --bin regex
:
1 extern crate regex; 2 3 fn main() { 4 use regex::Regex; 5 // Beginning a string with 'r' makes it a raw string, 6 // in which you don't need to escape any symbols 7 let date_regex = Regex::new(r"^\d{2}.\d{2}.\d{4}$").expect("Failed to create regex"); 8 let date = "15.10.2017"; 9 // Check for a match 10 let is_date = date_regex.is_match(date); 11 println!("Is '{}' a date? {}", date, is_date); 12 13 // Let's use capture groups now 14 let date_regex = Regex::new(r"(\d{2}).(\d{2}) .(\d{4})").expect("Failed to create regex"); 15 let text_with_dates = "Alan Turing was born on 23.06.1912 and died on 07.06.1954. \ 16 A movie about his life called 'The Imitation Game' came out on 14.11.2017"; 17 // Iterate over the matches 18 for cap in date_regex.captures_iter(text_with_dates) { 19 println!("Found date {}", &cap[0]); 20 println!("Year: {} Month: {} Day: {}", &cap[3], &cap[2], &cap[1]); 21 } 22 // Replace the date format 23 println!("Original text:\t\t{}", text_with_dates); 24 let text_with_indian_dates = date_regex.replace_all(text_with_dates, "$1-$2-$3"); 25 println!("In indian format:\t{}", text_with_indian_dates); 26 27 // Replacing groups is easier when we name them 28 // ?P<somename> gives a capture group a name 29 let date_regex = Regex::new(r"(?P<day>\d{2}).(?P<month>\d{2}) .(?P<year>\d{4})") 30 .expect("Failed to create regex"); 31 let text_with_american_dates = date_regex.replace_all(text_with_dates, "$month/$day/$year"); 32 println!("In american format:\t{}", text_with_american_dates); 33 let rust_regex = Regex::new(r"(?i)rust").expect("Failed to create regex"); 34 println!("Do we match RuSt? {}", rust_regex.is_match("RuSt")); 35 use regex::RegexBuilder; 36 let rust_regex = RegexBuilder::new(r"rust") 37 .case_insensitive(true) 38 .build() 39 .expect("Failed to create regex"); 40 println!("Do we still match RuSt? {}", rust_regex.is_match("RuSt")); 41 }
How it works...
You can construct a regex object by calling Regex::new()
with a valid regex string[7]. Most of the time, you will want to pass a raw string in the form of r"..."
. Raw means that all symbols in the string are taken at literal value without being escaped. This is important because of the backslash (\
) character that is used in regex to represent a couple of important concepts, such as digits(\d
) or whitespace (\s
). However, Rust already uses the backslash to escape special non-printable symbols, such as the newline (\n
) or the tab (\t
)[23]. If we wanted to use a backslash in a normal string, we would have to escape it by repeating it ( \\
). Or the regex on line [14] would have to be rewritten as:
"(\\d{2}).(\\d{2}).(\\d{4})"
Worse yet, if we wanted to match for the backslash itself, we would have to escape it as well because of regex. With normal strings, we would have to quadruple-escape it! ( \\\\
)
We can save ourselves the headache of missing readability and confusion by using raw strings and write our regex normally. In fact, it is considered good style to use raw strings in every regex, even when it doesn't have any backslashes [33]. This is a help for your future self if you notice down the line that you actually would like to use a feature that requires a backslash.
We can iterate over the results of our regex [18]. The object we get on every match is a collection of our capture groups. Keep in mind that the zeroeth index is always the entire capture [19]. The first index is then the string from our first capture group, the second index is the string of the second capture group, and so on. [20]. Unfortunately, we do not get a compile-time check on our index, so if we accessed &cap[4]
, our program would compile but then crash during runtime.
When replacing, we follow the same concept: $0
is the entire match, $1
the result of the first capture group, and so on. To make our life easier, we can give the capture groups names by starting them with ?P<somename>
[29] and then use this name when replacing [31].
There are many flags that you can specify, in the form of (?flag)
, for fine-tuning, such as i
, which makes the match case insensitive [33], or x
, which ignores whitespace in the regex string. If you want to read up on them, visit their documentation (https://doc.rust-lang.org/regex/regex/index.html). Most of the time though, you can get the same result by using the RegexBuilder
that is also in the regex crate [36]. Both of the rust_regex
objects we generate in lines [33] and [36] are equivalent. While the second version is definitely more verbose, it is also way easier to understand at first glance.
There's more...
The regexes work by compiling their strings into the equivalent Rust code on creation. For performance reasons, you are advised to reuse your regexes instead of creating them anew every time you use them. A good way of doing this is by using the lazy_static
crate, which we will look at later in the book, in the Creating lazy static objects section in Chapter 5, Advanced Data Structures.
Note
Be careful not to overdo it with regexes. As they say, "When all you have is a hammer, everything looks like a nail." If you parse complicated data, regexes can quickly become an unbelievably complex mess. When you notice that your regex has become too big to understand at first glance, try to rewrite it as a parser.
See also
- Creating lazy static objects recipe inChapter 5, Advanced Data Structures