Packt+ | Advance your knowledge in tech

You're reading from Rust Standard Library Cookbook Over 75 recipes to leverage the power of Rust

Product type Paperback

Published in Mar 2018

Publisher Packt

ISBN-13 9781788623926

Length 360 pages

Edition 1st Edition

Languages

Rust

Concepts

Programming Language

Authors (2):

Jan Hohenheim

Daniel Durante

View More author details

Table of Contents (17) Chapters

Title Page

Packt Upsell

Contributors

Preface

1. Learning the Basics

2. Working with Collections FREE CHAPTER

3. Handling Files and the Filesystem

4. Serialization

5. Advanced Data Structures

6. Handling Errors

7. Parallelism and Rayon

8. Working with Futures

9. Networking

10. Using Experimental Nightly Features

1. Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Querying with regexes

When parsing simple data formats, it is often easier to write regular expressions (or regex for short) than use a parser. Rust has pretty decent support for this through its regex crate.

Getting ready

In order to really understand this chapter, you should be familiar with regexes. There are countless free online resources for this, like regexone (https://www.regexone.com/).

Note

This recipe will not conform to clippy, as we kept the regexes intentionally too simple because we want to keep the focus of the recipe on the code, not the regex. Some of the examples shown could have been rewritten to use .contains() instead.

How to do it...

Open the Cargo.toml file that was generated earlier for you
Under [dependencies], add the following line:

regex = "0.2"

If you want, you can go to regex's crates.io page (https://crates.io/crates/regex) to check for the newest version and use that one instead
In the bin folder, create a file called regex.rs
Add the following code and run it with cargo run --bin regex:

1   extern crate regex;
2
3   fn main() {
4     use regex::Regex;
5     // Beginning a string with 'r' makes it a raw string,
6     // in which you don't need to escape any symbols
7     let date_regex =
        Regex::new(r"^\d{2}.\d{2}.\d{4}$").expect("Failed
          to create regex");
8     let date = "15.10.2017";
9     // Check for a match
10    let is_date = date_regex.is_match(date);
11    println!("Is '{}' a date? {}", date, is_date);
12
13    // Let's use capture groups now
14    let date_regex = Regex::new(r"(\d{2}).(\d{2})
        .(\d{4})").expect("Failed to create regex");
15    let text_with_dates = "Alan Turing was born on 23.06.1912 and
          died on 07.06.1954. \
16      A movie about his life called 'The Imitation Game' came out
          on 14.11.2017";
17    // Iterate over the matches
18    for cap in date_regex.captures_iter(text_with_dates) {
19      println!("Found date {}", &cap[0]);
20      println!("Year: {} Month: {} Day: {}", &cap[3], &cap[2],
          &cap[1]);
21    }
22    // Replace the date format
23    println!("Original text:\t\t{}", text_with_dates);
24    let text_with_indian_dates =
        date_regex.replace_all(text_with_dates, "$1-$2-$3");
25    println!("In indian format:\t{}", text_with_indian_dates);
26
27    // Replacing groups is easier when we name them
28    // ?P<somename> gives a capture group a name
29    let date_regex = Regex::new(r"(?P<day>\d{2}).(?P<month>\d{2})
        .(?P<year>\d{4})")
30      .expect("Failed to create regex");
31    let text_with_american_dates =
        date_regex.replace_all(text_with_dates,
          "$month/$day/$year");
32    println!("In american format:\t{}", 
      text_with_american_dates);
33    let rust_regex = Regex::new(r"(?i)rust").expect("Failed to
        create regex");
34    println!("Do we match RuSt? {}", 
      rust_regex.is_match("RuSt"));
35    use regex::RegexBuilder;
36    let rust_regex = RegexBuilder::new(r"rust")
37      .case_insensitive(true)
38      .build()
39      .expect("Failed to create regex");
40    println!("Do we still match RuSt? {}",
        rust_regex.is_match("RuSt"));
41  }

How it works...

You can construct a regex object by calling Regex::new() with a valid regex string[7]. Most of the time, you will want to pass a raw string in the form of r"...". Raw means that all symbols in the string are taken at literal value without being escaped. This is important because of the backslash (\) character that is used in regex to represent a couple of important concepts, such as digits(\d) or whitespace (\s). However, Rust already uses the backslash to escape special non-printable symbols, such as the newline (\n) or the tab (\t)[23]. If we wanted to use a backslash in a normal string, we would have to escape it by repeating it ( \\). Or the regex on line [14] would have to be rewritten as:

"(\\d{2}).(\\d{2}).(\\d{4})"

Worse yet, if we wanted to match for the backslash itself, we would have to escape it as well because of regex. With normal strings, we would have to quadruple-escape it! ( \\\\) We can save ourselves the headache of missing readability and confusion by using raw strings and write our regex normally. In fact, it is considered good style to use raw strings in every regex, even when it doesn't have any backslashes [33]. This is a help for your future self if you notice down the line that you actually would like to use a feature that requires a backslash.

We can iterate over the results of our regex [18]. The object we get on every match is a collection of our capture groups. Keep in mind that the zeroeth index is always the entire capture [19]. The first index is then the string from our first capture group, the second index is the string of the second capture group, and so on. [20]. Unfortunately, we do not get a compile-time check on our index, so if we accessed &cap[4], our program would compile but then crash during runtime.

When replacing, we follow the same concept: $0 is the entire match, $1 the result of the first capture group, and so on. To make our life easier, we can give the capture groups names by starting them with ?P<somename>[29] and then use this name when replacing [31].

There are many flags that you can specify, in the form of (?flag), for fine-tuning, such as i, which makes the match case insensitive [33], or x, which ignores whitespace in the regex string. If you want to read up on them, visit their documentation (https://doc.rust-lang.org/regex/regex/index.html). Most of the time though, you can get the same result by using the RegexBuilder that is also in the regex crate [36]. Both of the rust_regex objects we generate in lines [33] and [36] are equivalent. While the second version is definitely more verbose, it is also way easier to understand at first glance.

There's more...

The regexes work by compiling their strings into the equivalent Rust code on creation. For performance reasons, you are advised to reuse your regexes instead of creating them anew every time you use them. A good way of doing this is by using the lazy_static crate, which we will look at later in the book, in the Creating lazy static objects section in Chapter 5, Advanced Data Structures.

Note

Be careful not to overdo it with regexes. As they say, "When all you have is a hammer, everything looks like a nail." If you parse complicated data, regexes can quickly become an unbelievably complex mess. When you notice that your regex has become too big to understand at first glance, try to rewrite it as a parser.