Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Querying and Filtering Data

Save for later
  • 1680 min read
  • 2015-06-25 00:00:00

article-image

In this article by Edwood Ng and Vineeth Mohan, authors of the book Lucene 4 Cookbook, we will cover the following recipes:

  • Performing advanced filtering
  • Creating a custom filter
  • Searching with QueryParser
  • TermQuery and TermRangeQuery
  • BooleanQuery
  • PrefixQuery and WildcardQuery
  • PhraseQuery and MultiPhraseQuery
  • FuzzyQuery

(For more resources related to this topic, see here.)

When it comes to search application, usability is always a key element that either makes or breaks user impression. Lucene does an excellent job of giving you the essential tools to build and search an index. In this article, we will look into some more advanced techniques to query and filter data. We will arm you with more knowledge to put into your toolbox so that you can leverage your Lucene knowledge to build a user-friendly search application.

Performing advanced filtering

Before we start, let us try to revisit these questions: what is a filter and what is it for? In simple terms, a filter is used to narrow the search space or, in another words, search within a search. Filter and Query may seem to provide the same functionality, but there is a significant difference between the two. Scores are calculated in querying to rank results, based on their relevancy to the search terms, while a filter has no effect on scores. It's not uncommon that users may prefer to navigate through a hierarchy of filters in order to land on the relevant results. You may often find yourselves in a situation where it is necessary to refine a result set so that users can continue to search or navigate within a subset. With the ability to apply filters, we can easily provide such search refinements. Another situation is data security where some parts of the data in the index are protected. You may need to include an additional filter behind the scene that's based on user access level so that users are restricted to only seeing items that they are permitted to access. In both of these contexts, Lucene's filtering features will provide the capability to achieve the objectives.

Lucene has a few built-in filters that are designed to fit most of the real-world applications. If you do find yourself in a position where none of the built-in filters are suitable for the job, you can rest assured that Lucene's expansibility will allow you to build your own custom filters. Let us take a look at Lucene's built-in filters:

  • TermRangeFilter: This is a filter that restricts results to a range of terms that are defined by lower bound and upper bound of a submitted range. This filter is best used on a single-valued field because on a tokenized field, any tokens within a range will return by this filter. This is for textual data only.
  • NumericRangeFilter: Similar to TermRangeFilter, this filter restricts results to a range of numeric values.
  • FieldCacheRangeFilter: This filter runs on top of the number of range filters, including TermRangeFilter and NumericRangeFilter. It caches filtered results using FieldCache for improved performance. FieldCache is stored in the memory, so performance boost can be upward of 100x faster than the normal range filter. Because it uses FieldCache, it's best to use this on a single-valued field only. This filter will not be applicable for multivalued field and when the available memory is limited, since it maintains FieldCache (in memory) on filtered results.
  • QueryWrapperFilter: This filter acts as a wrapper around a Query object. This filter is useful when you have complex business rules that are already defined in a Query and would like to reuse for other business purposes. It constructs a Query to act like a filter so that it can be applied to other Queries. Because this is a filter, scoring results from the Query within is irrelevant.
  • PrefixFilter: This filter restricts results that match what's defined in the prefix. This is similar to a substring match, but limited to matching results with a leading substring only.
  • FieldCacheTermsFilter: This is a term filter that uses FieldCache to store the calculated results in memory. This filter works on a single-valued field only. One use of it is when you have a category field where results are usually shown by categories in different pages. The filter can be used as a demarcation by categories.
  • FieldValueFilter: This filter returns a document containing one or more values on the specified field. This is useful as a preliminary filter to ensure that certain fields exist before querying.
  • CachingWrapperFilter: This is a wrapper that adds a caching layer to a filter to boost performance. Note that this filter provides a general caching layer; it should be applied on a filter that produces a reasonably small result set, such as an exact match. Otherwise, larger results may unnecessarily drain the system's resources and can actually introduce performance issues.

If none of the above filters fulfill your business requirements, you can build your own, extending the Filter class and implementing its abstract method getDocIdSet (AtomicReaderContext, Bits).

How to do it...

Let's set up our test case with the following code:

Analyzer analyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new   IndexWriterConfig(Version.LATEST, analyzer);
IndexWriter indexWriter = new IndexWriter(directory, config);
Document doc = new Document();
StringField stringField = new StringField("name", "",   Field.Store.YES);
TextField textField = new TextField("content", "",   Field.Store.YES);
IntField intField = new IntField("num", 0, Field.Store.YES);
doc.removeField("name"); doc.removeField("content");
doc.removeField("num");
stringField.setStringValue("First");
textField.setStringValue("Humpty Dumpty sat on a wall,");
intField.setIntValue(100);
doc.add(stringField); doc.add(textField); doc.add(intField);
indexWriter.addDocument(doc);
doc.removeField("name"); doc.removeField("content");
doc.removeField("num");
stringField.setStringValue("Second");
textField.setStringValue("Humpty Dumpty had a great fall.");
intField.setIntValue(200);
doc.add(stringField); doc.add(textField); doc.add(intField);
indexWriter.addDocument(doc);
doc.removeField("name"); doc.removeField("content");
doc.removeField("num");
stringField.setStringValue("Third");
textField.setStringValue("All the king's horses and all the king's men");
intField.setIntValue(300);
doc.add(stringField); doc.add(textField); doc.add(intField);
indexWriter.addDocument(doc);
doc.removeField("name"); doc.removeField("content");
doc.removeField("num");
stringField.setStringValue("Fourth");
textField.setStringValue("Couldn't put Humpty together   again.");
intField.setIntValue(400);
doc.add(stringField); doc.add(textField); doc.add(intField);
indexWriter.addDocument(doc);
indexWriter.commit();
indexWriter.close();
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);

How it works…

The preceding code adds four documents into an index. The four documents are:

  • Document 1

    Name: First

    Content: Humpty Dumpty sat on a wall,

    Num: 100

  • Document 2

    Name: Second

    Content: Humpty Dumpty had a great fall.

    Num: 200

  • Document 3

    Name: Third

    Content: All the king's horses and all the king's men

    Num: 300

  • Document 4

    Name: Fourth

    Content: Couldn't put Humpty together again.

    Num: 400

Here is our standard test case:

IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Query query = new TermQuery(new Term("content", "humpty"));
TopDocs topDocs = indexSearcher.search(query, FILTER, 100);
System.out.println("Searching 'humpty'");
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
   doc = indexReader.document(scoreDoc.doc);
   System.out.println("name: " + doc.getField("name").stringValue() +
       " - content: " + doc.getField("content").stringValue() + " - num: " + doc.getField("num").stringValue());
}
indexReader.close();

Running the code as it is will produce the following output, assuming the FILTER variable is declared:

Searching 'humpty'
name: First - content: Humpty Dumpty sat on a wall, - num: 100
name: Second - content: Humpty Dumpty had a great fall. - num: 200
name: Fourth - content: Couldn't put Humpty together again. - num: 400

This is a simple search on the word humpty. The search would return the first, second, and fourth sentences.

Now, let's take a look at a TermRangeFilter example:

TermRangeFilter termRangeFilter = TermRangeFilter.newStringRange("name", "A", "G", true, true);

Applying this filter to preceding search (by setting FILTER as termRangeFilter) will produce the following output:

Searching 'humpty'
name: First - content: Humpty Dumpty sat on a wall, - num: 100
name: Fourth - content: Couldn't put Humpty together again. - num: 400

Note that the second sentence is missing from the results due to this filter. This filter removes documents with name outside of A through G. Both first and fourth sentences start with F that's within the range so their results are included. The second sentence's name value Second is outside the range, so the document is not considered by the query.

Let's move on to NumericRangeFilter:

NumericRangeFilter numericRangeFilter = NumericRangeFilter.newIntRange("num", 200, 400, true, true);

This filter will produce the following results:

Searching 'humpty'
name: Second - content: Humpty Dumpty had a great fall. - num: 200
name: Fourth - content: Couldn't put Humpty together again. - num: 400

Note that the first sentence is missing from results. It's because its num 100 is outside the specified numeric range 200 to 400 in NumericRangeFilter.

Next one is FieldCacheRangeFilter:

FieldCacheRangeFilter fieldCacheTermRangeFilter = FieldCacheRangeFilter.newStringRange("name", "A", "G", true, true);

The output of this filter is similar to the TermRangeFilter example:

Searching 'humpty'
name: First - content: Humpty Dumpty sat on a wall, - num: 100
name: Fourth - content: Couldn't put Humpty together again. - num: 400

This filter provides a caching layer on top of TermRangeFilter. Results are similar, but performance is a lot better because the calculated results are cached in memory for the next retrieval.

Next is QueryWrapperFiler:

QueryWrapperFilter queryWrapperFilter = new QueryWrapperFilter(new TermQuery(new Term("content", "together")));

This example will produce this result:

Searching 'humpty'
name: Fourth - content: Couldn't put Humpty together again. - num: 400

This filter wraps around TermQuery on term together on the content field. Since the fourth sentence is the only one that contains the word "together" search results is limited to this sentence only.

Next one is PrefixFilter:

PrefixFilter prefixFilter = new PrefixFilter(new Term("name", "F"));

This filter produces the following:

Searching 'humpty'
name: First - content: Humpty Dumpty sat on a wall, - num: 100
name: Fourth - content: Couldn't put Humpty together again. - num: 400

This filter limits results where the name field begins with letter F. In this case, the first and fourth sentences both have the name field that begins with F (First and Fourth); hence, the results.

Next is FieldCacheTermsFilter:

FieldCacheTermsFilter fieldCacheTermsFilter = new FieldCacheTermsFilter("name", "First");

This filter produces the following:

Searching 'humpty'
name: First - content: Humpty Dumpty sat on a wall, - num: 100

This filter limits results with the name containing the word first. Since the first sentence is the only one that contains first, only one sentence is returned in search results.

Next is FieldValueFilter:

FieldValueFilter fieldValueFilter = new FieldValueFilter("name1");

This would produce the following:

Searching 'humpty'

Note that there are no results because this filter limits results in which there is at least one value on the filed name1. Since the name1 field doesn't exist in our current example, no documents are returned by this filter; hence, zero results.

Next is CachingWrapperFilter:

TermRangeFilter termRangeFilter = TermRangeFilter.newStringRange("name", "A", "G", true, true);
CachingWrapperFilter cachingWrapperFilter = new CachingWrapperFilter(termRangeFilter);

This wrapper wraps around the same TermRangeFilter from above, so the result produced is similar:

Searching 'humpty'
name: First - content: Humpty Dumpty sat on a wall, - num: 100
name: Fourth - content: Couldn't put Humpty together again. - num: 400

Filters work in conjunction with Queries to refine the search results. As you may have already noticed, the benefit of Filter is its ability to cache results, while Query calculates in real time. When choosing between Filter and Query, you will want to ask yourself whether the search (or filtering) will be repeated. Provided you have enough memory allocation, a cached Filter will always provide a positive impact to search experiences.

Creating a custom filter

Now that we've seen numerous examples on Lucene's built-in Filters, we are ready for a more advanced topic, custom filters. There are a few important components we need to go over before we start: FieldCache, SortedDocValues, and DocIdSet. We will be using these items in our example to help you gain practical knowledge on the subject.

In the FieldCache, as you already learned, is a cache that stores field values in memory in an array structure. It's a very simple data structure as the slots in the array basically correspond to DocIds. This is also the reason why FieldCache only works for a single-valued field. A slot in an array can only hold a single value. Since this is just an array, the lookup time is constant and very fast.

The SortedDocValues has two internal data mappings for values' lookup: a dictionary mapping an ordinal value to a field value and a DocId to an ordinal value (for the field value) mapping. In the dictionary data structure, the values are deduplicated, dereferenced, and sorted. There are two methods of interest in this class: getOrd(int) and lookupTerm(BytesRef). The getOrd(int) returns an ordinal for a DocId (int) and lookupTerm(BytesRef) returns an ordinal for a field value. This data structure is the opposite of the inverted index structure, as this provides a DocId to value lookup (similar to FieldCache), instead of value to a DocId lookup.

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime

DocIdSet, as the name implies, is a set of DocId. A FieldCacheDocIdSet subclass we will be using is a combination of this set and FieldCache. It iterates through the set and calls matchDoc(int) to find all the matching documents to be returned.

In our example, we will be building a simple user security Filter to determine which documents are eligible to be viewed by a user based on the user ID and group ID. The group ID is assumed to be hereditary, where as a smaller ID inherits rights from a larger ID. For example, the following will be our group ID model in our implementation:

10 – admin
20 – manager
30 – user
40 – guest

A user with group ID 10 will be able to access documents where its group ID is 10 or above.

How to do it...

Here is our custom Filter, UserSecurityFilter:

public class UserSecurityFilter extends Filter {
 
private String userIdField;
private String groupIdField;
private String userId;
private String groupId;
 
public UserSecurityFilter(String userIdField, String groupIdField, String userId, String groupId) {
   this.userIdField = userIdField;
   this.groupIdField = groupIdField;
   this.userId = userId;
   this.groupId = groupId;
}
 
public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs) throws IOException {
   final SortedDocValues userIdDocValues = FieldCache.DEFAULT.getTermsIndex(context.reader(), userIdField);
   final SortedDocValues groupIdDocValues = FieldCache.DEFAULT.getTermsIndex(context.reader(), groupIdField);
 
   final int userIdOrd = userIdDocValues.lookupTerm(new BytesRef(userId));
   final int groupIdOrd = groupIdDocValues.lookupTerm(new BytesRef(groupId));
 
   return new FieldCacheDocIdSet(context.reader().maxDoc(), acceptDocs) {
     @Override
     protected final boolean matchDoc(int doc) {
       final int userIdDocOrd = userIdDocValues.getOrd(doc);
       final int groupIdDocOrd = groupIdDocValues.getOrd(doc);
       return userIdDocOrd == userIdOrd || groupIdDocOrd >= groupIdOrd;
     }
   };
}
}

This Filter accepts four arguments in its constructor:

  • userIdField: This is the field name for user ID
  • groupIdField: This is the field name for group ID
  • userId: This is the current session's user ID
  • groupId: This is the current session's group ID of the user

Then, we implement getDocIdSet(AtomicReaderContext, Bits) to perform our filtering by userId and groupId. We first acquire two SortedDocValues, one for the user ID and one for the group ID, based on the Field names we obtained from the constructor. Then, we look up the ordinal values for the current session's user ID and group ID. The return value is a new FieldCacheDocIdSet object implementing its matchDoc(int) method. This is where we compare both the user ID and group ID to determine whether a document is viewable by the user. A match is true when the user ID matches and the document's group ID is greater than or equal to the user's group ID.

To test this Filter, we will set up our index as follows:

   Analyzer analyzer = new StandardAnalyzer();
   Directory directory = new RAMDirectory();
   IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);
   IndexWriter indexWriter = new IndexWriter(directory, config);
   Document doc = new Document();
   StringField stringFieldFile = new StringField("file", "", Field.Store.YES);
   StringField stringFieldUserId = new StringField("userId", "", Field.Store.YES);
   StringField stringFieldGroupId = new StringField("groupId", "", Field.Store.YES);
 
   doc.removeField("file"); doc.removeField("userId"); doc.removeField("groupId");
   stringFieldFile.setStringValue("Z:\shared\finance\2014- sales.xls");
   stringFieldUserId.setStringValue("1001");
   stringFieldGroupId.setStringValue("20");
   doc.add(stringFieldFile); doc.add(stringFieldUserId); doc.add(stringFieldGroupId);
   indexWriter.addDocument(doc);
 
   doc.removeField("file"); doc.removeField("userId"); doc.removeField("groupId");
   stringFieldFile.setStringValue("Z:\shared\company\2014- policy.doc");
   stringFieldUserId.setStringValue("1101");
   stringFieldGroupId.setStringValue("30");
   doc.add(stringFieldFile); doc.add(stringFieldUserId);
   doc.add(stringFieldGroupId);
   indexWriter.addDocument(doc);
   doc.removeField("file"); doc.removeField("userId");
   doc.removeField("groupId");
   stringFieldFile.setStringValue("Z:\shared\company\2014- terms-and-conditions.doc");
   stringFieldUserId.setStringValue("1205");
   stringFieldGroupId.setStringValue("40");
   doc.add(stringFieldFile); doc.add(stringFieldUserId);
   doc.add(stringFieldGroupId);
   indexWriter.addDocument(doc);
   indexWriter.commit();
   indexWriter.close();

The setup adds three documents to our index with different user IDs and group ID settings in each document, as follows:

UserSecurityFilter userSecurityFilter = new UserSecurityFilter("userId", "groupId", "1001", "40");
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Query query = new MatchAllDocsQuery();
TopDocs topDocs = indexSearcher.search(query, userSecurityFilter,   100);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
doc = indexReader.document(scoreDoc.doc);
System.out.println("file: " + doc.getField("file").stringValue() +" - userId: " + doc.getField("userId").stringValue() + " - groupId: " +       doc.getField("groupId").stringValue());}
indexReader.close();

We initialize UserSecurityFilter with the matching names for user ID and group ID fields, and set it up with user ID 1001 and group ID 40. For our test and search, we use MatchAllDocsQuery to basically search without any queries (as it will return all the documents). Here is the output from the code:

file: Z:sharedfinance2014-sales.xls - userId: 1001 - groupId: 20
file: Z:sharedcompany2014-terms-and-conditions.doc - userId: 1205 - groupId: 40

The search specifically filters by user ID 1001, so the first document is returned because its user ID is also 1001. The third document is returned because its group ID, 40, is greater than or equal to the user's group ID, which is also 40.

Searching with QueryParser

QueryParser is an interpreter tool that transforms a search string into a series of Query clauses. It's not absolutely necessary to use QueryParser to perform a search, but it's a great feature that empowers users by allowing the use of search modifiers. A user can specify a phrase match by putting quotes (") around a phrase. A user can also control whether a certain term or phrase is required by putting a plus ("+") sign in front of the term or phrase, or use a minus ("-") sign to indicate that the term or phrase must not exist in results. For Boolean searches, the user can use AND and OR to control whether all terms or phrases are required.

To do a field-specific search, you can use a colon (":") to specify a field for a search (for example, content:humpty would search for the term "humpty" in the field "content"). For wildcard searches, you can use the standard wildcard character asterisk ("*") to match 0 or more characters, or a question mark ("?") for matching a single character. As you can see, the general syntax for a search query is not complicated, though the more advanced modifiers can seem daunting to new users. In this article, we will cover more advanced QueryParser features to show you what you can do to customize a search.

How to do it..

Let's look at the options that we can set in QueryParser. The following is a piece of code snippet for our setup:

Analyzer analyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);
IndexWriter indexWriter = new IndexWriter(directory, config);
Document doc = new Document();
StringField stringField = new StringField("name", "", Field.Store.YES);
TextField textField = new TextField("content", "", Field.Store.YES);
IntField intField = new IntField("num", 0, Field.Store.YES);
 
doc.removeField("name"); doc.removeField("content"); doc.removeField("num");
stringField.setStringValue("First");
textField.setStringValue("Humpty Dumpty sat on a wall,");
intField.setIntValue(100);
doc.add(stringField); doc.add(textField); doc.add(intField);
indexWriter.addDocument(doc);
 
doc.removeField("name"); doc.removeField("content"); doc.removeField("num");
stringField.setStringValue("Second");
textField.setStringValue("Humpty Dumpty had a great fall.");
intField.setIntValue(200);
doc.add(stringField); doc.add(textField); doc.add(intField);
indexWriter.addDocument(doc);
 
doc.removeField("name"); doc.removeField("content"); doc.removeField("num");
stringField.setStringValue("Third");
textField.setStringValue("All the king's horses and all the king's men");
intField.setIntValue(300);
doc.add(stringField); doc.add(textField); doc.add(intField);
indexWriter.addDocument(doc);
 
doc.removeField("name"); doc.removeField("content"); doc.removeField("num");
stringField.setStringValue("Fourth");
textField.setStringValue("Couldn't put Humpty together again.");
intField.setIntValue(400);
doc.add(stringField); doc.add(textField); doc.add(intField);
indexWriter.addDocument(doc);
 
indexWriter.commit();
indexWriter.close();
 
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser queryParser = new QueryParser("content", analyzer);
// configure queryParser here
Query query = queryParser.parse("humpty");
TopDocs topDocs = indexSearcher.search(query, 100);

We add four documents and instantiate a QueryParser object with a default field and an analyzer. We will be using the same analyzer that was used in indexing to ensure that we apply the same text treatment to maximize matching capability.

Wildcard search

The query syntax for a wildcard search is the asterisk ("*") or question mark ("?") character. Here is a sample query:

Query query = queryParser.parse("humpty*");

This query will return the first, second, and fourth sentences. By default, QueryParser does not allow a leading wildcard character because it has a significant performance impact. A leading wildcard would trigger a full scan on the index since any term can be a potential match. In essence, even an inverted index would become rather useless for a leading wildcard character search. However, it's possible to override this default setting to allow a leading wildcard character by calling setAllowLeadingWildcard(true). You can go ahead and run this example with different search strings to see how this feature works.

Depending on where the wildcard character(s) is placed, QueryParser will produce either a PrefixQuery or WildcardQuery. In this specific example in which there is only one wildcard character and it's not the leading character, a PrefixQuery will be produced.

Term range search

We can produce a TermRangeQuery by using TO in a search string. The range has the following syntax:

[start TO end] – inclusive
{start TO end} – exclusive

As indicated, the angle brackets ( [ and ] ) is inclusive of start and end terms, and curly brackets ( { and } ) is exclusive of start and end terms. It's also possible to mix these brackets to inclusive on one side and exclusive on the other side.

Here is a code snippet:

Query query = queryParser.parse("[aa TO c]");

This search will return the third and fourth sentences, as their beginning words are All and Couldn't, which are within the range. You can optionally analyze the range terms with the same analyzer by setting setAnalyzeRangeTerms(true).

Autogenerated phrase query

QueryParser can automatically generate a PhraseQuery when there is more than one term in a search string. Here is a code snippet:

queryParser.setAutoGeneratePhraseQueries(true);
Query query = queryParser.parse("humpty+dumpty+sat");

This search will generate a PhraseQuery on the phrase humpty dumpty sat and will return the first sentence.

Date resolution

If you have a date field (by using DateTools to convert date to a string format) and would like to do a range search on date, it may be necessary to match the date resolution on a specific field. Here is a code snippet on setting the Date resolution:

queryParser.setDateResolution("date", DateTools.Resolution.DAY);
queryParser.setLocale(Locale.US);
queryParser.setTimeZone(TimeZone.getTimeZone("Am erica/New_York"));

This example sets the resolution to day granularity, locale to US, and time zone to New York. The locale and time zone settings are specific to the date format only.

Default operator

The default operator on a multiterm search string is OR. You can change the default to AND so all the terms are required. Here is a code snippet that will require all the terms in a search string:

queryParser.setDefaultOperator(QueryParser.Operator.AND);
Query query = queryParser.parse("humpty dumpty");

This example will return first and second sentences as these are the only two sentences with both humpty and dumpty.

Enable position increments

This setting is enabled by default. Its purpose is to maintain a position increment of the token that follows an omitted token, such as a token filtered by a StopFilter. This is useful in phrase queries when position increments may be important for scoring. Here is an example on how to enable this setting:

queryParser.setEnablePositionIncrements(true);
Query query = queryParser.parse(""humpty dumpty"");

In our scenario, it won't change our search results. This attribute only enables position increments information to be available in the resulting PhraseQuery.

Fuzzy query

Lucene's fuzzy search implementation is based on Levenshtein distance. It compares two strings and finds out the number of single character changes that are needed to transform one string to another. The resulting number indicates the closeness of the two strings. In a fuzzy search, a threshold number of edits is used to determine if the two strings are matched. To trigger a fuzzy match in QueryParser, you can use the tilde ~ character. There are a couple configurations in QueryParser to tune this type of query. Here is a code snippet:

queryParser.setFuzzyMinSim(2f);
queryParser.setFuzzyPrefixLength(3);
Query query = queryParser.parse("hump~");

This example will return first, second, and fourth sentences as the fuzzy match matches hump to humpty because these two words are missed by two characters. We tuned the fuzzy query to a minimum similarity to two in this example.

Lowercase expanded term

This configuration determines whether to automatically lowercase multiterm queries. An analyzer can do this already, so this is more like an overriding configuration that forces multiterm queries to be lowercased. Here is a code snippet:

queryParser.setLowercaseExpandedTerms(true);
Query query = queryParser.parse(""Humpty Dumpty"");

This code will lowercase our search string before search execution.

Phrase slop

Phrase search can be tuned to allow some flexibility in phrase matching. By default, phrase match is exact. Setting a slop value will give it some tolerance on terms that may not always be matched consecutively. Here is a code snippet that will demonstrate this feature:

queryParser.setPhraseSlop(3);
Query query = queryParser.parse(""Humpty Dumpty wall"");

Without setting a phrase slop, this phrase Humpty Dumpty wall will not have any matches. By setting phrase slop to three, it allows some tolerance so that this search will now return the first sentence. Go ahead and play around with this setting in order to get more familiarized with its behavior.

TermQuery and TermRangeQuery

A TermQuery is a very simple query that matches documents containing a specific term. The TermRangeQuery is, as its name implies, a term range with a lower and upper boundary for matching.

How to do it..

Here are a couple of examples on TermQuery and TermRangeQuery:

query = new TermQuery(new Term("content", "humpty"));
query = new TermRangeQuery("content", new BytesRef("a"), new BytesRef("c"), true, true);

The first line is a simple query that matches the term humpty in the content field. The second line is a range query matching documents with the content that's sorted within a and c.

BooleanQuery

A BooleanQuery is a combination of other queries in which you can specify whether each subquery must, must not, or should match. These options provide the foundation to build up to logical operators of AND, OR, and NOT, which you can use in QueryParser. Here is a quick review on QueryParser syntax for BooleanQuery:

  • "+" means required; for example, a search string +humpty dumpty equates to must match humpty and should match "dumpty"
  • "-" means must not match; for example, a search string -humpty dumpty equates to must not match humpty and should match dumpty
  • AND, OR, and NOT are pseudo Boolean operators. Under the hood, Lucene uses BooleanClause.Occur to model these operators. The options for occur are MUST, MUST_NOT, and SHOULD. In an AND query, both terms must match. In an OR query, both terms should match. Lastly, in a NOT query, the term MUST_NOT exists. For example, humpty AND dumpty means must match both humpty and dumpty, humpty OR dumpty means should match either or both humpty or dumpty, and NOT humpty means the term humpty must not exist in matching.

As mentioned, rudimentary clauses of BooleanQuery have three option: must match, must not match, and should match. These options allow us to programmatically create Boolean operations through an API.

How to do it..

Here is a code snippet that demonstrates BooleanQuery:

BooleanQuery query = new BooleanQuery();
query.add(new BooleanClause(
new TermQuery(new Term("content", "humpty")),
BooleanClause.Occur.MUST));
query.add(new BooleanClause(new TermQuery(
new Term("content", "dumpty")),
BooleanClause.Occur.MUST));
query.add(new BooleanClause(new TermQuery(
new Term("content", "wall")),
BooleanClause.Occur.SHOULD));
query.add(new BooleanClause(new TermQuery(
new Term("content", "sat")),
BooleanClause.Occur.MUST_NOT));

How it works…

In this demonstration, we will use TermQuery to illustrate the building of BooleanClauses. It's equivalent to this logic: (humpty AND dumpty) OR wall NOT sat. This code will return the second sentence from our setup. Because of the last MUST_NOT BooleanClause on the word "sat", the first sentence is filtered from the results. Note that BooleanClause accepts two arguments: a Query and a BooleanClause.Occur. BooleanClause.Occur is where you specify the matching options: MUST, MUST_NOT, and SHOULD.

PrefixQuery and WildcardQuery

PrefixQuery, as the name implies, matches documents with terms starting with a specified prefix. WildcardQuery allows you to use wildcard characters for wildcard matching.

A PrefixQuery is somewhat similar to a WildcardQuery in which there is only one wildcard character at the end of a search string. When doing a wildcard search in QueryParser, it would return either a PrefixQuery or WildcardQuery, depending on the wildcard character's location. PrefixQuery is simpler and more efficient than WildcardQuery, so it's preferable to use PrefixQuery whenever possible. That's exactly what QueryParser does.

How to do it...

Here is a code snippet to demonstrate both Query types:

PrefixQuery query = new PrefixQuery(new Term("content", "hum"));
WildcardQuery query2 = new WildcardQuery(new Term("content", "*um*"));

How it works…

Both queries would return the same results from our setup. The PrefixQuery will match anything that starts with hum and the WildcardQuery would match anything that contains um.

PhraseQuery and MultiPhraseQuery

A PhraseQuery matches a particular sequence of terms, while a MultiPhraseQuery gives you an option to match multiple terms in the same position. For example, MultiPhrasQuery supports a phrase such as humpty (dumpty OR together) in which it matches humpty in position 0 and dumpty or together in position 1.

How to do it...

Here is a code snippet to demonstrate both Query types:

PhraseQuery query = new PhraseQuery();
query.add(new Term("content", "humpty"));
query.add(new Term("content", "together"));
MultiPhraseQuery query2 = new MultiPhraseQuery();
Term[] terms1 = new Term[1];terms1[0] = new Term("content", "humpty");
Term[] terms2 = new Term[2];terms2[0] = new Term("content", "dumpty");
terms2[1] = new Term("content", "together");
query2.add(terms1);
query2.add(terms2);

How it works…

The first Query, PhraseQuery, searches for the phrase humpty together. The second Query, MultiPhraseQuery, searches for the phrase humpty (dumpty OR together). The first Query would return sentence four from our setup, while the second Query would return sentence one, two, and four. Note that in MultiPhraseQuery, multiple terms in the same position are added as an array.

FuzzyQuery

A FuzzyQuery matches terms based on similarity, using the Damerau-Levenshtein algorithm. We are not going into the details of the algorithm as it is outside of our topic. What we need to know is a fuzzy match is measured in the number of edits between terms. FuzzyQuery allows a maximum of 2 edits. For example, between "humptX" and humpty is first edit and between humpXX and humpty are two edits. There is also a requirement that the number of edits must be less than the minimum term length (of either the input term or candidate term). As another example, ab and abcd would not match because the number of edits between the two terms is 2 and it's not greater than the length of ab, which is 2.

How to do it...

Here is a code snippet to demonstrate FuzzyQuery:

FuzzyQuery query = new FuzzyQuery(new Term("content", "humpXX"));

How it works…

This Query will return sentences one, two, and four from our setup, as humpXX matches humpty within the two edits. In QueryParser, FuzzyQuery can be triggered by the tilde ( ~ ) sign. An equivalent search string would be humpXX~.

Summary

This gives you a glimpse of the various querying and filtering features that have been proven to build successful search engines.

Resources for Article:


Further resources on this subject: