





















































In this article by Edwood Ng and Vineeth Mohan, authors of the book Lucene 4 Cookbook, we will cover the following recipes:
(For more resources related to this topic, see here.)
When it comes to search application, usability is always a key element that either makes or breaks user impression. Lucene does an excellent job of giving you the essential tools to build and search an index. In this article, we will look into some more advanced techniques to query and filter data. We will arm you with more knowledge to put into your toolbox so that you can leverage your Lucene knowledge to build a user-friendly search application.
Before we start, let us try to revisit these questions: what is a filter and what is it for? In simple terms, a filter is used to narrow the search space or, in another words, search within a search. Filter and Query may seem to provide the same functionality, but there is a significant difference between the two. Scores are calculated in querying to rank results, based on their relevancy to the search terms, while a filter has no effect on scores. It's not uncommon that users may prefer to navigate through a hierarchy of filters in order to land on the relevant results. You may often find yourselves in a situation where it is necessary to refine a result set so that users can continue to search or navigate within a subset. With the ability to apply filters, we can easily provide such search refinements. Another situation is data security where some parts of the data in the index are protected. You may need to include an additional filter behind the scene that's based on user access level so that users are restricted to only seeing items that they are permitted to access. In both of these contexts, Lucene's filtering features will provide the capability to achieve the objectives.
Lucene has a few built-in filters that are designed to fit most of the real-world applications. If you do find yourself in a position where none of the built-in filters are suitable for the job, you can rest assured that Lucene's expansibility will allow you to build your own custom filters. Let us take a look at Lucene's built-in filters:
If none of the above filters fulfill your business requirements, you can build your own, extending the Filter class and implementing its abstract method getDocIdSet (AtomicReaderContext, Bits).
Let's set up our test case with the following code:
Analyzer analyzer = new StandardAnalyzer(); Directory directory = new RAMDirectory(); IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer); IndexWriter indexWriter = new IndexWriter(directory, config); Document doc = new Document(); StringField stringField = new StringField("name", "", Field.Store.YES); TextField textField = new TextField("content", "", Field.Store.YES); IntField intField = new IntField("num", 0, Field.Store.YES); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("First"); textField.setStringValue("Humpty Dumpty sat on a wall,"); intField.setIntValue(100); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Second"); textField.setStringValue("Humpty Dumpty had a great fall."); intField.setIntValue(200); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Third"); textField.setStringValue("All the king's horses and all the king's men"); intField.setIntValue(300); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Fourth"); textField.setStringValue("Couldn't put Humpty together again."); intField.setIntValue(400); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); indexWriter.commit(); indexWriter.close(); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader);
The preceding code adds four documents into an index. The four documents are:
Document 1
Name: First
Content: Humpty Dumpty sat on a wall,
Num: 100
Document 2
Name: Second
Content: Humpty Dumpty had a great fall.
Num: 200
Document 3
Name: Third
Content: All the king's horses and all the king's men
Num: 300
Document 4
Name: Fourth
Content: Couldn't put Humpty together again.
Num: 400
Here is our standard test case:
IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); Query query = new TermQuery(new Term("content", "humpty")); TopDocs topDocs = indexSearcher.search(query, FILTER, 100); System.out.println("Searching 'humpty'"); for (ScoreDoc scoreDoc : topDocs.scoreDocs) { doc = indexReader.document(scoreDoc.doc); System.out.println("name: " + doc.getField("name").stringValue() + " - content: " + doc.getField("content").stringValue() + " - num: " + doc.getField("num").stringValue()); } indexReader.close();
Running the code as it is will produce the following output, assuming the FILTER variable is declared:
Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 name: Second - content: Humpty Dumpty had a great fall. - num: 200 name: Fourth - content: Couldn't put Humpty together again. - num: 400
This is a simple search on the word humpty. The search would return the first, second, and fourth sentences.
Now, let's take a look at a TermRangeFilter example:
TermRangeFilter termRangeFilter = TermRangeFilter.newStringRange("name", "A", "G", true, true);
Applying this filter to preceding search (by setting FILTER as termRangeFilter) will produce the following output:
Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 name: Fourth - content: Couldn't put Humpty together again. - num: 400
Note that the second sentence is missing from the results due to this filter. This filter removes documents with name outside of A through G. Both first and fourth sentences start with F that's within the range so their results are included. The second sentence's name value Second is outside the range, so the document is not considered by the query.
Let's move on to NumericRangeFilter:
NumericRangeFilter numericRangeFilter = NumericRangeFilter.newIntRange("num", 200, 400, true, true);
This filter will produce the following results:
Searching 'humpty' name: Second - content: Humpty Dumpty had a great fall. - num: 200 name: Fourth - content: Couldn't put Humpty together again. - num: 400
Note that the first sentence is missing from results. It's because its num 100 is outside the specified numeric range 200 to 400 in NumericRangeFilter.
Next one is FieldCacheRangeFilter:
FieldCacheRangeFilter fieldCacheTermRangeFilter = FieldCacheRangeFilter.newStringRange("name", "A", "G", true, true);
The output of this filter is similar to the TermRangeFilter example:
Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 name: Fourth - content: Couldn't put Humpty together again. - num: 400
This filter provides a caching layer on top of TermRangeFilter. Results are similar, but performance is a lot better because the calculated results are cached in memory for the next retrieval.
Next is QueryWrapperFiler:
QueryWrapperFilter queryWrapperFilter = new QueryWrapperFilter(new TermQuery(new Term("content", "together")));
This example will produce this result:
Searching 'humpty' name: Fourth - content: Couldn't put Humpty together again. - num: 400
This filter wraps around TermQuery on term together on the content field. Since the fourth sentence is the only one that contains the word "together" search results is limited to this sentence only.
Next one is PrefixFilter:
PrefixFilter prefixFilter = new PrefixFilter(new Term("name", "F"));
This filter produces the following:
Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 name: Fourth - content: Couldn't put Humpty together again. - num: 400
This filter limits results where the name field begins with letter F. In this case, the first and fourth sentences both have the name field that begins with F (First and Fourth); hence, the results.
Next is FieldCacheTermsFilter:
FieldCacheTermsFilter fieldCacheTermsFilter = new FieldCacheTermsFilter("name", "First");
This filter produces the following:
Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100
This filter limits results with the name containing the word first. Since the first sentence is the only one that contains first, only one sentence is returned in search results.
Next is FieldValueFilter:
FieldValueFilter fieldValueFilter = new FieldValueFilter("name1");
This would produce the following:
Searching 'humpty'
Note that there are no results because this filter limits results in which there is at least one value on the filed name1. Since the name1 field doesn't exist in our current example, no documents are returned by this filter; hence, zero results.
Next is CachingWrapperFilter:
TermRangeFilter termRangeFilter = TermRangeFilter.newStringRange("name", "A", "G", true, true); CachingWrapperFilter cachingWrapperFilter = new CachingWrapperFilter(termRangeFilter);
This wrapper wraps around the same TermRangeFilter from above, so the result produced is similar:
Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 name: Fourth - content: Couldn't put Humpty together again. - num: 400
Filters work in conjunction with Queries to refine the search results. As you may have already noticed, the benefit of Filter is its ability to cache results, while Query calculates in real time. When choosing between Filter and Query, you will want to ask yourself whether the search (or filtering) will be repeated. Provided you have enough memory allocation, a cached Filter will always provide a positive impact to search experiences.
Now that we've seen numerous examples on Lucene's built-in Filters, we are ready for a more advanced topic, custom filters. There are a few important components we need to go over before we start: FieldCache, SortedDocValues, and DocIdSet. We will be using these items in our example to help you gain practical knowledge on the subject.
In the FieldCache, as you already learned, is a cache that stores field values in memory in an array structure. It's a very simple data structure as the slots in the array basically correspond to DocIds. This is also the reason why FieldCache only works for a single-valued field. A slot in an array can only hold a single value. Since this is just an array, the lookup time is constant and very fast.
The SortedDocValues has two internal data mappings for values' lookup: a dictionary mapping an ordinal value to a field value and a DocId to an ordinal value (for the field value) mapping. In the dictionary data structure, the values are deduplicated, dereferenced, and sorted. There are two methods of interest in this class: getOrd(int) and lookupTerm(BytesRef). The getOrd(int) returns an ordinal for a DocId (int) and lookupTerm(BytesRef) returns an ordinal for a field value. This data structure is the opposite of the inverted index structure, as this provides a DocId to value lookup (similar to FieldCache), instead of value to a DocId lookup.
DocIdSet, as the name implies, is a set of DocId. A FieldCacheDocIdSet subclass we will be using is a combination of this set and FieldCache. It iterates through the set and calls matchDoc(int) to find all the matching documents to be returned.
In our example, we will be building a simple user security Filter to determine which documents are eligible to be viewed by a user based on the user ID and group ID. The group ID is assumed to be hereditary, where as a smaller ID inherits rights from a larger ID. For example, the following will be our group ID model in our implementation:
10 – admin 20 – manager 30 – user 40 – guest
A user with group ID 10 will be able to access documents where its group ID is 10 or above.
Here is our custom Filter, UserSecurityFilter:
public class UserSecurityFilter extends Filter { private String userIdField; private String groupIdField; private String userId; private String groupId; public UserSecurityFilter(String userIdField, String groupIdField, String userId, String groupId) { this.userIdField = userIdField; this.groupIdField = groupIdField; this.userId = userId; this.groupId = groupId; } public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs) throws IOException { final SortedDocValues userIdDocValues = FieldCache.DEFAULT.getTermsIndex(context.reader(), userIdField); final SortedDocValues groupIdDocValues = FieldCache.DEFAULT.getTermsIndex(context.reader(), groupIdField); final int userIdOrd = userIdDocValues.lookupTerm(new BytesRef(userId)); final int groupIdOrd = groupIdDocValues.lookupTerm(new BytesRef(groupId)); return new FieldCacheDocIdSet(context.reader().maxDoc(), acceptDocs) { @Override protected final boolean matchDoc(int doc) { final int userIdDocOrd = userIdDocValues.getOrd(doc); final int groupIdDocOrd = groupIdDocValues.getOrd(doc); return userIdDocOrd == userIdOrd || groupIdDocOrd >= groupIdOrd; } }; } }
This Filter accepts four arguments in its constructor:
Then, we implement getDocIdSet(AtomicReaderContext, Bits) to perform our filtering by userId and groupId. We first acquire two SortedDocValues, one for the user ID and one for the group ID, based on the Field names we obtained from the constructor. Then, we look up the ordinal values for the current session's user ID and group ID. The return value is a new FieldCacheDocIdSet object implementing its matchDoc(int) method. This is where we compare both the user ID and group ID to determine whether a document is viewable by the user. A match is true when the user ID matches and the document's group ID is greater than or equal to the user's group ID.
To test this Filter, we will set up our index as follows:
Analyzer analyzer = new StandardAnalyzer(); Directory directory = new RAMDirectory(); IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer); IndexWriter indexWriter = new IndexWriter(directory, config); Document doc = new Document(); StringField stringFieldFile = new StringField("file", "", Field.Store.YES); StringField stringFieldUserId = new StringField("userId", "", Field.Store.YES); StringField stringFieldGroupId = new StringField("groupId", "", Field.Store.YES); doc.removeField("file"); doc.removeField("userId"); doc.removeField("groupId"); stringFieldFile.setStringValue("Z:\shared\finance\2014- sales.xls"); stringFieldUserId.setStringValue("1001"); stringFieldGroupId.setStringValue("20"); doc.add(stringFieldFile); doc.add(stringFieldUserId); doc.add(stringFieldGroupId); indexWriter.addDocument(doc); doc.removeField("file"); doc.removeField("userId"); doc.removeField("groupId"); stringFieldFile.setStringValue("Z:\shared\company\2014- policy.doc"); stringFieldUserId.setStringValue("1101"); stringFieldGroupId.setStringValue("30"); doc.add(stringFieldFile); doc.add(stringFieldUserId); doc.add(stringFieldGroupId); indexWriter.addDocument(doc); doc.removeField("file"); doc.removeField("userId"); doc.removeField("groupId"); stringFieldFile.setStringValue("Z:\shared\company\2014- terms-and-conditions.doc"); stringFieldUserId.setStringValue("1205"); stringFieldGroupId.setStringValue("40"); doc.add(stringFieldFile); doc.add(stringFieldUserId); doc.add(stringFieldGroupId); indexWriter.addDocument(doc); indexWriter.commit(); indexWriter.close();
The setup adds three documents to our index with different user IDs and group ID settings in each document, as follows:
UserSecurityFilter userSecurityFilter = new UserSecurityFilter("userId", "groupId", "1001", "40"); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); Query query = new MatchAllDocsQuery(); TopDocs topDocs = indexSearcher.search(query, userSecurityFilter, 100); for (ScoreDoc scoreDoc : topDocs.scoreDocs) { doc = indexReader.document(scoreDoc.doc); System.out.println("file: " + doc.getField("file").stringValue() +" - userId: " + doc.getField("userId").stringValue() + " - groupId: " + doc.getField("groupId").stringValue());} indexReader.close();
We initialize UserSecurityFilter with the matching names for user ID and group ID fields, and set it up with user ID 1001 and group ID 40. For our test and search, we use MatchAllDocsQuery to basically search without any queries (as it will return all the documents). Here is the output from the code:
file: Z:sharedfinance2014-sales.xls - userId: 1001 - groupId: 20 file: Z:sharedcompany2014-terms-and-conditions.doc - userId: 1205 - groupId: 40
The search specifically filters by user ID 1001, so the first document is returned because its user ID is also 1001. The third document is returned because its group ID, 40, is greater than or equal to the user's group ID, which is also 40.
QueryParser is an interpreter tool that transforms a search string into a series of Query clauses. It's not absolutely necessary to use QueryParser to perform a search, but it's a great feature that empowers users by allowing the use of search modifiers. A user can specify a phrase match by putting quotes (") around a phrase. A user can also control whether a certain term or phrase is required by putting a plus ("+") sign in front of the term or phrase, or use a minus ("-") sign to indicate that the term or phrase must not exist in results. For Boolean searches, the user can use AND and OR to control whether all terms or phrases are required.
To do a field-specific search, you can use a colon (":") to specify a field for a search (for example, content:humpty would search for the term "humpty" in the field "content"). For wildcard searches, you can use the standard wildcard character asterisk ("*") to match 0 or more characters, or a question mark ("?") for matching a single character. As you can see, the general syntax for a search query is not complicated, though the more advanced modifiers can seem daunting to new users. In this article, we will cover more advanced QueryParser features to show you what you can do to customize a search.
Let's look at the options that we can set in QueryParser. The following is a piece of code snippet for our setup:
Analyzer analyzer = new StandardAnalyzer(); Directory directory = new RAMDirectory(); IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer); IndexWriter indexWriter = new IndexWriter(directory, config); Document doc = new Document(); StringField stringField = new StringField("name", "", Field.Store.YES); TextField textField = new TextField("content", "", Field.Store.YES); IntField intField = new IntField("num", 0, Field.Store.YES); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("First"); textField.setStringValue("Humpty Dumpty sat on a wall,"); intField.setIntValue(100); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Second"); textField.setStringValue("Humpty Dumpty had a great fall."); intField.setIntValue(200); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Third"); textField.setStringValue("All the king's horses and all the king's men"); intField.setIntValue(300); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Fourth"); textField.setStringValue("Couldn't put Humpty together again."); intField.setIntValue(400); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); indexWriter.commit(); indexWriter.close(); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); QueryParser queryParser = new QueryParser("content", analyzer); // configure queryParser here Query query = queryParser.parse("humpty"); TopDocs topDocs = indexSearcher.search(query, 100);
We add four documents and instantiate a QueryParser object with a default field and an analyzer. We will be using the same analyzer that was used in indexing to ensure that we apply the same text treatment to maximize matching capability.
The query syntax for a wildcard search is the asterisk ("*") or question mark ("?") character. Here is a sample query:
Query query = queryParser.parse("humpty*");
This query will return the first, second, and fourth sentences. By default, QueryParser does not allow a leading wildcard character because it has a significant performance impact. A leading wildcard would trigger a full scan on the index since any term can be a potential match. In essence, even an inverted index would become rather useless for a leading wildcard character search. However, it's possible to override this default setting to allow a leading wildcard character by calling setAllowLeadingWildcard(true). You can go ahead and run this example with different search strings to see how this feature works.
Depending on where the wildcard character(s) is placed, QueryParser will produce either a PrefixQuery or WildcardQuery. In this specific example in which there is only one wildcard character and it's not the leading character, a PrefixQuery will be produced.
We can produce a TermRangeQuery by using TO in a search string. The range has the following syntax:
[start TO end] – inclusive {start TO end} – exclusive
As indicated, the angle brackets ( [ and ] ) is inclusive of start and end terms, and curly brackets ( { and } ) is exclusive of start and end terms. It's also possible to mix these brackets to inclusive on one side and exclusive on the other side.
Here is a code snippet:
Query query = queryParser.parse("[aa TO c]");
This search will return the third and fourth sentences, as their beginning words are All and Couldn't, which are within the range. You can optionally analyze the range terms with the same analyzer by setting setAnalyzeRangeTerms(true).
QueryParser can automatically generate a PhraseQuery when there is more than one term in a search string. Here is a code snippet:
queryParser.setAutoGeneratePhraseQueries(true); Query query = queryParser.parse("humpty+dumpty+sat");
This search will generate a PhraseQuery on the phrase humpty dumpty sat and will return the first sentence.
If you have a date field (by using DateTools to convert date to a string format) and would like to do a range search on date, it may be necessary to match the date resolution on a specific field. Here is a code snippet on setting the Date resolution:
queryParser.setDateResolution("date", DateTools.Resolution.DAY); queryParser.setLocale(Locale.US); queryParser.setTimeZone(TimeZone.getTimeZone("Am erica/New_York"));
This example sets the resolution to day granularity, locale to US, and time zone to New York. The locale and time zone settings are specific to the date format only.
The default operator on a multiterm search string is OR. You can change the default to AND so all the terms are required. Here is a code snippet that will require all the terms in a search string:
queryParser.setDefaultOperator(QueryParser.Operator.AND); Query query = queryParser.parse("humpty dumpty");
This example will return first and second sentences as these are the only two sentences with both humpty and dumpty.
This setting is enabled by default. Its purpose is to maintain a position increment of the token that follows an omitted token, such as a token filtered by a StopFilter. This is useful in phrase queries when position increments may be important for scoring. Here is an example on how to enable this setting:
queryParser.setEnablePositionIncrements(true); Query query = queryParser.parse(""humpty dumpty"");
In our scenario, it won't change our search results. This attribute only enables position increments information to be available in the resulting PhraseQuery.
Lucene's fuzzy search implementation is based on Levenshtein distance. It compares two strings and finds out the number of single character changes that are needed to transform one string to another. The resulting number indicates the closeness of the two strings. In a fuzzy search, a threshold number of edits is used to determine if the two strings are matched. To trigger a fuzzy match in QueryParser, you can use the tilde ~ character. There are a couple configurations in QueryParser to tune this type of query. Here is a code snippet:
queryParser.setFuzzyMinSim(2f); queryParser.setFuzzyPrefixLength(3); Query query = queryParser.parse("hump~");
This example will return first, second, and fourth sentences as the fuzzy match matches hump to humpty because these two words are missed by two characters. We tuned the fuzzy query to a minimum similarity to two in this example.
This configuration determines whether to automatically lowercase multiterm queries. An analyzer can do this already, so this is more like an overriding configuration that forces multiterm queries to be lowercased. Here is a code snippet:
queryParser.setLowercaseExpandedTerms(true); Query query = queryParser.parse(""Humpty Dumpty"");
This code will lowercase our search string before search execution.
Phrase search can be tuned to allow some flexibility in phrase matching. By default, phrase match is exact. Setting a slop value will give it some tolerance on terms that may not always be matched consecutively. Here is a code snippet that will demonstrate this feature:
queryParser.setPhraseSlop(3); Query query = queryParser.parse(""Humpty Dumpty wall"");
Without setting a phrase slop, this phrase Humpty Dumpty wall will not have any matches. By setting phrase slop to three, it allows some tolerance so that this search will now return the first sentence. Go ahead and play around with this setting in order to get more familiarized with its behavior.
A TermQuery is a very simple query that matches documents containing a specific term. The TermRangeQuery is, as its name implies, a term range with a lower and upper boundary for matching.
Here are a couple of examples on TermQuery and TermRangeQuery:
query = new TermQuery(new Term("content", "humpty")); query = new TermRangeQuery("content", new BytesRef("a"), new BytesRef("c"), true, true);
The first line is a simple query that matches the term humpty in the content field. The second line is a range query matching documents with the content that's sorted within a and c.
A BooleanQuery is a combination of other queries in which you can specify whether each subquery must, must not, or should match. These options provide the foundation to build up to logical operators of AND, OR, and NOT, which you can use in QueryParser. Here is a quick review on QueryParser syntax for BooleanQuery:
As mentioned, rudimentary clauses of BooleanQuery have three option: must match, must not match, and should match. These options allow us to programmatically create Boolean operations through an API.
Here is a code snippet that demonstrates BooleanQuery:
BooleanQuery query = new BooleanQuery(); query.add(new BooleanClause( new TermQuery(new Term("content", "humpty")), BooleanClause.Occur.MUST)); query.add(new BooleanClause(new TermQuery( new Term("content", "dumpty")), BooleanClause.Occur.MUST)); query.add(new BooleanClause(new TermQuery( new Term("content", "wall")), BooleanClause.Occur.SHOULD)); query.add(new BooleanClause(new TermQuery( new Term("content", "sat")), BooleanClause.Occur.MUST_NOT));
In this demonstration, we will use TermQuery to illustrate the building of BooleanClauses. It's equivalent to this logic: (humpty AND dumpty) OR wall NOT sat. This code will return the second sentence from our setup. Because of the last MUST_NOT BooleanClause on the word "sat", the first sentence is filtered from the results. Note that BooleanClause accepts two arguments: a Query and a BooleanClause.Occur. BooleanClause.Occur is where you specify the matching options: MUST, MUST_NOT, and SHOULD.
PrefixQuery, as the name implies, matches documents with terms starting with a specified prefix. WildcardQuery allows you to use wildcard characters for wildcard matching.
A PrefixQuery is somewhat similar to a WildcardQuery in which there is only one wildcard character at the end of a search string. When doing a wildcard search in QueryParser, it would return either a PrefixQuery or WildcardQuery, depending on the wildcard character's location. PrefixQuery is simpler and more efficient than WildcardQuery, so it's preferable to use PrefixQuery whenever possible. That's exactly what QueryParser does.
Here is a code snippet to demonstrate both Query types:
PrefixQuery query = new PrefixQuery(new Term("content", "hum")); WildcardQuery query2 = new WildcardQuery(new Term("content", "*um*"));
Both queries would return the same results from our setup. The PrefixQuery will match anything that starts with hum and the WildcardQuery would match anything that contains um.
A PhraseQuery matches a particular sequence of terms, while a MultiPhraseQuery gives you an option to match multiple terms in the same position. For example, MultiPhrasQuery supports a phrase such as humpty (dumpty OR together) in which it matches humpty in position 0 and dumpty or together in position 1.
Here is a code snippet to demonstrate both Query types:
PhraseQuery query = new PhraseQuery(); query.add(new Term("content", "humpty")); query.add(new Term("content", "together")); MultiPhraseQuery query2 = new MultiPhraseQuery(); Term[] terms1 = new Term[1];terms1[0] = new Term("content", "humpty"); Term[] terms2 = new Term[2];terms2[0] = new Term("content", "dumpty"); terms2[1] = new Term("content", "together"); query2.add(terms1); query2.add(terms2);
The first Query, PhraseQuery, searches for the phrase humpty together. The second Query, MultiPhraseQuery, searches for the phrase humpty (dumpty OR together). The first Query would return sentence four from our setup, while the second Query would return sentence one, two, and four. Note that in MultiPhraseQuery, multiple terms in the same position are added as an array.
A FuzzyQuery matches terms based on similarity, using the Damerau-Levenshtein algorithm. We are not going into the details of the algorithm as it is outside of our topic. What we need to know is a fuzzy match is measured in the number of edits between terms. FuzzyQuery allows a maximum of 2 edits. For example, between "humptX" and humpty is first edit and between humpXX and humpty are two edits. There is also a requirement that the number of edits must be less than the minimum term length (of either the input term or candidate term). As another example, ab and abcd would not match because the number of edits between the two terms is 2 and it's not greater than the length of ab, which is 2.
Here is a code snippet to demonstrate FuzzyQuery:
FuzzyQuery query = new FuzzyQuery(new Term("content", "humpXX"));
This Query will return sentences one, two, and four from our setup, as humpXX matches humpty within the two edits. In QueryParser, FuzzyQuery can be triggered by the tilde ( ~ ) sign. An equivalent search string would be humpXX~.
This gives you a glimpse of the various querying and filtering features that have been proven to build successful search engines.
Further resources on this subject: