作者: Michael McCandless / Erik Hatcher / Otis Gospodnetic
出版社: Manning Publications
副标题: Covers Apache Lucene 3.0
出版年: 2010-7-28
页数: 475
定价: USD 49.99
装帧: Paperback
ISBN: 9781933988177
- Meet Lucene
- Building a search index
- Index File Formats
- Adding search to your application
- Implementing a simple search feature
- Using indexsearcher
- Understanding lucene scoring
- Lucene’s diverse queries
- Searching by term: TermQuery
- Searching within a term range: TermRangeQuery
- Searching within a numeric range: NumericRangeQuery
- Searching on a string: PrefixQuery
- Combining queries: BooleanQuery
- Searching by phrase: PhraseQuery
- Searching by wildcard: WildcardQuery
- Searching for similar terms: FuzzyQuery
- Matching all documents: MatchAllDocsQuery
- Parsing query expressions queryparser
- Lucene’s analysis process
Meet Lucene
What is lucene
Lucene is a high-performance, scalable information retrieval
(IR
) library. IR
refers to the process of searching for documents, information within documents, or metadata about documents.
Lucene and the components of a search application
Figure 1.4. Typical components of search application; the shaded components show which parts Lucene handles.
Components for indexing
Acquire Content
:which involves using a crawler or spider, gathers and scopes the content that needs to be indexed.Build Document
:Once you have the raw content that needs to be indexed, you must translate the content into theunits
(usually calleddocuments
) used by the search engine. The document typically consists of several separately named fields with values, such astitle
,body
,abstract
,author
, andurl
.Analyze Document
:No search engine indexes text directly: rather, the text must be broken into a series of individual atomic elements calledtokens
.Index Document
:During the indexing step, the document is added to the index.
Components for searching
Search User Interface
:The user interface is what users actually see, in the web browser, desktop application, or mobile device, when they interact with your search application.Build Query
- You must then translate the request into the search engine’s
Query
object. We call this theBuild Query
step. - Lucene provides a powerful package, called
QueryParser
, to process the user’s text into a query object according to a common search syntax.
- You must then translate the request into the search engine’s
Search Query
- Search Query is the process of consulting the search index and retrieving the documents matching the Query, sorted in the requested sort order.
- There are three common theoretical models of search:
Pure Boolean model
— Documents either match or don’t match the provided query, and no scoring is done. In this model there are no relevance scores associated with matching documents, and the matching documents are unordered; a query simply identifies a subset of the overall corpus as matching the query.Vector space model
— Both queries and documents are modeled as vectors in a high dimensional space, where each unique term is a dimension. Relevance, or similarity, between a query and a document is computed by a vector distance measure between these vectors.Probabilistic model
— In this model, you compute the probability that a document is a good match to a query using a full probabilistic approach.
Render Results
:Once you have the raw set of documents that match the query, sorted in the right order, you then render them to the user in an intuitive, consumable manner.
The rest of the search application
Administration Interface
:tune the size of the RAM buffer, how many segments to merge at once, how often to commit changes, or when to optimize and purge deletes from the index.Analytics Interface
:Lucene-specific metrics that could feed the analytics interface include:- How often which kinds of queries (single term, phrase, Boolean queries, etc.) are run
- Queries that hit low relevance
- Queries where the user didn’t click on any results (if your application tracks click-throughs)
- How often users are sorting by specified fields instead of relevance
- The breakdown of Lucene’s search time
Scaling
:There are two dimensions to scaling: net amount of content, and net query throughput.
Lucene In Action: A Sample Application
Creating an index
Listing 1.1 shows the Indexer command-line program, originally written for Erik’s introductory Lucene article on java.net. It takes two arguments:
- A path to a directory where we store the Lucene index
- A path to a directory that contains the files we want to index
Listing 1.1. Indexer, which indexes .txt files
Go ahead and type ant Indexer, and you should see output like this:
% ant Indexer
Index *.txt files in a directory into a Lucene index.
Use the Searcher target to search this index.
Indexer is covered in the "Meet Lucene" chapter.
Press return to continue...
Directory for new Lucene index: [indexes/MeetLucene]
Directory with .txt files to index: [src/lia/meetlucene/data]
Overwrite indexes/MeetLucene? (y, n) y
Running lia.meetlucene.Indexer...
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/apache1.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/apache1.1.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/apache2.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/cpl1.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/epl1.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/freebsd.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/gpl1.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/gpl2.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/gpl3.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/lgpl2.1.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/lgpl3.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/lpgl2.0.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/mit.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/mozilla1.1.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/
mozilla_eula_firefox3.txt
Indexing /Users/mike/lia2e/src/lia/meetlucene/data/
mozilla_eula_thunderbird2.txt
Indexing 16 files took 757 milliseconds
BUILD SUCCESSFUL
Searching an index
The Searcher program, originally written for Erik’s introductory Lucene article on java.net, complements Indexer and provides command-line searching capability. Listing 1.2 shows Searcher in its entirety. It takes two command-line arguments:
- The path to the index created with Indexer
- A query to use to search the index
Listing 1.2. Searcher, which searches a Lucene index
Let’s run Searcher and find documents in our index using the query ‘patent’
% ant Searcher
Search an index built using Indexer.
Searcher is described in the "Meet Lucene" chapter.
Press return to continue...
Directory of existing Lucene index built by
Indexer: [indexes/MeetLucene]
Query: [patent]
Running lia.meetlucene.Searcher...
Found 8 document(s) (in 11 milliseconds) that
matched query 'patent':
/Users/mike/lia2e/src/lia/meetlucene/data/cpl1.0.txt
/Users/mike/lia2e/src/lia/meetlucene/data/mozilla1.1.txt
/Users/mike/lia2e/src/lia/meetlucene/data/epl1.0.txt
/Users/mike/lia2e/src/lia/meetlucene/data/gpl3.0.txt
/Users/mike/lia2e/src/lia/meetlucene/data/apache2.0.txt
/Users/mike/lia2e/src/lia/meetlucene/data/lpgl2.0.txt
/Users/mike/lia2e/src/lia/meetlucene/data/gpl2.0.txt
/Users/mike/lia2e/src/lia/meetlucene/data/lgpl2.1.txt
BUILD SUCCESSFUL
Total time: 4 seconds
Understanding the core indexing classes
As you saw in our Indexer class, you need the following classes to perform the simplest indexing procedure:
- IndexWriter
- Directory
- Analyzer
- Document
- Field
Figure 1.5. Classes used when indexing documents with Lucene
IndexWriter
IndexWriter
is the central component of the indexing process. This class creates a new index or opens an existing one, and adds, removes, or updates documents in the index. Think of IndexWriter as an object that gives you write access to the index but doesn’t let you read or search it. IndexWriter needs somewhere to store its index, and that’s what Directory is for.
Directory
The Directory
class represents the location of a Lucene index. It’s an abstract class that allows its subclasses to store the index as they see fit. In our Indexer example, we used FSDirectory.open to get a suitable concrete FSDirectory implementation that stores real files in a directory on the file system, and passed that in turn to IndexWriter’s constructor.
Analyzer
Before text is indexed, it’s passed through an analyzer
. The analyzer, specified in the IndexWriter constructor, is in charge of extracting those tokens out of text that should be indexed and eliminating the rest. Analyzer is an abstract class, but Lucene comes with several implementations of it.
- Some of them deal with skipping
stop words
(frequently used words that don’t help distinguish one document from the other, such as a, an, the, in, and on); - some deal with conversion of tokens to lowercase letters, so that searches aren’t case sensitive;
- and so on.
Document
The Document
class represents a collection of fields. Think of it as a virtual document—a chunk of data, such as a web page, an email message, or a text file—that you want to make retrievable at a later time.
Field
Each document in an index contains one or more named fields
, embodied in a class called Field. Each field has a name and corresponding value, and a bunch of options, described in section 2.4, that control precisely how Lucene will index the field’s value.
Understanding the core searching classes
The basic search interface that Lucene provides is as straightforward as the one for indexing. Only a few classes are needed to perform the basic search operation:
- IndexSearcher
- Term
- Query
- TermQuery
- TopDocs
IndexSearcher
IndexSearcher
is to searching what IndexWriter is to indexing: the central link to the index that exposes several search methods. You can think of IndexSearcher as a class that opens an index in a read-only mode. It requires a Directory instance, holding the previously created index, and then offers a number of search methods, some of which are implemented in its abstract parent class Searcher; the simplest takes a Query object and an int topN count as parameters and returns a TopDocs object. A typical use of this method looks like this:
Directory dir = FSDirectory.open(new File("/tmp/index"));
IndexSearcher searcher = new IndexSearcher(dir);
Query q = new TermQuery(new Term("contents", "lucene"));
TopDocs hits = searcher.search(q, 10);
searcher.close();
Term
A Term
is the basic unit for searching. Similar to the Field object, it consists of a pair of string elements: the name of the field and the word (text value) of that field.
During searching, you may construct Term objects and use them together with TermQuery:
Query q = new TermQuery(new Term("contents", "lucene"));
TopDocs hits = searcher.search(q, 10);
Query
Lucene comes with a number of concrete Query
subclasses. So far in this chapter we’ve mentioned only the most basic Lucene Query: TermQuery. Other Query types are BooleanQuery, PhraseQuery, PrefixQuery, PhrasePrefixQuery, TermRangeQuery, NumericRangeQuery, FilteredQuery, and SpanQuery.
TermQuery
TermQuery
is the most basic type of query supported by Lucene, and it’s one of the primitive query types. It’s used for matching documents that contain fields with specific values, as you’ve seen in the last few paragraphs.
TopDocs
The TopDocs
class is a simple container of pointers to the top N ranked search results—documents that match a given query. For each of the top N results, TopDocs records the int docID (which you can use to retrieve the document) as well as the float score.
Building a search index
How lucene models content
Documents and fields
At a high level, there are three things Lucene can do with each field:
- The value may be indexed (or not).
- If it’s indexed, the field may also optionally store term vectors, which are collectively a miniature inverted index for that one field, allowing you to retrieve all of its tokens.
- Separately, the field’s value may be stored, meaning a verbatim copy of the unanalyzed value is written away in the index so that it can later be retrieved. This is useful for fields you’d like to present unchanged to the user, such as the document’s title or abstract.
Flexible schema
Unlike a database, Lucene has no notion of a fixed global schema.
The second major difference between Lucene and databases is that Lucene requires you to flatten, or denormalize, your content when you index it.
Denormalization
Lucene documents are flat. Such recursion and joins must be denormalized when creating your documents.
Understanding the indexing process
Figure 2.1. Indexing with Lucene breaks down into three main operations: extracting text from source documents, analyzing it, and saving it to the index.
- Extracting text and creating the document
- Analysis
- Adding to the index
Figure 2.2. Segmented structure of a Lucene inverted index
Basic index operations
Adding documents to an index
There are two methods for adding documents:
- addDocument(Document)—Adds the document using the default analyzer, which you specified when creating the IndexWriter, for tokenization.
- addDocument(Document, Analyzer)—Adds the document using the provided analyzer for tokenization. But be careful! In order for searches to work correctly, you need the analyzer used at search time to “match” the tokens produced by the analyzers at indexing time.
Listing 2.1. Adding documents to an index
In the getWriter method, we create the IndexWriter with three arguments:
Directory
, where the index is stored.- The analyzer to use when indexing tokenized fields.
- MaxFieldLength.UNLIMITED, a required argument that tells IndexWriter to index all tokens in the document.
public static int hitCount(IndexSearcher searcher, Query query)
throws IOException {
return searcher.search(query, 1).totalHits;
}
This method runs the search and returns the total number of hits that matched.
Deleting documents from an index
IndexWriter provides various methods to remove documents from an index:
- deleteDocuments(Term) deletes all documents containing the provided term.
- deleteDocuments(Term[]) deletes all documents containing any of the terms in the provided array.
- deleteDocuments(Query) deletes all documents matching the provided query.
- deleteDocuments(Query[]) deletes all documents matching any of the queries in the provided array.
- deleteAll() deletes all documents in the index. This is exactly the same as closing the writer and opening a new writer with create=true, without having to close your writer.
Listing 2.2. Deleting documents from an index
- maxDoc() returns the total number of deleted or undeleted documents in the index,
- numDocs() returns only the number of undeleted documents.
Updating documents in the index
IndexWriter provides two convenience methods to replace a document in the index:
- updateDocument(Term, Document) first deletes all documents containing the provided term and then adds the new document using the writer’s default analyzer.
- updateDocument(Term, Document, Analyzer) does the same but uses the provided analyzer instead of the writer’s default analyzer.
Listing 2.3. Updating indexed Documents
Field options
Field options for indexing
The options for indexing (Field.Index.*) control how the text in the field will be made searchable via the inverted index. Here are the choices:
Index.ANALYZED
—Use the analyzer to break the field’s value into a stream of separate tokens and make each token searchable. This option is useful for normal text fields (body, title, abstract, etc.).Index.NOT_ANALYZED
—Do index the field, but don’t analyze the String value. Instead, treat the Field’s entire value as a single token and make that token searchable. This option is useful for fields that you’d like to search on but that shouldn’t be broken up, such as URLs, file system paths, dates, personal names, Social Security numbers, and telephone numbers. This option is especially useful for enabling “exact match” searching.Index.ANALYZED_NO_NORMS
—A variant ofIndex.ANALYZED
that doesn’t store norms information in the index. Norms record index-time boost information in the index but can be memory consuming when you’re searching.Index.NOT_ANALYZED_NO_NORMS
—Just like Index.NOT_ANALYZED, but also doesn’t store norms. This option is frequently used to save index space and memory usage during searching, because single-token fields don’t need the norms information unless they’re boosted.Index.NO
—Don’t make this field’s value available for searching.
Field options for storing fields
The options for stored fields (Field.Store.*) determine whether the field’s exact value should be stored away so that you can later retrieve it during searching:
- Store.YES —Stores the value. When the value is stored, the original String in its entirety is recorded in the index and may be retrieved by an IndexReader. This option is useful for fields that you’d like to use when displaying the search results (such as a URL, title, or database primary key). Try not to store very large fields, if index size is a concern, as stored fields consume space in the index.
- Store.NO —Doesn’t store the value. This option is often used along with Index.ANALYZED to index a large text field that doesn’t need to be retrieved in its original form, such as bodies of web pages, or any other type of text document.
Field options for term vectors
Term vectors are a mix between an indexed field and a stored field. They’re similar to a stored field because you can quickly retrieve all term vector fields for a given document: term vectors are keyed first by document ID. But then, they’re keyed secondarily by term, meaning they store a miniature inverted index for that one document. Unlike a stored field, where the original String content is stored verbatim, term vectors store the actual separate terms that were produced by the analyzer, allowing you to retrieve all terms for each field, and the frequency of their occurrence within the document, sorted in lexicographic order.
You can choose separately whether these details are also stored in your term vectors by passing these constants as the fourth argument to the Field constructor:
- TermVector.YES—Records the unique terms that occurred, and their counts, in each document, but doesn’t store any positions or offsets information
- TermVector.WITH_POSITIONS—Records the unique terms and their counts, and also the positions of each occurrence of every term, but no offsets
- TermVector.WITH_OFFSETS—Records the unique terms and their counts, with the offsets (start and end character position) of each occurrence of every term, but no positions
- TermVector.WITH_POSITIONS_OFFSETS—Stores unique terms and their counts, along with positions and offsets
- TermVector.NO—Doesn’t store any term vector information
Reader, TokenStream, and byte[] field values
There are a few other constructors for the Field object that allow you to use values other than String:
- Field(String name, Reader value, TermVector termVector) uses a Reader instead of a String to represent the value. In this case, the value can’t be stored (the option is hardwired to Store.NO) and is always analyzed and indexed (Index.ANALYZED). This can be useful when holding the full String in memory might be too costly or inconvenient—for example, for very large values.
- Field(String name, Reader value), like the previous value, uses a Reader instead of a String to represent the value but defaults termVector to TermVector.NO.
- Field(String name, TokenStream tokenStream, TermVector termVector) allows you to preanalyze the field value into a TokenStream. Likewise, such fields aren’t stored and are always analyzed and indexed.
- Field(String name, TokenStream tokenStream), like the previous value, allows you to preanalyze the field value into a TokenStream but defaults termVector to TermVector.NO.
- Field(String name, byte[] value, Store store) is used to store a binary field. Such fields are never indexed (Index.NO) and have no term vectors (TermVector.NO). The store argument must be Store.YES.
- Field(String name, byte[] value, int offset, int length, Store store), like the previous value, indexes a binary field but allows you to reference a sub-slice of the bytes starting at offset and running for length bytes.
Field option combinations
Table 2.1. A summary of various field characteristics, showing you how fields are created, along with common usage examples
Index | Store | TermVector | Example usage |
---|---|---|---|
NOT_ANALYZED_NO_NORMS | YES | NO | Identifiers (filenames, primary keys), telephone and Social Security numbers, URLs, personal names, dates, and textual fields for sorting |
ANALYZED | YES | WITH_POSITIONS_OFFSETS | Document title, document abstract |
ANALYZED | NO | WITH_POSITIONS_OFFSETS | Document body |
NO | YES | NO | Document type, database primary key (if not used for searching) |
NOT_ANALYZED | NO | NO | Hidden keywords |
Field options for sorting
- If the field is numeric, use
NumericField
- If the field is textual, such as the sender’s name in an email message, you must add it as a
Field
that’s indexed but not analyzed usingField.Index.NOT_ANALYZED
- If you aren’t doing any boosting for the field, you should index it without norms, to save disk space and memory, using
Field.Index.NOT_ANALYZED_NO_NORMS
:
new Field("author", "Arthur C. Clark", Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS);
Fields
used for sorting must be indexed and must contain one token per document. Typically this means using Field.Index.NOT_ANALYZED
or Field.Index.NOT_ANALYZED_NO_NORMS
(if you’re not boosting documents or fields), but if your analyzer will always produce only one token, such as KeywordAnalyzer
, Field.Index.ANALYZED
or Field.Index.ANALYZED_NO_NORMS
will work as well.
Multivalued fields
This is perfectly acceptable and encouraged, as it’s a natural way to represent a field that legitimately has multiple values.
Boosting documents and fields
Boosting documents
By changing a document’s boost factor, you can instruct Lucene to consider it more or less important with respect to other documents in the index when computing relevance.
Listing 2.4. Selectively boosting documents and fields
Boosting fields
Just as you can boost documents, you can also boost individual fields.
To achieve this behavior, we use the setBoost(float) method of the Field class:
Field subjectField = new Field("subject", subject,
Field.Store.YES,
Field.Index.ANALYZED);
subjectField.setBoost(1.2F);
Norms
During searching, norms for any field being searched are loaded into memory, decoded back into a floating-point number, and used when computing the relevance score.
Indexing numbers dates and times
Indexing numbers
doc.add(new NumericField("price").setDoubleValue(19.99));
Indexing dates and times
doc.add(new NumericField("timestamp").setLongValue(new Date().getTime()));
doc.add(new NumericField("day").setIntValue((int) (new Date().getTime()/24/3600)));
Calendar cal = Calendar.getInstance();
cal.setTime(date);
doc.add(new NumericField("dayOfMonth").setIntValue(cal.get(Calendar.DAY_OF_MONTH)));
Field truncation
IndexWriter
allows you to truncate per-Field indexing so that only the first N terms are indexed for an analyzed field. When you instantiate IndexWriter, you must pass in a MaxFieldLength
instance expressing this limit. MaxFieldLength provides two convenient default instances: MaxFieldLength.UNLIMITED
, which means no truncation will take place, and MaxFieldLength.LIMITED
, which means fields are truncated at 10,000 terms. You can also instantiate MaxFieldLength with your own limit.
Near real time search
IndexReader getReader()
This method immediately flushes any buffered added or deleted documents, and then creates a new read-only IndexReader that includes those documents.
Optimizing an index
Optimizing only improves searching speed, not indexing speed.
IndexWriter exposes four methods to optimize:
- optimize() reduces the index to a single segment, not returning until the operation is finished.
- optimize(int maxNumSegments), also known as partial optimize, reduces the index to at most maxNumSegments segments. Because the final merge down to one segment is the most costly, optimizing to, say, five segments should be quite a bit faster than optimizing down to one segment, allowing you to trade less optimization time for slower search speed.
- optimize(boolean doWait) is just like optimize, except if doWait is false then the call returns immediately while the necessary merges take place in the background. Note that doWait=false only works for a merge scheduler that runs merges in background threads, such as the default ConcurrentMergeScheduler.
- optimize(int maxNumSegments, boolean doWait) is a partial optimize that runs in the background if doWait is false.
Other directory implementations
Table 2.2. Lucene’s several core Directory implementations
Directory | Description |
---|---|
SimpleFSDirectory | A simplistic Directory that stores files in the file system, using java.io.* APIs. It doesn’t scale well with many threads. |
NIOFSDirectory | A Directory that stores files in the file system, using java.nio.* APIs. This does scale well with threads on all platforms except Microsoft Windows, due to a longstanding issue with Sun’s Java Runtime Environment (JRE). |
MMapDirectory | A Directory that uses memory-mapped I/O to access files. This is a good choice on 64-bit JREs, or on 32-bit JREs where the size of the index is relatively small. |
RAMDirectory | A Directory that stores all files in RAM. |
FileSwitchDirectory | A Directory that takes two directories in, and switches between these directories based on file extension. |
Concurrency thread safety and locking issues
Thread and multi-JVM safety
Figure 2.3. A single IndexWriter can be shared by multiple threads.
Accessing an index over a remote file system
Table 2.3. Issues related to accessing a Lucene index across remote file systems
Remote file system | Notes |
---|---|
Samba/CIFS 1.0 | The standard remote file system for Windows computers. Sharing a Lucene index works fine. |
Samba/CIFS 2.0 | The new version of Samba/CIFS that’s the default for Windows Server 2007 and Windows Vista. Lucene has trouble due to incoherent client-side caches. |
Networked File System (NFS) | The standard remote file systems for most Unix OSs. Lucene has trouble due to both incoherent client-side caches as well as how NFS handles deletion of files that are held open by another computer. |
Apple File Protocol (AFP) | Apple’s standard remote file system protocol. Lucene has trouble due to incoherent client-side caches. |
Index locking
Table 2.4. Locking implementations provided by Lucene
Locking class name | Description |
---|---|
NativeFSLockFactory | This is the default locking for FSDirectory, using java.nio native OS locking, which will never leave leftover lock files when the JVM exits. But this locking implementation may not work correctly over certain shared file systems, notably NFS. |
SimpleFSLockFactory | Uses Java’s File.createNewFile API, which may be more portable across different file systems than NativeFSLockFactory. Be aware that if the JVM crashes or IndexWriter isn’t closed before the JVM exits, this may leave a leftover write.lock file, which you must manually remove. |
SingleInstanceLockFactory | Creates a lock entirely in memory. This is the default locking implementation for RAMDirectory. Use this when you know all IndexWriters will be instantiated in a single JVM. |
NoLockFactory | Disables locking entirely. Be careful! Only use this when you are absolutely certain that Lucene’s normal locking safeguard isn’t necessary—for example, when using a private RAMDirectory with a single IndexWriter instance. |
Listing 2.5. Using file-based locks to enforce a single writer at a time
Index File Formats
http://lucene.apache.org/core/3_5_0/fileformats.html
Definitions
- An
index
contains a sequence of documents. - A
document
is a sequence of fields. - A
field
is a named sequence of terms. - A
term
is a string.
Types of Fields
In Lucene, fields may be stored
, in which case their text is stored in the index literally, in a non-inverted manner. Fields that are inverted are called indexed
. A field may be both stored and indexed.
The text of a field may be tokenized into terms to be indexed, or the text of a field may be used literally as a term to be indexed. Most fields are tokenized, but sometimes it is useful for certain identifier fields to be indexed literally.
Segments
Lucene indexes may be composed of multiple sub-indexes, or segments
. Each segment is a fully independent index, which could be searched separately. Indexes evolve by:
- Creating new segments for newly added documents.
- Merging existing segments.
Document Numbers
Internally, Lucene refers to documents by an integer document number
. The first document added to an index is numbered zero, and each subsequent document added gets a number one greater than the previous.
In particular, numbers may change in the following situations:
- The numbers stored in each segment are unique only within the segment, and must be converted before they can be used in a larger context. The standard technique is to allocate each segment a range of values, based on the range of numbers used in that segment. To convert a document number from a segment to an external value, the segment’s base document number is added. To convert an external value back to a segment-specific value, the segment is identified by the range that the external value is in, and the segment’s base value is subtracted. For example two five document segments might be combined, so that the first segment has a base value of zero, and the second of five. Document three from the second segment would have an external value of eight.
- When documents are deleted, gaps are created in the numbering. These are eventually removed as the index evolves through merging. Deleted documents are dropped when segments are merged. A freshly-merged segment thus has no gaps in its numbering.
Overview
Each segment index maintains the following:
Field names
. This contains the set of field names used in the index.Stored Field values
. This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are used to store auxiliary information about the document, such as its title, url, or an identifier to access a database. The set of stored fields are what is returned for each hit when searching. This is keyed by document number.Term dictionary
. A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also contains the number of documents which contain the term, and pointers to the term’s frequency and proximity data.Term Frequency data
. For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document, unless frequencies are omitted (IndexOptions.DOCS_ONLY)Term Proximity data
. For each term in the dictionary, the positions that the term occurs in each document. Note that this will not exist if all fields in all documents omit position data.Normalization factors
. For each field in each document, a value is stored that is multiplied into the score for hits on that field.Term Vectors
. For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency.Deleted documents
. An optional file indicating which documents are deleted.
File Naming
All files belonging to a segment have the same name with varying extensions. The extensions correspond to the different file formats described below. When using the Compound File format (default in 1.4 and greater) these files are collapsed into a single .cfs
file (see below for details)
Typically, all segments in an index are stored in a single directory, although this is not required.
As of version 2.1 (lock-less commits), file names are never re-used (there is one exception, “segments.gen”, see below). That is, when any file is saved to the Directory it is given a never before used filename
. This is achieved using a simple generations approach. For example, the first segments file is segments_1, then segments_2, etc. The generation is a sequential long integer represented in alpha-numeric (base 36) form.
Adding search to your application
Table 3.1. Lucene’s primary searching API
Class | Purpose |
---|---|
IndexSearcher | Gateway to searching an index. All searches come through an IndexSearcher instance using any of the several overloaded search methods. |
Query (and subclasses) | Concrete subclasses encapsulate logic for a particular query type. Instances of Query are passed to an IndexSearcher’s search method. |
QueryParser | Processes a human-entered (and readable) expression into a concrete Query object. |
TopDocs | Holds the top scoring documents, returned by IndexSearcher.search. |
ScoreDoc | Provides access to each search result in TopDocs. |
Implementing a simple search feature
Searching for a specific term
Listing 3.1. Simple searching with TermQuery
Parsing a user-entered query expression: QueryParser
Figure 3.1. QueryParser translates a textual expression from the end user into an arbitrarily complex query for searching.
Listing 3.2. QueryParser, which makes it trivial to translate search text into a Query
Table 3.2. Expression examples that QueryParser handles
Query expression | Matches documents that… |
---|---|
java | Contain the term java in the default field |
java junit | Contain the term java or junit, or both, in the default field |
java OR junit | - |
+java +junit | Contain both java and junit in the default field |
java AND junit | - |
title:ant | Contain the term ant in the title field |
title:extreme–subject:sports | Contain extreme in the title field and don’t have sports in the subject field |
title:extreme AND NOT subject:sports | - |
(agile OR extreme) AND methodology | Contain methodology and must also contain agile and/or extreme, all in the default field |
title:”junit in action” | Contain the exact phrase “junit in action” in the title field |
title:”junit action”~5 | Contain the terms junit and action within five positions of one another, in the title field |
java* | Contain terms that begin with java, like javaspaces, javaserver, java.net, and the exact tem java itself. |
java~ | Contain terms that are close to the word java, such as lava |
lastmodified: [1/1/09 TO 12/31/09] | Have lastmodified field values between the dates January 1, 2009 and December 31, 2009 |
Using indexsearcher
Creating an IndexSearcher
Figure 3.2. The relationship between the common classes used for searching
Performing searches
Table 3.3. Primary IndexSearcher search methods
IndexSearcher.search method signature | When to use |
---|---|
TopDocs search(Query query, int n) | Straightforward searches. The int n parameter specifies how many top-scoring documents to return. |
TopDocs search(Query query, Filter filter, int n) | Searches constrained to a subset of available documents, based on filter criteria. |
TopFieldDocs search(Query query, Filter filter, int n, Sort sort) | Searches constrained to a subset of available documents based on filter criteria, and sorted by a custom Sort object |
void search(Query query, Collector results) | Used when you have custom logic to implement for each document visited, or you’d like to collect a different subset of documents than the top N by the sort criteria. |
void search(Query query, Filter filter, Collector results) | Same as previous, except documents are only accepted if they pass the filter criteria. |
Working with TopDocs
Table 3.4. TopDocs methods for efficiently accessing search results03_Ch03.fm
TopDocs method or attribute | Return value |
---|---|
totalHits | Number of documents that matched the search |
scoreDocs | Array of ScoreDoc instances that contains the results |
getMaxScore() | Returns best score of all matches, if scoring was done while searching (when sorting by field, you separately control whether scores are computed) |
Paging through results
You can choose from a couple of implementation approaches:
- Gather multiple pages’ worth of results on the initial search and keep the resulting ScoreDocs and IndexSearcher instances available while the user is navigating the search results.
- Requery each time the user navigates to a new page.
Near-real-time search
Listing 3.3. Near-real-time search
Understanding lucene scoring
How Lucene scores
Figure 3.3. Lucene uses this formula to determine a document score based on a query.
Table 3.5. Factors in the scoring formula
Factor | Description |
---|---|
tf(t in d) | Term frequency factor for the term (t) in the document (d)—how many times the term t occurs in the document. |
idf(t) | Inverse document frequency of the term: a measure of how “unique” the term is. Very common terms have a low idf; very rare terms have a high idf. |
boost(t.field in d) | Field and document boost, as set during indexing (see section 2.5). You may use this to statically boost certain fields and certain documents over others. |
lengthNorm(t.field in d) | Normalization value of a field, given the number of terms within the field. This value is computed during indexing and stored in the index norms. Shorter fields (fewer tokens) get a bigger boost from this factor. |
coord(q, d) | Coordination factor, based on the number of query terms the document contains. The coordination factor gives an AND-like boost to documents that contain more of the search terms than other documents. |
queryNorm(q) | Normalization value for a query, given the sum of the squared weights of each of the query terms. |
Using explain() to understand hit scoring
Listing 3.4. The explain() method
Lucene’s diverse queries
Searching by term: TermQuery
public void testKeyword() throws Exception {
Directory dir = TestUtil.getBookIndexDirectory();
IndexSearcher searcher = new IndexSearcher(dir);
Term t = new Term("isbn", "9781935182023");
Query query = new TermQuery(t);
TopDocs docs = searcher.search(query, 10);
assertEquals("JUnit in Action, Second Edition",
1, docs.totalHits);
searcher.close();
dir.close();
}
Searching within a term range: TermRangeQuery
public void testTermRangeQuery() throws Exception {
Directory dir = TestUtil.getBookIndexDirectory();
IndexSearcher searcher = new IndexSearcher(dir);
TermRangeQuery query = new TermRangeQuery("title2", "d", "j",
true, true);
TopDocs matches = searcher.search(query, 100);
assertEquals(3, matches.totalHits);
searcher.close();
dir.close();
}
Searching within a numeric range: NumericRangeQuery
public void testInclusive() throws Exception {
Directory dir = TestUtil.getBookIndexDirectory();
IndexSearcher searcher = new IndexSearcher(dir);
// pub date of TTC was September 2006
NumericRangeQuery query = NumericRangeQuery.newIntRange("pubmonth",
200605,
200609,
true,
true);
TopDocs matches = searcher.search(query, 10);
assertEquals(1, matches.totalHits);
searcher.close();
dir.close();
}
public void testExclusive() throws Exception {
Directory dir = TestUtil.getBookIndexDirectory();
IndexSearcher searcher = new IndexSearcher(dir);
// pub date of TTC was September 2006
NumericRangeQuery query = NumericRangeQuery.newIntRange("pubmonth",
200605,
200609,
false,
false);
TopDocs matches = searcher.search(query, 10);
assertEquals(0, matches.totalHits);
searcher.close();
dir.close();
}
Searching on a string: PrefixQuery
Listing 3.5. PrefixQuery
Combining queries: BooleanQuery
Listing 3.6. Using BooleanQuery to combine required subqueries
public static boolean hitsIncludeTitle(IndexSearcher searcher,
TopDocs hits, String title)
throws IOException {
for (ScoreDoc match : hits.scoreDocs) {
Document doc = searcher.doc(match.doc);
if (title.equals(doc.get("title"))) {
return true;
}
}
System.out.println("title '" + title + "' not found");
return false;
}
Listing 3.7. Using BooleanQuery to combine optional subqueries.
Searching by phrase: PhraseQuery
Listing 3.8. PhraseQuery
Searching by wildcard: WildcardQuery
Listing 3.9. WildcardQuery
Searching for similar terms: FuzzyQuery
public void testFuzzy() throws Exception {
indexSingleFieldDocs(new Field[] { new Field("contents",
"fuzzy",
Field.Store.YES,
Field.Index.ANALYZED),
new Field("contents",
"wuzzy",
Field.Store.YES,
Field.Index.ANALYZED)
});
IndexSearcher searcher = new IndexSearcher(directory);
Query query = new FuzzyQuery(new Term("contents", "wuzza"));
TopDocs matches = searcher.search(query, 10);
assertEquals("both close enough", 2, matches.totalHits);
assertTrue("wuzzy closer than fuzzy",
matches.scoreDocs[0].score != matches.scoreDocs[1].score);
Document doc = searcher.doc(matches.scoreDocs[0].doc);
assertEquals("wuzza bear", "wuzzy", doc.get("contents"));
searcher.close();
}
Matching all documents: MatchAllDocsQuery
Query query = new MatchAllDocsQuery(field);
Parsing query expressions queryparser
Query.toString
public void testToString() throws Exception {
BooleanQuery query = new BooleanQuery();
query.add(new FuzzyQuery(new Term("field", "kountry")),
BooleanClause.Occur.MUST);
query.add(new TermQuery(new Term("title", "western")),
BooleanClause.Occur.SHOULD);
assertEquals("both kinds", "+kountry~0.5 title:western",
query.toString("field"));
}
TermQuery
public void testTermQuery() throws Exception {
QueryParser parser = new QueryParser(Version.LUCENE_30,
"subject", analyzer);
Query query = parser.parse("computers");
System.out.println("term: " + query);
}
produces this output:
term: subject:computers
Term range searches
Listing 3.10. Creating a TermRangeQuery using QueryParser
Numeric and date range searches
QueryParser does include certain built-in logic for parsing dates when they appear as part of a range query, but the logic doesn’t work when you’ve indexed your dates using NumericField.
Prefix and wildcard queries
public void testLowercasing() throws Exception {
Query q = new QueryParser(Version.LUCENE_30,
"field", analyzer).parse("PrefixQuery*");
assertEquals("lowercased",
"prefixquery*", q.toString("field"));
QueryParser qp = new QueryParser(Version.LUCENE_30,
"field", analyzer);
qp.setLowercaseExpandedTerms(false);
q = qp.parse("PrefixQuery*");
assertEquals("not lowercased",
"PrefixQuery*", q.toString("field"));
}
Boolean operators
QueryParser parser = new QueryParser(Version.LUCENE_30,
"contents", analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
Table 3.6. Boolean query operator shortcuts
Verbose syntax | Shortcut syntax |
---|---|
a AND b | +a +b |
a OR b | a b |
a AND NOT b | +a –b |
Phrase queries
public void testPhraseQuery() throws Exception {
Query q = new QueryParser(Version.LUCENE_30,
"field",
new StandardAnalyzer(
Version.LUCENE_30))
.parse("\"This is Some Phrase*\"");
assertEquals("analyzed",
"\"? ? some phrase\"", q.toString("field"));
q = new QueryParser(Version.LUCENE_30,
"field", analyzer)
.parse("\"term\"");
assertTrue("reduced to TermQuery", q instanceof TermQuery);
}
public void testSlop() throws Exception {
Query q = new QueryParser(Version.LUCENE_30,
"field", analyzer)
.parse("\"exact phrase\"");
assertEquals("zero slop",
"\"exact phrase\"", q.toString("field"));
QueryParser qp = new QueryParser(Version.LUCENE_30,
"field", analyzer);
qp.setPhraseSlop(5);
q = qp.parse("\"sloppy phrase\"");
assertEquals("sloppy, implicitly",
"\"sloppy phrase\"~5", q.toString("field"));
}
Fuzzy queries
public void testFuzzyQuery() throws Exception {
QueryParser parser = new QueryParser(Version.LUCENE_30,
"subject", analyzer);
Query query = parser.parse("kountry~");
System.out.println("fuzzy: " + query);
query = parser.parse("kountry~0.7");
System.out.println("fuzzy 2: " + query);
}
This produces the following output:
fuzzy: subject:kountry~0.5
fuzzy 2: subject:kountry~0.7
MatchAllDocsQuery
QueryParser produces the MatchAllDocsQuery when you enter *:*
.
Grouping
public void testGrouping() throws Exception {
Query query = new QueryParser(
Version.LUCENE_30,
"subject",
analyzer).parse("(agile OR extreme) AND methodology");
TopDocs matches = searcher.search(query, 10);
assertTrue(TestUtil.hitsIncludeTitle(searcher, matches,
"Extreme Programming Explained"));
assertTrue(TestUtil.hitsIncludeTitle(searcher,
matches,
"The Pragmatic Programmer"));
}
Figure 3.7. A Query can have an arbitrary nested structure, easily expressed with QueryParser’s grouping. This query is achieved by parsing the expression (+”brown fox” +quick) “red dog”.
Setting the boost for a subquery
A caret (^) followed by a floating-point number sets the boost factor for the preceding query. For example, the query expression junit^2.0 testing sets the junit TermQuery to a boost of 2.0 and leaves the testing TermQuery at the default boost of 1.0.
Lucene’s analysis process
Using analyzers
Analyzing "The quick brown fox jumped over the lazy dog"
WhitespaceAnalyzer:
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]
SimpleAnalyzer:
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]
StopAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dog]
StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dog]
Analyzing "XY&Z Corporation - xyz@example.com"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [xyz@example.com]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [xyz@example.com]
In the meantime, here’s a summary of each of these analyzers:
WhitespaceAnalyzer
, as the name implies, splits text into tokens on whitespace characters and makes no other effort to normalize the tokens. It doesn’t lowercase each token.SimpleAnalyzer
first splits tokens at nonletter characters, then lowercases each token. Be careful! This analyzer quietly discards numeric characters but keeps all other characters.StopAnalyzer
is the same as SimpleAnalyzer, except it removes common words. By default, it removes common words specific to the English language (the, a, etc.), though you can pass in your own set.StandardAnalyzer
is Lucene’s most sophisticated core analyzer. It has quite a bit of logic to identify certain kinds of tokens, such as company names, email addresses, and hostnames. It also lowercases each token and removes stop words and punctuation.
Indexing analysis
Figure 4.1. Analysis process during indexing. Fields 1 and 2 are analyzed, producing a sequence of tokens; Field 3 is unanalyzed, causing its entire value to be indexed as a single token.
QueryParser analysis
QueryParser parser = new QueryParser(Version.LUCENE_30,
"contents", analyzer);
Query query = parser.parse(expression);
Parsing vs. analysis: when an analyzer isn’t appropriate
- Analyzers are used to analyze a specific field at a time and break things into tokens only within that field.
- Analyzers don’t help in field separation because their scope is to deal with a single field at a time. Instead, parsing these documents prior to analysis is required.
What’s inside an analyzer
public final class SimpleAnalyzer extends Analyzer {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
return new LowerCaseTokenizer(reader);
}
@Override
public TokenStream reusableTokenStream(String fieldName, Reader reader
throws IOException {
Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream();
if (tokenizer == null) {
tokenizer = new LowerCaseTokenizer(reader);
setPreviousTokenStream(tokenizer);
} else
tokenizer.reset(reader);
return tokenizer;
}
}
What’s in a token?
Figure 4.2. A token stream with positional and offset information
TokenStream uncensored
Figure 4.3. The hierarchy of classes used to produce tokens: TokenStream is the abstract base class; Tokenizer creates tokens from a Reader; and TokenFilter filters any other TokenStream.
Figure 4.4. An analyzer chain starts with a Tokenizer, to produce initial tokens from the characters read from a Reader, then modifies the tokens with any number of chained TokenFilters.
Figure 4.5. TokenFilter and Tokenizer class hierarchy
To illustrate the analyzer chain in code, here’s a simple example analyzer:
public TokenStream tokenStream(String fieldName, Reader reader) {
return new StopFilter(true,
new LowerCaseTokenizer(reader),
stopWords);
}
Visualizing analyzers
Listing 4.1. AnalyzerDemo: seeing analysis in action
Listing 4.2. AnalyzerUtils: delving into an analyzer
The AnalyzerDemo application lets you specify one or more strings from the command line to be analyzed instead of the embedded example ones:
%java lia.analysis.AnalyzerDemo "No Fluff, Just Stuff"
Analyzing "No Fluff, Just Stuff"
org.apache.lucene.analysis.WhitespaceAnalyzer:
[No] [Fluff,] [Just] [Stuff]
org.apache.lucene.analysis.SimpleAnalyzer:
[no] [fluff] [just] [stuff]
org.apache.lucene.analysis.StopAnalyzer:
[fluff] [just] [stuff]
org.apache.lucene.analysis.standard.StandardAnalyzer:
[fluff] [just] [stuff]
Listing 4.3. Seeing the term, offsets, type, and position increment of each token
We display all token information on the example phrase using SimpleAnalyzer:
public static void main(String[] args) throws IOException {
AnalyzerUtils.displayTokensWithFullDetails(new SimpleAnalyzer(),
"The quick brown fox....");
}
Here’s the output:
1: [the:0->3:word]
2: [quick:4->9:word]
3: [brown:10->15:word]
4: [fox:16->19:word]
Analyzing the phrase “I’ll email you at xyz@example.com” with StandardAnalyzer
produces this interesting output:
1: [i'll:0->4:<APOSTROPHE>]
2: [email:5->10:<ALPHANUM>]
3: [you:11->14:<ALPHANUM>]
5: [xyz@example.com:18->33:<EMAIL>]
Using the built-in analyzers
Table 4.3. Primary analyzers available in Lucene
Analyzer | Steps taken |
---|---|
WhitespaceAnalyzer | Splits tokens at whitespace. |
SimpleAnalyzer | Divides text at nonletter characters and lowercases. |
StopAnalyzer | Divides text at nonletter characters, lowercases, and removes stop words. |
KeywordAnalyzer | Treats entire text as a single token. |
StandardAnalyzer | Tokenizes based on a sophisticated grammar that recognizes email addresses, acronyms, Chinese-Japanese-Korean characters, alphanumerics, and more. It also lowercases and removes stop words. |
Sounds-like querying
We chose the Metaphone
algorithm as an example, but other algorithms are available, such as Soundex
.
Listing 4.4. Searching for words that sound like one another
The trick lies in the MetaphoneReplacementAnalyzer:
public class MetaphoneReplacementAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
return new MetaphoneReplacementFilter(
new LetterTokenizer(reader));
}
}
Listing 4.5. TokenFilter that replaces tokens with their metaphone equivalents
Using our AnalyzerUtils, two phrases that sound similar yet are spelled differently are tokenized and displayed:
public static void main(String[] args) throws IOException {
MetaphoneReplacementAnalyzer analyzer =
new MetaphoneReplacementAnalyzer();
AnalyzerUtils.displayTokens(analyzer,
"The quick brown fox jumped over the lazy dog");
System.out.println("");
AnalyzerUtils.displayTokens(analyzer,
"Tha quik brown phox jumpd ovvar tha lazi dag");
}
We get a sample of the metaphone encoder, shown here:
[0] [KK] [BRN] [FKS] [JMPT] [OFR] [0] [LS] [TKS]
[0] [KK] [BRN] [FKS] [JMPT] [OFR] [0] [LS] [TKS]
Synonyms,aliases and words that mean the same
Listing 4.6. Testing the synonym analyzer
Creating SynonymAnalyzer
Listing 4.7. SynonymAnalyzer implementation
public class SynonymAnalyzer extends Analyzer {
private SynonymEngine engine;
public SynonymAnalyzer(SynonymEngine engine) {
this.engine = engine;
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new SynonymFilter(
new StopFilter(true,
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(
Version.LUCENE_30, reader))),
StopAnalyzer.ENGLISH_STOP_WORDS_SET),
engine
);
return result;
}
}
Listing 4.8. SynonymFilter: buffering tokens and emitting one at a time
The design of SynonymAnalyzer allows for pluggable SynonymEngine implementations. SynonymEngine is a one-method interface:
public interface SynonymEngine {
String[] getSynonyms(String s) throws IOException;
}
public class TestSynonymEngine implements SynonymEngine {
private static HashMap<String, String[]> map =
new HashMap<String, String[]>();
static {
map.put("quick", new String[] {"fast", "speedy"});
map.put("jumps", new String[] {"leaps", "hops"});
map.put("over", new String[] {"above"});
map.put("lazy", new String[] {"apathetic", "sluggish"});
map.put("dog", new String[] {"canine", "pooch"});
}
public String[] getSynonyms(String s) {
return map.get(s);
}
}
Listing 4.9. SynonymAnalyzerTest: showing that synonym queries work
Listing 4.10. Testing SynonymAnalyzer with QueryParser
The test produces the following output:
With SynonymAnalyzer, "fox jumps" parses to "fox (jumps hops leaps)"
With StandardAnalyzer, "fox jumps" parses to "fox jumps"
Visualizing token positions
Listing 4.11. Visualizing the position increment of each token
public static void displayTokensWithPositions
(Analyzer analyzer, String text) throws IOException {
TokenStream stream = analyzer.tokenStream("contents",
new StringReader(text));
TermAttribute term = stream.addAttribute(TermAttribute.class);
PositionIncrementAttribute posIncr =
stream.addAttribute(PositionIncrementAttribute.class);
int position = 0;
while(stream.incrementToken()) {
int increment = posIncr.getPositionIncrement();
if (increment > 0) {
position = position + increment;
System.out.println();
System.out.print(position + ": ");
}
System.out.print("[" + term.term() + "] ");
}
System.out.println();
}
We wrote a quick piece of code to see what our SynonymAnalyzer is doing:
public class SynonymAnalyzerViewer {
public static void main(String[] args) throws IOException {
SynonymEngine engine = new TestSynonymEngine();
AnalyzerUtils.displayTokensWithPositions(
new SynonymAnalyzer(engine),
"The quick brown fox jumps over the lazy dog");
}
}
And we can now visualize the synonyms placed in the same positions as the original words:
2: [quick] [speedy] [fast]
3: [brown]
4: [fox]
5: [jumps] [hops] [leaps]
6: [over] [above]
8: [lazy] [sluggish] [apathetic]
9: [dog] [pooch] [canine]
Stemming analysis
StopFilter leaves holes
This is illustrated from the output of AnalyzerUtils.displayTokensWithPositions:
2: [quick]
3: [brown]
4: [fox]
5: [jump]
6: [over]
8: [lazi]
9: [dog]
Combining stemming and stop-word removal
Listing 4.12. PositionalPorterStopAnalyzer: stemming and stop word removal
public class PositionalPorterStopAnalyzer extends Analyzer {
private Set stopWords;
public PositionalPorterStopAnalyzer() {
this(StopAnalyzer.ENGLISH_STOP_WORDS_SET);
}
public PositionalPorterStopAnalyzer(Set stopWords) {
this.stopWords = stopWords;
}
public TokenStream tokenStream(String fieldName, Reader reader) {
StopFilter stopFilter = new StopFilter(true,
new LowerCaseTokenizer(reader),
stopWords);
stopFilter.setEnablePositionIncrements(true);
return new PorterStemFilter(stopFilter);
}
}