org.apache.lucene.index.memory
public class PatternAnalyzer extends Analyzer
If you are unsure how exactly a regular expression should look like, consider prototyping by simply trying various expressions on some test texts via String#split(String). Once you are satisfied, give that regex to PatternAnalyzer. Also see Java Regular Expression Tutorial.
This class can be considerably faster than the "normal" Lucene tokenizers. It can also serve as a building block in a compound Lucene TokenFilter chain. For example as in this stemming example:
PatternAnalyzer pat = ... TokenStream tokenStream = new SnowballFilter( pat.tokenStream("content", "James is running round in the woods"), "English"));
Field Summary | |
---|---|
static PatternAnalyzer | DEFAULT_ANALYZER
A lower-casing word analyzer with English stop words (can be shared
freely across threads without harm); global per class loader. |
static PatternAnalyzer | EXTENDED_ANALYZER
A lower-casing word analyzer with extended English stop words
(can be shared freely across threads without harm); global per class
loader. |
static Pattern | NON_WORD_PATTERN"\\W+" ; Divides text at non-letters (NOT Character.isLetter(c)) |
static Pattern | WHITESPACE_PATTERN"\\s+" ; Divides text at whitespaces (Character.isWhitespace(c)) |
Constructor Summary | |
---|---|
PatternAnalyzer(Pattern pattern, boolean toLowerCase, Set stopWords)
Constructs a new instance with the given parameters.
|
Method Summary | |
---|---|
boolean | equals(Object other)
Indicates whether some other object is "equal to" this one.
|
int | hashCode()
Returns a hash code value for the object.
|
TokenStream | tokenStream(String fieldName, String text)
Creates a token stream that tokenizes the given string into token terms
(aka words).
|
TokenStream | tokenStream(String fieldName, Reader reader)
Creates a token stream that tokenizes all the text in the given Reader;
This implementation forwards to tokenStream(String, String) and is
less efficient than tokenStream(String, String) .
|
"\\W+"
; Divides text at non-letters (NOT Character.isLetter(c))"\\s+"
; Divides text at whitespaces (Character.isWhitespace(c))Parameters: pattern
a regular expression delimiting tokens toLowerCase
if true
returns tokens after applying
String.toLowerCase() stopWords
if non-null, ignores all tokens that are contained in the
given stop set (after previously having applied toLowerCase()
if applicable). For example, created via
(String[])
and/or
WordlistLoaderas in
WordlistLoader.getWordSet(new File("samples/fulltext/stopwords.txt")
or other stop words
lists .
Parameters: other the reference object with which to compare.
Returns: true if equal, false otherwise
Returns: the hash code.
Parameters: fieldName the name of the field to tokenize (currently ignored). text the string to tokenize
Returns: a new token stream
tokenStream(String, String)
and is
less efficient than tokenStream(String, String)
.
Parameters: fieldName the name of the field to tokenize (currently ignored). reader the reader delivering the text
Returns: a new token stream