Parse/tokenize a string with Lucene 7.3.0

To parse a string with lucene, like so:

List<String> terms = parseText("This is a test");

Add the following maven dependency:

    <dependency>
      <groupId>org.apache.lucene</groupId>
      <artifactId>lucene-core</artifactId>
      <version>7.3.0</version>
    </dependency>

And imports:

import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
 
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.standard.StandardAnalyzer;

And the code:

 public static List parseText(String text) {
    	if (keywords == null) {
    		return new ArrayList();
    	}
    	
        List result = new ArrayList();
        TokenStream stream = analyzer.tokenStream("nofield", new StringReader(text));

        try {
            stream.reset();

            while(stream.incrementToken()) {
                result.add(stream.getAttribute(CharTermAttribute.class).toString());
            }
        } catch(IOException e) {}

        return result;
    }  
 

Interested in Solr? I send out weekly, personalized emails with articles and conference talks. Click here to see an example and subscribe.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *