你的位置:首页 > Java教程

[Java教程]编译Ansj之Solr插件


  Ansj是一个比较优秀的中文分词组件,具体情况就不在本文介绍了。ansj作者在其官方代码中,提供了对lucene接口的支持。如果用在Solr下,还需要简单的扩展一下。

1、基于maven管理

   ansj是基于maven进行开发管理的。我们首先修改一下其pom.

  

<project 

  其中,代码依赖的配置项:<scope>provided</scope> 表示只用于代码编译阶段。依赖关系整理好以后,写一个TokenizerFactory类,用于solr中配置使用,代码如下:

package org.ansj.solr;import java.io.BufferedReader;import java.io.File;import java.io.FileInputStream;import java.io.FileNotFoundException;import java.io.IOException;import java.io.InputStreamReader;import java.io.Reader;import java.util.HashSet;import java.util.Map;import java.util.Set;import org.ansj.lucene.util.AnsjTokenizer;import org.ansj.splitWord.analysis.IndexAnalysis;import org.ansj.splitWord.analysis.ToAnalysis;import org.apache.lucene.analysis.Tokenizer;import org.apache.lucene.analysis.util.TokenizerFactory;import org.apache.lucene.util.AttributeSource.AttributeFactory;public class AnsjTokenizerFactory extends TokenizerFactory{  boolean pstemming;  boolean isQuery;  private String stopwordsDir;  public Set<String> filter;   public AnsjTokenizerFactory(Map<String, String> args) {    super(args);    assureMatchVersion();    isQuery = getBoolean(args, "isQuery", true);    pstemming = getBoolean(args, "pstemming", false);    stopwordsDir = get(args,"words");    addStopwords(stopwordsDir);  }  //add stopwords list to filter  private void addStopwords(String dir) {    if (dir == null){      System.out.println("no stopwords dir");      return;    }    //read stoplist    System.out.println("stopwords: " + dir);    filter = new HashSet<String>();    File file = new File(dir);     InputStreamReader reader;    try {      reader = new InputStreamReader(new FileInputStream(file),"UTF-8");      BufferedReader br = new BufferedReader(reader);       String word = br.readLine();       while (word != null) {        filter.add(word);        word = br.readLine();       }     } catch (FileNotFoundException e) {      System.out.println("No stopword file found");    } catch (IOException e) {      System.out.println("stopword file io exception");    }     }  @Override  public Tokenizer create(AttributeFactory factory, Reader input) {    if(isQuery == true){      //query      return new AnsjTokenizer(new ToAnalysis(new BufferedReader(input)), input, filter, pstemming);    } else {      //index      return new AnsjTokenizer(new IndexAnalysis(new BufferedReader(input)), input, filter, pstemming);    }  }    }

  pstemming 参数是ansj需要的参数。

  isQuery 是用于判断是查询还是索引,一般搜索index阶段分词比较细,查询的分词比较粗。

2、编译jar包。

    代码结构如下:

   

  编写mavn编译命令:mvn install -DskipTests=true# 忽略单元测试编译。

  

    执行编译:

[INFO] Scanning for projects...[INFO]                                     [INFO] ------------------------------------------------------------------------[INFO] Building ansj_lucene4_plug 2.0.2[INFO] ------------------------------------------------------------------------[INFO] [INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ ansj_lucene4_plug ---[INFO] Deleting R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\target[INFO] [INFO] --- maven-resources-plugin:2.4.3:resources (default-resources) @ ansj_lucene4_plug ---[INFO] Using 'UTF-8' encoding to copy filtered resources.[INFO] skip non existing resourceDirectory R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\src\main\resources[INFO] [INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ ansj_lucene4_plug ---[INFO] Compiling 5 source files to R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\target\classes[INFO] [INFO] --- maven-resources-plugin:2.4.3:testResources (default-testResources) @ ansj_lucene4_plug ---[INFO] Using 'UTF-8' encoding to copy filtered resources.[INFO] skip non existing resourceDirectory R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\src\test\resources[INFO] [INFO] --- maven-compiler-plugin:2.3.2:testCompile (default-testCompile) @ ansj_lucene4_plug ---[INFO] Compiling 3 source files to R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\target\test-classes[INFO] [INFO] --- maven-surefire-plugin:2.7.1:test (default-test) @ ansj_lucene4_plug ---[INFO] Tests are skipped.[INFO] [INFO] --- maven-jar-plugin:2.3.1:jar (default-jar) @ ansj_lucene4_plug ---[INFO] Building jar: R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\target\ansj_lucene4_plug-2.0.2.jar[INFO] [INFO] --- maven-install-plugin:2.3.1:install (default-install) @ ansj_lucene4_plug ---[INFO] Installing R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\target\ansj_lucene4_plug-2.0.2.jar to C:\Users\GCZX-016\.m2\repository\org\ansj\ansj_lucene4_plug\2.0.2\ansj_lucene4_plug-2.0.2.jar[INFO] Installing R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\pom.