Lucene

Lucene is a powerful text search engine, available from the fine people at the Apache Group.

It’s not a fully-featured web search engine. In particular, it doesn’t know how to extract text from most common file formats, and it doesn’t know how to crawl the web. Lucene is, more specifically, the algorithm that finds text quickly in a giant text index.

Ironically, if you’re trying to find something on the Lucene site, it uses Google.

Nutch

Recently, a new project is being built on top of Lucene: the project is called Nutch, and it provides the rest of the features you need to create a web search engine.

Extractors

New Extractors with POI

The current POI development version (labelled alpha 2) contains a scratchpad jar that includes a few useful extractors that are extremely useful for extracting the text from MS Office files.

For example, the WordExtractor makes it possible to get all the text from a Word document like this:

InputStream input = new FileInputStream("myDocument.doc");
try {
  System.out.println(new WordExtractor(input).getText());
} finally {
  input.close();
}

The only bad thing I can say about that is that it may be a while before a full release is available.

It's only fair to share...
Share on FacebookGoogle+Tweet about this on TwitterShare on LinkedIn

Leave a Reply