Competition between Google and the other major search sites has led us to expect relevant results to our searches with a variety of associated tools to highlight contextual information and to support burrowing into the endless collection of documents available on the web.
Shifting the view a little closer to home, the searching capabilities across an organization’s many sites and information stores can often be a painful chore in comparison. Searching is typically limited to a single system at a time and is often nothing more than a list of plain search results.
Due to the incremental growth in an enterprise’s growth, where old systems and data are preserved while only new data is included in replacement systems, the ability to search across many systems at once becomes critical.
We faced a similar problem where one of our longer running projects had to contend with documents held across four wikis and web sites, two issue tracking systems, and many other build, code and analysis systems. Searches become painful and so are often undertaken only as a last resort, this is far different from the ease of searching online.
There are a number of commercial systems that support enterprise wide searching but they are often expensive. There is also a hesitation regarding whether the “Google Magic” applies to such a small corpus of documents with very low linkage between them and if not would these systems provide enough benefit over the well known core indexing and searching capabilities that are now found in many open-source implementations such as Lucene.
Another factor is that organizations are in the position to potentially know much more regarding what is important to the way they work and the specific characteristics of there their documents and systems. With this in mind, together with a recognition that there were technologies here that may be of value to our custom development efforts, we ended up assembling an internal search site based on the open-source Solr, Lucene, and Nutch tools with a custom front-end for users to access the search engine.
The heart of the system is an Apache Lucene instance that contains the indexes for all the documents and executes searches against them. The integration point is the Apache Solr search service that wraps Lucene and provides an HTTP API for uploading new documents to be indexed and also for triggering search queries.
Documents from across the domain are found and parsed using the Nutch web crawler and then submitted to Solr for inclusion in the Lucene index. Users interact with the system through a custom website implemented in Wicket that has been modelled after a basic form of Google with some extensions specific to the interests of our company.
Indexing Documents with Nutch
The Solr search service with its API provides a number of ways to submit data for indexing: directly using the HTTP API in XML or JSON format, using the Java client wrapper around the API, uploading CSV, and importing from a relational database using the DataImportHandler.
There also seem to be more specific integration for a number of other libraries and tools such as Spring and Liferay. In addition to finding a document it also has to be broken down into a sequence of terms that can then be indexed. One common library for achieving this seems to be the Apache Tika library that can analyse HTML, Microsoft Office, and many other document formats.
Considering the number of systems we wanted to integrate with, about a dozen initially, we started with the web crawling approach using the open-source Nutch tool. Conceptually this works in the intranet setting by defining a set of seed URLs to start the search from and also the boundary of the internal network past which it will not search. The documents at each of the URLs will be downloaded parsed and uploaded to Solr for indexing by Lucene, then any referenced URLs from the source documents will be processed, and so on until all of the reachable documents on the network have been processed. This crawl can then be left running and based on configured refresh rates will find new or updated documents to be processed over time.
Although this approach seems a very inviting way to get lots of documents into the system quickly, in practice, it turned out to be a long slog. Many of the sites on an internal network aren’t by nature suitable for web crawling, so for example Bugzilla refuses robots by default and form-based authentication can require custom coding to surmount.
Achieving a consistent level of document and meta-information from crawled documents to support date filtering or accurate search results can be difficult to achieve, so for many sites we filter out navigation and time sensitive sections of the HTML, and calculate a relevant last-modified date based on knowledge of the application.
Nutch proved the most frustrating of the technologies used for the system. Without access to a book to lead one through its usage within an intranet scope I found I had to rely on a lot of trial-and-error to reach a reasonable ongoing setup.
Solr Search Service
The Solr search server is an Apache project connected to the Lucene index engine. Solr acts as the central integration point of the system and is used in turn by the web crawler and the user front-end website.
Although Lucene is accomplishing much of the heavy lifting it is wrapped so thoroughly by Solr that you can be mostly ignorant of it. One area where it is closer to the surface is with the schema index that has to be defined prior to any document indexing. This defines all the fields for a single record where each field has a name, whether it is stored or indexed, and whether any index or query time processing occurs with the content.
I started with one supplied by Nutch for integration with Solr that has a document based record whose fields include title, content, URL, last modified date, and so on. Being able to specify a schema allows a closer match to the model of your application so for a CD application there could be fields for the artist, composer, album name, and track name for example.
Solr comes with an admin focused website that provides details on the setup of the site, its indexes, and some metrics for its usage. One nice set of features to support production deployments is for cores. A core is packaged up Lucene index and Solr configuration that can be accessed on its own URL path. Many cores can be active at once on a single Solr server and there are commands for reloading a core’s configuration and atomic switches between cores.
In our case we have a production core that services requests from users and a second core in preparation to which updated crawled documents are being added to. Once we are ready we can promote the preparation core to production using the core switch feature which seems to take less than a second and is atomic from the perspective of client applications. Solr also has a story for scaling both vertically and horizontally, however so far a single machine for the entire search system has been sufficient for us.
I found the book Solr 1.4 Enterprise Server Server by David Smiley and Eric Pugh very helpful in leading through using the features of the system. Many of the things I wanted to do where present in the book. Once the book runs out the Solr wiki has a lot of material although I didn’t find much material past it and the Lucid site.
One of the frustrations when first learning about Solr and wanting a turn-key implementation of an enterprise wide search system based on it, was finding that it didn’t come with a suitable website for users to search from. The only user site it comes with is for administration.
The basic feature list is simply derived from the typical modern search interface. Solr makes it straight forward to implement many of these core features, so all of the following were directly supported by the Solr API:
- Producing a list of results by page where specific fields from the results are returned. This includes a relevant selection from the content with the matching terms highlighted.
- Auto-completion for the search query.
- Faceted search so that the total number of hits by each of the indexed sites is shown. Results can also be filtered to specific sites using the same mechanism.
- Additional filters can also be overlaid on the query typed in by the user. These can apply across the common fields or specific fields. This is used to support filtering by projects based on their name but also against the URL for sites with a hierarchical URL.
- Spelling correction that can suggest an alternative search query if insufficient results are returned for the original query.
- Showing similar pages to a selected result page based on common terms between the documents.
With Solr providing a way to implement all of the above we have been able to start focusing on extending the search based on the specifics of our company. One direction has been to use the faceting and filter capabilities to include the different source sites and the project nature of our work.
Another feature of search sites in the last few years has been to add some range to the way that results are shown. So a search for a retail store might show a map with locations nearby at the head of the list, or if images seem associated with the query then include a sample of images with a way to trigger an image only search.
We follow a similar approach so that, for example, a search for person will show at the head of the results list their photograph, contact details, current project room and so on. One way of looking at this is that based on the query or returned results a more specific structured or database search has been triggered, rather than a text search, with a custom rendering of the results. The custom results start to look like a form of dashboard for integrating with other systems.