iSpace Solr Component

The Solr component allows messages to be send to a Solr endpoint. The messages are indexed and can thereafter be searched.

This component is based on the SolarJ API. All options supported by the SolrJ API is also supported by this component.

To use this component make sure you have installed and configured a Solr server. The Solr component works with the example schema provided with the solr server. You can also choose to tailor the schema, to classify data using different 'categories' (facets). The Solr component supports the storage of tailored fields as well.

 

URI Format

solr:[name]?options"

 

Options

 

Option  Default Value Description
solrServerUrl  http://localhost:8080/apache-solr-3.3.0/ The URL to the Solr server.
contentFieldName "text" (default in solr example configuration) The name of the field in the Solr server schema configuration containing the main text of the document.
assignUniquekey true If set to true, then the endpoint will automatically ensure that all documents submitted for storage have a value set for the field set as the 'unique key'. If the schema has no unique key (non-default), then this should be set to false.
uniquekeyFieldName "id" (default in solr example configuration) The name of the unique key of a specific document stored in Solr. If the schema has no unique key (non-default), then this option has no effect. If the option 'assignUniquekey' is set to true, then a UUID will be automatically assigned to the 'id' field if no value is specified in the submitted document.
forceCommit true Whether a commit will be forced on Solr upon reception of each document for storage. Notice that commits can also be forced only for specific docuemnts by setting the field 'solr.forcecommit' in the message header.

 

Indexing a Document

A message without any of the header flags described in the following sections, are interpreted as an 'insert' request.

The Solr endpoint expects the body of the IN message to contain the core textual part of the message to be stored in the 'content field' and indexed, but it may be empty. The headers of the message can be used enrich the document with additional fields, as well as to boost the relevance of the document (increase the relevance score when searching). Each IN message header field in the format

solr.field.[name]

Will be mapped to a Solr field [name] with the value(s) set in the header. The header value can be a 'list' in which case the corresponding solr schema field must be configured as 'multivalued="true" '.

Notice that the Solr component does not extract the textual content, i.e. it will take the body as a textual string and store it. To extract the textual content of a data source, see the Aperture Component.

 

Example: Indexing the Files in a Directory

The following Camel route definition will monitor the local file directory 'c:/test/testdata', iterate through all subfolders (recursive=true) every minute (delay=60000), not-delete any file (noop=true), remember the state (idempotent=true), store the state in a separate file (idempotentRepository=#fileStore), pickup any changes to files, extract text and title, and route the messages to the Solr store.

 

<?xml version="1.0" encoding="UTF-8"?>
 
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:context="http://www.springframework.org/schema/context"
       xmlns:util="http://www.springframework.org/schema/util"
       xmlns:p="http://www.springframework.org/schema/p"
       xsi:schemaLocation=" http://www.springframework.org/schema/beans 
        http://www.springframework.org/schema/beans/spring-beans-2.5.xsd
        http://www.springframework.org/schema/context
            http://www.springframework.org/schema/context/spring-context-2.5.xsd
            http://www.springframework.org/schema/util 
            http://www.springframework.org/schema/util/spring-util-2.5.xsd 
       http://camel.apache.org/schema/spring
     http://camel.apache.org/schema/spring/camel-spring.xsd">
 
<context:annotation-config/>
 
<!-- The iSpace Aperture based splitter of text from documents. -->
<bean id="extractor" class="com.villemos.ispace.aperture.DocumentProcessor"/>
 
<!-- The persistent store used by the Camel file component to persist the state. -->
<bean id="fileStore" class="org.apache.camel.processor.idempotent.FileIdempotentRepository">
   <property name="fileStore" value="c:/test/filestore.dat"/>
</bean>
 
<camelContext id="context" xmlns="http://camel.apache.org/schema/spring">
<!-- Route for crawling the documents.  -->
<route id="DocumentETL">
<from uri="file://c:/test/testdata?recursive=true&amp;noop=true&amp;consumer.delay=60000&amp;idempotent=true&amp;idempotentRepository=#fileStore" />
<split>
<method bean="extractor"/>
<to uri="solr:documentstore"/>
</split>
</route>
</camelContext>
 
</beans>

For a detailed description of the file directory related configuration, see the excellent documentation of the Camel file component on the Apache Camel site. 

For a detailed description og the Aperture text extraction component, see the Camel Aperture component.

 

Example: Indexing against a Tailored Solr Schema

To tailor Solr for your needs, you can and should update the Solr configuration files. The two key files are the schema.xml file (the schema file) and the solrconfig.xml file (the solr configuration file). In the schema file the fields of each entry are configured. To understand the configuration, read through the default Solr configuration file and/or visit the Solr site.

The Solr server has been configured with a schema file containing;

...

<field name="format" type="string" indexed="true" stored="true" multiValued="false" required="true" omitNorms="true"/>

<field name="type" type="string" indexed="true" stored="true" multiValued="true" required="true" omitNorms="true"/>

<field name="mainText" type="string" indexed="true" stored="true" multiValued="false" required="true" omitNorms="true"/>

...

Two new fields have been added (format and type) of which type is multi-valued, and the main content field has been renamed from the default name 'text' to 'mainText'.

The following route will route the files in the directory 'c:/test/testdata' first to two custom beans (i.e. beans you have to implement, or use one of the existing processors) which will set the message header fields 'solr.field.format' and 'solr.field.type'. 

 

 

<bean id="type.setter" class="my.class.for.setting.the.type"/>

<bean id="format.setter" class="my.class.for.setting.the.format"/>

<bean id="fileStore" class="org.apache.camel.processor.idempotent.FileIdempotentRepository">
   <property name="fileStore" value="c:/test/filestore.dat"/>
   <property name="maxFileStoreSize" value="512000"/>
   <property name="cacheSize" value="250"/>
</bean>

<camelContext>

<route id="filesystem">

    <from uri="file://c:/test/testdata?recursive=true&amp;noop=true&amp;consumer.delay=60000&amp;idempotent=true&amp;idempotentRepository=#fileStore" />
<to uri="bean:type.setter"/>        
<to uri="bean:format.setter"/>        
<to uri="solr:store?contentFieldname=mainText"/>
    </route>
</camelContext>

 

 

Example: Timestamping Documents upon Indexing

To timestamp documents upon storage, change the default solr schema, uncommenting the line defining the field 'timestamp'. You should end with the following entry in your solr schema.xml file

<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>

 

Example: Forcing Commit

Solr will activly cache data upon storage, to optimize performance. Cached data is not available to retrieval before it has been commited. The commit policy is configured in the Solr 'solrconfig.xml' file and will typically be based on size as well as a maximal cache time.

The Solr endpoint supports two ways of influencing commits outside Solr;

1. Per Endpoint. A solr endpoint can be configured to always force commits. This setting can be changed using the 'forceCommit' option.

2. Per message. To force a commit for an individual document regardless of the endpoint configuration, insert a header field 'solr.commit' in the exchanges IN message.

Note that frequent commits will impact storage performance. The defaults in Solr are pretty good, so only change this if you really need to.

The following configuration defines two Solr endpoints, one for 'priority' messages that will be committed immediatly upon reception, and one for 'whenever' using the delayed commit configured by in the solr config file.

 

<camelContext>

<route id="priority">

    <from uri="direct:priority" />
        <to uri="solr:store"/>
    </route>

<route id="whenever">

    <from uri="direct:whenever" />
        <to uri="solr:store?forceCommit=false"/>
    </route>
 
</camelContext>
 

Submitting a document to 'direct:priority' will always lead to a commit (default). The message header 'solr.commit' have no effect (also setting it to 'false' will be ignored).

Submitting a document to 'direct:whenever' will not lead to a commit. The document will be indexed, but cached and not be immediatly available for search. To force the commit, insert the message header field 'solr.commit=true'.

 

Retrieving Documents

The main retrieval filter is set through the IN message header field 'solr.query'. The value should be the filter search string.

The Solr endpoint is based on the SolrJ API. All query configuration options available through the SolrJ API is also available in the Solr endpoint. The configuration is done through the IN message header fields. Each header field with the syntax

solr.option.[option] 

Will be mapped to a SolrJ API call 'set[Option]' or 'add[Option]'.

Results can be returned either as a stream of individual documents or as a batch in the OUT message body. To configure delivery as a batch, the IN header field 'solr.stream' must be set and the body of the IN message must contain an object that implememnts the ICallback interface. The callback will thereafter be called for all documents matching the search criterion. When delivering in batch mode, the result set, a SolrJ API 'SolrDocumentList' with zero or more results, will be set as the message OUT body.

 

Example: Keyword Retrieval

The following IN message

Message in = exchange.getIn();
in.setHeader("solr.query", "test");
in.setHeader("solr.option.rows", 100);
in.setHeader("solr.option.start", 200);
in.setHeader("solr.option.highlight", true);
in.setHeader("solr.option.includescore", true);
in.setHeader("solr.option.sortfield", new Object[] {"score", ORDER.desc});

Will retrieve a maximum of 100 entries starting from position 200 (i.e. entry 200 to 300), using the filter "test", and provide snippets with highlights for each result. The result set will include the score for each result and be sorted based on the score in descending order.

The OUT message body will contain a SolrJ 'SolrDocumentList'.

 

 

Example: Streaming Results of a Keyword Retrieval

The following IN message

ICallback callback = new MyCallback();
 
Message in = exchange.getIn();
in.setHeader("solr.query", "test");
in.setHeader("solr.stream", "");
in.setHeader("solr.option.rows", 100);
in.setHeader("solr.option.start", 200);
in.setHeader("solr.option.highlight", true);
in.setHeader("solr.option.includescore", true);
in.setHeader("solr.option.sortfield", new Object[] {"score", ORDER.desc});
in.setBody(callback);

Will retrieve all entries starting from position 200 (i.e. entry 200 to N), using the filter "test", and provide snippets with highlights for each result. The result set will include the score for each result and be sorted based on the score in descending order.

The results will be returned through the callback, one SolrDocument at a time.

 

Retrieving Facets

The facet retrieval is set through the IN message header field 'solr.facetquery

Example: Facet Retrieval

The following IN message

Message in = exchange.getIn();
in.setHeader("solr.facetquery", "");
in.setHeader("solr.option.facetfield", new String[] {"mimetype"});
in.setHeader("solr.option.facetlimit", -1);
in.setHeader("solr.option.rows", 0);
in.setHeader("solr.option.facetsort", "count");

Will retrieve all facet values for the Solr field 'mimetype' (field mut exist in Solr schema file) and sort the returned facet values based on their frequency (count).

The OUT message body will contain a 'List<FacetField>'.


 
Warning: Note that the 'facetfield' option value is a String[]. The SolrJ uses method signature with 'String... arg', which corresponds to an ARRAY. 

 

Article

Pride Before the Fall

David vs Goliath. Historic signs and patterns showing the fall of the large incombants. The patterns are plain to see in the space market today.

Article

Probe. Sense. Respond.

How to solve complex problems. The fundament of Agile development as well as third generation knowledge management.

Article

The most important metric to watch.

There is one source code metric that will tell you what you need about complexity, quality and future costs. It is so very simple.

Article

The Inadequacy of Requirements

Requirements in their classical form are poor instruments for specifying the needs of systems. There are alternatives available.

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer