solr Create custom field type from available field types


Example

Let's get some theoretical knowledge before moving to the example. There are three important terms being used here Analyzers, Tokenizers, and Filters. To create such custom field you will need to create an analyzer with one tokenizer and one or more filters. As mentioned here, you can have only one tokenizer per analyzer but there are ways to overcome this limitation.

<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.TrimFilterFactory"/>
    <filter class="solr.PatternReplaceFilterFactory" replace="all" replacement="" pattern="([^a-z])"/>
  </analyzer>
</fieldType>

Another example:

<fieldType name="lowercase_text" class="solr.TextField" positionIncrementGap="150">
  <analyzer>
     <tokenizer class="solr.KeywordTokenizerFactory" />
     <filter class="solr.LowerCaseFilterFactory" />
  </analyzer>
</fieldType>

One more example with description:

<fieldType name="text_stem" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory"/>
  </analyzer>
</fieldType>

This example starts with Solr's standard tokenizer, which breaks the field's text into tokens. Those tokens then pass through Solr's standard filter, which removes dots from acronyms, and performs a few other common operations. All the tokens are then set to lowercase, which will facilitate case-insensitive matching at query time. The last filter in the above example is a stemmer filter that uses the Porter stemming algorithm. A stemmer is basically a set of mapping rules that maps the various forms of a word back to the base, or stem, word from which they derive. For example, in English the words "hugs", "hugging" and "hugged" are all forms of the stem word "hug". The stemmer will replace all of these terms with "hug", which is what will be indexed. This means that a query for "hug" will match the term "hugged", but not "huge".

Example usage of such custom field:

<field name="keywords" type="text_stem" indexed="true" stored="true" />

List of available tokenizer types: list of tokenizer types

List of available filter types: list of filter types