For the first time ever Solr can now support a multi-sharded, distributed join query.   As of Solr 8.6, there is a new join method; the Cross Collection Join Query (SOLR-13749).  Currently there are multiple ways to perform a join query in Solr: the Block Join query parser, Join query parser, Graph query parser and Streaming Expressions all can provide some notion of join capabilities.  As the name implies, the Cross Collection Join query  is specifically built to allow for queries to join documents across two different, distributed collections.   This post covers the technical implementation of the Cross Collection Join, we’ll be following this up soon with another post dedicated to sample use cases and benchmarks for using the new Cross Collection Join query.

How it works

The Cross Collection Join is a new join method that executes a query against a remote Solr collection to retrieve the set of join keys that will be used as a filter query against the local Solr collection.  The CrossCollectionJoinQuery will first query a remote solr collection and get back a streaming expression result of the join keys.  As the join keys are streamed to the node, a bitset of the matching documents in the local index is built up.  This avoids keeping the full set of join keys in memory at any given time.  This bitset is then inserted into the filter cache upon successful execution as with the normal behavior of the solr filter cache.


Additionally, if the local index is sharded according to the join key field, the Cross Collection query can leverage a secondary query parser called the “hash_range” query parser.  The hash_range query parser is responsible for returning only the documents that hash to a given range of values.  If the local collection is routed based on the join key used, each shard participating in the Cross Collection Join Query will be able to retrieve only the join keys from the remote collection that potentially match in the local shard.  This has the benefit of making sure that network traffic doesn’t increase as the number of shards increases and allows for much greater scalability.

The Cross Collection Join Query works with both String and Point types of fields.  The field that is being used for the join key in the remote collection must also have docValues enabled.  The field that is being used for the join key in the local collection must be a single valued string field in order to take advantage for the hash range optimization described above.  Additionally, the documents in the local collection must be routed to shard based on the join key field for the optimization.  
The remote solr collection does not have any specific sharding requirements.

Example Usage

{!join method="crossCollection" fromIndex="otherCollection" from="fromField" to="toField"}*:*
 
or
 
{!join method="crossCollection" fromIndex="otherCollection" from="fromField" to="toField" v="*:*"}    (preferred)

Glossary of Terms

TermDescription

Local Collection

This is the main collection that is being queried.


Remote Collection


This is the collection that the Cross Collection Join Query will query to resolve the join keys.
CrossCollectionJoinQuery
The lucene query that executes a search to get back a set of join keys from a remote collection

HashRangeQuery

The lucene query that matches only the documents whose hash code on a field falls within a specified range.

Cross Collection Join Query Parameters

ParamRequiredDescription

All others



Any normal Solr parameter can also be specified as a local param.
fromIndexRequired

The name of the external Solr collection to be queried to retrieve the set of join key values (required)
fromRequired
The join key field name in the external collection (required)
routedOptional
true / false.  If true, the Cross Collection Join query will use each shard’s hash range to determine the set of join keys to retrieve for that shard.  This parameter improves the performance of the cross-collection join, but it depends on the local collection being routed by the toField.  If this parameter is not specified, the Cross Collection Join query will try to determine the correct value automatically.
solrUrlOptional
The URL of the external Solr node to be queried (optional).  The solrUrl must be added to the “allowSolrUrls” list in the solrconfig.xml file
to
Required
The join key field name in the local collection
ttlOptional
The length of time that a Cross Collection Join query in the cache will be considered valid, in seconds.  Defaults to 3600 (one hour).  The Cross Collection Join query will not be aware of changes to the remote collection, so if the remote collection is updated, cached Cross Collection Join queries may give inaccurate results.  After the ttl period has expired, the Cross Collection Join query will re-execute the join against the remote collection.
vSee Note
The query to be executed against the external Solr collection to retrieve the set of join key values.  Note:  The original query can be passed at the end of the string or as the “v” parameter.  It’s recommended to use query parameter substitution with the “v” parameter to ensure no issues arise with the default query parsers.
zkHostOptional
The connection string to be used to connect to Zookeeper.  zkHost and solrUrl are both optional parameters, and at most one of them should be specified.  If neither of zkHost or solrUrl are specified, the local Zookeeper cluster will be used. (optional)

Installation

In order to use the Hash Range query optimization, the solrconfig.xml for local and remote collections needs to be updated as defined below.

Solrconfig.xml – Local Collection

<queryParser name="join" class="org.apache.solr.search.JoinQParserPlugin">
  <str name="routerField">product_id</str>
</queryParser>

Normally, this query parser is automatically registered by Solr.  In order to take advantage of the Hash Range optimization the routedField property must be set on the JoinQParser in the Solrconfig .xml.  In this example, we’ve specified the “product_id” field to indicate the field name that will support the routed=true parameter optimization.

Solrconfig.xml – Remote Collection

<cache name="hash_product_id"
       class="solr.LRUCache"
       size="128"
       initialSize="0"
       regenerator="solr.NoOpRegenerator"/>

Make sure to define the cache for the Hash Range query with a size equal to the maximum number of segments that are expected to be in the index.  (typically this number stays well below 100.)

Potential Issues

  1. A large number of join keys will take a lot of network bandwidth to transmit to all the shards.
  2. The hash_product_id cache will not explicitly expunge entries until the max size is reached.


KMW solves hard search problems, contact us to learn more