The Cross Collection Join Query

By Dan Fox July 15, 2020

For the first time ever Solr can now support a multi-sharded, distributed join query. As of Solr 8.6, there is a new join method; the Cross Collection Join Query (SOLR-13749). Currently there are multiple ways to perform a join query in Solr: the Block Join query parser, Join query parser, Graph query parser and Streaming Expressions all can provide some notion of join capabilities. As the name implies, the Cross Collection Join query is specifically built to allow for queries to join documents across two different, distributed collections. This post covers the technical implementation of the Cross Collection Join, we’ll be following this up soon with another post dedicated to sample use cases and benchmarks for using the new Cross Collection Join query.

How it works

The Cross Collection Join is a new join method that executes a query against a remote Solr collection to retrieve the set of join keys that will be used as a filter query against the local Solr collection. The CrossCollectionJoinQuery will first query a remote solr collection and get back a streaming expression result of the join keys. As the join keys are streamed to the node, a bitset of the matching documents in the local index is built up. This avoids keeping the full set of join keys in memory at any given time. This bitset is then inserted into the filter cache upon successful execution as with the normal behavior of the solr filter cache.

Additionally, if the local index is sharded according to the join key field, the Cross Collection query can leverage a secondary query parser called the “hash_range” query parser. The hash_range query parser is responsible for returning only the documents that hash to a given range of values. If the local collection is routed based on the join key used, each shard participating in the Cross Collection Join Query will be able to retrieve only the join keys from the remote collection that potentially match in the local shard. This has the benefit of making sure that network traffic doesn’t increase as the number of shards increases and allows for much greater scalability.

The Cross Collection Join Query works with both String and Point types of fields. The field that is being used for the join key in the remote collection must also have docValues enabled. The field that is being used for the join key in the local collection must be a single valued string field in order to take advantage for the hash range optimization described above. Additionally, the documents in the local collection must be routed to shard based on the join key field for the optimization.
The remote solr collection does not have any specific sharding requirements.

Example usage

{!join method="crossCollection" fromIndex="otherCollection" from="fromField" to="toField"}*:*
 
or
 
{!join method="crossCollection" fromIndex="otherCollection" from="fromField" to="toField" v="*:*"}    (preferred)

{!join method="crossCollection" fromIndex="otherCollection" from="fromField" to="toField"}*:*
 
or
 
{!join method="crossCollection" fromIndex="otherCollection" from="fromField" to="toField" v="*:*"}    (preferred)

glossary of terms

Term	Description
Local Collection	This is the main collection that is being queried.
Remote Collection	This is the collection that the Cross Collection Join Query will query to resolve the join keys.
CrossCollectionJoinQuery	The lucene query that executes a search to get back a set of join keys from a remote collection
HashRangeQuery	The lucene query that matches only the documents whose hash code on a field falls within a specified range.

cross collection join query parameters

Param	Required	Description
All others		Any normal Solr parameter can also be specified as a local param.
fromIndex	Required	The name of the external Solr collection to be queried to retrieve the set of join key values (required)
from	Required	The join key field name in the external collection (required)
routed	Optional	true / false. If true, the Cross Collection Join query will use each shard’s hash range to determine the set of join keys to retrieve for that shard. This parameter improves the performance of the cross-collection join, but it depends on the local collection being routed by the toField. If this parameter is not specified, the Cross Collection Join query will try to determine the correct value automatically.
solrUrl	Optional	The URL of the external Solr node to be queried (optional). The solrUrl must be added to the “allowSolrUrls” list in the solrconfig.xml file
to	Required	The join key field name in the local collection
ttl	Optional	The length of time that a Cross Collection Join query in the cache will be considered valid, in seconds. Defaults to 3600 (one hour). The Cross Collection Join query will not be aware of changes to the remote collection, so if the remote collection is updated, cached Cross Collection Join queries may give inaccurate results. After the ttl period has expired, the Cross Collection Join query will re-execute the join against the remote collection.
v	See Note	The query to be executed against the external Solr collection to retrieve the set of join key values. Note: The original query can be passed at the end of the string or as the “v” parameter. It’s recommended to use query parameter substitution with the “v” parameter to ensure no issues arise with the default query parsers.
zkHost	Optional	The connection string to be used to connect to Zookeeper. zkHost and solrUrl are both optional parameters, and at most one of them should be specified. If neither of zkHost or solrUrl are specified, the local Zookeeper cluster will be used. (optional)

Installation

In order to use the Hash Range query optimization, the solrconfig.xml for local and remote collections needs to be updated as defined below.

Solrconfig.xml – Local Collection

<queryParser name="join" class="org.apache.solr.search.JoinQParserPlugin">
  <str name="routerField">product_id</str>
</queryParser>

Normally, this query parser is automatically registered by Solr. In order to take advantage of the Hash Range optimization the routedField property must be set on the JoinQParser in the Solrconfig .xml. In this example, we’ve specified the “product_id” field to indicate the field name that will support the routed=true parameter optimization.

Solrconfig.xml – Remote Collection

<cache name="hash_product_id"
       class="solr.LRUCache"
       size="128"
       initialSize="0"
       regenerator="solr.NoOpRegenerator"/>

Make sure to define the cache for the Hash Range query with a size equal to the maximum number of segments that are expected to be in the index. (typically this number stays well below 100.)

potential issues

A large number of join keys will take a lot of network bandwidth to transmit to all the shards.
The hash_product_id cache will not explicitly expunge entries until the max size is reached.

Comments (1)

Building A Vector Search Application on Opensearch - KMW Technology

March 30, 2023 at 10:50 pm

[…] 2, 2022 Solr Payload Inequalities By Kevin Watters June 12, 2021 The Cross Collection Join Query By Dan Fox July 15, 2020 Solr JSON Facets for Reporting and Data Aggregation […]

Comments are closed.

How it works

Example usage

glossary of terms

cross collection join query parameters

Installation

Solrconfig.xml – Local Collection

Solrconfig.xml – Remote Collection

potential issues

Comments (1)

Privacy Policy

INTERPRETATION AND DEFINITIONS

INTERPRETATION

DEFINITIONS

COLLECTING AND USING YOUR PERSONAL DATA

TYPES OF DATA COLLECTED

PERSONAL DATA

USAGE DATA

TRACKING TECHNOLOGIES AND COOKIES

EMBEDDED CONTENT & PLUGINS

GOOGLE WEB FONTS

YOUTUBE

AKISTMET

USE OF YOUR PERSONAL DATA

RETENTION OF YOUR PERSONAL DATA

TRANSFER OF YOUR PERSONAL DATA

DELETE YOUR PERSONAL DATA

DISCLOSURE OF YOUR PERSONAL DATA

BUSINESS TRANSACTIONS

LAW ENFORCEMENT

OTHER LEGAL REQUIREMENTS

SECURITY OF YOUR PERSONAL DATA

DETAILED INFORMATION ON THE PROCESSING OF YOUR PERSONAL DATA

ANALYTICS

GDPR PRIVACY

LEGAL BASIS FOR PROCESSING PERSONAL DATA UNDER GDPR

YOUR RIGHTS UNDER THE GDPR

EXERCISING OF YOUR GDPR DATA PROTECTION RIGHTS

CCPA PRIVACY

CATEGORIES OF PERSONAL INFORMATION COLLECTED

SOURCES OF PERSONAL INFORMATION

USE OF PERSONAL INFORMATION FOR BUSINESS PURPOSES OR COMMERCIAL PURPOSES

DISCLOSURE OF PERSONAL INFORMATION FOR BUSINESS PURPOSES OR COMMERCIAL PURPOSES

SALE OF PERSONAL INFORMATION

SHARE OF PERSONAL INFORMATION

SALE OF PERSONAL INFORMATION OF MINORS UNDER 16 YEARS OF AGE

YOUR RIGHTS UNDER THE CCPA

EXERCISING YOUR CCPA DATA PROTECTION RIGHTS

DO NOT SELL MY PERSONAL INFORMATION

WEBSITE

MOBILE DEVICES

“DO NOT TRACK” POLICY AS REQUIRED BY CALIFORNIA ONLINE PRIVACY PROTECTION ACT (CALOPPA)

CHILDREN’S PRIVACY

INFORMATION COLLECTED FROM CHILDREN UNDER THE AGE OF 13

PARENTAL ACCESS

YOUR CALIFORNIA PRIVACY RIGHTS (CALIFORNIA’S SHINE THE LIGHT LAW)

CALIFORNIA PRIVACY RIGHTS FOR MINOR USERS (CALIFORNIA BUSINESS AND PROFESSIONS CODE SECTION 22581)

LINKS TO OTHER WEBSITES

CHANGES TO THIS PRIVACY POLICY

CONTACT US