In the beginning…

In the beginning, databases taught us to normalize our data.  This was great to be able to insert data very quickly.  It meant that you could update one row in a table and all of the related objects would change.  Then,  people realized that you also need to select data, and performing these joins at select time and looking across many tables became problematic at scale, so search engines came along and denormalized your data.  This was great because people could now do selects/searches on large amounts of data very quickly.  Then people realized that when data is denormalized, you need to re-index large amounts of data when a single value changes.

 

Search engines have traditionally lacked the ability to do relational types of queries.  This was ok because the data in the search index was flattened onto a document so you could do a simple query that did not have to span across multiple “tables”.  Now, that is no longer an issue.  Solr introduced a JOIN query that allowed users to model their data as 2 separate data sets and do a join query across the two tables.  This solved a vast number of problems for people for data sets that were relational in nature.  However, in this modern age, we’re discovering that data tends to be organized more like a graph.  This is akin to how the human mind organizes information.  It seems reasonable that systems designed by humans would mimic this sort of data modeling.  In traditional search, you have search terms and filters that allow you to select  a set of documents that have those terms.  But what happens when you want to select a set of documents that relate to a set of terms or filters?

Enter the graph query

 

Let’s take a concrete example of indexing the files in a directory.  Let’s say you have a folder on a drive somewhere.  Let’s say there are many folders, and those folders contain both files and other folders.  This nested folder structure is what would be referred to as a hierarchy.  It’s pretty simple to have a file connector recursively index those files and index them.  This is nice, you can then search for documents that contain “foo” and they’ll be returned.  But what if you wanted to find only the files that contained “foo” in a particular directory?  Ok, pretty simple, search for “foo” and the  “folder = “/home/foo”.  Great,  but now, let’s say you want to search for all the files that contain “foo” and are in one of the subfolders?  Hmm, this gets a little tricky and this is where we can start playing tricks with how to flatten/denormalize the data.   One approach is to expand out the folder path on every document.  You might have a document in folder   /home/foo/bar/baz/bap , in the index, you might be know about a fancy trick to expand out all of the sub paths, then any folder in that subdirectory would have the exact match for the folder “/home/foo”.  Sometimes this approach is referred to as “shingling”.  It can add a field to a docment that would contain the following list of values:

“/home” , “/home/foo” , “/home/foo/bar” , “/home/foo/bar/baz”

Great, now if we query for “foo” and folder = “/home/foo”  all of the documents under that folder will be matched because they have the term “/home/foo” on them because of this little flattening and denormalization trick.

 

Ok, now let’s say someone renames a folder..   Uh,oh.. time to re-index!  It’s a bit of a bummer if you have multiple terabytes of data under that folder.  It’s especially painful if a lot of those documents are complex formats like PDF’s that need to be processed with OCR, or other very large log files.  That re-indexing is likely to be non-trivial.  

 

Wouldn’t it be nice if you could just send 1 update for the folder name?   Well. with the graph query you can.

 

We can think of this use case as a graph problem.  The directory tree can be represented as a graph.  All of the documents reside in one of those folders in the tree.  As we index the files, we could also index a document in the index that represents the folder.  The folder would only have a pointer to it’s parent directory.  When files are indexed, they would have a parent that points to their id of the folder that they are in.

 

So, now we have 2 types of records in the index.  1 that represents the folders that you are searching, and 1 that represents the files/documents in those folders.

 

So, how does it all work?  In order to understand it , we must think recursively.  That is, relationships beget relationships beget relationships.  It helps to think of yourself walking from document to document and follow the path to the answer.

 

Let’s take our original example, searching for documents containing  “foo”.  But then we want to restrict those to just the files that are in a subdirectory of ”/home/foo”.  We can start thinking of it like a human would think of it.  We would look in /home/foo find all of the sub directories and search those files.

 

So, how do we do the first part?  Assuming we have a document in the index that represents the folder and it has a link to it’s parent folder, we can use the solr graph query to identify all the subfolders.

 

{!graph from=”folder_id” to=“parent_id” traversalFilter=”table:folder”}folder_id:123

 

The result of this is all of the folder records that are descendents of the folder identified by the starting “root node query” of “folder_id:123”.  In more detail, we’re starting our graph traversal with the search result for “folder_id:123”.  The values in the “folder_id” field are accumulated and then turned around to search the parent_id fields.  This process occurs until no new records are discovered.  At each “hop” in the graph traversal, the additional filter of “table:folder” is applied.  This makes sure when we’re traversing the graph in the index, that we only consider the “folder” records.

 

Now what?  We weren’t looking for folders, we were looking for documents.  Well, if we recall the documents have a folder id on them, so if we use the folders from the previous query as the starting point for our query against the documents we could do a inner join between those folder records and the document records.

+{!graph from=”folder_id” to “folder_id” traversalFilter=”table:document”}({!graph from=”folder_id” to=“parent_id” traversalFilter=”table:folder”}folder_id:123) +bar

 

Yes, that’s right, we can nest graph queries arbitrarily.  So, the above example is a graph query starting with the folder hierarchy traversal, nested inside a graph query to traverse to the documents and the final result set is “ANDed” with the query term “bar”.

 

Ok, this seems rather complex, how does this help us?  Well, in the previously mentioned scenario, when a folder is moved, we only need to re-index 1 folder record in the index that changes it’s parent folder id, and like magic the rest of the index is properly updated and we avoided re-indexing tons of data!

 

Horray!  We can now handle highly relational data inside of a search index!  We are only scratching the surface here with the abilities.  Do you have a graph problem?  Let us know, we’d love to hear from you!

 

Graph Query Parser Syntax Reference Guide:

 

{!graph param=”value” … }

 

Parameter Default Description
from field containing the node id
to Field contaning the edge id(s)
maxDepth -1 The number of hops to traverse from the root of the graph.  -1 means traverse until all edges and documents have been collected. maxDepth=1 is similar behavior to a JOIN.
traversalFilter null arbitrary query string to apply at each hop of the traversal
returnRoot true true|false – indication of if the documents matching the root query should be returned.
leafNodesOnly false true|false – indication to return only documents in the result set that do not have a value in the “to” field.