Jackrabbit 2.x / CQ5
Back in Jackrabbit 2.x, and therefore in CQ/AEM 5.x, everything was indexed by default other than you stated otherwise.
This translated that every time you run a query, Lucene was there for you serving an indexed answer.
In this scenario it didn’t really matter what property name you used for you application or if you defined additional node types.
This had the advantage that everything was indexed and therefore an index was almost always there serving your query and you didn’t have to think about it.
On the other hand we all know that the bigger the index is, the slower it will be in serving you the result set, as it will simply have to analyse more data.
Jackrabbit Oak / AEM6
Nowadays Apache Jackrabbit Oak, aka Jackrabbit 3.x, is the foundation of AEM6.
Opposed to JR2, in Oak almost nothing is indexed by default. Which means that if you would take a vanilla Oak and run a query, you have very good chances you’re going to traverse the repository (depending on your query).
This has the advantage that you can create very dedicated indexes that will overall perform better as they will be as tailored as possible to your query.
The disadvantage are that you’ll have to define each index and that you’ll have to know how fine tune your queries for getting the most out of this approach.
Not going deeply into the configuration of each individual available index type I think the two main properties, you’ll end-up tuning for better performances are
- propertyNames
- declaringNodeTypes
the first one will define what property your index is going to index while the second will restrict the index to a specific node type. In other words the condition for a node to be included into an index are
$nodetype in ($declaringNodeTypes) AND $property = $propertyNames
caveats
- indexes on more than one property are not supported (yet)
- an index cannot serve conditions where you ask something like WHERE property IS NULL.
This take us to the very topic of this post: be careful on how you use your property or structure your queries.
Remember the rule: the smaller the index the more efficient the query.
Let’s see how important is a property and a node type with an example then.
If you have a custom application in which you want to extract nodes after a specific date, a way of doing so would be
SELECT * FROM [nt:base] WHERE [jcr:lastModified] >= CAST('...' AS DATE)
this query is very bad. It can’t really makes use of any index.
Let’s say you create an index on jcr:lastModified. The index itself will be almost as big as the repository as by default in AEM (almost?) every node as mix:lastModified.
A better way would be
SELECT * FROM [nt:base] WHERE [myLastModified] >= CAST('...' AS DATE)
this will allow you to define an index on the property mylastModified which you’ll know it will contain only your application data. But we can get even better.
Let’s assume you have a very sparse and large content structure so you can’t apply path filters and you don’t want on the other side to create tons of myLastModified for addressing different aspects of your information.
Let’s assume then, for sake of example, that you categorise your data into:
- comments
- news
- articles.
What you could do is create three different node types:
- my:comments
- my:news
- my:articles
now you can define three different, very dedicated indexes
- declaringNodeTypes = my:comments AND propertyNames = myLastModified
- declaringNodeTypes = my:news AND propertyNames = myLastModified
- declaringNodeTypes = my:articles AND propertyNames = myLastModified
One eventual query will look like
SELECT * FROM [my:comments] WHERE [myLastModified] >= CAST('...' AS DATE)
Actually in the example above, assuming your nodes comes with mix:lastModified, as soon as you create a custom node type you could have simply used the jcr:lastModified date as they will be (I expect) the same size. You can change the exercise above with any property name like: colours, size, tags, etc.