JOINs at scale: Comparing Hypi’s ArcQL to SQL

Introduction

A DAO which uses the Arc io.hypi.arc.codegen.descriptor model to represent GraphQL models/relationships across Ignite caches. Every non-scalar GraphQL type creates one Ignite cache and one Lucene index. Each record of a type creates one lucene org.apache.lucene.document.Document.

Relationships (one to one and one to many)

Relationships are created by creating a lucene org.apache.lucene.document.Document for each link. i.e. if type A has a field b of type B then a org.apache.lucene.document.Document is created in B’s index with two keys
ForeignRef.ARC_FOREIGN_KEY_REF is added which stores the ID of the A and
ForeignRef.ARC_PK_FIELD_NAME is added which stores the ID of the B being referenced
The end result is that two writes are done to B. The first is the ForeignRef and the second is record for B. Since two records are written to B’s index there’s no need for a transfer cache to write references to and pull from in the io.hypi.arc.lucene.LuceneIndexingSpi. The ForeignRef written uses an arbitrary but unique ID for its key so as not to collide with any keys in B’s cache.

Query Generation

As a result of the ForeignRef in B’s index, when search(ReqCtx) is run, it wraps the original query in a org.apache.lucene.search.BooleanQuery that includes a org.apache.lucene.search.BooleanClause.Occur.MUST on the original query and a org.apache.lucene.search.BooleanClause.Occur.MUST_NOT on a generated query which excludes documents in B’s index that have the ForeignRef.ARC_FOREIGN_KEY_REF field.

If the lucene index becomes corrupt or otherwise wrong, an index(ReqCtx, LuceneIndexingSpiWrapper) query must be done on A to rebuild the indices. This can be extremely taxing on resources and should be done with care and rarely.

During resolution of the foreign fields the query is a org.apache.lucene.search.BooleanQuery again that org.apache.lucene.search.BooleanClause.Occur.MUST match the user’s sub-query ForeignRef.ARC_FOREIGN_KEY_REF org.apache.lucene.search.BooleanClause.Occur.MUST match.

Foreign key joins

This implementation makes use of Lucene’s org.apache.lucene.search.join.JoinUtil. First, a query is executed on B’s index to find all foreign references for the current doc. Lucene’s JoinUtil gathers all matching refs into a query.

Second, the query from the join is used in a org.apache.lucene.search.BooleanQuery with a org.apache.lucene.search.BooleanClause.Occur.MUST on B’s index and another org.apache.lucene.search.BooleanClause.Occur.MUST with the user’s sub-query, paging, limits etc. If no user sub-query is given then the second query is a match-all.

Note that this is the same whether the relationship is one to one or one to many. All foreign references work the same.

Joins and sub-queries

Foreign key relationships can be queried with dot syntax e.g a.b.c where a is the source, b is a foreign object being referenced and c is a field on the foreign object. This creates implicit join queries.

The default io.hypi.arc.lucene.ArcQLParser sub-query on foreign keys is ForeignRef.ARC_DEFAULT_FOREIGN_FILTER, which returns the 10 newest entries of B.

One other important factor in the EntityGraph is how sub-queries are done. Two options are either to perform a analysis after io.hypi.arc.lucene.ArcQLParser has generated a query or to generate multiple queries, one for each field as the ArcQL tree is traversed.

In the second model, the io.hypi.arc.codegen.descriptor.SchemaDescriptor is used to understand what is or isn’t a foreign key.
So what does it mean to join? Let’s revise what joins mean in SQL and the types of JOINs that are typically available.

CROSS JOIN

A CROSS JOIN is a cartesian product, i.e. a product as in “multiplication”. The mathematical notation uses the multiplication sign to express this operation: A × B
CROSS JOIN combines all results from one table with all results from another table. If either table is empty then no results i.e times anything by 0 is 0 since it’s a cartesian product

INNER JOIN

(Equi when = is used and Theta when other comparators are used)
Builds on a CROSS join and allows filtering the rows by some predicate. So it’s a filtered CROSS JOIN as such if either table is empty then there are no results.

NATURAL JOIN and USING keyword

A NATURAL JOIN takes columns from both tables with the same name and uses them with the = comparator i.e. it is an equi join which saves you having to write on a.x = b.x

OUTER JOIN

LEFT OUTER JOIN

Similar to INNER join except, if no results are found in the right table it returns null for the rows which are missing instead of no rows at all. A left outer join is an INNER JOIN with a UNION

RIGHT OUTER JOIN

Exact opposite, retains rows from the right table even if there are no matching rows in the left table.

FULL OUTER JOIN

Retains rows from both tables with null on either side where there is no matching row on the other.

SEMI JOIN

(doesn’t have key word in ANSI SQL)
Achieved using an exist with a sub-query, the same query can be changed to use the IN keyword to get the same results SELECT * FROM actor a WHERE EXISTS ( SELECT * FROM film_actor fa WHERE a.actor_id = fa.actor_id )
Find actors that have been in at least one film i.e. we want the actor rows but not the film rows its just that there happens to be a relationship between them.

Others

ANTI JOIN, opposite of SEMI JOIn achieved using NOT IN but be careful as

NOT IN is not equivalent to NOT EXIST LATERAL JOIN… MULTISET

EntityGraph LEFT OUTER JOIN

Hypi implements an equivalent to SQL LEFT OUTER JOINs i.e. if no records are found in the referenced type results are returned from the left and the field is returned as an empty array or null for one to one references.
By Example
Take this as the model for the examples.

GraphQL Query

ArcQL

Selecting primitives only, no join is done

In frist example, don’t select image repeat example

next select image and show it adds a left join on Image in next example

select tweets.media.url and show what that join looks like

next do AND and OR with 1 to N single fields,
and single field to group
and group to group

see if we can arrive at a syntax that leads to 1 or more sub queries on a field with non-ambiguous way of doing OR and ANDs

potentially a way to force a field assertion to be treated as an INNER JOIN

instead of LEFT OUTER JOIN so in a.b = 1 on vals with ref to b = 1 are returned from left table
findUser
username = ‘courtney’
SELECT * FROM User WHERE username = ‘courtney’ ORDER BY A.hypi.created LIMIT n
1.0

findUser
tweets.content ~ ‘low code’
SELECT * FROM User LEFT OUTER JOIN B ON A.hypi.id = B.hypi.id WHERE B.i = 10 ORDER BY B.hypi.created LIMIT n

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
follow:

Related Posts

An Introduction to Kubernetes

 Kubernetes is an open-source platform that adds automation to containerization processes.This way you can eliminate