Can joins be parallelized?

Question

Suppose we want to join two relations on a predicate. Is this in NC?

I realize that a proof of it not being in NC would amount to a proof that $P\not=NC$, so I'd accept evidence of it being an open problem as an answer.

I'm interested in the general case as well as specific cases (e.g. perhaps with some specific data structure it can be parallelized).

EDIT: to bring some clarifications from the comments into this post:

We could consider an equijoin $A.x = B.y$. On a single processor, a hash-based algorithm runs in $O(|A|+|B|)$ and this is the best we can do since we have to read each set
If the predicate is a "black box" where we have to check each pair, there are $|A|\cdot|B|$ pairs, and each one could be in or not, so $2^{ab}$ possibilities. Checking each pair divides the possibilities in half, so the best we can do is $O(ab)$.

Could either of these (or some third type of join) be improved to $\log^k n$ on multiple processors?

If this question is motivated by a practical problem, keep in mind that NC might not be the most suitable notion of "parallelisable". — Raphael, Jun 05 '12 at 18:57
@Raphael: it's not, but could you link to something about why? I can ask this as a separate question if that's more appropriate. — Xodarap, Jun 05 '12 at 19:28
It is not clear for me what you are asking. What is the base relational database query language that you are adding the join operator to it? Or are you asking the complexity of queries which only contain join operators? Or are your real question is whether it is possible to run join operators "in parallel" to achieve better time complexity? (similar to way that say AND can be done in parallel) Also note that (safe) SQL queries corresponds to FOL(Count). — Kaveh, Jun 05 '12 at 20:35
Or are you asking what are the best known upper-bound and lower-bound (complexity classes) on the complexity computing the join given two relational databases as input. — Kaveh, Jun 05 '12 at 20:40
@Kaveh: is this clearer: given $A, B$ and a predicate $P$ which runs in constant time for any two tuples $a, b$, how fast can we find $C = {(a,b) : P(a,b), a\in A, b\in B}$? — Xodarap, Jun 05 '12 at 20:43
How is $P$ given? What does it mean that it runs in constant time? — Kaveh, Jun 05 '12 at 20:45
@Kaveh: Let us assume it's a simple equijoin. ($A.x = B.y$) But I am interested in the general problem, so I'd be interested in remarks on how other types of predicates may be better or worse. And what I mean by "runs in constant time" is we can determine $P(a,b)$ in $O(1)$ for any $a,b$. — Xodarap, Jun 05 '12 at 20:53
@Xodarap: You might find the answers and comments on this question of mine instructive; I know I did. Kruskal et al. (1990) is also a good read. — Raphael, Jun 06 '12 at 09:22

Xodarap · Accepted Answer · 2012-07-07T01:35:55.073

1

$n^2$ processors can compare all ${n \choose 2}$ possibilities in constant depth, so yes it's in NC.

edited Jul 07 '12 at 01:35

answered Jun 19 '12 at 14:51

Xodarap

1,538
1
10
17

If you are going to take OR, the depth will be logarithmic. – sdcvvc Jul 07 '12 at 09:56
@sdcvvc: Fair enough. At the extreme you could encode 3SAT in the relational calculus, so this result really only holds if your selections are simple (i.e. constant time). – Xodarap Jul 07 '12 at 16:48

Can joins be parallelized?

1 Answers1