DeAL Tutorial Five Basic Aggregates in Recursion

Five Basic Aggregates in Recursion

DeAL allows the use of the following five aggregates in recursive rules: max, min, mcount, count and sum as discussed next. Although min, max, count and sum define mappings that are non-monotonic w.r.t set containment, their use in rules that have the pre-mappability property does not compromise the declarative semantics of the programs. While people interested in a formal treatment of this topic should consult paper and technical report, we next present a few examples that should enable our reader to verify pre-mappability (PreM for short) in applications of practical interest.

max

Bill of Materials (BOM) is a classical recursive application for traditional Database Management Systems. In BOM, the database contains a large set of assbl(Part, Subpart, Qty) facts which describe how a given part is assembled using various subparts, each in a given quantity. Not all subparts are assembled, basic parts are instead supplied by external suppliers in a given number of days, as per the facts basic(Part, Days). Simple assemblies, such as bicycles, can be put together the very same day in which the last basic part arrives. Thus, the time needed to deliver a bicycle is the maximum of the number of days that the various basic parts require to arrive.

Example: Number of days until delivery

delivery(Part, Days) <- basic(Part, Days).
delivery(Part, max<Days>) <- assbl(Part, Sub), delivery(Sub, Days).
actualDays(Part, Days) <- delivery(part, Days).

An equivalent but less efficient approach consists in removing the max from the second rule and just use it in the last rule. Rather than doing that, we will leave the max in the head of the second rule, and check that PreM holds by checking the effect of adding a max constraint (Sub,max) to the delivery goal in the second rule. Now, such constraint will eliminate (Sub,Days) pairs that are not maximal, and as such they do not produce maximal Days for the corresponding part in the head. Thus the pre-map of the max constraint does not change the mapping defined by this recursive rule and PreM of max holds for this rule.

In the current version of our system, a warning message reminds users that they have the responsibility to ensure that PreM holds, but compiler extensions that check PreM for the program at hand are planned for future releases. The current version of DeAL, support the use of deprecated mmax construct that provides a target for max in the non-recursive rule to expresses the same application as follows (and dual observations hold for mmin):

Example: A deprecated syntax for the previous example.

delivery(Part, mmax<Days>) <- basic(Part, Days).
delivery(Part, mmax<Days>) <- assbl(Part, Sub), delivery(Sub, Days).
actualDays(Part, max<Days>) <- delivery(part, Days).

The next example assumes that we have an acyclic directed graph where the arcs are labelled with the probability that the connection represented by the arc is actually functional. Thus the maximum probability paths between nodes can be computed as follows:

Example: Maximum Probability Path

reach(X, Y, max<P>) <- net(X, Y, P).
reach(X, Z, max<P>) <- reach(X, Y, P1), reach(Y, Z, P2), P = P1 * P2.

In general, PreM of a min or max (i) constraint on a predicate p is check by applying the constraint to p-heads of the rule, and then (ii) to the p-goals in the rules, and asking if (ii) it has changed the mapping defined in (i). Thus, in the above example, the first rule is non-recursive and thus satisfies PreM by definition. For the second rule what is the effect of applying max to both of the goals in the recursive rules, i.e. the effect of only considering the max of P1 and the max of P2 and eliminating the smaller values. Here too, those smaller values cannot produce max value in the head: thus PreM holds and the above program is correct.

min

The min aggregate supports the expression of minimum-cost algorithms.

Example: All Pairs Shortest Paths

spath(X,Y,D) <- arc(X,Y,D).
spath(X,Y,min<D>) <- spath(X,Z,D1), arc(Z,Y,D2), D=D1+D2.

The All Pairs Shortest Paths (APSP) program finds the length of the shorest path between two connected vertices in a directed graph with edge labels. In the first rule, the arc relation denotes edges of the graph where D is the cost from X to Y. The first rule finds the shortest arc from X to Y . The second rule, takes arcs originating in Z and appended them to previously produced paths terminating at Z , where the length of the new arc is D=D1+D2. Observe here that min is pre-mappable to our recursive rule, inasmuch as the enforcement of, say, (X, Z, min<D1>) in the body of the rule will only filter out values that generate non-minimal (X, Y, D) values in the head.

Example: Connected Components of a Graph

cc(A,A) <- arc(A,_).
cc(C,min<B>) <- cc(A,B), arc(A,C).
concomp(countd<A>) <- cc(_,A).

The connected components program above produces the number of connected components in an undirected graph. The arc relation denotes edges in the graph with integer vertex ids. The program works as follows. Initially, a vertex will be assigned its vertex id (first rule). This value will be refined to a lower vertex id, if the vertex is indeed connected to a vertex with a lower id (second rule) and a new cc fact is produced. Rule three is necessary to select only the cc fact for A with the lowest vertex id, which will indicate which component of the graph A belongs to. The last rule counts the number of distinct low vertex ids to determine the count of components. Here too it is simple to see that PreM holds.

count and mcount

The mcount monotonic aggregate performs a continuous count of the number of distinct occurrences and allows us to express many queries that could not be expressed efficiently, or could not be expressed at all, in Datalog with stratified negation or aggregates. For instance, the following query that counts the paths between nodes in a directed acyclic graph was not expressible in Datalog with stratified aggregates (Mumick and Shmueli. How expressive is stratified aggregation? Annals of Mathematics and Artificial Intelligence, 15:407-435, 1995.).

Example: Counting Paths

cpaths(X, Y, mcount<X>) <- arc(X, Y).
cpaths(X, Y, mcount<(Z, C)>) <- cpaths(X, Z, C), arc(Z, Y).
countpaths(X, Y, max<C>) <- cpaths(X, Y, C).

The first rule states that each arc counts as one path between its nodes. The second rule states that if there are C distinct paths from X to Z, then the presence of arc(Z, Y) establishes that there also are C occurrences of distinct paths from X to Y through Z. The mcount<(Z, C)> aggregate in the head performs the continuous count of these occurrences.

Thus, if we have two paths from X to Y, going say through Z' and Z" as we enumerate those going through Z' and then those going through Z" the mcount in the head returns all the values from 1 to their sum. Now if we check PreM for max in this example, we see that it does hold since max from several K triplets sharing the same values X, Z will only preserve one, and mcount in the head will also deliver one value instead of K. Thus in this example mcount cannot be replaced by count. The countpaths application will be re-expressed later by using sum. For now let us consider an example where max can be pre-mapped to mcount.

Example: Join the party once you see that three of your friends have joined

attend(X) <- organizer(X).
attend(X) <- cntfriends(X, Nfx), Nfx >= 3.
cntfriends(Y, mcount<X>) <- attend(X), friend(Y, X).
finalcount(Y, max<N>) <- cntfriend(Y, N).

Here we assume that organizer of the party will attend, no matter what; however other people will attend if the number of their friends attending is greater or equal to 3, i.e., Nfx >= 3.

mcount is monotonic w.r.t. set-containment and thus it can be used without restrictions in recursive rules. However in situations where max can be pre-mapped to recursive rules max and mcount can be combined to yield the normal count aggregate, which often results in a faster computation. Thus our previous application can be written as follows:

Example: Join the party once you see that three of your friends have joined

attend(X) <- organizer(X).
attend(X) <- cntfriends(X, Nfx), Nfx >= 3.
cntfriends(Y, count<X>) <- attend(X), friend(Y, X).
finalcount(Y, N) <- cntfriend(Y, N).

Observe here that the pre-map of max to the third rule has changed mcount to count in the third rule, but has not changed the meaning of the program since the second rule is satisfied iff it is satisfied for the max value of Nfx.

sum

The sum aggregate can be defined as the max msum whose semantics is based on that of msum and then mcount. Indeed, while mcount presented with a positive integer C treats this as a single occurrence, msum treats C a short- hand for the C occurrences of the positive integers that range from 1 to C, whereby msum can be computed as mcount on these C integers. Now, since mcount and the rules generating the C integers are monotonic msum is monotonic and thus can be freely used in recursive definitions. Therefore, we now have the following alternative program for counting paths:

Example: Counting Paths using msum: scpaths sums the counts of paths.

scpaths(X, Y, 1) <- arc(X, Y).
scpaths(X, Y, msum<(Z, C)>) <- scpaths(X, Z, C), arc(Z, Y).
scountpaths(X, Y, max<C>) <- scpaths(X, Y, C).

As in the case of mcount, we need to account for the contributions of different Zs: thus we use a similar syntax, i.e. msum<(Z, C)>, to specify that multiple occurrences of the same value of C generated by different Zs are considered independently in the computation of sum. Observe now that the pre-mappability of the max aggregate to the C value in the body of the recursive rule, might eliminate non-maximal values of C, but this does change the result produced by the head by msum, since this is defined as mcount of all integers up to the max. Since the result of msum does not change, neither does the result of max applied to it. Therefore, PreM holds and we can transfer max into our recursive rule and obtain the following optimized program:

Example: Counting Paths using sum.

scpaths(X, Y, 1) <- arc(X, Y).
scpaths(X, Y, sum<(Z, C)>) <- scpaths(X, Z, C), arc(Z, Y).

Since the implementation of msum is prone to inefficiencies and its can be replaced that of mcount, msum is henceforth deprecated and will not be supported in future releases. On the other hand, the use of sum enables the efficient expression of many useful applications including those that apply sum to positive floating-point numbers (since they can be viewed as the numerators over a common denominator defined by the length of the exponent used in the mantissa+exponent representation of floating-point numbers).

The Company Control application was proposed in Mumick, Pirahesh and Ramakrishnan, The magic of duplicates and aggregates. In VLDB 1990. Companies can purchase shares of other companies; in addition to the fraction of total shares that a company owns directly, a company A controls the shares controlled by a company B when A has a controlling majority (> .50 of the total number) of B's shares. The fact ownedshares(a, b, .51) indicates company a directly owns 51% of company b.

Example: Company Control

cshares(C2, C3, P) <- owned_shares(C2, C3, P).
cshares(C1, C3, sum<(C2, P)>) <- bought(C1, C2), cshares(C2, C3, P).
bought(C1, C2) <- cshares(C1, C2, P), C1 != C2, P > 0.50.

To confirm that our program has the correct semantics, we must assume that we originally had a program using msum that has been turned into sum by the application of the max constraint to the cshare heads; then, PreM is verified by applying the same constraint to each cshare goal and checking that this does not affect the result. In the second rule, the elimination of all the values of P smaller than the max will not affect the computation of msum, which by definition reinstates all the positive integers up to P, and this also implies that the max of msum remains unaffected. In the third rule, we only need to check the condition P>0.5 for the max value of P produced by the goal cshares(C1, C2, P).


Last Updated 2014