# DeAL Tutorial Five Basic Aggregates in Recursion

### Five Basic Aggregates in Recursion

DeAL allows the use of the following five aggregates in recursive rules: `max`, `min`, `mcount`, `count` and `sum` as discussed next. Although `min`, `max`, `count` and `sum` define mappings that are non-monotonic w.r.t set containment, their use in rules that have the pre-mappability property does not compromise the declarative semantics of the programs. While people interested in a formal treatment of this topic should consult paper and technical report, we next present a few examples that should enable our reader to verify pre-mappability (PreM for short) in applications of practical interest.

`max`

Bill of Materials (BOM) is a classical recursive application for traditional Database Management Systems. In BOM, the database contains a large set of `assbl(Part, Subpart, Qty)` facts which describe how a given part is assembled using various subparts, each in a given quantity. Not all subparts are assembled, basic parts are instead supplied by external suppliers in a given number of days, as per the facts `basic(Part, Days)`. Simple assemblies, such as bicycles, can be put together the very same day in which the last basic part arrives. Thus, the time needed to deliver a bicycle is the maximum of the number of days that the various basic parts require to arrive.

Example: *Number of days until delivery*

delivery(Part, Days) <- basic(Part, Days). delivery(Part, max<Days>) <- assbl(Part, Sub), delivery(Sub, Days). actualDays(Part, Days) <- delivery(part, Days).

An equivalent but less efficient approach consists in removing the max from the second rule and just use it in the
last rule. Rather than doing that, we will leave the max in the head of the second rule, and check that *PreM*
holds by checking the effect of adding a `max` constraint `(Sub,max )` to the delivery goal in the second rule.
Now, such constraint will eliminate

`(Sub,Days)`pairs that are not maximal, and as such they do not produce maximal Days for the corresponding part in the head. Thus the pre-map of the max constraint does not change the mapping defined by this recursive rule and

*PreM*of max holds for this rule.

In the current version of our system, a warning message reminds users that they have the responsibility to ensure
that PreM holds, but compiler extensions that check *PreM* for the program at hand are planned for
future releases. The current version of DeAL, support the use of deprecated `mmax` construct that provides a
target for `max` in the non-recursive rule to expresses the same application as follows (and dual observations hold
for `mmin`):

Example: *A deprecated syntax for the previous example.*

delivery(Part, mmax<Days>) <- basic(Part, Days). delivery(Part, mmax<Days>) <- assbl(Part, Sub), delivery(Sub, Days). actualDays(Part, max<Days>) <- delivery(part, Days).

The next example assumes that we have an acyclic directed graph where the arcs are labelled with the probability that the connection represented by the arc is actually functional. Thus the maximum probability paths between nodes can be computed as follows:

Example: *Maximum Probability Path*

reach(X, Y, max<P>) <- net(X, Y, P). reach(X, Z, max<P>) <- reach(X, Y, P1), reach(Y, Z, P2), P = P1 * P2.

In general, PreM of a `min` or `max` (i) constraint on a predicate p is check by applying the constraint to p-heads of the rule, and then (ii) to the p-goals in the rules, and asking if (ii) it has changed the mapping defined in (i). Thus, in the above example, the first rule is non-recursive and thus satisfies *PreM* by definition. For the
second rule what is the effect of applying max to both of the goals in the recursive rules, i.e. the effect of only
considering the max of P1 and the max of P2 and eliminating the smaller values. Here too, those smaller values
cannot produce max value in the head: thus *PreM* holds and the above program is correct.

`min`

The `min` aggregate supports the expression of minimum-cost algorithms.

Example: *All Pairs Shortest Paths*

spath(X,Y,D) <- arc(X,Y,D). spath(X,Y,min<D>) <- spath(X,Z,D1), arc(Z,Y,D2), D=D1+D2.

The All Pairs Shortest Paths (APSP) program finds the length of the shorest path between two connected vertices
in a directed graph with edge labels. In the first rule, the arc relation denotes edges of the graph where `D` is the
cost from `X` to `Y`. The first rule finds the shortest arc from `X` to `Y` . The second rule, takes arcs originating in `Z` and
appended them to previously produced paths terminating at `Z` , where the length of the new arc is `D=D1+D2`.
Observe here that `min` is pre-mappable to our recursive rule, inasmuch as the enforcement of, say, `(X, Z, min<D1>)` in the body of the rule will only filter out values that generate non-minimal `(X, Y, D)` values in the head.

Example: *Connected Components of a Graph*

cc(A,A) <- arc(A,_). cc(C,min<B>) <- cc(A,B), arc(A,C). concomp(countd<A>) <- cc(_,A).

The connected components program above produces the number of connected components in an undirected graph.
The arc relation denotes edges in the graph with integer vertex ids. The program works as follows. Initially, a
vertex will be assigned its vertex id (first rule). This value will be refined to a lower vertex id, if the vertex is
indeed connected to a vertex with a lower id (second rule) and a new `cc` fact is produced. Rule three is necessary
to select only the `cc` fact for `A` with the lowest vertex id, which will indicate which component of the graph `A`
belongs to. The last rule counts the number of distinct low vertex ids to determine the count of components. Here too it is simple to see that *PreM* holds.

`count` and `mcount`

The `mcount` monotonic aggregate performs a continuous count of the number of distinct occurrences and allows us to express
many queries that could not be expressed efficiently, or could not be expressed at all, in Datalog with stratified negation or
aggregates. For instance, the following query that counts the paths between nodes in a directed acyclic graph was not expressible
in Datalog with stratified aggregates (Mumick and Shmueli. How expressive is stratified aggregation? *Annals of Mathematics and Artificial Intelligence*, 15:407-435, 1995.).

Example: *Counting Paths*

cpaths(X, Y, mcount<X>) <- arc(X, Y). cpaths(X, Y, mcount<(Z, C)>) <- cpaths(X, Z, C), arc(Z, Y). countpaths(X, Y, max<C>) <- cpaths(X, Y, C).

The first rule states that each `arc` counts as one path between its nodes. The second rule states that if there are `C`
distinct paths from `X` to `Z`, then the presence of `arc(Z, Y)` establishes that there also are `C` occurrences of distinct paths
from `X` to `Y` through `Z`. The `mcount<(Z, C)>` aggregate in the head performs the continuous count of these occurrences.

`mcount`in the head returns all the values from 1 to their sum. Now if we check PreM for max in this example, we see that it does hold since max from several K triplets sharing the same values X, Z will only preserve one, and

`mcount`in the head will also deliver one value instead of K. Thus in this example

`mcount`cannot be replaced by

`count`. The countpaths application will be re-expressed later by using

`sum`. For now let us consider an example where

`max`can be pre-mapped to

`mcount`.

Example: *Join the party once you see that three of your friends have joined*

attend(X) <- organizer(X). attend(X) <- cntfriends(X, Nfx), Nfx >= 3. cntfriends(Y, mcount<X>) <- attend(X), friend(Y, X). finalcount(Y, max<N>) <- cntfriend(Y, N).

Here we assume that organizer of the party will attend, no matter what; however other people will attend if the number of their friends attending is greater or equal to 3, i.e., Nfx >= 3.

`mcount`
is monotonic w.r.t. set-containment and thus it can be used without restrictions in recursive
rules. However in situations where max can be pre-mapped to recursive rules max and `mcount` can be combined
to yield the normal `count` aggregate, which often results in a faster computation. Thus our previous application
can be written as follows:

Example: *Join the party once you see that three of your friends have joined*

attend(X) <- organizer(X). attend(X) <- cntfriends(X, Nfx), Nfx >= 3. cntfriends(Y, count<X>) <- attend(X), friend(Y, X). finalcount(Y, N) <- cntfriend(Y, N).

Observe here that the pre-map of `max` to the third rule has changed `mcount` to `count` in the third rule, but has
not changed the meaning of the program since the second rule is satisfied iff it is satisfied for the max value of
Nfx.

`sum`

The `sum` aggregate can be defined as the max `msum` whose semantics is based on that of `msum` and then `mcount`.
Indeed, while `mcount` presented with a positive integer C treats this as a single occurrence, `msum` treats C a short-
hand for the C occurrences of the positive integers that range from 1 to C, whereby `msum` can be computed as
`mcount` on these C integers. Now, since `mcount` and the rules generating the C integers are monotonic `msum` is
monotonic and thus can be freely used in recursive definitions. Therefore, we now have the following alternative
program for counting paths:

Example: *Counting Paths using msum: scpaths sums the counts of paths.*

scpaths(X, Y, 1) <- arc(X, Y). scpaths(X, Y, msum<(Z, C)>) <- scpaths(X, Z, C), arc(Z, Y). scountpaths(X, Y, max<C>) <- scpaths(X, Y, C).

As in the case of `mcount`, we need to account for the contributions of different Zs: thus we use a similar syntax,
i.e. msum<(Z, C)>, to specify that multiple occurrences of the same value of C generated by different Zs are
considered independently in the computation of sum. Observe now that the pre-mappability of the `max` aggregate to
the C value in the body of the recursive rule, might eliminate non-maximal values of C, but this does change the
result produced by the head by `msum`, since this is defined as `mcount` of all integers up to the max. Since the
result of `msum` does not change, neither does the result of max applied to it. Therefore, PreM holds and
we can transfer max into our recursive rule and obtain the following optimized program:

Example: *Counting Paths using sum.*

scpaths(X, Y, 1) <- arc(X, Y). scpaths(X, Y, sum<(Z, C)>) <- scpaths(X, Z, C), arc(Z, Y).

Since the implementation of `msum` is prone to inefficiencies and its can be replaced that of `mcount`, msum is henceforth deprecated and will not be supported in future releases.
On the other hand, the use of sum enables the efficient
expression of many useful applications including those that apply sum to positive floating-point numbers (since
they can be viewed as the numerators over a common denominator defined by the length of the exponent used in
the mantissa+exponent representation of floating-point numbers).

The Company Control application was proposed in Mumick, Pirahesh and Ramakrishnan, The magic of duplicates and aggregates. *In VLDB* 1990.
Companies can purchase shares of other companies; in addition to the fraction of total shares that a company owns
directly, a company A controls the shares controlled by a company B when A has a controlling majority (> .50 of the total number) of B's shares.
The fact `ownedshares(a, b, .51)` indicates company a directly owns 51% of company b.

Example: *Company Control*

cshares(C2, C3, P) <- owned_shares(C2, C3, P). cshares(C1, C3, sum<(C2, P)>) <- bought(C1, C2), cshares(C2, C3, P). bought(C1, C2) <- cshares(C1, C2, P), C1 != C2, P > 0.50.

To confirm that our program has the correct semantics, we must assume that we originally had a program using
`msum` that has been turned into `sum` by the application of the max constraint to the cshare heads; then,
PreM is verified by applying the same constraint to each cshare goal and checking that this does not
affect the result. In the second rule, the elimination of all the values of P smaller than the max will not affect the
computation of `msum`, which by definition reinstates all the positive integers up to P, and this also implies that the
max of `msum` remains unaffected. In the third rule, we only need to check the condition P>0.5 for the max value
of P produced by the goal cshares(C1, C2, P).