# DeAL TutorialMonotonic Aggregates

### Monotonic Aggregates (.deal | .fac)

DeAL supports four monotonic aggregates that can be used freely in recursive definitions. The formal semantics of the `mmax` and `mcount`
aggregate functions are based on the DatalogFS work
which proved that a least-fixpoint semantics, and its equivalent unique minimal model semantics, exist for these.
DatalogFS semantics have been extended in DeAL, to interval semantics that support of the `mmin` aggregate, the dual to `mmax`.
The extended semantics also allow for support of negative numbers. Lastly, DeAL also supports the `msum` aggregate, which performs a monotonic sum.

*For assignments with monotonic aggregates, get latest DeALS Version 0.5.1 (released 10/19/2014) from department servers.*

`mmax`

Bill of Materials (BOM) is a classical recursive application for traditional Database Management Systems. In BOM, the
database contains a large set of `assbl(Part, Subpart, Qty)` facts which describe how a given part is assembled using
various subparts, each in a given quantity. Not all subparts are assembled, basic parts are instead supplied by external
suppliers in a given number of days, as per the facts `basic(Part, Days)`. Simple assemblies, such as bicycles, can
be put together the very same day in which the last basic part arrives. Thus, the time needed to deliver a bicycle is
the maximum of the number of days that the various basic parts require to arrive.

Example: *Number of days until delivery*

delivery(Part, mmax<Days>) <- basic(Part, Days). delivery(Part, mmax<Days>) <- assbl(Part, Sub), delivery(Sub, Days). actualDays(Part, max<Days>) <- delivery(part, Days).

For `mmax` monotonicity is achieved by simply viewing this as a continuous function, which along with producing
the maximum value of days required for a part to arrive, also produces all the positive integers up to and including
the max. In other words, where max might return the pair `(Bolt, 4)`, `mmax` will instead return `(Bolt, 4)`,
`(Bolt, 3)`, `(Bolt, 2)`, `(Bolt, 1)`. Thus if, the recursive computation also produces `(Bolt, 3)`, `(Bolt, 2)`,
`(Bolt, 1)`, this can be discarded because w.r.t. set containment semantics, this is a subset of the previous. Thus only maxima are significant according to
the value-based set semantics of Datalog. Therefore we now have maxima in recursion while preserving the declarative
least-fixpoint semantics of Datalog.

The `mmax` aggregate can be used on arbitrary positive numbers not just integers. For instance, the following program computes the maximum probability
path between two nodes in a network where `net(X, Y, P)` denotes the probability of reaching `Y` starting from `X` is `P`.

Example: *Maximum Probability Path*

reach(X, Y, mmax<P>) <- net(X, Y, P). reach(X, Z, mmax<P>) <- reach(X, Y, P1), reach(Y, Z, P2), P = P1 * P2. maxP(X, Y, max<P>) <- reach(X, Y, P).

`mmin`

The `mmin` aggregate supports the expression of minimum-cost algorithms.

Example: *All Pairs Shortest Paths*

spath(X,Y,mmin<D>) <- arc(X,Y,D). spath(X,Y,mmin<D>) <- spath(X,Z,D1), arc(Z,Y,D2), D=D1+D2. shortestpaths(X,Y,min<D>) <- spath(X,Y,D).

The All Pairs Shortest Paths (APSP) program finds the length of the shorest path between two connected vertices in a directed graph with edge labels. In the first rule,
the `arc` relation denotes edges of the graph where D is the cost from `X` to `Y`. The first rule finds the shortest arc from `X` to `Y`.
In the second rule, `arc`s from `Z` are appended to any previously produced paths terminating at `Z`, the lengths (`D=D1+D2`) are summed to produce
the new path length, and if `D` is indeed a new shortest distance from `X` to `Y` it will become the new value for `X,Y` and a new `spath(X,Y,D)`
fact is produced. Lastly rule three will produce a single fact for each `X,Y` in `spath` with the shortest path length `D`.

Example: *Connected Components of a Graph*

cc(A,mmin<A>) <- arc(A,_). cc(C,mmin<B>) <- cc(A,B), arc(A,C). cc2(A,min<B>) <- cc(A,B). concomp(countd<A>) <- cc2(_,A).

The connected components program above produces the number of connected components in an undirected graph. The `arc` relation denotes edges in the graph with integer vertex ids.
The program works as follows. Initially, a vertex will be assigned its vertex id (first rule). This value will be refined to a lower vertex id, if the vertex is indeed connected to a
vertex with a lower id (second rule) and a new `cc` fact is produced. Rule three is necessary to select only the `cc` fact for `A` with the lowest vertex id, which will indicate which
component of the graph `A` belongs to. The last rule counts the number of distinct low vertex ids to deteremine the count of components.

`mcount`

The `mcount` monotonic aggregate performs a continuous count of the number of distinct occurrences and allows us to express
many queries that could not be expressed efficiently, or could not be expressed at all, in Datalog with stratified negation or
aggregates. For instance, the following query that counts the paths between nodes in a directed acyclic graph was not expressible
in Datalog with stratified aggregates (Mumick and Shmueli. How expressive is stratified aggregation? *Annals of Mathematics and Artificial Intelligence*, 15:407-435, 1995.).

Example: *Counting Paths*

cpaths(X, Y, mcount<X>) <- arc(X, Y). cpaths(X, Y, mcount<(Z, C)>) <- cpaths(X, Z, C), arc(Z, Y). countpaths(X, Y, max<C>) <- cpaths(X, Y, C).

The first rule states that each `arc` counts as one path between its nodes. The second rule states that if there are `C`
distinct paths from `X` to `Z`, then the presence of `arc(Z, Y)` establishes that there also are `C` occurrences of distinct paths
from `X` to `Y` through `Z`. The `mcount<(Z, C)>` aggregate in the head performs the continuous count of these occurrences.

The Company Control application was proposed in Mumick, Pirahesh and Ramakrishnan, The magic of duplicates and aggregates. *In VLDB* 1990.
Companies can purchase shares of other companies; in addition to the fraction of total shares that a company owns
directly, a company A controls the shares controlled by a company B when A has a controlling majority (> .50 of the total number) of B's shares.
The fact `ownedshares(a, b, 51)` indicates company a directly owns 51% of company b.

Example: *Company Control*

cshares(C2, C3, dirct, mmax<P>) <- owned_shares(C2, C3, P). cshares(C1, C3, indirct, mcount<(C2, P)>) <- bought(C1, C2), cshares(C2, C3, _, P). bought(C1, C2) <- cshares(C1, C2, _, P), C1 != C2, P > 50.

`msum`

The `msum` aggregate supports monotonic summation. With `msum`, only positive numbers are permitted to be aggregated.

Example: *Number of Basic Subparts in a Part*

cassb(Part, Sub, mmax<Qty>) <- assbl(Part, Sub, Qty). cbasic(Pno, mcount<(Pno, 1)>) <- basic(Pno, _). cbasic(Part, msum<(Sub,K)>) <- cassb(Part, Sub, Qty), cbasic(Sub, N), K=Qty*N. countbasicsubparts(Prt, max<Qty>) <- cbasic(Prt, Qty).

The example above finds the total number of basic subparts for a part, where basic parts are parts not composed of other parts. The program recursively sums the basic subparts for a part's subparts.
The `basic` predicate holds the basic parts and the `assbl` predicate contains the mapping between part and each subpart and the number of subparts needed. The `cassb` rule is used to ensure the maximum number of subparts for each part is used. The first `cbasic` rule counts 1 for each basic part.
Then, the second `cbasic` rule, to compute the number of basic subparts for a part, multiplies the number of direct subparts for the part (`cassb`) with the number of basic parts computed thus far for the subpart (`cbasic`). Lastly, `countbasicsubparts` returns the maximum count for each part.

Example: *Markov Chain*

p_state(X, mmax<P>) <- p_state_init(X,P). p_state(X, msum<(Y,K)>) <- p_state(Y,C), w_matrix(Y,X,W), K = C * W. rank(X, max<K>) <- p_state(X,K). sum_rank(sum<A>) <- rank(_,A). p_norm(X,Pr) <- sum_rank(SR), rank(X,R), Pr = R / SR.

Monotonic aggregates can be used in many machine learning programs. The example above is the monotonic aggregate variation of a Markov chain, much like PageRank. The intuition behind this program is that the markov chain is computed until no new higher aggregate transition probabilities are derived. Then the results of `rank`
are normalized by `p_norm` to produce the final transition probabilities. `p_state_init` contains the initial transition probability for each node and `w_matrix` denotes the transition probability contribution `W` of node `Y` to node `X`.

#### Alternative Naming

Each monotonic aggregate has an alternate naming to reflect the aggregates' origin from DatalogFS.

mmax | fsmax |

mmin | fsmin |

mcount | fscnt |

msum | fssum |

* If compilation errors occur with 'm' prefixed aggregates, replace it with the equivalent 'fs' agggregate (i.e. replace 'mmax' with 'fsmax').