DeAL TutorialMonotonic Aggregates

Monotonic Aggregates (.deal | .fac)

DeAL supports four monotonic aggregates that can be used freely in recursive definitions. The formal semantics of the mmax and mcount aggregate functions are based on the DatalogFS work which proved that a least-fixpoint semantics, and its equivalent unique minimal model semantics, exist for these. DatalogFS semantics have been extended in DeAL, to interval semantics that support of the mmin aggregate, the dual to mmax. The extended semantics also allow for support of negative numbers. Lastly, DeAL also supports the msum aggregate, which performs a monotonic sum.

For assignments with monotonic aggregates, get latest DeALS Version 0.5.1 (released 10/19/2014) from department servers.

mmax

Bill of Materials (BOM) is a classical recursive application for traditional Database Management Systems. In BOM, the database contains a large set of assbl(Part, Subpart, Qty) facts which describe how a given part is assembled using various subparts, each in a given quantity. Not all subparts are assembled, basic parts are instead supplied by external suppliers in a given number of days, as per the facts basic(Part, Days). Simple assemblies, such as bicycles, can be put together the very same day in which the last basic part arrives. Thus, the time needed to deliver a bicycle is the maximum of the number of days that the various basic parts require to arrive.

Example: Number of days until delivery

delivery(Part, mmax<Days>) <- basic(Part, Days).
delivery(Part, mmax<Days>) <- assbl(Part, Sub), delivery(Sub, Days).
actualDays(Part, max<Days>) <- delivery(part, Days).

For mmax monotonicity is achieved by simply viewing this as a continuous function, which along with producing the maximum value of days required for a part to arrive, also produces all the positive integers up to and including the max. In other words, where max might return the pair (Bolt, 4), mmax will instead return (Bolt, 4), (Bolt, 3), (Bolt, 2), (Bolt, 1). Thus if, the recursive computation also produces (Bolt, 3), (Bolt, 2), (Bolt, 1), this can be discarded because w.r.t. set containment semantics, this is a subset of the previous. Thus only maxima are significant according to the value-based set semantics of Datalog. Therefore we now have maxima in recursion while preserving the declarative least-fixpoint semantics of Datalog.

The mmax aggregate can be used on arbitrary positive numbers not just integers. For instance, the following program computes the maximum probability path between two nodes in a network where net(X, Y, P) denotes the probability of reaching Y starting from X is P.

Example: Maximum Probability Path

reach(X, Y, mmax<P>) <- net(X, Y, P).
reach(X, Z, mmax<P>) <- reach(X, Y, P1), reach(Y, Z, P2), P = P1 * P2.
maxP(X, Y, max<P>) <- reach(X, Y, P).

mmin

The mmin aggregate supports the expression of minimum-cost algorithms.

Example: All Pairs Shortest Paths

spath(X,Y,mmin<D>) <- arc(X,Y,D).
spath(X,Y,mmin<D>) <- spath(X,Z,D1), arc(Z,Y,D2), D=D1+D2.
shortestpaths(X,Y,min<D>) <- spath(X,Y,D).

The All Pairs Shortest Paths (APSP) program finds the length of the shorest path between two connected vertices in a directed graph with edge labels. In the first rule, the arc relation denotes edges of the graph where D is the cost from X to Y. The first rule finds the shortest arc from X to Y. In the second rule, arcs from Z are appended to any previously produced paths terminating at Z, the lengths (D=D1+D2) are summed to produce the new path length, and if D is indeed a new shortest distance from X to Y it will become the new value for X,Y and a new spath(X,Y,D) fact is produced. Lastly rule three will produce a single fact for each X,Y in spath with the shortest path length D.

Example: Connected Components of a Graph

cc(A,mmin<A>) <- arc(A,_).
cc(C,mmin<B>) <- cc(A,B), arc(A,C).
cc2(A,min<B>) <- cc(A,B).
concomp(countd<A>) <- cc2(_,A).

The connected components program above produces the number of connected components in an undirected graph. The arc relation denotes edges in the graph with integer vertex ids. The program works as follows. Initially, a vertex will be assigned its vertex id (first rule). This value will be refined to a lower vertex id, if the vertex is indeed connected to a vertex with a lower id (second rule) and a new cc fact is produced. Rule three is necessary to select only the cc fact for A with the lowest vertex id, which will indicate which component of the graph A belongs to. The last rule counts the number of distinct low vertex ids to deteremine the count of components.

mcount

The mcount monotonic aggregate performs a continuous count of the number of distinct occurrences and allows us to express many queries that could not be expressed efficiently, or could not be expressed at all, in Datalog with stratified negation or aggregates. For instance, the following query that counts the paths between nodes in a directed acyclic graph was not expressible in Datalog with stratified aggregates (Mumick and Shmueli. How expressive is stratified aggregation? Annals of Mathematics and Artificial Intelligence, 15:407-435, 1995.).

Example: Counting Paths

cpaths(X, Y, mcount<X>) <- arc(X, Y).
cpaths(X, Y, mcount<(Z, C)>) <- cpaths(X, Z, C), arc(Z, Y).
countpaths(X, Y, max<C>) <- cpaths(X, Y, C).

The first rule states that each arc counts as one path between its nodes. The second rule states that if there are C distinct paths from X to Z, then the presence of arc(Z, Y) establishes that there also are C occurrences of distinct paths from X to Y through Z. The mcount<(Z, C)> aggregate in the head performs the continuous count of these occurrences.

The Company Control application was proposed in Mumick, Pirahesh and Ramakrishnan, The magic of duplicates and aggregates. In VLDB 1990. Companies can purchase shares of other companies; in addition to the fraction of total shares that a company owns directly, a company A controls the shares controlled by a company B when A has a controlling majority (> .50 of the total number) of B's shares. The fact ownedshares(a, b, 51) indicates company a directly owns 51% of company b.

Example: Company Control

cshares(C2, C3, dirct, mmax<P>) <- owned_shares(C2, C3, P).
cshares(C1, C3, indirct, mcount<(C2, P)>) <- bought(C1, C2), cshares(C2, C3, _, P).
bought(C1, C2) <- cshares(C1, C2, _, P), C1 != C2, P > 50.

msum

The msum aggregate supports monotonic summation. With msum, only positive numbers are permitted to be aggregated.

Example: Number of Basic Subparts in a Part

cassb(Part, Sub, mmax<Qty>) <- assbl(Part, Sub, Qty).
cbasic(Pno, mcount<(Pno, 1)>) <- basic(Pno, _).
cbasic(Part, msum<(Sub,K)>) <- cassb(Part, Sub, Qty), cbasic(Sub, N), K=Qty*N.
countbasicsubparts(Prt, max<Qty>) <- cbasic(Prt, Qty).

The example above finds the total number of basic subparts for a part, where basic parts are parts not composed of other parts. The program recursively sums the basic subparts for a part's subparts. The basic predicate holds the basic parts and the assbl predicate contains the mapping between part and each subpart and the number of subparts needed. The cassb rule is used to ensure the maximum number of subparts for each part is used. The first cbasic rule counts 1 for each basic part. Then, the second cbasic rule, to compute the number of basic subparts for a part, multiplies the number of direct subparts for the part (cassb) with the number of basic parts computed thus far for the subpart (cbasic). Lastly, countbasicsubparts returns the maximum count for each part.

Example: Markov Chain

p_state(X, mmax<P>) <- p_state_init(X,P).
p_state(X, msum<(Y,K)>) <- p_state(Y,C), w_matrix(Y,X,W), K = C * W.
rank(X, max<K>) <- p_state(X,K).

sum_rank(sum<A>) <- rank(_,A).
p_norm(X,Pr) <- sum_rank(SR), rank(X,R), Pr = R / SR.

Monotonic aggregates can be used in many machine learning programs. The example above is the monotonic aggregate variation of a Markov chain, much like PageRank. The intuition behind this program is that the markov chain is computed until no new higher aggregate transition probabilities are derived. Then the results of rank are normalized by p_norm to produce the final transition probabilities. p_state_init contains the initial transition probability for each node and w_matrix denotes the transition probability contribution W of node Y to node X.

Alternative Naming

Each monotonic aggregate has an alternate naming to reflect the aggregates' origin from DatalogFS.

mmaxfsmax
mminfsmin
mcountfscnt
msumfssum

* If compilation errors occur with 'm' prefixed aggregates, replace it with the equivalent 'fs' agggregate (i.e. replace 'mmax' with 'fsmax').


Last Updated 2014