Chapter 3—The ATLaS User Manual
 


Links

1. Introduction
2. ATLaS SQL on Tables
3. UDAs
4. Table Functions
5. Programming in ATLaS
5.1 Recursion
5.2 ROLAPs
5.3 References &Data Structures
5.4 Apriori Algorithm
6.External Functions
6.1 Scalar Functions
6.2 Table Functions
6.3 Built-ins

References

ATLaS Syntax
 Program Structure
 Table Dcl
 SQL Statements
 UDAs
 Table Function

 

 

 

 

User-Defined Aggregates

ATLaS supports the standard five aggregates count, sum, avg, min, and max without the DISTINCT option. But the real power of ATLaS follows from its User-Defined Aggregates (UDAs) discussed next. As a first example, we define an aggregate equivalent to the standard avg aggregate in SQL.

Standard Average The first line of this aggregate function declares a local table, state, to keep the sum and count of the values processed so far. While, for this particular example, state contains only one tuple, it is in fact a table that can be queried and updated using SQL statements and can contain any number of tuples (see later examples). These SQL statements are grouped into the three blocks labelled respectively INITIALIZE, ITERATE, and TERMINATE. To compute the average, the SQL statement in INITIALIZE inserts the value taken from the input stream and sets the count to 1. The ITERATE statement updates the table by adding the new input value to the sum and 1 to the count. The TERMINATE statement returns the final result(s) of computation by INSERT INTO RETURN (to conform to SQL syntax, RETURN is treated as a virtual table; however, it is not a stored table and cannot be used in any other role):

AGGREGATE myavg(Next Int) : Real
{ TABLE state(sum Int, cnt Int);
INITIALIZE : {
INSERT INTO state VALUES (Next, 1);
}
 ITERATE : {
UPDATE state SET sum=sum+Next, cnt=cnt+1;
}
TERMINATE : {
INSERT INTO RETURN SELECT sum/cnt FROM state;
}
}

The basic initialize-iterate-terminate template used to define the average aggregate of SQL-2, can now be used to defined powerful new aggregates required by new database applications.

OnLine AverageFor instance, there is much current interest in online aggregates [2]. Since averages converge toward the correct value well before all the tuples in the set have been visited, we can have an online aggregate that returns the average-so-far every, say, 200 input tuples. (In this way, the user or the calling application can stop the computation as soon as convergence is detected.) Online averages can be expressed in ATLaS as follows:

AGGREGATE online_avg(Next Int) : Real
{ TABLE state(sum Int, cnt Int);
INITIALIZE : {
INSERT INTO state VALUES (Next, 1);
}
ITERATE: {
UPDATE state SET sum=sum+Next, cnt=cnt+1;
INSERT INTO RETURN
SELECT sum/cnt FROM state WHERE cnt % 200 = 0;
}
TERMINATE : {  }
}

Therefore, the online average program has been obtained from the traditional average program by removing the statements from TERMINATE and adding a RETURN statement to ITERATE. Our UDA online_avg takes a stream of values as input and returns a stream of values as output (one every 200 tuples). In this example only one tuple is added to output by the the INSER INTO RETURN statement; in general, however, such statement can produce (a stream of) several tuples. Thus ATLaS UDAs operate as general stream transformers.

ATLaS uses the same basic framework to define both traditional aggregates and non-blocking aggregates. ATLaS UDAs are non-blocking when their TERMINATE clause is either empty or absent.

The typical default semantics for SQL aggregates is that the data is first sorted according to the GROUP-BY attributes; this is a blocking operation. However, ATLaS’s default semantics for UDAs is that the data is pipelined through the INITIALIZE and ITERATE clauses where the input stream is transformed into the output stream: the only blocking operations (if any) are those specified in TERMINATE, and only take place at the end of the computation.

Calling User-Defined Aggregates (UDAs) UDAs are called as any other builtin aggregate. For instance, given a database table employee(Eno, Name, Sex, Dept, Sal), the following statement computes the average salary of employees in department 1024 by their gender:

SELECT Sex, online_avg(Sal)
FROM employee WHERE Dept=1024 GROUP BY Sex;

Thus the results of the selection, defined by Dept= 1024, are pipelined to the aggregate in a stream-like fashion.

SQLCODE This a convenient labor-saving device found in most SQL systems, that comes very handy for the ATLaS programmer who wants to correlate a statement with the next. SQLCODE is set to a positive value when the last statement had a null effect, and to zero otherwise. Thus to tell the user that no employee was found in department 1024, we can modify the previous program as follows:

SELECT Sex, online_avg(Sal)
FROM employee WHERE Dept=1024 GROUP BY Sex;
select ’Nobody found in that department’
where SQLCODE >0;

In the last statement, the predicates in the WHERE clause controls its conditional execution, in a fashion similar to that of the IF clauses in a procedural programming language. In fact, the ATLaS compiler recognizes, and optimizes execution of, such conditional predicates.

 

 

Minima: Points and Values. In the next Example, we have a sequence of point-value pairs, and we define a minpair aggregate that returns the point where a minimum occurs along with its value at the minimum.

AGGREGATE minpair(iPoint Int, iValue Int): (mPoint Int, mValue Int)
{ TABLE mvalue(value Int) MEMORY; TABLE mpoints(point Int) MEMORY;
INITIALIZE: {
INSERT INTO mvalue VALUES (iValue);
INSERT INTO mpoints VALUES(iPoint);
}
ITERATE: {
UPDATE mvalue SET value = iValue WHERE iValue < value;
DELETE FROM mpoints WHERE SQLCODE = 0;
INSERT INTO mpoints SELECT iPoint FROM mvalue
WHERE iValue =mvalue.value;
}
TERMINATE: {
INSERT INTO RETURN SELECT point, value FROM mpoints, mvalue;
}
}

Here have used two internal tables: the mvalue table holds, as its only entry, the current min value, while points holds all the points where this value occurs. In the ITERATE statement we have used SQLCODE to ‘remember’ if the previous statement updated mvalue; this is the situation in which the old value was larger than the new one and the old points must be discarded.

Then, the last statement in ITERATE adds the new iPoint to mpoints if the input value is equal to the current min value. In the UDA definition the formal parameters of the UDA function are treated as constants in the SQL statements. Thus, this third INSERT statement adds the constant value of iPoint to the mpoints relation, provided that iValue is the same as the value in mvalue--thus the FROM and WHERE clauses operate here as conditionals. The RETURN statement returns the final list of min pairs as a stream.

For instance, say that we have a time series containing the daily closing prices of certain stocks arranged in temporal sequence (i.e. the table stock_prices, below). Then the following program computes the local minima for each stock:

/* The declaration of AGGREGATE minpair should go here*/
TABLE stock_prices(Day Int, Stock char(4), Cprice Real)
       source ’D:\mydabase\stock_prices’
select Stock, minpair(Day, Stock) --> iPoint, minpair(Day, Stock)--> iValue
from stock_prices
group by Stock ordered by Stock, minpair(Day, Stock) --> iPoint

Observe the use of “-->” to identify the different components in the two-column tuples returned by the aggregate minpair. Since temporal data types are not yet supported in the current ATLaS version, we are using integers to represent dates: thus May, 27, 1999 is represented as 19990527.

The next table summarizes the syntax for declaring new aggregates.



Figure 3.1: The declaration of UDAs

Initializing Tables and Combining Blocks. Let us now introduce the following two syntactic variations of convenience supported in ATLaS:

  • Tables declared in UDAs can be initialized as part of their declaration, via an SQL statement. This executes at the time when the first tuple is processed, thus the result is the same as if the initialization had been executed in the INITIALIZE block.
  • Different blocks can be merged together when they perform the same function. In the next example the INITIALIZE and ITERATE blocks are merged together.

Our Online Averages UDA could also have been written as follows:

AGGREGATE online_avg(Next Int) : Real
{ TABLE state(sum Int, cnt Int) AS VALUES(0, 0);
INITIALIZE:ITERATE: {
UPDATE state SET sum=sum+Next, cnt=cnt+1;
INSERT INTO RETURN
SELECT sum/cnt FROM state WHERE cnt % 200 = 0;
}
}

In the previous example, the statement has been omitted: this is equivalent to writing ‘TERMINATE: { } ’. An empty INITIALIZE statement can also be omitted in a similar fashion.

The results produced by online averages depend on the order in which the data is streamed through the UDA. This illustrate a common situation in stream processing: the abstract semantics of the aggregate used is order-independent, but approximations must be used because of efficiency and real-time requirements (e.g., nonblocking computations); often, the approximate UDA is order-dependent.

In other situations, no approximation is involved, and the dependence on order follows from the very semantics of the UDA. For instance, this is the case of the rising aggregate described below.

Rising. In addition to temporal extensions of standard aggregates (suggested homework: write them in ATLaS), TSQL2 [7] proposes this new aggregate to return the maximal time periods during which a certain attribute values has been increasing monotonically. We can apply this aggregate to our stock_prices(Day Int, Stock char(4), Cprice Real) table to find the periods during which different stocks have been rising, as follows:

select Stock, rising(Day, Cprice) --> Start, rising(Day, Cprice)--> End
from stock_prices
group by Stock

here rising is defined as follows:

AGGREGATE rising(iPoint Int, iValue Real) : (Start Int, End Int)
{ TABLE rperiod(First Int, Last Int, Value Real) MEMORY;
INITIALIZE: {
INSERT INTO rperiod VALUES (iPoint, iPoint, iValue);
}
ITERATE: {
INSERT INTO return SELECT First, Last
FROM rperiod
WHERE iValue <= Value AND First < Last;
UPDATE rperiod SET Last=iPoint, Value=iValue
WHERE iValue > Value;
UPDATE rperiod SET First=iPoint, Last=iPoint, Value=iValue
WHERE SQLCODE > 0;
}
TERMINATE:{ INSERT INTO return SELECT First, Last
FROM rperiod
WHERE First < Last; }
}
}

Therefore we have a sequence of time-value ordered by increasing time. We store a zero length period iPoint, iPoint whenever the new iValue is not increasing (also at INITIALIZE). Also a non-zero length period is currently held in rperiod we return it. When the new iValue is larger than the previous stored value, we advance the End of the current period to the current point.