Theses

	Supporting Schema Evolution in Information Systems and Historical Databases (2008)
	Data Stream Processing and Query Optimization Techniques (2007)
	Efficient Support for XML Queries on Data Base and Data Stream Management Systems (2006)
	Models and Operators for Continuous Queries on Data Streams (2006)
	Mining Techniques for Data Streams and Sequences (2005)
	XML-Based Support for Database Histories and Document Versions (2004)
	Managing and Querying Multiversion XML Documents (2001)
	Data Models and Query Languages of Spatio-Temporal Information (2001)
	Optimization of Sequence Queries in Database Systems (2001)
	User-Defined Aggregates for Advanced Database Applications (2000)
	Temporal Reasoning in Active Databases (1997)

Supporting Schema Evolution in Information Systems and Historical Databases

Hyun Jin Moon

In this dissertation, we study the problem of supporting schema evolution in snapshot databases and historical databases. Schema evolution requires several error-prone and time-consuming tasks, including migration of stored data, rewriting of applications, and in historical databases, management of heterogeneous data. Thus, our primary goal is to relieve users from the burden of performing the tasks manually by having the system to perform them automatically.

The outline and the main contributions of this dissertation are as follows. First, we design schema modification operators (SMOs), for high-level specification of schema changes. Then, based on SMOs, we develop a workbench system called PRISM, where users can evaluate the effect of schema changes ("what-if" scenario) and (semi-)automatically execute data migration and application query translation. For query translation, the system computes for the user possible inverse schema mappings and performs automatic query translation according to the inverse selected by the user.

Evolution of schemas also affect historical data management: repeated schema changes lead to the accumulation of successive schema versions in history, since historical data must be archived according to the schema version under which they were originally created, as to assure faithful preservation of the database history. To write a historical query on such databases, users have to refer to every schema version containing relevant parts of the historical database. Since this is too cumbersome for typical application, we let users write a single historical query on the current schema version, as if all historical data were stored under that version. Thus our PRIMA system uses efficient rewriting techniques to map the queries formulated against the current schema onto the relevant schema versions.

We also study the problem of efficient support for transaction-time databases with evolving schemas and propose novel optimization techniques for temporal queries. In particular, we propose an efficient temporal coalescing technique called CNesT, which outperforms the best coalescing algorithm to date by almost two orders of magnitude.

Thesis Download

Data Stream Processing and Query Optimization Techniques

Yijian Bai

Many modern applications require continuous processing of massive data streams, creating difficult research challenges on multiple fronts. Foremost among these there is the design and implementation of Data Stream Management Systems (DSMSs) that optimize the execution of multiple continuous queries, on massive and bursty data streams to produce fast, near-realtime, responses upon tuple arrival. Existing query constructs and query processing techniques of DataBase Management Systems must be greatly extended and revised to support efficiently continuous queries in DSMSs.

In this dissertation, we propose novel query scheduling algorithms and query constructs whose effectiveness was demonstrated in the Stream Mill DSMS prototype. Thus, we introduce on-demand timestamp generation and propagation mechanisms based on a flexible query-execution model, to solve the idle-waiting problem, improve query response, and reduce memory consumption. Moreover, we introduce practical scheduling algorithms, which assume uniform cost for tuples on the same operator, to handle query graphs of arbitrary complexity. Finally, we compare the performance of these practical algorithms against the theoretically optimal ones that assume perfect knowledge on the costs of the individual tuples being processed. Our experiments show that the performance of the former closely approximate that of the latter.

In order to utilize DBMS languages for data stream processing, stream-oriented constructs, such as sliding windows, have to be defined and integrated into the query language. Thus, we introduce novel constructs and techniques for supporting sliding windows on arbitrary User Defined Aggregates, as to achieve their incremental maintenance on logical and physical windows.

Finally, we explore some performance optimization issues in advanced applications, including RFID (Radio Frequency IDentification) applications, and load-shedding in multi-source stream classifiers.

Thesis Download

Efficient Support for XML Queries on Data Base and Data Stream Management Systems

Xin Zhou

The eXtensible Markup Language (XML) is rapidly becoming the de facto standard for retrieving and exchanging web information, and is having a significant impact on the evolution of existing database management systems. Commercial SQL-compliant data base systems are moving aggressively to support the integrated management of XML documents and XML-published relational data. This is achieved by supporting XML query languages, such as XQuery, along with extensions of relational languages such as SQL/XML.

In this dissertation, we focus our work on three areas that present research issues of particular signi/cance and interest. The /rst area involves managing and querying the temporal history of databases using XML, since its data model and query languages are more supportive of temporal information than relational databases and SQL. We then discuss the problem of supporting this XML historical view e1ciently by mapping back to relational databases, since these achieve better scalability performance. We /nally introduce an algorithm which can process temporal coalescing e1ciently, within the current SQL framework.

The second and fast growing opportunity area is represented by data streams, whereby query languages need to be extended to support continuous query applications on incoming relational streams and XML streams. While current approaches design separate data stream management systems for relational and XML data, the focus of my research has been on providing a uni/ed support for both kinds of data streams. Toward this goal, XML SAX event streams and relational streams need to be integrated, and SQL and XQuery must be extended to overcome their limitations with certain data stream queries. We also overview the state-of-the-art technologies based on Finite State Automata (FSA) model to process multiple queries, and use UDAs to simulate such FSAs with comparable performance. These techniques allow us to build a system which integrated the management of relational and XML streams.

The last but not the least research problem in my thesis is OLAP applications on XML data. With XML gaining importance as the standard for representing business data, XQuery must support the types of queries that are common in business analytics. One such class of queries is OLAP-style aggregation queries. We review several recent research proposals which introduce new constructs into XQuery, pinpoint their disadvantages, and bring forward our function import mechanism. Basically, we allow XQuery UDFs to import functions written in SQL:2003 and UDAs. All complex moving window queries and grouping set queries can thus be speci/ed in SQL easily using current SQL standards, and optimize using the mature query optimization technology of relational DBMSs. We show our approach provides a clear language expressive power and e1cient query performance.

Thesis Download

Modles and Operators for Continuous Queries on Data Streams

Yan Nei Law

A new generation of data-intensive applications is emerging for managing and querying information that, rather than residing in databases, flows continuously through the network in the form of massive data streams. Hence, there is much research work on designing Data Stream Management Systems, and the approach favored by many research projects consists in extending database languages and technology for data streams. However, the new computational environment brings signi/cant research challenges in areas such as query languages, query processing, and advanced applications.

In this dissertation, we first focus on the limitations of relational languages in expressing continuous stream queries. A main limitation follows from the fact that only nonblocking operators can be used in continuous queries, which makes relational languages incomplete on these queries. To address this problem, we investigate user-defined aggregates natively defined in SQL itself, and prove that these make SQL (i) Turing-complete on stored data, and (ii) complete on data streams. Furthermore, we illustrate the e.ectiveness of the proposed extensions on complex applications involving time-series queries, and mining queries.

For advanced applications, we focus on data-stream mining algorithms, which must now be redesigned to make lighter demands on resources and display greater adaptability than those on stored data. In this dissertation, we introduce ANNCAD, which uses multi-resolution data representation to classify new test points using the nearest-neighbors principle. The incremental property and very fast update speed make ANNCAD very suitable for mining data streams. Our experiments show that ANNCAD is adaptive and works well in many applications, including image recognition and censor surveying.

We then study the problem of stream query processing. We propose a load-shedding technique for multi-join, called MSketch, which makes decisions based on the productivity of tuples, rather than only the content of the joined pair stream. A thorough study shows that Msketch outperforms other existing algorithms. Finally, we propose general techniques for optimizing the accuracy of window aggregates, statistical aggregates and mining queries in the presence of sampling. Our method incorporates prior knowledge into an error model that is used to reduce the uncertainty introduced by sampling. We also extend the method to adjust to concept shifts.

Thesis Download

Mining Techniques for Data Streams and Sequences

Andrea Fang Chu

Data stream mining and sequence mining have many applications and pose challenging research problems. Typical applications, such as network monitoring, web searching, telephone services and credit card purchases, are characterized by the need to mine continuously massive data streams to discover up-to-date patterns, which are invaluable for timely strategic decisions. These new requirements call for the design of new mining methods to replace the traditional ones, since those would require the data to be first stored and then processed off-line using complex algorithms that make several passes over the data. Therefore, a first research challenge is designing fast and light mining methods for data streams !* e.g., algorithms that only require one pass over the data and work with limited memory. Another challenge is created by the highly dynamic nature of data streams, whereby the stream mining algorithms need to detect promptly changing concepts and data distribution and adapt to them. While noise represents a general problem in data mining, it poses new challenges on data streams insofar as adaptability becomes more difficult when the data stream contains noise.

The main limitation of data stream mining methods is that they cannot reveal long term trends as they only keep a small snapshot of most recent data. Neither can they discover very complicated patterns that can be detected by methods that require extensive computational resources. However, these patterns can be discovered by off-line mining after data streams are stored as sequences. Sequence mining, in general, can reveal long-term trends and more complicated patterns, defined in a multidimensional space via some similarity criteria. The key research challenges that arise in this context include on (i) designing metrics that measure the similarity of sequences, (ii) dealing with high dimensionality, and (iii) achieving scalability.

This dissertation makes a number of contributions toward the solution of these problems, including the following ones:

1. Adaptive Boosting: A stream ensemble method is proposed that maintains a very accurate predictive model with fast learning and light memory consumption. The method is also highly adaptive through novel change detection techniques.

2. Robust Regression Ensemble: This method enhances the ensemble methods with outlier detection and a statistical learning theory.

3. SeqClus: A pattern-based subspace clustering algorithm is introduced along with a novel pattern similarity metric for sequences. The algorithm is scalable and efficient.

4. Mining Quality: To deal with noise and improve mining quality, a general approach is introduced based on data dependency. The approach exploits local data dependency between samples using pairwise Markov Networks and Bayesian belief propagation techniques.

The efficacy of the techniques proposed was demonstrated through extensive experiments, both on synthetic and on real-life data.

Thesis Download

XML-Based Support for Database Histories and Document Versions

Fusheng Wang

Current database systems do not provide effective means for archiving the database history and querying past snapshots of the database and its temporal evolution. Better support for temporal applications by database systems represents an important objective that is difficult to achieve, since it requires an integrated solution for technical problems that are challenging on their own, including (i) expressive temporal representations and data models, (ii) powerful languages for temporal queries and snapshot queries, (iii) indexing, clustering and query optimization techniques for managing temporal information efficiently, and (iv) architectures that bring together the different pieces of enabling technology into a robust system.

There is much current interest in publishing and viewing databases as XML documents. The general benefits of this approach follow from the popularity of XML and the tool set available for processing information encoded in this universal standard. In this dissertation, we explore the additional and unique benefits achieved by this approach on temporal database applications. We show that XML with XQuery can provide surprisingly effective solutions to the problem of supporting historical queries on past contents of database relations and their evolution. Indeed, using XML, the histories of database relations can be represented naturally using a temporally grouped data model, and complex temporal queries can be expressed in XQuery without requiring extensions to the current standard.

Therefore, we present the ArchIS system that achieves these benefits. ArchIS' architecture uses (a) XML to support a temporally grouped (virtual) representation of the database history, (b) XQuery to express powerful temporal queries on such views, (c) temporal clustering and indexing techniques for managing the actual historical data in an RDBMS, and (d) SQL/XML for executing the queries on the XML views as equivalent queries on the relational database. Extensive performance studies show that ArchIS is quite effective at storing and retrieving under complex query conditions the transaction-time history of relational databases. By supporting database compression as an option, ArchIS also achieves excellent storage efficiency for archived histories. Therefore, ArchIS delivers full-functionality transaction-time databases without requiring temporal extensions in XML or database standards. Moreover, these techniques can be extended to valid-time and bitemporal databases, where complex temporal queries can also be expressed in standard XQuery.

The temporal modeling and querying approach proposed in this thesis is very general and can be extended to arbitrary XML documents. In our extension, we manage the XML document revision history effectively by (i) representing concisely the successive versions of a document as another XML document that implements a temporally grouped data model, ii) using structured diff algorithms to build such documents, and iii) using XQuery to express complex temporal queries on the evolution of the document structure and its content.

Thesis Download Defense Slides Download

Managing and Querying Multiversion XML Documents

Shu-Yao Chien
Managing multiversion documents is important for traditional applications, such as software configuration and cooperative work, and for a fast-growing array of web-based applications. Traditional document version control systems, e.g. RCS, which model documents as a sequence of lines of text and use the shortest edit script to represent version differences, can be inefficient and do not preserve the logical structure of the original document. We first propose a new storage strategy, called the Usefulness Based Copy Control (UBCC), which improves on RCS by assuring more efficient version reconstruction with a small storage overhead. The improvements provided by UBCC notwithstanding, edit-based representations suffer from limited generality and flexibility --- and do not support queries well. To solve these problems, we then propose a reference-based version scheme that preserves the logical structure of the evolving document via the use of object references. By preserving the document structure among versions, the new scheme facilitates efficient query support and provides an effective representation of multiversion XML documents, both at the storage level and the transport level. In fact, for the storage level, we develop a usefulness-based management policy similar to the UBCC strategy, and for the transport level, we represent the history of a multiversion document as yet another XML document.
Finally, we explore another direction of modeling the versioned document structure---the SPaR scheme. The SPaR scheme quantifies the document structure by assigning a unique node number to each element. In addition to traditional queries, this approach is efficient for new XML-oriented queries, including searches by regular path expressions that are used frequently in XML query languages.

Thesis Download Defense Slides Download

Data Models and Query Languages of Spatio-Temporal Information

Cindy Chen

In this dissertation, we extend database models and query languages to support spatio-temporal information, including representations for changing positions and shapes. Furthermore, we propose techniques to support spatio-temporal extensions on Object-Relational systems and achieve end-user extensibility.

We begin from temporal data models and query languages. Our approach is based upon a point-based representation of time enhanced with user-defined aggregates that support interval-oriented operators such as duration and during. This approach provides several advantages, since it solves the coalescing problem, minimizes extensions required from SQL, and it is applicable to all query languages.

Then, we extend the time model at the physical level to spatio-temporal databases. As the basis of our implementation, we introduce counterclockwise-directed triangles as our core spatial abstract data type. Then, user-defined-aggregates are introduced to support spatial operators such as contain and overlap, and spatial-temporal operators such as moving_distance. Finally, we represent moving objects as sequences of snapshots of spatial objects such as points, lines and polygons.

To reconcile logical and physical requirements, we follow an implementation approach based on a layered architecture. Thus, for temporal information, point-based representations are mapped to an interval-oriented representation at the internal level. Likewise, we map polygon-based representations into a triangle-based internal representation. We implement these mappings through query transformations, and user-defined aggregates.

A final advantage of our approach is that it allows end-users to further extend and customize their system: in fact in our SQL^ST system, users can extend database functionality by writing new user-defined aggregates in SQL^ST itself.

In summary, the spatio-temporal data models and query languages introduced in this dissertation offer several conceptual advantages; furthermore, our implementation approach provides considerable practical benefits, including compatibility with Object-Relational systems, flexibility, robustness, and ease of customization by end-users.

Thesis Download Defense Slides Download

Optimization of Sequence Queries in Database Systems

Reza Sadri

The need to search for complex and recurring patterns in database sequences is shared by many applications. In this work, we discuss how to express and support e.ciently sophisticated sequential pattern queries in relational database systems. Thus, we first introduce SQL-TS, an extension of SQL, to express these patterns, and then we study how to optimize search queries for this language. We take the optimal text search algorithm of Knuth, Morris and Pratt, and generalize it to handle complex queries on sequences. Our algorithm exploits the inter- dependencies between the elements of a sequential pattern to minimize repeated passes over the same data. We then present extensions of the algorithm for detecting repeated patterns and disjunctive patterns. We also provide methods for finding the inter-dependencies between the pattern elements for important domains including intervals and vector time-series. In addition, a logic based semantics for SQL-TS is given. Experimental results on typical sequence queries, such as double bottom queries, confirm that substantial speedups are achieved by our new optimization techniques.

Thesis Download Slides Download(1) Slides Download(2)

User-Defined Aggregates for Advanced Database Applications

Haixun Wang

A mounting wave of data�intensive and knowledge�based applications, such as data mining, data warehousing, and Online Analytical Processing (OLAP), have created a strong demand for more powerfule database languages and systems. Several data model extensions (e.g., Object�Relational models), new language constructs (e.g. recursion and OLAP constructs), and various database extenders based on user�defined functions, have been proposed to enhance the current Dababase Management Systems (DBMSs). However, state�of�the�art DBMSs are not powerful and general enough for many advanced database applications, and in particular for data mining.
In this thesis, we claim that User�defined Aggregates (UDAs) provide a versatile mechanism for extending the power and applicability of Object�Relational Databases (O�R DBs). We first define the formal semantics of UDAs in logic and then we apply them to SQL DBMSs. After building a series of language prototypes, we designed and implemented AXL. AXL is easy to learn and use for database programmers because it preserves the constructs, programming paradigm and data types of SQL (whereas there is an `impedance mismatch' between SQL and the procedural languages of user�defined functions currently used in O�R DBs). Data independence and parallelizability represent two additional qualities that AXL inherits from database systems. In this thesis, we show that, while adding only minimal extensions to SQL, AXL is very powerful and capable of expressing complex algorithms efficiently. We demonstrate this by coding data mining functions and other advanced applications that, previously, had been a major problem for SQL databases.
Due to its flexibility, SQL�compatibility and ease of use, the AXL approach offers better extensibility mechanisms, in several application domains, than the function libraries now offered by commercial O�R DBs under names such as Dat� ablades or DB�Extenders.

Thesis Download Defense Slides Download

Temporal Reasoning in Active Databases

Iakovos Motakis