UCLA Information System LaboratoryGo Bruins!
 
Professor
Professor
Professor

Professor

Professor

Professor

Professor

Professor

Professor

Professor

Professor

Professor

Professor

Professor

Professor
















  • Mining Techniques for Data Streams and Sequences (2005)

  • XML-Based Support for Database Histories and Document Versions (2004)

  • Managing and Querying Multiversion XML Documents (2001)

  • Data Models and Query Languages of Spatio-Temporal Information (2001)

  • Optimization of Sequence Queries in Database Systems (2001)

  • User-Defined Aggregates for Advanced Database Applications (2000)

  • Temporal Reasoning in Active Databases (1997)

    • Mining Techniques for Data Streams and Sequences

      Andrea Fang Chu

      Data stream mining and sequence mining have many applications and pose challenging research problems. Typical applications, such as network monitoring, web searching, telephone services and credit card purchases, are characterized by the need to mine continuously massive data streams to discover up-to-date patterns, which are invaluable for timely strategic decisions. These new requirements call for the design of new mining methods to replace the traditional ones, since those would require the data to be first stored and then processed off-line using complex algorithms that make several passes over the data. Therefore, a first research challenge is designing fast and light mining methods for data streams !* e.g., algorithms that only require one pass over the data and work with limited memory. Another challenge is created by the highly dynamic nature of data streams, whereby the stream mining algorithms need to detect promptly changing concepts and data distribution and adapt to them. While noise represents a general problem in data mining, it poses new challenges on data streams insofar as adaptability becomes more difficult when the data stream contains noise.

      The main limitation of data stream mining methods is that they cannot reveal long term trends as they only keep a small snapshot of most recent data. Neither can they discover very complicated patterns that can be detected by methods that require extensive computational resources. However, these patterns can be discovered by off-line mining after data streams are stored as sequences. Sequence mining, in general, can reveal long-term trends and more complicated patterns, defined in a multidimensional space via some similarity criteria. The key research challenges that arise in this context include on (i) designing metrics that measure the similarity of sequences, (ii) dealing with high dimensionality, and (iii) achieving scalability.

      This dissertation makes a number of contributions toward the solution of these problems, including the following ones:

      1. Adaptive Boosting: A stream ensemble method is proposed that maintains a very accurate predictive model with fast learning and light memory consumption. The method is also highly adaptive through novel change detection techniques.

      2. Robust Regression Ensemble: This method enhances the ensemble methods with outlier detection and a statistical learning theory.

      3. SeqClus: A pattern-based subspace clustering algorithm is introduced along with a novel pattern similarity metric for sequences. The algorithm is scalable and efficient.

      4. Mining Quality: To deal with noise and improve mining quality, a general approach is introduced based on data dependency. The approach exploits local data dependency between samples using pairwise Markov Networks and Bayesian belief propagation techniques.

      The efficacy of the techniques proposed was demonstrated through extensive experiments, both on synthetic and on real-life data.

      Thesis Download

      XML-Based Support for Database Histories and Document Versions

      Fusheng Wang

      Current database systems do not provide effective means for archiving the database history and querying past snapshots of the database and its temporal evolution. Better support for temporal applications by database systems represents an important objective that is difficult to achieve, since it requires an integrated solution for technical problems that are challenging on their own, including (i) expressive temporal representations and data models, (ii) powerful languages for temporal queries and snapshot queries, (iii) indexing, clustering and query optimization techniques for managing temporal information efficiently, and (iv) architectures that bring together the different pieces of enabling technology into a robust system.

      There is much current interest in publishing and viewing databases as XML documents. The general benefits of this approach follow from the popularity of XML and the tool set available for processing information encoded in this universal standard. In this dissertation, we explore the additional and unique benefits achieved by this approach on temporal database applications. We show that XML with XQuery can provide surprisingly effective solutions to the problem of supporting historical queries on past contents of database relations and their evolution. Indeed, using XML, the histories of database relations can be represented naturally using a temporally grouped data model, and complex temporal queries can be expressed in XQuery without requiring extensions to the current standard.

      Therefore, we present the ArchIS system that achieves these benefits. ArchIS' architecture uses (a) XML to support a temporally grouped (virtual) representation of the database history, (b) XQuery to express powerful temporal queries on such views, (c) temporal clustering and indexing techniques for managing the actual historical data in an RDBMS, and (d) SQL/XML for executing the queries on the XML views as equivalent queries on the relational database. Extensive performance studies show that ArchIS is quite effective at storing and retrieving under complex query conditions the transaction-time history of relational databases. By supporting database compression as an option, ArchIS also achieves excellent storage efficiency for archived histories. Therefore, ArchIS delivers full-functionality transaction-time databases without requiring temporal extensions in XML or database standards. Moreover, these techniques can be extended to valid-time and bitemporal databases, where complex temporal queries can also be expressed in standard XQuery.

      The temporal modeling and querying approach proposed in this thesis is very general and can be extended to arbitrary XML documents. In our extension, we manage the XML document revision history effectively by (i) representing concisely the successive versions of a document as another XML document that implements a temporally grouped data model, ii) using structured diff algorithms to build such documents, and iii) using XQuery to express complex temporal queries on the evolution of the document structure and its content.

      Thesis Download       Defense Slides Download

      Managing and Querying Multiversion XML Documents

      Shu-Yao Chien

      Managing multiversion documents is important for traditional applications, such as software configuration and cooperative work, and for a fast-growing array of web-based applications. Traditional document version control systems, e.g. RCS, which model documents as a sequence of lines of text and use the shortest edit script to represent version differences, can be inefficient and do not preserve the logical structure of the original document. We first propose a new storage strategy, called the Usefulness Based Copy Control (UBCC), which improves on RCS by assuring more efficient version reconstruction with a small storage overhead. The improvements provided by UBCC notwithstanding, edit-based representations suffer from limited generality and flexibility --- and do not support queries well. To solve these problems, we then propose a reference-based version scheme that preserves the logical structure of the evolving document via the use of object references. By preserving the document structure among versions, the new scheme facilitates efficient query support and provides an effective representation of multiversion XML documents, both at the storage level and the transport level. In fact, for the storage level, we develop a usefulness-based management policy similar to the UBCC strategy, and for the transport level, we represent the history of a multiversion document as yet another XML document.
      Finally, we explore another direction of modeling the versioned document structure---the SPaR scheme. The SPaR scheme quantifies the document structure by assigning a unique node number to each element. In addition to traditional queries, this approach is efficient for new XML-oriented queries, including searches by regular path expressions that are used frequently in XML query languages.

      Thesis Download       Defense Slides Download

      Data Models and Query Languages of Spatio-Temporal Information

      Cindy Chen

      In this dissertation, we extend database models and query languages to support spatio-temporal information, including representations for changing positions and shapes. Furthermore, we propose techniques to support spatio-temporal extensions on Object-Relational systems and achieve end-user extensibility.

      We begin from temporal data models and query languages. Our approach is based upon a point-based representation of time enhanced with user-defined aggregates that support interval-oriented operators such as duration and during. This approach provides several advantages, since it solves the coalescing problem, minimizes extensions required from SQL, and it is applicable to all query languages.

      Then, we extend the time model at the physical level to spatio-temporal databases. As the basis of our implementation, we introduce counterclockwise-directed triangles as our core spatial abstract data type. Then, user-defined-aggregates are introduced to support spatial operators such as contain and overlap, and spatial-temporal operators such as moving_distance. Finally, we represent moving objects as sequences of snapshots of spatial objects such as points, lines and polygons.

      To reconcile logical and physical requirements, we follow an implementation approach based on a layered architecture. Thus, for temporal information, point-based representations are mapped to an interval-oriented representation at the internal level. Likewise, we map polygon-based representations into a triangle-based internal representation. We implement these mappings through query transformations, and user-defined aggregates.

      A final advantage of our approach is that it allows end-users to further extend and customize their system: in fact in our SQLST system, users can extend database functionality by writing new user-defined aggregates in SQLST itself.

      In summary, the spatio-temporal data models and query languages introduced in this dissertation offer several conceptual advantages; furthermore, our implementation approach provides considerable practical benefits, including compatibility with Object-Relational systems, flexibility, robustness, and ease of customization by end-users.

      Thesis Download       Defense Slides Download

      Optimization of Sequence Queries in Database Systems

      Reza Sadri


      The need to search for complex and recurring patterns in database sequences is shared by many applications. In this work, we discuss how to express and support e.ciently sophisticated sequential pattern queries in relational database systems. Thus, we first introduce SQL-TS, an extension of SQL, to express these patterns, and then we study how to optimize search queries for this language. We take the optimal text search algorithm of Knuth, Morris and Pratt, and generalize it to handle complex queries on sequences. Our algorithm exploits the inter- dependencies between the elements of a sequential pattern to minimize repeated passes over the same data. We then present extensions of the algorithm for detecting repeated patterns and disjunctive patterns. We also provide methods for finding the inter-dependencies between the pattern elements for important domains including intervals and vector time-series. In addition, a logic based semantics for SQL-TS is given. Experimental results on typical sequence queries, such as double bottom queries, confirm that substantial speedups are achieved by our new optimization techniques.

      Thesis Download       Slides Download(1)       Slides Download(2)

      User-Defined Aggregates for Advanced Database Applications

      Haixun Wang


      A mounting wave of data­intensive and knowledge­based applications, such as data mining, data warehousing, and Online Analytical Processing (OLAP), have created a strong demand for more powerfule database languages and systems. Several data model extensions (e.g., Object­Relational models), new language constructs (e.g. recursion and OLAP constructs), and various database extenders based on user­defined functions, have been proposed to enhance the current Dababase Management Systems (DBMSs). However, state­of­the­art DBMSs are not powerful and general enough for many advanced database applications, and in particular for data mining.
      In this thesis, we claim that User­defined Aggregates (UDAs) provide a versatile mechanism for extending the power and applicability of Object­Relational Databases (O­R DBs). We first define the formal semantics of UDAs in logic and then we apply them to SQL DBMSs. After building a series of language prototypes, we designed and implemented AXL. AXL is easy to learn and use for database programmers because it preserves the constructs, programming paradigm and data types of SQL (whereas there is an `impedance mismatch' between SQL and the procedural languages of user­defined functions currently used in O­R DBs). Data independence and parallelizability represent two additional qualities that AXL inherits from database systems. In this thesis, we show that, while adding only minimal extensions to SQL, AXL is very powerful and capable of expressing complex algorithms efficiently. We demonstrate this by coding data mining functions and other advanced applications that, previously, had been a major problem for SQL databases.
      Due to its flexibility, SQL­compatibility and ease of use, the AXL approach offers better extensibility mechanisms, in several application domains, than the function libraries now offered by commercial O­R DBs under names such as Dat­ ablades or DB­Extenders.

      Thesis Download       Defense Slides Download

      Temporal Reasoning in Active Databases

      Iakovos Motakis