![]() ![]() |
|
|
Data stream mining and sequence mining have many applications and pose challenging research problems. Typical applications, such as network monitoring, web searching, telephone services and credit card purchases, are characterized by the need to mine continuously massive data streams to discover up-to-date patterns, which are invaluable for timely strategic decisions. These new requirements call for the design of new mining methods to replace the traditional ones, since those would require the data to be first stored and then processed off-line using complex algorithms that make several passes over the data. Therefore, a first research challenge is designing fast and light mining methods for data streams !* e.g., algorithms that only require one pass over the data and work with limited memory. Another challenge is created by the highly dynamic nature of data streams, whereby the stream mining algorithms need to detect promptly changing concepts and data distribution and adapt to them. While noise represents a general problem in data mining, it poses new challenges on data streams insofar as adaptability becomes more difficult when the data stream contains noise. The main limitation of data stream mining methods is that they cannot reveal long term trends as they only keep a small snapshot of most recent data. Neither can they discover very complicated patterns that can be detected by methods that require extensive computational resources. However, these patterns can be discovered by off-line mining after data streams are stored as sequences. Sequence mining, in general, can reveal long-term trends and more complicated patterns, defined in a multidimensional space via some similarity criteria. The key research challenges that arise in this context include on (i) designing metrics that measure the similarity of sequences, (ii) dealing with high dimensionality, and (iii) achieving scalability. This dissertation makes a number of contributions toward the solution of these problems, including the following ones: 1. Adaptive Boosting: A stream ensemble method is proposed that maintains a very accurate predictive model with fast learning and light memory consumption. The method is also highly adaptive through novel change detection techniques. 2. Robust Regression Ensemble: This method enhances the ensemble methods with outlier detection and a statistical learning theory. 3. SeqClus: A pattern-based subspace clustering algorithm is introduced along with a novel pattern similarity metric for sequences. The algorithm is scalable and efficient. 4. Mining Quality: To deal with noise and improve mining quality, a general approach is introduced based on data dependency. The approach exploits local data dependency between samples using pairwise Markov Networks and Bayesian belief propagation techniques. The efficacy of the techniques proposed was demonstrated through extensive experiments, both on synthetic and on real-life data.
Current database systems do not provide effective means for archiving the database history and querying past snapshots of the database and its temporal evolution. Better support for temporal applications by database systems represents an important objective that is difficult to achieve, since it requires an integrated solution for technical problems that are challenging on their own, including (i) expressive temporal representations and data models, (ii) powerful languages for temporal queries and snapshot queries, (iii) indexing, clustering and query optimization techniques for managing temporal information efficiently, and (iv) architectures that bring together the different pieces of enabling technology into a robust system. There is much current interest in publishing and viewing databases as XML documents. The general benefits of this approach follow from the popularity of XML and the tool set available for processing information encoded in this universal standard. In this dissertation, we explore the additional and unique benefits achieved by this approach on temporal database applications. We show that XML with XQuery can provide surprisingly effective solutions to the problem of supporting historical queries on past contents of database relations and their evolution. Indeed, using XML, the histories of database relations can be represented naturally using a temporally grouped data model, and complex temporal queries can be expressed in XQuery without requiring extensions to the current standard. Therefore, we present the ArchIS system that achieves these benefits. ArchIS' architecture uses (a) XML to support a temporally grouped (virtual) representation of the database history, (b) XQuery to express powerful temporal queries on such views, (c) temporal clustering and indexing techniques for managing the actual historical data in an RDBMS, and (d) SQL/XML for executing the queries on the XML views as equivalent queries on the relational database. Extensive performance studies show that ArchIS is quite effective at storing and retrieving under complex query conditions the transaction-time history of relational databases. By supporting database compression as an option, ArchIS also achieves excellent storage efficiency for archived histories. Therefore, ArchIS delivers full-functionality transaction-time databases without requiring temporal extensions in XML or database standards. Moreover, these techniques can be extended to valid-time and bitemporal databases, where complex temporal queries can also be expressed in standard XQuery. The temporal modeling and querying approach proposed in this thesis is very general and can be extended to arbitrary XML documents. In our extension, we manage the XML document revision history effectively by (i) representing concisely the successive versions of a document as another XML document that implements a temporally grouped data model, ii) using structured diff algorithms to build such documents, and iii) using XQuery to express complex temporal queries on the evolution of the document structure and its content.
In this dissertation, we extend database models and query languages to support spatio-temporal information, including representations for changing positions and shapes. Furthermore, we propose techniques to support spatio-temporal extensions on Object-Relational systems and achieve end-user extensibility. We begin from temporal data models and query languages. Our approach is based upon a point-based representation of time enhanced with user-defined aggregates that support interval-oriented operators such as duration and during. This approach provides several advantages, since it solves the coalescing problem, minimizes extensions required from SQL, and it is applicable to all query languages. Then, we extend the time model at the physical level to spatio-temporal databases. As the basis of our implementation, we introduce counterclockwise-directed triangles as our core spatial abstract data type. Then, user-defined-aggregates are introduced to support spatial operators such as contain and overlap, and spatial-temporal operators such as moving_distance. Finally, we represent moving objects as sequences of snapshots of spatial objects such as points, lines and polygons. To reconcile logical and physical requirements, we follow an implementation approach based on a layered architecture. Thus, for temporal information, point-based representations are mapped to an interval-oriented representation at the internal level. Likewise, we map polygon-based representations into a triangle-based internal representation. We implement these mappings through query transformations, and user-defined aggregates. A final advantage of our approach is that it allows end-users to further extend and customize their system: in fact in our SQLST system, users can extend database functionality by writing new user-defined aggregates in SQLST itself. In summary, the spatio-temporal data models and query languages introduced in this dissertation offer several conceptual advantages; furthermore, our implementation approach provides considerable practical benefits, including compatibility with Object-Relational systems, flexibility, robustness, and ease of customization by end-users.
The need to search for complex and recurring patterns in database sequences is shared by many applications. In this work, we discuss how to express and support e.ciently sophisticated sequential pattern queries in relational database systems. Thus, we first introduce SQL-TS, an extension of SQL, to express these patterns, and then we study how to optimize search queries for this language. We take the optimal text search algorithm of Knuth, Morris and Pratt, and generalize it to handle complex queries on sequences. Our algorithm exploits the inter- dependencies between the elements of a sequential pattern to minimize repeated passes over the same data. We then present extensions of the algorithm for detecting repeated patterns and disjunctive patterns. We also provide methods for finding the inter-dependencies between the pattern elements for important domains including intervals and vector time-series. In addition, a logic based semantics for SQL-TS is given. Experimental results on typical sequence queries, such as double bottom queries, confirm that substantial speedups are achieved by our new optimization techniques.
|