-
Mining frequent itemsets in a stream
Publication year: 2012 Source: Information Systems, Available online 30 January 2012 Toon Calders, Nele Dexters, Joris J.M. Gillis, Bart Goethals Mining frequent itemsets in a datastream proves to be a difficult problem, as itemsets arrive in rapid succession and storing parts of the stream is typically impossible. Nonetheless, it has many useful applications; e.g. opinion and sentiment analysis from social networks. Current stream mining algorithms are based on approximations. In earlier work, mining frequent items in a stream under the max-frequency measure proved to be effective for items. In this article, we extended our work from items to itemsets. Firstly, an optimized incremental algorithm for mining frequent itemsets in a stream is presented. The algorithm maintains a very compact summary of the stream for selected itemsets. Secondly, we show that further compacting the summary is non-trivial. Thirdly, we establish a connection between the size of a summary and results from number theory. Fourthly, we report results of extensive experimentation, both of synthetic and real-world datasets, showing the efficiency of the algorithm both in terms of time and space.
-
CIRCE: correcting imprecise readings and compressing excrescent points for querying common patterns in uncertain sensor streams
Publication year: 2012 Source: Information Systems, Available online 25 January 2012 Jing He, Yanchun Zhang, Guangyan Huang, Paulo de Souza Continuous sensor stream data are often recorded as a series of discrete points in a database from which knowledge can be retrieved through queries. Two classes of uncertainties inevitably happen in sensor streams that we present as follows. The first is Uncertainty due to Discrete Sampling (DS Uncertainty); even if every discrete point is correct, the discrete sensor stream is uncertain - that is, it is not exactly like the continuous stream - since some critical points are missing due to the limited capabilities of the sensing equipment and the database server. The second is Uncertainty due to Sampling Error (SE Uncertainty); sensor readings for the same situation cannot be repeated exactly when we record them at different times or use different sensors since different sampling errors exist. These two uncertainties reduce the efficiency and accuracy of querying common patterns. However, already known algorithms generally only resolve SE Uncertainty. In this paper, we propose a novel method of Correcting Imprecise Readings and Compressing Excrescent points (CIRCE). Particularly, to resolve DS Uncertainty, a novel CIRCE core algorithm is developed in the CIRCE method to correct the missing critical points while compressing the original sensor streams. The experimental study based on various sizes of sensor stream datasets validates that the CIRCE core algorithm is more efficient and more accurate than a counterpart algorithm to compress sensor streams. We also resolve the SE Uncertainty problem in the CIRCE method. The application for querying Longest Common Route patterns validates the effectiveness of our CIRCE method.
-
Efficient tracking of moving objects using a relational database
Publication year: 2012 Source: Information Systems, Available online 20 January 2012 Andreas Behrend, Gereon Schüller, Monika Wieneke Tracking uncooperative moving objects by means of radar is a complex task due to clutter and association problems in multi-target scenarios. An approach to solve this problem isprobabilistic multiple hypothesis tracking(PMHT). This method combines classical track filtering with a likelihood ratio test for the estimation of the plot-to-track association. The basics of PMHT and similar algorithms have gained much attention recently. However, the efficient implementation of real world applications of this technique still represents a challenging task. Since a common requirement in this context is the reliable storage of track data in a database, an implementation of the tracker's calculation inside a database management system (DBMS) using SQL views is desirable. A naive implementation of PMHT using a commercial DBMS, however, usually leads to performance problems because of the high frequency of measurement updates. In this paper, we propose possible optimizations for solving these performance problems. Their usage leads to a dramatic run-time improvement in our sample case and makes the implementation of PMHT in a database context feasible.
-
Recognizing patterns in streams with imprecise timestamps
Publication year: 2012 Source: Information Systems, Available online 18 January 2012 Haopeng Zhang, Yanlei Diao, Neil Immerman Large-scale event systems are becoming increasingly popular in a variety of domains. Event pattern evaluation plays a key role in monitoring applications in these domains. identifies matches of user-defined patterns on high-volume event streams. Existing work on pattern evaluation, however, assumes that the occurrence time of each event is known precisely and the events from various sources can be merged into a single stream with a total or partial order. We observe that in real-world applications event occurrence times are often unknown or imprecise. Therefore, we propose a temporal model that assigns a time interval to each event to represent all of its possible occurrence times and revisit pattern evaluation under this model. In particular, we propose the formal semantics of such pattern evaluation, two evaluation frameworks, and algorithms and optimizations in these frameworks. Our evaluation results using both real traces and synthetic systems show that the event-based framework always outperforms the point-based framework and with optimizations, it achieves high efficiency for a wide range of workloads tested.
-
On the refactoring of activity labels in business process models
Publication year: 2012 Source: Information Systems, Available online 14 January 2012 Henrik Leopold, Sergey Smirnov, Jan Mendling Large corporations increasingly utilize business process models for documenting and redesigning their operations. The extent of such modeling initiatives with several hundred models and dozens of often hardly trained modelers calls for automated quality assurance. While formal properties of control flow can easily be checked by existing tools, there is a notable gap for checking the quality of the textual content of models, in particular, its activity labels. In this paper, we address the problem of activity label quality in business process models. We designed a technique for the recognition of labeling styles, and the automatic refactoring of labels with quality issues. More specifically, we developed a parsing algorithm that is able to deal with the shortness of activity labels, which integrates natural language tools like WordNet and the Stanford Parser. Using three business process model collections from practice with differing labeling style distributions, we demonstrate the applicability of our technique. In comparison to a straightforward application of standard natural language tools, our technique provides much more stable results. As an outcome, the technique shifts the boundary of process model quality issues that can be checked automatically from syntactic to semantic aspects.
Highlights? We present a technique for automatically refactoring poor activity labels. ? Our approach is capable of handling the shortness of activity labels. ? We demonstrate the applicability using three different industry model collections. ? Our method outperforms the standard application of natural language processing tools. ? We pave the way for automated checking of semantic process model quality aspects.
|