eBook - ePub

Querying and Mining Uncertain Data Streams

Name: Querying and Mining Uncertain Data Streams
Author: Cheqing Jin, Aoying Zhou

Cheqing Jin,

Aoying Zhou,

164 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Querying and Mining Uncertain Data Streams

Cheqing Jin,

Aoying Zhou,

Book details

Book preview

Table of contents

Citations

About This Book

Data uncertainty widely exists in many applications, and an uncertain data stream is a series of uncertain tuples that arrive rapidly. However, traditional techniques for deterministic data streams cannot be applied to deal with data uncertainty directly due to the exponential growth of possible solution space.

This book provides a comprehensive overview of the authors' work on querying and mining uncertain data streams. Its contents include some important discoveries dealing with typical topics such as top-k query, ER-Topk query, rarity estimation, set similarity, and clustering.

Querying and Mining Uncertain Data Streams is written for professionals, researchers, and graduate students in data mining and its various related fields.

Contents:

Introduction
Top- k Queries Over the Sliding-window Model
ER-Top k Query Over the Landmark Model
Rarity Estimation
Set Similarity
Clustering
Conclusion

Readership: Students and Professionals involved in data mining, big data, and data gathering.
Key Features:

The first book on uncertain data stream management
There exist significant contributions on typical topics

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Querying and Mining Uncertain Data Streams by Cheqing Jin, Aoying Zhou in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.

Information

Publisher

WSPC

Year

2016

ISBN

9789813142923

Topic

Computer Science

Subtopic

Data Processing

Index

Computer Science

Chapter 1 Introduction

1.1Background

During the past four decades, huge efforts have been dedicated to manage data that is deterministic, and the successful deployment of commercial DBMSs deeply influences human life. Most recently, it becomes critical to process uncertain (probabilistic) data, since data uncertainty widely exists in many applications, such as economy, military, logistic, finance, and telecommunication, etc. The causes for uncertain data include, but are not limited to the following: low-end physical equipments, manual intervention, data missing, and data integration [82].

Low-end physical equipments. Data may be imprecise due to low-end physical equipments, such as sensor networks [29]. Data accuracy is also affected by bandwidth, transmission delay and energy during wireless network transmission process. It is reported that more than 30% of tag readings in the RFID applications are dropped [46].
Manual intervention. Manual intervention helps to achieve special targets, such as privacy preserving. For example, modern GPS-enabled electronic devices can track the trajectories of moving objects, such as intelligent mobile phones. A common way to protect location privacy (called k-anonymity) uses an area to describe the actual location of a user, so that he (she) cannot be distinguished from the other k − 1 users.
Data missing. Data missing occurs in many cases, including equipment failure, invalid access, data inconsistent, and historical usage. Although some existing data cleaning mechanisms can eliminate uncertainty by assuming the underlying data follows a distribution or simply ignoring incomplete records, they cannot describe the nature of data uncertainty well.
Data integration. Data integration also brings in uncertainty, since the data from different sources may be inconsistent. By integrating multiple sources of uncertain data, it is helpful to resolve portions of the uncertainty to achieve more accurate results [7].

A significant trend in the modern applications is that the uncertain data tends to be gathered in a streaming fashion. For example, each sensor node in a senor network gathers and transfers data to a main site continuously; RFID tags are read continuously. Table 1.1 illustrates a small example of modern traffic monitoring applications, where radars are used to detect the speed of vehicles. There are four tuples, each described as a discrete probability distribution function rather than a single value, since some errors may be caused for complicated reasons, such as nearby high voltage lines, close cars’ interference, human operator mistakes, etc. For instance, the 1st record observes a Buick car (No. Z-333) running through the monitoring area at 10:33AM with the speed estimated as 50 (miles per hour) with probability 0.6, and 70 with probability 0.4 respectively. In addition, a range is used to test the validation of a speed reading (e.g., [0, 120]). Once the reading exceeds the range, this information is removed from the tuple description, which makes the total confidence smaller than 1. For example, the 3rd record estimates the speed of 80 with probability 0.4, and of invalidation with probability 0.6 (= 1 − 0.4).

Table 1.1 A Radar Reading Data Stream

There are two kinds of uncertainties: attribute-level uncertainty and existential uncertainty. The latter is also called as tuple-level uncertainty in some literatures [2]. The attribute-level uncertainty, commonly described as discrete probability distribution function or probability density function, illustrates the imprecision of a tuple’s attributes, e.g., the first tuple. The existential uncertainty describes the confidence of a tuple, e.g., the third tuple. A tuple may have two kinds of uncertainties at the same time, such as the last tuple.

Uncertain data streams have attracted a lot of attention [5, 20, 44, 45, 78]. Since large amounts of such streaming data arrive rapidly, the goal is to design both space- and time-efficient query processing techniques, which only require to visit each tuple once. On the other hand, streaming data is also highly time-sensitive: people are generally more interested in the recent tuples than those in the far past, which further adds the challenges of uncertain stream processing.

1.2Data Model

The model for uncertain data streams is actually the integration of data uncertainty and streaming property. Most uncertain data models adopt the possible world semantics [1, 36], in which the possible world space contains a huge number of possible world instances (also shorten as possible world), each consisting of a set of certain values from uncertain tuples. A possible world instance is affiliated with a probability value, computed by the product of all tuples within the instance and the product of the non-existing confidence of all tuples outside.

Table 1.2 illustrates the possible world space of total 12 instances for the data stream in Table 1.1. Each column is a possible world instance with the probability listed below. For example, tuples t₁, t₂ and t₃ occur in pw₁₀ at the same time, while t₄ does not occur, so the probability of pw₁₀ is computed as 0.016 (= 0.4 × 1.0 × 0.4 × (1 − 0.4 − 0.5)).

Table 1.2 The Possible World Space for Table 1.1

In general, an uncertain stream S is defined as S = {t₁, t₂, . . .}, where an uncertain tuple t_i appears at timestamp i. Each tuple is described as a discrete probability distribution function, t_i = 〈(v_i,1, p_i,1,), . . . ,(v_{i,m_i}, p_{i,m_i})〉, meaning that t_i can be one of m_i different certain values, denoted as v_i,1, . . . , v_{i,m_i}, with confidence of pi,1, . . . , pi,m_i. The sum of the probabilities of all choices is no greater than 1, i.e.,

. Especially, if only existential uncertainty exists, ∀i ≥ 1, m_i = 1 [20].

The streaming models can be categorized into the landmark, time-decay, and sliding-window models according to time aspect. The landmark model considers all tuples from a fixed time point to now. The time-decay model assigns a weight to each tuple, and the weight will decrease over time. This model usually works together with statistical aggregates, such as averages, histograms, and heavy hitters [21, 24]. The sliding-window model only considers all tuples within a pre-defined window W. In other words, the rest ones outside the window are filtered. For example, assuming W = 5, the window is [6, 10] at time 10.

One major challenge of uncertain data management is that the possible world space will grow up exponentially to the size of the uncertain database. Suppose there are N independent tuples in an uncertain database, the number of the possible world instances is O(2^N). Hence, it is infeasible to access all possible world instances one by one even when N is not tiny. Therefore, it is necessary to make a trade-off between the accuracy and overhead, so as to obtain high-quality approximate results with small computational overhead. Moreover, the possible world space will change a lot when a new tuple arrives, no matter upon the landmark model or the sliding-window model. For example, assuming W = 3, Tables 1.3(a) and 1.3(b) show the possible world space for the data in Table 1.1 at times 3 and 4 respectively. We observe that all possible world instances change.

Table 1.3 The Possible World Space for Table 1.1 over the Sliding-window Model (W = 3)

Furthermore, we also study a more comprehensive model in Chapter 5 where each tuple in the data stream is no longer a single value, but a set of probabilistic tuples. An example data stream may be: {{1:0.7, 2, 1.0}, {2:0.5, 3:0.7, 4:0.4}, . . .}. In this stream, the first set, {1:0.7, 2,1.0}, contains two tuples, 1 and 2, each with probability 0.7 and 1.0. The second set in the data stream contains three tuples, and so on. Similarly, we can construct a joint-possible world space based on any of two probabilistic sets.

For example, consider two probabilistic sets A and B in Table 1.4. They have eight (joint) possible worlds, which are listed together with their similarities and existential probabilities in Table 1.4(b). For example, the ...

Cover Page
Title
Copyright
Contents
East China Normal University Scientific Reports
Preface
Acknowledgments
About the Authors
1. Introduction
2. Top-k Queries Over the Sliding-window Model
3. ER-Topk Query Over the Landmark Model
4. Rarity Estimation
5. Set Similarity
6. Clustering
7. Conclusion
References
Index