Discovery And Fusion Of Uncertain Knowledge In Data
eBook - ePub

Discovery And Fusion Of Uncertain Knowledge In Data

  1. 104 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Discovery And Fusion Of Uncertain Knowledge In Data

Book details
Book preview
Table of contents
Citations

About This Book

-->

Data analysis is of upmost importance in the mining of big data, where knowledge discovery and inference are the basis for intelligent systems to support the real world applications. However, the process involves knowledge acquisition, representation, inference and data, Bayesian network (BN) is the key technology plays a key role in knowledge representation, in order to pave way to cope with incomplete, fuzzy data to solve the real-life problems.

This book presents Bayesian network as a technology to support data-intensive and incremental learning in knowledge discovery, inference and data fusion in uncertain environment.

--> Contents:

  • Introduction
  • Data-Intensive Learning of Uncertain Knowledge
  • Data-Intensive Inferences of Large-Scale Bayesian Networks
  • Uncertain Knowledge Representation and Inference for Lineage Processing over Uncertain Data
  • Uncertain Knowledge Representation and Inference for Tracing Errors in Uncertain Data
  • Fusing Uncertain Knowledge in Time-Series Data
  • Summary

-->
--> Readership: Graduate students, researchers and professionals in the field of artificial intelligence/machine learning and information sciences, especially in databases. -->
Keywords:Uncertain Knowledge;Bayesian Network;Data-Intensive Computing;Lineage;Inference;FusionReview: Key Features:

  • Upon the preliminaries of BN (Pearl, 1988), this book establishes the connection between massive/uncertain/dynamic data management and uncertainty in artificial intelligence, specifically taking BN as the knowledge framework; different from the publications (Pearl, 1988; Russel & Norvig, 2010), this book concerns uncertain knowledge representation and corresponding inferences from the data-driven perspective, where we focus on the construction of knowledge models with respect to specific applications; different from the publication (Han, 2011), this book focuses on the critical problem of knowledge engineering specially taking BN as the framework, instead of the previously-unknown patterns by mining data
  • This book presents the theoretic conclusions, algorithmic strategies, running examples and empirical studies while emphasizing the soundness in both theoretic/semantic and executive/applicable perspectives of the methods for discovery and fusion of uncertain knowledge in data
  • This book is appropriately a reference book for researchers in the fields of massive data analysis, artificial intelligence and knowledge engineering. As well, this book can be also adopted as textbook for graduate students who major in data mining and knowledge discovery, or intelligent data analysis etc.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Discovery And Fusion Of Uncertain Knowledge In Data by Kun Yue, Weiyi Liu;Hao Wu;Dapeng Tao;Ming Gao in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Modelling & Design. We have over one million books available in our catalogue for you to explore.

Information

Publisher
WSPC
Year
2017
ISBN
9789813227156

Chapter 1

Introduction

1.1 Background and Motivation

To discover knowledge from data is always the subject of great interest and importance in data mining, machine learning and artificial intelligence paradigms. Knowledge discovered from data can be integrated with that specified by experts, and provides the basis for prediction and decision making. With the development of data acquisition, socialization and IT infrastructures, more and more heterogeneous data are generated dynamically and stored in distributed systems. For example, the scale of Web data is increased rapidly in information services based on Web2.0 and cloud computing, such as e-commerce, on-line advertising, microblogging, etc. [5, 15]; the applications of sensor data analysis, RFID networks, multimedia databases, location-based services, object identification and data integration make real-world data often exhibit uncertainty and impreciseness [42, 128, 142, 177].
Actually, a large number of applications has been transformed into data-centered ones, while big data analysis and knowledge discovery have attracted much attention both in academic and industry paradigms due to the indispensable requirements of data understanding, data utilization and information services [79, 92, 167, 177, 178]. Further from the perspective of knowledge engineering, acquisition, representation, inference and fusion are critical problems and techniques in real world applications of intelligent systems. To obtain implicit conclusions from the described knowledge is knowledge inference problem, which is one of the important tasks in knowledge discovery. To combine various knowledge frameworks with various theoretic foundations or corresponding to various databases is knowledge fusion problem, which is more and more important in big data background [134, 167]. Basic ideas and classic mechanisms in knowledge engineering make big data analysis strengthened, while the constantly emerging characteristics of data make knowledge engineering confronted with novel significance and challenges simultaneously.
Thus, it is desirable to develop appropriate methods for knowledge discovery and fusion in view of inherent characteristics of massive, distributed, uncertain and dynamically changing data. Various methods have been proposed by researchers from various perspectives or by various underlying theories and techniques. In particular, uncertainty is ubiquitous in real applications and uncertain knowledge is ubiquitously implied in data. For example, Cancer holds with the probability of 64% if both Smoking and X-ray hold; Cancer is caused by Smoking with the probability of 32%. Generally, uncertainty in artificial intelligence is the new development of artificial intelligence in big data era [31, 142, 177].
Upon the above interpretation, we are to establish the connection between massive/uncertain/dynamic data management and knowledge engineering, and specifically exploit the methods for acquisition, representation, inference and fusion of uncertain knowledge implied in data. As one of the important probabilistic graphical models, Bayesian network (BN) is the effective framework for representing and inferring uncertain knowledge [128]. BN is a directed acyclic graph (DAG) of random variables as nodes, each of which has a conditional probability table (CPT) to describe the dependencies among the variables. BN has been widely used in realistic applications of data analysis, prediction and decision making, since it provides a graphical and concise mechanism for describing dependencies and simplifying joint probability distributions.
By adopting BN as the framework of knowledge representation and inferences, in this book we present our research findings of uncertain knowledge discovery and fusion by incorporating the massive, distributed, uncertain and dynamically changing characteristics concerned in data analysis applications. For this purpose, it is necessary to consider the following two fundamental problems: model construction and knowledge inference, which is in line with the basic roadmap of uncertain knowledge research [44, 142, 167].
For the first problem, constructing the DAG of a BN is the most important and challenging step, upon which CPTs can be obtained easily. We consider the following two kinds of roadmaps:
(1)DAG can be learned from data by measuring the likelihood-based coincidence between candidate model and sample data, and thus the hill-climbing induced optimal one is exactly the required model. From this model-learning point of view, the massive, distributed and dynamic changing characteristics make it necessary to develop a parallel, incremental and efficient approach.
(2)DAG can be transformed directly from specific tasks of data processing (e.g., query processing) or domain knowledge (e.g., knowledge graph). Specifically, the DAG can be achieved by a transformation from another kind of knowledge framework (e.g., first-order predicate logical lineage expression). As well, the DAG can be also achieved by fusing relevant knowledge frameworks with various theoretic foundations or corresponding to various databases. From this model-transformation point of view, the semantics preservation or equivalence makes it necessary to develop theoretically sound algorithms.
For the second problem, we consider the following two aspects of concerns.
(1)Large-scale BNs learned from massive data or constructed for complex applications make corresponding inferences by classic algorithms infeasible due to the exponential complexity with respect to large amounts of nodes and probability parameters. This makes it necessary to develop data-intensive inference algorithms by regarding BN itself as a massive dataset.
(2)Probabilistic inference is the fundamental step for knowledge-based applications (e.g., probabilities of query results or probabilities of inputs inducing errors over uncertain data), which makes it necessary to develop effective inference algorithms oriented to specific applications.
Surrounding the above two fundamental problems, we use MapReduce [48, 49] as the programming model for data-intensive computing to implement aggregation computations upon massive data. Simultaneously, we incorporate the inherent relationship among first-order predicate logic, logical implication and DAGs into the construction of BNs. We incorporate the concepts of Markov equivalence [161], blame [35] and evidence fusion [150] into the semantics description and induction of uncertain knowledge. Generally, the contents of this book are summarized as follows:
(1)By extending the classic algorithm for learning BN from data, we give a parallel and incremental method for learning BN from distributed, massive and changing data using MapReduce [62, 184, 185].
(2)By extending the classic algorithm for inferring uncertain knowledge with a BN, we give a parallel inference method for computing joint probability distributions with large-scale BNs using MapReduce, as well as the case study of user similarity discovery by the proposed inference algorithm [170, 178].
(3)By taking lineage analysis over uncertain data as the representative paradigm, we give a method for transforming lineage expressions (i.e., logical knowledge) into BN (i.e., probabilistic knowledge), and then give the algorithms for lineage processing based on probabilistic inferences [182, 187].
(4)By taking BN as the framework of uncertain knowledge representation and inferences, we further give an algorithm for detecting errors in query processing over uncertain data [57, 58].
(5)By adopting qualitative probabilistic network (QPN) [163], the qualitative abstraction of classic BN, as the knowledge framework, we give a semantics-preserving method for fusing uncertain knowledge in time-series data [183, 186].

1.2 Challenges, Research Issues and Basic Ideas

1.2.1 Learning and inferring uncertain knowledge in massive and changing data using MapReduce

Classic algorithms for learning and inferring BN cannot suit practical applications of big data analysis. BN learning should be consistent with the physically distributed dataset, and is much more critical and challenging than that in classic situations since there is no universally theoretical and practical roadmap but explicit application-driven nature and pervasively implied uncertainties.
The first challenge and most important step for BN learning and consequent inferences is to construct the DAG from distributed massive data. We extend the scoring & search algorithm [37] for model evaluation and incorporate the hill-climbing search [76, 158] for model selection, where specific scoring metrics (e.g., minimal description length (MDL) [141, 153]) are adopted. For this purpose, we use MapReduce for data-intensive processing of aggregation queries with respect to marginal probabilities in MDL and compute the MDL score for a given model and massive dataset, since the computation cost is expensive in terms of large scale of datasets.
Second, to keep the coincidence between the model and dynamically changing data is highlighted but challenging, which needs incremental revision of the learned model in response to the distributed new data. We focus on incremental revision of DAG instead of just CPTs, where the former is more challenging than the latter addressed in the state-of-the-art achievements of BN’s incremental learning. For this purpose, we define the concept of influence degree based on likelihood to measure the coincidence between the current BN and new data. Then, we compute the influence degree of each node by using MapReduce to determine the nodes, centered on which the model should be revised.
Third, to fulfill probabilistic references efficiently with large-scale BNs is challenging, since the execution time is exponential to the number of nodes in a BN when computing joint probabilities according to the chain rule. From data-intensive computing point of view, we regard a large-scale BN as a massive dataset and make it stored as 〈key, valueâŒȘ pairs into a distributed file system (DFS). Then, we transform the operations concerned in probabilistic inferences as those on the DFS, and consequently fulfill probabilistic inferences using MapReduce. Further, we give a case study to discover user similarities in social media by the proposed algorithm for data-intensive probabilistic inferences.

1.2.2 Representing and inferring uncertain knowledge for lineage processing over uncertain data

With respect to the uncertain characteristics of data, lineages (a.k.a provenance) over uncertain data facilitate the correlation and coordination of uncertainty in query results with uncertainty in the input data, and lineage processing consists in tracing the origin of uncertainties based on the process of data production and evolution [12, 67].
First, probabilistic inference is the most challenging issue in the paradigm of uncertain data management [43], which is based up...

Table of contents

  1. Cover
  2. Halftitle
  3. Series Editors
  4. Title
  5. Copyright
  6. Dedication
  7. Preface
  8. About the Authors
  9. Acknowledgments
  10. Contents
  11. Chapter 1. Introduction
  12. Chapter 2. Data-Intensive Learning of Uncertain Knowledge
  13. Chapter 3. Data-Intensive Inferences of Large-Scale Bayesian Networks
  14. Chapter 4. Uncertain Knowledge Representation and Inference for Lineage Processing over Uncertain Data
  15. Chapter 5. Uncertain Knowledge Representation and Inference for Tracing Errors in Uncertain Data
  16. Chapter 6. Fusing Uncertain Knowledge in Time-Series Data
  17. Chapter 7. Summary
  18. References
  19. Index