Data Lake Development with Big Data
eBook - ePub

Data Lake Development with Big Data

  1. 164 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Data Lake Development with Big Data

Book details
Book preview
Table of contents
Citations

About This Book

Explore architectural approaches to building Data Lakes that ingest, index, manage, and analyze massive amounts of data using Big Data technologies

About This Book

  • Comprehend the intricacies of architecting a Data Lake and build a data strategy around your current data architecture
  • Efficiently manage vast amounts of data and deliver it to multiple applications and systems with a high degree of performance and scalability
  • Packed with industry best practices and use-case scenarios to get you up-and-running

Who This Book Is For

This book is for architects and senior managers who are responsible for building a strategy around their current data architecture, helping them identify the need for a Data Lake implementation in an enterprise context. The reader will need a good knowledge of master data management and information lifecycle management, and experience of Big Data technologies.

What You Will Learn

  • Identify the need for a Data Lake in your enterprise context and learn to architect a Data Lake
  • Learn to build various tiers of a Data Lake, such as data intake, management, consumption, and governance, with a focus on practical implementation scenarios
  • Find out the key considerations to be taken into account while building each tier of the Data Lake
  • Understand Hadoop-oriented data transfer mechanism to ingest data in batch, micro-batch, and real-time modes
  • Explore various data integration needs and learn how to perform data enrichment and data transformations using Big Data technologies
  • Enable data discovery on the Data Lake to allow users to discover the data
  • Discover how data is packaged and provisioned for consumption
  • Comprehend the importance of including data governance disciplines while building a Data Lake

In Detail

A Data Lake is a highly scalable platform for storing huge volumes of multistructured data from disparate sources with centralized data management services. This book explores the potential of Data Lakes and explores architectural approaches to building data lakes that ingest, index, manage, and analyze massive amounts of data using batch and real-time processing frameworks. It guides you on how to go about building a Data Lake that is managed by Hadoop and accessed as required by other Big Data applications.

This book will guide readers (using best practices) in developing Data Lake's capabilities. It will focus on architect data governance, security, data quality, data lineage tracking, metadata management, and semantic data tagging. By the end of this book, you will have a good understanding of building a Data Lake for Big Data.

Style and approach

Data Lake Development with Big Data provides architectural approaches to building a Data Lake. It follows a use case-based approach where practical implementation scenarios of each key component are explained. It also helps you understand how these use cases are implemented in a Data Lake. The chapters are organized in a way that mimics the sequential data flow evidenced in a Data Lake.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Data Lake Development with Big Data by Pradeep Pasupuleti, Beulah Salome Purra in PDF and/or ePUB format, as well as other popular books in Computer Science & Databases. We have over one million books available in our catalogue for you to explore.

Information

Year
2015
ISBN
9781785888083
Edition
1

Data Lake Development with Big Data


Table of Contents

Data Lake Development with Big Data
Credits
About the Authors
Acknowledgement
About the Reviewer
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Errata
Piracy
Questions
1. The Need for Data Lake
Before the Data Lake
Need for Data Lake
Defining Data Lake
Key benefits of Data Lake
Challenges in implementing a Data Lake
When to go for a Data Lake implementation
Data Lake architecture
Architectural considerations
Architectural composition
Architectural details
Understanding Data Lake layers
The Data Governance and Security Layer
The Information Lifecycle Management layer
The Metadata Layer
Understanding Data Lake tiers
The Data Intake tier
The Source System Zone
The Transient Zone
The Raw Zone
Batch Raw Storage
The real-time Raw Storage
The Data Management tier
The Integration Zone
The Enrichment Zone
The Data Hub Zone
The Data Consumption tier
The Data Discovery Zone
The Data Provisioning Zone
Summary
2. Data Intake
Understanding Intake tier zones
Source System Zone functionalities
Understanding connectivity processing
Understanding Intake Processing for data variety
Structured data
The need for integrating Structured Data in the Data Lake
Structured data loading approaches
Semi-structured data
The need for integrating semi-structured data in the Data Lake
Semi-structured data loading approaches
Unstructured data
The need for integrating Unstructured data in the Data Lake
Unstructured data loading approaches
Transient Landing Zone functionalities
File validation checks
File duplication checks
File integrity checks
File size checks
File periodicity checks
Data Integrity checks
Checking record counts
Checking for column counts
Schema validation checks
Raw Storage Zone functionalities
Data lineage processes
Watermarking process
Metadata capture
Deep Integrity checks
Bit Level Integrity checks
Periodic checksum checks
Security and governance
Information Lifecycle Management
Practical Data Ingestion scenarios
Architectural guidance
Structured data use cases
Semi-structured and unstructured data use cases
Big Data tools and technologies
Ingestion of structured data
Sqoop
Use case scenarios for Sqoop
WebHDFS
Use case scenarios for WebHDFS
Ingestion of streaming data
Apache Flume
Use case scenarios for Flume
Fluentd
Use case scenarios for Fluentd
Kafka
Use case scenarios for Kafka
Amazon Kinesis
Use case scenarios for Kinesis
Apache Storm
Use case scenarios for Storm
Summary
3. Data Integration, Quality, and Enrichment
Introduction to the Data Management Tier
Understanding Data Integration
Introduction to Data Integration
Prominent features of Data Integration
Loosely coupled Integration
Ease of use
Secure access
High-quality data
Lineage tracking
Practical Data Integration scenarios
The workings of Data Integration
Raw data discovery
Data quality assessment
Profiling the data
Data cleansing
Deletion of missing, null, or invalid values
Imputation of missing, null, or invalid values
Data transformations
Unstructured text transformation techniques
Structured data transformations
Data enrichment
Collect metadata and track data lineage
Traditional Data Integration versus Data Lake
Data pipelines
Addressing the limitations using Data Lake
Data partitioning
Addressing the limitations using Data Lake
Scale on demand
Addressing the limitations using Data Lake
Data ingest parallelism
Addressing the limitations using Data Lake
Extensibility
Addressing the limitations using Data Lake
Big Data tools and technologies
Syncsort
Use case scenarios for Syncsort
Talend
Use case scenarios for Talend
Pentaho
Use case scenarios for Pentaho
Summary
4. Data Discovery and Consumption
Understanding the Data Consumption tier
Data Consumption – Traditional versus Data Lake
An introduction to Data Consumption
Practical Data Consumption scenarios
Data Discovery and metadata
Enabling Data Discovery
Data classification
Classifying unstructured data
Named entity recognition
Topic modeling
Text clustering
Applications of data classification
Relation extraction
Extracting relationships from unstructured data
Feature-based methods
Understanding how feature-based methods work
Implementation
Semantic technologies
Understanding how semantic technologies work
Implementation
Extracting Relationships from structured data
Applications of relation extraction
Indexing data
Inverted index
Understanding how inverted index works
Implementation
Applications of Indexing
Performing Data Discovery
Semantic search
Word sense disambiguation
Latent Semantic Analysis
Faceted search
Fuzzy search
Edit distance
Wildcard and regular expressions
Data Provisioning and metadata
Data publication
Data subscription
Data Provisioning functionalities
Data formatting
Data selection
Data Provisioning approaches
Post-provisioning processes
Architectural guidance
Data Discovery
Big Data tools and technologies
Elasticsearch
Use case scenarios for Elasticsearch
IBM InfoSphere Data Explorer
Use case scenarios for IBM InfoSphere Data Explorer
Tableau
Use case scenarios for Tableau
Splunk
Use case scenarios for Splunk
Data Provisioning
Big Data tools and technologies
Data Dispatch
Use case scenarios for Data Dispatch
Summary
5. Data Governance
Understanding Data Governance
Introduction to Data Governance
The need for Data Governance
Governing Big Data in the Data Lake
Data Governance – Traditional versus Data Lake
Practical Data Governance scenarios
Data Governance components
Metadata management and lineage tracking
Data security and privacy
Big Data implications for security and privacy
Security issues in the Data Lake tiers
The Intake Tier
The Management Tier
The Consumption Tier
Information Lifecycle Management
Big Data implications for ILM
Implementing ILM using Data Lake
The Intake Tier
The Management Tier
The Consumption Tier
Architectural guidance
Big Data tools and technologies
Apache Falcon
Understanding how Falcon works
Use case scenarios for Falcon
Apache Atlas
Understanding how Atlas works
Use case scenarios for Atlas
IBM Big Data platform
Understanding how governance is provided in IBM Big Data platform
Use case scenarios for IBM Big Data platform
The current and future trends
Data Lake and future enterprise trajectories
Future Data Lake technologies
Summary
Index

Data Lake Development with Big Data

Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either...

Table of contents

  1. Data Lake Development with Big Data