eBook - ePub

Scala for Data Science

Name: Scala for Data Science
Author: Pascal Bugnion

Pascal Bugnion,

416 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Scala for Data Science

Pascal Bugnion,

Book details

Book preview

Table of contents

Citations

About This Book

Leverage the power of Scala with different tools to build scalable, robust data science applications

About This Book

A complete guide for scalable data science solutions, from data ingestion to data visualization
Deploy horizontally scalable data processing pipelines and take advantage of web frameworks to build engaging visualizations
Build functional, type-safe routines to interact with relational and NoSQL databases with the help of tutorials and examples provided

Who This Book Is For

If you are a Scala developer or data scientist, or if you want to enter the field of data science, then this book will give you all the tools you need to implement data science solutions.

What You Will Learn

Transform and filter tabular data to extract features for machine learning
Implement your own algorithms or take advantage of MLLib's extensive suite of models to build distributed machine learning pipelines
Read, transform, and write data to both SQL and NoSQL databases in a functional manner
Write robust routines to query web APIs
Read data from web APIs such as the GitHub or Twitter API
Use Scala to interact with MongoDB, which offers high performance and helps to store large data sets with uncertain query requirements
Create Scala web applications that couple with JavaScript libraries such as D3 to create compelling interactive visualizations
Deploy scalable parallel applications using Apache Spark, loading data from HDFS or Hive

In Detail

Scala is a multi-paradigm programming language (it supports both object-oriented and functional programming) and scripting language used to build applications for the JVM. Languages such as R, Python, Java, and so on are mostly used for data science. It is particularly good at analyzing large sets of data without any significant impact on performance and thus Scala is being adopted by many developers and data scientists. Data scientists might be aware that building applications that are truly scalable is hard. Scala, with its powerful functional libraries for interacting with databases and building scalable frameworks will give you the tools to construct robust data pipelines.

This book will introduce you to the libraries for ingesting, storing, manipulating, processing, and visualizing data in Scala.

Packed with real-world examples and interesting data sets, this book will teach you to ingest data from flat files and web APIs and store it in a SQL or NoSQL database. It will show you how to design scalable architectures to process and modelling your data, starting from simple concurrency constructs such as parallel collections and futures, through to actor systems and Apache Spark. As well as Scala's emphasis on functional structures and immutability, you will learn how to use the right parallel construct for the job at hand, minimizing development time without compromising scalability. Finally, you will learn how to build beautiful interactive visualizations using web frameworks.

This book gives tutorials on some of the most common Scala libraries for data science, allowing you to quickly get up to speed with building data science and data engineering solutions.

Style and approach

A tutorial with complete examples, this book will give you the tools to start building useful data engineering and data science solutions straightaway

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Scala for Data Science by Pascal Bugnion in PDF and/or ePUB format, as well as other popular books in Computer Science & Object Oriented Programming. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2016

ISBN

9781785281372

Edition

Topic

Computer Science

Subtopic

Object Oriented Programming

Index

Computer Science

Scala for Data Science

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Installing the JDK

Installing and using SBT

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

eBooks, discount offers, and more

Questions

1. Scala and Data Science

Data science

Programming in data science

Why Scala?

Static typing and type inference

Scala encourages immutability

Scala and functional programs

Null pointer uncertainty

Easier parallelism

Interoperability with Java

When not to use Scala

Summary

References

2. Manipulating Data with Breeze

Code examples

Installing Breeze

Getting help on Breeze

Basic Breeze data types

Vectors

Dense and sparse vectors and the vector trait

Matrices

Building vectors and matrices

Advanced indexing and slicing

Mutating vectors and matrices

Matrix multiplication, transposition, and the orientation of vectors

Data preprocessing and feature engineering

Breeze – function optimization

Numerical derivatives

Regularization

An example – logistic regression

Towards re-usable code

Alternatives to Breeze

Summary

References

3. Plotting with breeze-viz

Diving into Breeze

Customizing plots

Customizing the line type

More advanced scatter plots

Multi-plot example – scatterplot matrix plots

Managing without documentation

Breeze-viz reference

Data visualization beyond breeze-viz

Summary

4. Parallel Collections and Futures

Parallel collections

Limitations of parallel collections

Error handling

Setting the parallelism level

An example – cross-validation with parallel collections

Futures

Future composition – using a future's result

Blocking until completion

Controlling parallel execution with execution contexts

Futures example – stock price fetcher

Summary

References

5. Scala and SQL through JDBC

Interacting with JDBC

First steps with JDBC

Connecting to a database server

Creating tables

Inserting data

Reading data

JDBC summary

Functional wrappers for JDBC

Safer JDBC connections with the loan pattern

Enriching JDBC statements with the "pimp my library" pattern

Wrapping result sets in a stream

Looser coupling with type classes

Type classes

Coding against type classes

When to use type classes

Benefits of type classes

Creating a data access layer

Summary

References

6. Slick – A Functional Interface for SQL

FEC data

Importing Slick

Defining the schema

Connecting to the database

Creating tables

Inserting data

Querying data

Invokers

Operations on columns

Aggregations with "Group by"

Accessing database metadata

Slick versus JDBC

Summary

References

7. Web APIs

A whirlwind tour of JSON

Querying web APIs

JSON in Scala – an exercise in pattern matching

JSON4S types

Extracting fields using XPath

Extraction using case classes

Concurrency and exception handling with futures

Authentication – adding HTTP headers

HTTP – a whirlwind overview

Adding headers to HTTP requests in Scala

Summary

References

8. Scala and MongoDB

MongoDB

Connecting to MongoDB with Casbah

Connecting with authentication

Inserting documents

Extracting objects from the database

Complex queries

Casbah query DSL

Custom type serialization

Beyond Casbah

Summary

References

9. Concurrency with Akka

GitHub follower graph

Actors as people

Hello world with Akka

Case classes as messages

Actor construction

Anatomy of an actor

Follower network crawler

Fetcher actors

Routing

Message passing between actors

Queue control and the pull pattern

Accessing the sender of a message

Stateful actors

Follower network crawler

Fault tolerance

Custom supervisor strategies

Life-cycle hooks

What we have not talked about

Summary

References

10. Distributed Batch Processing with Spark

Installing Spark

Acquiring the example data

Resilient distributed datasets

RDDs are immutable

RDDs are lazy

RDDs know their lineage

RDDs are resilient

RDDs are distributed

Transformations and actions on RDDs

Persisting RDDs

Key-value RDDs

Double RDDs

Building and running standalone programs

Running Spark applications locally

Reducing logging output and Spark configuration

Running Spark applications on EC2

Spam filtering

Lifting the hood

Data shuffling and partitions

Summary

Reference

11. Spark SQL and DataFrames

DataFrames – a whirlwind introduction

Aggregation operations

Joining DataFrames together

Custom functions on DataFrames

DataFrame immutability and persistence

SQL statements on DataFrames

Complex data types – arrays, maps, and structs

Structs

Arrays

Maps

Interacting with data sources

JSON files

Parquet files

Standalone programs

Summary

References

12. Distributed Machine Learning with MLlib

Introducing MLlib – Spam classification

Pipeline components

Transformers

Estimators

Evaluation

Regularization in logistic regression

Cross-validation and model selection

Beyond logistic regression

Summary

References

13. Web APIs with Play

Client-server applications

Introduction to web frameworks

Model-View-Controller architecture

Single page applications

Building an application

The Play framework

Dynamic routing

Actions

Composing the response

Understanding and parsing the request

Interacting with JSON

Querying external APIs and consuming JSON

Calling external web services

Parsing JSON

Asynchronous actions

Creating APIs with Play: a summary

Rest APIs: best practice

Summary

References

14. Visualization with D3 and the Play Framework

GitHub user data

Do I need a backend?

JavaScript dependencies through web-jars

Towards a web application: HTML templates

Modular JavaScript through RequireJS

Bootstrapping the applications

Client-side program architecture

Designing the model

The event bus

AJAX calls through JQuery

Response views

Drawing plots with NVD3

Summary

References

A. Pattern Matching and Extractors

Pattern matching in for comprehen...