Scala for Data Science
eBook - ePub

Scala for Data Science

  1. 416 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Scala for Data Science

Book details
Book preview
Table of contents
Citations

About This Book

Leverage the power of Scala with different tools to build scalable, robust data science applications

About This Book

  • A complete guide for scalable data science solutions, from data ingestion to data visualization
  • Deploy horizontally scalable data processing pipelines and take advantage of web frameworks to build engaging visualizations
  • Build functional, type-safe routines to interact with relational and NoSQL databases with the help of tutorials and examples provided

Who This Book Is For

If you are a Scala developer or data scientist, or if you want to enter the field of data science, then this book will give you all the tools you need to implement data science solutions.

What You Will Learn

  • Transform and filter tabular data to extract features for machine learning
  • Implement your own algorithms or take advantage of MLLib's extensive suite of models to build distributed machine learning pipelines
  • Read, transform, and write data to both SQL and NoSQL databases in a functional manner
  • Write robust routines to query web APIs
  • Read data from web APIs such as the GitHub or Twitter API
  • Use Scala to interact with MongoDB, which offers high performance and helps to store large data sets with uncertain query requirements
  • Create Scala web applications that couple with JavaScript libraries such as D3 to create compelling interactive visualizations
  • Deploy scalable parallel applications using Apache Spark, loading data from HDFS or Hive

In Detail

Scala is a multi-paradigm programming language (it supports both object-oriented and functional programming) and scripting language used to build applications for the JVM. Languages such as R, Python, Java, and so on are mostly used for data science. It is particularly good at analyzing large sets of data without any significant impact on performance and thus Scala is being adopted by many developers and data scientists. Data scientists might be aware that building applications that are truly scalable is hard. Scala, with its powerful functional libraries for interacting with databases and building scalable frameworks will give you the tools to construct robust data pipelines.

This book will introduce you to the libraries for ingesting, storing, manipulating, processing, and visualizing data in Scala.

Packed with real-world examples and interesting data sets, this book will teach you to ingest data from flat files and web APIs and store it in a SQL or NoSQL database. It will show you how to design scalable architectures to process and modelling your data, starting from simple concurrency constructs such as parallel collections and futures, through to actor systems and Apache Spark. As well as Scala's emphasis on functional structures and immutability, you will learn how to use the right parallel construct for the job at hand, minimizing development time without compromising scalability. Finally, you will learn how to build beautiful interactive visualizations using web frameworks.

This book gives tutorials on some of the most common Scala libraries for data science, allowing you to quickly get up to speed with building data science and data engineering solutions.

Style and approach

A tutorial with complete examples, this book will give you the tools to start building useful data engineering and data science solutions straightaway

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Scala for Data Science by Pascal Bugnion in PDF and/or ePUB format, as well as other popular books in Computer Science & Object Oriented Programming. We have over one million books available in our catalogue for you to explore.

Information

Year
2016
ISBN
9781785281372
Edition
1

Scala for Data Science


Table of Contents

Scala for Data Science
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Installing the JDK
Installing and using SBT
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
eBooks, discount offers, and more
Questions
1. Scala and Data Science
Data science
Programming in data science
Why Scala?
Static typing and type inference
Scala encourages immutability
Scala and functional programs
Null pointer uncertainty
Easier parallelism
Interoperability with Java
When not to use Scala
Summary
References
2. Manipulating Data with Breeze
Code examples
Installing Breeze
Getting help on Breeze
Basic Breeze data types
Vectors
Dense and sparse vectors and the vector trait
Matrices
Building vectors and matrices
Advanced indexing and slicing
Mutating vectors and matrices
Matrix multiplication, transposition, and the orientation of vectors
Data preprocessing and feature engineering
Breeze – function optimization
Numerical derivatives
Regularization
An example – logistic regression
Towards re-usable code
Alternatives to Breeze
Summary
References
3. Plotting with breeze-viz
Diving into Breeze
Customizing plots
Customizing the line type
More advanced scatter plots
Multi-plot example – scatterplot matrix plots
Managing without documentation
Breeze-viz reference
Data visualization beyond breeze-viz
Summary
4. Parallel Collections and Futures
Parallel collections
Limitations of parallel collections
Error handling
Setting the parallelism level
An example – cross-validation with parallel collections
Futures
Future composition – using a future's result
Blocking until completion
Controlling parallel execution with execution contexts
Futures example – stock price fetcher
Summary
References
5. Scala and SQL through JDBC
Interacting with JDBC
First steps with JDBC
Connecting to a database server
Creating tables
Inserting data
Reading data
JDBC summary
Functional wrappers for JDBC
Safer JDBC connections with the loan pattern
Enriching JDBC statements with the "pimp my library" pattern
Wrapping result sets in a stream
Looser coupling with type classes
Type classes
Coding against type classes
When to use type classes
Benefits of type classes
Creating a data access layer
Summary
References
6. Slick – A Functional Interface for SQL
FEC data
Importing Slick
Defining the schema
Connecting to the database
Creating tables
Inserting data
Querying data
Invokers
Operations on columns
Aggregations with "Group by"
Accessing database metadata
Slick versus JDBC
Summary
References
7. Web APIs
A whirlwind tour of JSON
Querying web APIs
JSON in Scala – an exercise in pattern matching
JSON4S types
Extracting fields using XPath
Extraction using case classes
Concurrency and exception handling with futures
Authentication – adding HTTP headers
HTTP – a whirlwind overview
Adding headers to HTTP requests in Scala
Summary
References
8. Scala and MongoDB
MongoDB
Connecting to MongoDB with Casbah
Connecting with authentication
Inserting documents
Extracting objects from the database
Complex queries
Casbah query DSL
Custom type serialization
Beyond Casbah
Summary
References
9. Concurrency with Akka
GitHub follower graph
Actors as people
Hello world with Akka
Case classes as messages
Actor construction
Anatomy of an actor
Follower network crawler
Fetcher actors
Routing
Message passing between actors
Queue control and the pull pattern
Accessing the sender of a message
Stateful actors
Follower network crawler
Fault tolerance
Custom supervisor strategies
Life-cycle hooks
What we have not talked about
Summary
References
10. Distributed Batch Processing with Spark
Installing Spark
Acquiring the example data
Resilient distributed datasets
RDDs are immutable
RDDs are lazy
RDDs know their lineage
RDDs are resilient
RDDs are distributed
Transformations and actions on RDDs
Persisting RDDs
Key-value RDDs
Double RDDs
Building and running standalone programs
Running Spark applications locally
Reducing logging output and Spark configuration
Running Spark applications on EC2
Spam filtering
Lifting the hood
Data shuffling and partitions
Summary
Reference
11. Spark SQL and DataFrames
DataFrames – a whirlwind introduction
Aggregation operations
Joining DataFrames together
Custom functions on DataFrames
DataFrame immutability and persistence
SQL statements on DataFrames
Complex data types – arrays, maps, and structs
Structs
Arrays
Maps
Interacting with data sources
JSON files
Parquet files
Standalone programs
Summary
References
12. Distributed Machine Learning with MLlib
Introducing MLlib – Spam classification
Pipeline components
Transformers
Estimators
Evaluation
Regularization in logistic regression
Cross-validation and model selection
Beyond logistic regression
Summary
References
13. Web APIs with Play
Client-server applications
Introduction to web frameworks
Model-View-Controller architecture
Single page applications
Building an application
The Play framework
Dynamic routing
Actions
Composing the response
Understanding and parsing the request
Interacting with JSON
Querying external APIs and consuming JSON
Calling external web services
Parsing JSON
Asynchronous actions
Creating APIs with Play: a summary
Rest APIs: best practice
Summary
References
14. Visualization with D3 and the Play Framework
GitHub user data
Do I need a backend?
JavaScript dependencies through web-jars
Towards a web application: HTML templates
Modular JavaScript through RequireJS
Bootstrapping the applications
Client-side program architecture
Designing the model
The event bus
AJAX calls through JQuery
Response views
Drawing plots with NVD3
Summary
References
A. Pattern Matching and Extractors
Pattern matching in for comprehen...

Table of contents

  1. Scala for Data Science