Hands-On Data Analysis with Scala
eBook - ePub

Hands-On Data Analysis with Scala

Perform data collection, processing, manipulation, and visualization with Scala

  1. 298 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Hands-On Data Analysis with Scala

Perform data collection, processing, manipulation, and visualization with Scala

Book details
Book preview
Table of contents
Citations

About This Book

Master scala's advanced techniques to solve real-world problems in data analysis and gain valuable insights from your data

Key Features

  • A beginner's guide for performing data analysis loaded with numerous rich, practical examples
  • Access to popular Scala libraries such as Breeze, Saddle for efficient data manipulation and exploratory analysis
  • Develop applications in Scala for real-time analysis and machine learning in Apache Spark

Book Description

Efficient business decisions with an accurate sense of business data helps in delivering better performance across products and services. This book helps you to leverage the popular Scala libraries and tools for performing core data analysis tasks with ease.

The book begins with a quick overview of the building blocks of a standard data analysis process. You will learn to perform basic tasks like Extraction, Staging, Validation, Cleaning, and Shaping of datasets. You will later deep dive into the data exploration and visualization areas of the data analysis life cycle. You will make use of popular Scala libraries like Saddle, Breeze, Vegas, and PredictionIO for processing your datasets. You will learn statistical methods for deriving meaningful insights from data. You will also learn to create applications for Apache Spark 2.x on complex data analysis, in real-time. You will discover traditional machine learning techniques for doing data analysis. Furthermore, you will also be introduced to neural networks and deep learning from a data analysis standpoint.

By the end of this book, you will be capable of handling large sets of structured and unstructured data, perform exploratory analysis, and building efficient Scala applications for discovering and delivering insights

What you will learn

  • Techniques to determine the validity and confidence level of data
  • Apply quartiles and n-tiles to datasets to see how data is distributed into many buckets
  • Create data pipelines that combine multiple data lifecycle steps
  • Use built-in features to gain a deeper understanding of the data
  • Apply Lasso regression analysis method to your data
  • Compare Apache Spark API with traditional Apache Spark data analysis

Who this book is for

If you are a data scientist or a data analyst who wants to learn how to perform data analysis using Scala, this book is for you. All you need is knowledge of the basic fundamentals of Scala programming.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Hands-On Data Analysis with Scala by Rajesh Gupta in PDF and/or ePUB format, as well as other popular books in Informatica & Elaborazione di dati. We have over one million books available in our catalogue for you to explore.

Information

Year
2019
ISBN
9781789344264

Section 1: Scala and Data Analysis Life Cycle

In this section, you will gain an insight into what data is, how it is prepared and processed, and how it is analyzed and stored. This section will also get you introduced to the Scala framework and how we use Scala with data. You will learn some basic commands in Scala. You will also understand what a data pipeline is and the tasks involved in this pipeline.
This section will contain the following chapters:
  • Chapter 1, Scala Overview
  • Chapter 2, Data Analysis Life Cycle
  • Chapter 3, Data Ingestion
  • Chapter 4, Data Exploration and Visualization
  • Chapter 5, Applying Statistics and Hypothesis Testing

Scala Overview

Scala is a popular general-purpose, high-level programming language that typically runs on the Java Virtual Machine (JVM). JVM is a time-tested platform that has proven itself in terms of stability and performance. A large number of very powerful libraries and frameworks have been built using Java. For instance, in the context of data analysis, there are many Java libraries available to handle different data formats, such as XML, JSON, Avro, and so on. Scala's interoperability with such well-tested libraries helps increase a Scala programmer's productivity greatly.
When it comes to data analysis and processing, it is often the case that there is an abundance of data transformation tasks that need to be performed. Some examples of such tasks are mapping from one representation to another, filtering irrelevant data, and joining one set of data with another set. Trying to solve such problems using the object-oriented paradigm often means that we have to write a significant amount of boilerplate code even to perform a fairly simple task. Oftentimes, solving data problems requires thinking in terms of input and transformations that are to be applied to this input. Scala's functional programming model provides a set of features that facilitate writing code that is concise and expressive. Spark is a popular distributed data analytics engine that has almost entirely been written in Scala. In fact, there is a strong resemblance between the Scala collection API and the Spark API.
Most of the Java libraries can be used with relative ease from Scala code. One can easily mix object-oriented and functional styles of programming in the same Scala code base. This ability provides a very simple pathway to a great deal of productivity. Some of the major benefits of using Scala are as follows:
  • Most Java libraries and frameworks can be reused from Scala. Scala code is compiled into Java byte code and runs on JVM. This makes it seamless to use Java code that has already been written from a Scala program. In fact, it is not uncommon to have a mix of both Java and Scala codes within a single project.
  • Scala's functional constructs can be used to write code that is simple, concise, and expressive.
  • We can still use object-oriented features where they are a better fit.
There are many useful data libraries and frameworks that are built using Scala. These are summarized later in this chapter. Apache Spark needs a special mention. Apache Spark has become a de facto standard for performing distributed data analysis at scale. Since Spark is almost entirely written in Scala, its integration with Scala is the most complete, even though it has support for Java, Python, and R as well. Spark's API has been heavily influenced by Scala's collection API. It also leverages Scala's case class features in its dataset API and significantly helps in reducing the writing of boilerplate code that is otherwise necessary for Java.
The following topics will be covered in this chapter:
  • Installing and getting started with Scala
  • Object-oriented and functional programming overview
  • Scala case classes and the collection API
  • Overview of Scala libraries for data analysis

Getting started with Scala

At the time of writing, Scala Version 2.12.8 (https://www.scala-lang.org/) is the most recent version of the language. We have the option of running Scala code online using our browser or installing and running it on our machine. Running it online is good for getting started with Scala as a first step; however, you will need to install it on your computer to learn the language's in-depth features and make use of it for data analysis.

Running Scala code online

There are some great resources available online to run code in your web browser. These are good for trying out small snippets of Scala code and gaining a better understanding of how Scala works. Some of these online resources allow you to share your code with someone else by generating a static URL that is typically valid for a few months. This feature could be useful for quick code review or collaboration with someone on the internet.
Let's look at two such online resources:
  • Scastie
  • ScalaFiddle

Scastie

Scastie (https://scastie.scala-lang.org) is a great online resource for trying out small Scala code snippets. All that is needed is a web browser and access to the internet.
The main screen is divided into two parts: the top part consists of a program and its output and the bottom part is the output from the backend server that compiles and runs the code. You can modify the code and run it any number of times by using the Run option, as shown in the following screenshot:

When run in the browser, this simple example displays what is going on in each step. In our example, the Scala expression produced Hello, World!, which is of the java.lang.String type. The expression is of the following pattern:
List of strings joined together by a comma with an exclamation mark suffix.

ScalaFiddle

ScalaFiddle (https://scalafiddle.io/) is another good online resource for running Scala code. This is a good resource for sharing your code with someone else by generating a URL after you save; however, it does require the user to be logged in to a GitHub account. The ScalaFiddle homepage is shown in the following s...

Table of contents

  1. Title Page
  2. Copyright and Credits
  3. Dedication
  4. About Packt
  5. Contributors
  6. Preface
  7. Section 1: Scala and Data Analysis Life Cycle
  8. Scala Overview
  9. Data Analysis Life Cycle
  10. Data Ingestion
  11. Data Exploration and Visualization
  12. Applying Statistics and Hypothesis Testing
  13. Section 2: Advanced Data Analysis and Machine Learning
  14. Introduction to Spark for Distributed Data Analysis
  15. Traditional Machine Learning for Data Analysis
  16. Section 3: Real-Time Data Analysis and Scalability
  17. Near Real-Time Data Analysis Using Streaming
  18. Working with Data at Scale
  19. Another Book You May Enjoy