Apache Solr for Indexing Data
eBook - ePub

Apache Solr for Indexing Data

  1. 160 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Apache Solr for Indexing Data

Book details
Book preview
Table of contents
Citations

About This Book

Enhance your Solr indexing experience with advanced techniques and the built-in functionalities available in Apache Solr

About This Book

  • Learn about distributed indexing and real-time optimization to change index data on fly
  • Index data from various sources and web crawlers using built-in analyzers and tokenizers
  • This step-by-step guide is packed with real-life examples on indexing data

Who This Book Is For

This book is for developers who want to increase their experience of indexing in Solr by learning about the various index handlers, analyzers, and methods available in Solr. Beginner level Solr development skills are expected.

What You Will Learn

  • Get to know the basic features of Solr indexing and the analyzers/tokenizers available
  • Index XML/JSON data in Solr using the HTTP Post tool and CURL command
  • Work with Data Import Handler to index data from a database
  • Use Apache Tika with Solr to index word documents, PDFs, and much more
  • Utilize Apache Nutch and Solr integration to index crawled data from web pages
  • Update indexes in real-time data feeds
  • Discover techniques to index multi-language and distributed data in Solr
  • Combine the various indexing techniques into a real-life working example of an online shopping web application

In Detail

Apache Solr is a widely used, open source enterprise search server that delivers powerful indexing and searching features. These features help fetch relevant information from various sources and documentation. Solr also combines with other open source tools such as Apache Tika and Apache Nutch to provide more powerful features.

This fast-paced guide starts by helping you set up Solr and get acquainted with its basic building blocks, to give you a better understanding of Solr indexing. You'll quickly move on to indexing text and boosting the indexing time. Next, you'll focus on basic indexing techniques, various index handlers designed to modify documents, and indexing a structured data source through Data Import Handler.

Moving on, you will learn techniques to perform real-time indexing and atomic updates, as well as more advanced indexing techniques such as de-duplication. Later on, we'll help you set up a cluster of Solr servers that combine fault tolerance and high availability. You will also gain insights into working scenarios of different aspects of Solr and how to use Solr with e-commerce data.

By the end of the book, you will be competent and confident working with indexing and will have a good knowledge base to efficiently program elements.

Style and approach

This fast-paced guide is packed with examples that are written in an easy-to-follow style, and are accompanied by detailed explanation. Working examples are included to help you get better results for your applications.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Apache Solr for Indexing Data by Sachin Handiekar, Anshul Johri in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over one million books available in our catalogue for you to explore.

Information

Year
2015
ISBN
9781783553235
Edition
1

Apache Solr for Indexing Data


Table of Contents

Apache Solr for Indexing Data
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Getting Started
Overview and installation of Solr
Installing Solr in OS X (Mac)
Running Solr
Installing Solr in Windows
Installing Solr on Linux
The Solr architecture and directory structure
Solr directory structure
Cores in Solr (Multicore Solr)
Summary
2. Understanding Analyzers, Tokenizers, and Filters
Introducing analyzers
Analysis phases
Tokenizers
Standard tokenizer
Keyword tokenizer
Lowercase tokenizer
N-gram tokenizer
Filters
Lowercase filter
Synonym filter
Porter stem filter
Running your analyzer
Summary
3. Indexing Data
Indexing data in Solr
Introducing field types
Defining fields
Defining an unique key
Copy fields and dynamic fields
Building our musicCatalogue example
Using the Solr Admin UI
Facet searching
Summary
4. Indexing Data – The Basic Technique and Using Index Handlers
Inserting data into Solr
Configuring UpdateRequestHandler
Indexing documents using XML
Adding and updating documents
Deleting a document
Indexing documents using JSON
Adding a single document
Adding multiple JSON documents
Sequential JSON update commands
Indexing updates using CSV
Summary
5. Indexing Data with the Help of Structured Datasources – Using DIH
Indexing data from MySQL
Configuring datasource
DIH commands
Indexing data using XPath
Summary
6. Indexing Data Using Apache Tika
Introducing Apache Tika
Configuring Apache Tika in Solr
Indexing PDF and Word documents
Summary
7. Apache Nutch
Introducing Apache Nutch
Installing Apache Nutch
Configuring Solr with Nutch
Summary
8. Commits, Real-Time Index Optimizations, and Atomic Updates
Understanding soft commit, optimize, and hard commit
Using atomic updates in Solr
Using RealTime Get
Summary
9. Advanced Topics – Multilanguage, Deduplication, and Others
Multilanguage indexing
Removing duplicate documents (deduplication)
Content streaming
UIMA integration with Solr
Summary
10. Distributed Indexing
Setting up SolrCloud
The collections API
Updating configuration files
Distributed indexing and searching
Summary
11. Case Study of Using Solr in E-Commerce
Creating an AutoSuggest feature
Facet navigation
Search filtering and sorting
Relevancy boosting
Summary
Index

Apache Solr for Indexing Data

Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: December 2015
Production reference: 1151215
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78355-323-5
www.packtpub.com

Credits

Authors
Sachin Handiekar
Anshul Johri
Reviewers
Damiano Braga
Florian Hopf
Commissioning Editor
Ashwin Nair
Acquisition Editors
Rebecca Pedley
Reshma Raman
Content Development Editor
Rohit Kumar Singh
Technical Editor
Utkarsha S. Kadam
Copy Editor
Vikrant Phadke
Project Coordinator
Mary Alex
Proofreader
Safis Editing
Indexer
Rekha Nair
Production Coordinator
Manu Joseph
Cover Work
Manu Joseph

About the Authors

Sachin Handiekar is a senior software developer with over 5 years of experience in Java EE development. He graduated in computer science from the University of Greenwich, London, and currently works for a global consulting company, developing enterprise applications using various open source technologies, such as Apache Camel, ServiceMix, ActiveMQ, and ZooKeeper.
He has a lot of interest in open source projects and has contributed code to Apache Camel and developed plugins for the Spring Social, which can be found on GitHub at https://github.com/sachin-handiekar.
He also actively writes about enterprise application development on his blog (http://www.sachinhandiekar.com/).
Anshul Johri has more than 10 years of technical experience in software engineering. He did his masters in computer science from the computer science department in the University of Pune. Anshul has always been a start-up mindset guy, working on fast-paced development using cutting-edge technologies and doing multiple things at a time. His core strength has always been search technology, whereby Solr plays an important role in his career. Anshul started using Solr around 9 years ago, and since then, he has never looked back. He did better and better with Solr, whether using it or contributing to the open source search community. He has used Solr extensively in all his organizations across various projects.
As mentioned earlier, Anshul has always been a start-up mindset guy. Because of that, he has worked with many start-ups in his career so far, which includes early-age and mid-size start-ups as well. To name a few, they are Ibibo.com, Asklaila.com, Bookadda.com, and so on. His last company was Amazon, where he spent around 2 years building scalable systems for Amazon Prime (a global product). Anshul recently started his own company in India with another friend from Amazon and founded http://www.r...

Table of contents

  1. Apache Solr for Indexing Data