What Is Big Data?
Today, roughly half of the world population interacts with online services. Data are generated at an unprecedented scale from a wide range of sources. The way we view and manipulate the data is also changing, as we discover new ways of discovering insights from unstructured data sources. Managing data volume has changed considerably over recent years (Malik, 2013), because we need to cope with demands to deal with terabytes, petabytes, and now even zettabytes. Now we need to have a vision that includes what the data might be used for in the future so that we can begin to plan and budget for likely resources. A few terabytes of data are quickly generated by a commercial business organization, and individuals are starting to accumulate this amount of personal data. Storage capacity has roughly doubled every 14 months over the past 3 decades. Concurrently, the price of data storage has reduced, which has affected the storage strategies that enterprises employ (Kumar et al., 2012) as they buy more storage rather than determine what to delete. Because enterprises have started to discover new value in data, they are treating it like a tangible asset (Laney, 2001). This enormous generation of data, along with the adoption of new strategies to deal with the data, has caused the emergence of a new era of data management, commonly referred to as Big Data.
Big Data has a multitude of definitions, with some research suggesting that the term itself is a misnomer (Eaton et al., 2012). Big Data challenges the huge gap between analytical techniques used historically for data management, as opposed to what we require now (Barlow, 2013). The size of datasets has always grown over the years, but we are currently adopting improved practices for large-scale processing and storage. Big Data is not only huge in terms of volume, it is also dynamic and has various forms. On the whole, we have never seen these kinds of data in the history of technology.
Broadly speaking, Big Data can be defined as the emergence of new datasets with massive volume that change at a rapid pace, are very complex, and exceed the reach of the analytical capabilities of commonly used hardware environments and software tools for data management. In short, the volume of data has become too large to handle with conventional tools and methods.
With advances in science, medicine, and business, the sources that generate data increase every day, especially from electronic communications as a result of human activities. Such data are generated from e-mail, radiofrequency identification, mobile communication, social media, health care systems and records, enterprise data such as retail, transport, and utilities, and operational data from sensors and satellites. The data generated from these sources are usually unprocessed (raw) and require various stages of processing for analytics. Generally, some processing converts unstructured data into semi-structured data; if they are processed further, the data are regarded as structured. About 80% of the worldâs data are semi-structured or unstructured. Some enterprises largely dealing with Big Data are Facebook, Twitter, Google, and Yahoo, because the bulk of their data are regarded as unstructured. As a consequence, these enterprises were early adopters of Big Data technology.
The Internet of Things (IoT) has increased data generation dramatically, because patterns of usage of IoT devices have changed recently. A simple snapshot event has turned out to be a data generation activity. Along with image recognition, todayâs technology allows users to take and name a photograph, identify the individuals in the picture, and include the geographical location, time and date, before uploading the photo over the Internet within an instance. This is a quick data generation activity with considerable volume, velocity, and variety.
How Different Is Big Data?
The concept of Big Data is not new to the technological community. It can be seen as the logical extension of already existing technology such as storage and access strategies and processing techniques. Storing data is not new, but doing something meaningful (Hofstee et al., 2013) (and quickly) with the stored data is the challenge with Big Data (Gartner, 2011). Big Data analytics has something more to do with information technology management than simply dealing with databases. Enterprises used to retrieve historical data for processing to produce a result. Now, Big Data deals with real-time processing of the data and producing quick results (Biem et al., 2013). As a result, months, weeks, and days of processing have been reduced to minutes, seconds, and even fractions of seconds. In reality, the concept of Big Data is making things possible that would have been considered impossible not long ago.
Most existing storage strategies followed a knowledge managementâbased storage approach, using data warehouses (DW). This approach follows a hierarchy flowing from data to information, knowledge, and wisdom, known as the DIKW hierarchy. Elements in every level constitute elements for building the succeeding level. This architecture makes the accessing policies more complex and most of the existing databases are no longer able to support Big Data. Big Data storage models need more accuracy, and the semi-structured and the unstructured nature of Big Data is driving the adoption of storage models that use cross-linked data. Even though the data relate to each other and are physically located in different parts of the DW, logical connection remains between the data. Typically we use algorithms to process data in standalone machines and over the Internet. Most or all of these algorithms are bounded by space and time constraints, and they might lose logical functioning if an attempt is made to exceed their bound limitations. Big Data is processed with algorithms (Gualtieri, 2013) that possess the ability to function on a logically connected cluster of machines without limited time and space constraints.
Big Data processing is expected to produce results in real time or nearâreal time, and it is not meaningful to produce results after a prolonged period of processing. For instance, as users search for information using a search engine, the results that are displayed may be interspersed with advertisements. The advertisements will be for products or services that are related to the userâs query. This is an example of the real-time response upon which Big Data solutions are focused.
More on Big Data: Types and Sources
Big Data arises from a wide variety of sources and is categorized based on the nature of the data, their complexity in processing, and the intended analysis to extract a value for a meaningful execution. As a consequence, Big Data is classified as structured data, unstructured data, and semi-structured data.
Structured Data
Most of the data contained in traditional database systems are regarded as structured. These data are particularly suited to further analysis because they are less complex with defined length, semantics, and format. Records have well-defined fields with a high degree of organization (rows and columns), and the data usually possess meaningful codes in a standard form that computers can easily read. Often, data are organized into semantic chunks, and similar chunks with common description are usually grouped together. Structured data can be easily stored in databases and show reduced analytical complexity in searching, retrieving, categorizing, sorting, and analyzing with defined criteria.
Structured data come from both machine- and human-generated sources. Without the intervention of humans for data generation, some machine-generated datasets include sensor data, Web log data, call center detail records, data from smart meters, and trading systems. Humans interact with computers to generate data such as input data, XML data, click stream data, traditional enterprise data such as customer information from customer relationship management systems, and enterprise resource planning data, general ledger data, financial data, and so on.
Unstructured Data
Conversely, unstructured data lack a predefined data format and do not fit well into the traditional relational database systems. Such data do not follow any rules or recognizable patterns and can be unpredictable. These data are more complex to explore, and their analytical complexity is high in terms of capture, storage, processing, and resolving meaningful queries from them. More than 80% of data generated today are unstructured as a result of recording event data from daily activities.
Unstructured data are also generated by both machine and human sources. Some machine-generated data include image and video files generated from satellite and traffic sensors, geographical data from radars and sonar, and surveillance and security data from closed-circuit television (CCTV) sources. Human-generated data include social media data (e.g., Facebook and Twitter updates) (Murtagh, 2013; Wigan and Clarke, 2012), data from mobile communications, Web sources such as YouTube and Flickr, e-mails, documents, and spreadsheets.
Semi-structured Data
Semi-structured data are a combination of both structured and unstructured data. They still have the data organized in chunks, with similar chunks grouped together. However, the description of the chunks in the same group may not necessarily be the same. Some of the attributes of the data may be defined, and there is often a self-describing data model, but it is not as rigid as structured data. In this sense, semi-structu...