Part I: Introduction
1 Background
Database systems are fundamental for the information society. Every day, an inestimable amount of data is produced, collected, stored and processed: online shopping, sending emails, using social media, or seeing your physician are just some of the day-to-day activities that involve data management. A properly working database management system is hence crucial for a smooth operation of these activities. In this chapter, we introduce the principles and properties that a database system should ful fill. Database management systems and their components as well as data modeling are the other two basic concepts treated in this chapter.
1.1 Database Properties
As data storage plays such a crucial role in most applications, database systems should guarantee a correct and reliable execution in several use cases. From an abstract perspective, we desire that a database system fulfill the following properties:
Data management. A database system not only stores data, it must just as well support operations for retrieval of data, searches for data and updates on data. To enable interoperability with external applications, the database system must provide communication interfaces or application programming interfaces for several communication protocols or programming languages. A database system should also support transactions: A transaction is a sequence of operations on data in a database that must not be interrupted. In other words, the database executes operations within a transaction according to the āall or nothingā principle: Either all operations succeed to their full extent or none of the operations is executed (and the subsequence of operations that was already executed is undone).
Scalability. The amount of data processed daily with modern information technology is tremendous. Processing these data can only be achieved by distribution of data in a network of database servers and a high level of parallelization. Database systems must flexibly react and adapt to a higher workload.
Heterogeneity. When collecting data or producing data (as output of some program), these data are usually not tailored to being stored in a relational table format. While the data in relational format are called structured and have a fixed schema which prescribes the structure of the data, data often come in different formats. Data that have a more flexible structure than the table format are called semi-structured; these can be tree-like structures (as used in XML documents) or ā more generally ā graph structures. Furthermore, data can be entirely unstructured (like arbitrary text documents).
Efficiency. The majority of database applications need fast database systems. Online shopping and web searches rely on high-performance search and retrieval operations. Likewise, other database operations like store and update must be executed in a speedy fashion to ensure operability of database applications.
Persistence. The main purpose of a database system is to provide a long-term storage facility for data. Some modern database applications (like data stream processing) just require a kind of selective persistence: only some designated output data have to be stored onto long-term storage devices, whereas the majority of the data is processed in volatile main memory and discarded afterwards.
Reliability. Database systems must prevent data loss. Data stored in the database system should not be distorted unintentionally: data integrity must be maintained by the database system. Storing copies of data on other servers or storage media (a mechanism called physical redundancy or replication) is crucial for data recovery after a failure of a database server.
Consistency. The database system must do its best to ensure that no incorrect or contradictory data persist in the system. This involves the automatic verification of consistency constraints (data dependencies like primary key or foreign key constraints) and the automatic update of distributed data copies (the replicas).
Non-redundancy. While physical redundancy is decisive for the reliability of a database system, duplication of values inside the stored data sets (that is, logical redundancy) should best be avoided. First of all, logical redundancy wastes space on the storage media. Moreover, data sets with logical redundancy are prone to different forms of anomalies that can lead to erroneous or inconsistent data. Normalization is one way to transform data sets into a non-redundant format.
Multi-User Support. Modern database systems must support concurrent accesses by multiple users or applications. Those independent accesses should run in isolation and not interfere with each other so that a user does not notice that other users are accessing the database system at the same time. Another major issue with multi-user support is the need for access control: data of one user should be protected from unwanted accesses by other users. A simple strategy for access control is to only allow users access to certain views on the data sets. A well-defined authentication mechanism is crucial to implement access control.
| A database system should manage large amounts of heterogeneous data in an efficient, persistent, reliable, consistent, non-redundant way for multiple users. |
Database systems often do not satisfy all of these requirements or only to the certain extent. When choosing a database system for a specific application, clarifying all mandatory requirements and weighing the pros and cons of the available systems is the first and foremost task.
Fig. 1.1. Database management system and interacting components
1.2 Database Components
The software component that is in charge of all database operations is the database management system (DBMS). Several other systems and components interact with the DBMS as shown in Figure 1.1. The DBMS relies on the operating system and the file system of the database server to store the data on disk. The DBMS also relies on the operating system to be able to use the network interfaces for communication with external applications or other database servers.
The low-level file system (or the operating system) does not have knowledge on internal structure or meaning of stored data, it just handles the stored data as arbitrary records. Hence, the purpose of the database management system is to provide the users with a higher-level interface and more structured data storage and retrieval operations. The DBMS operates on data in the main memory; more precisely it handles data in a particular portion of the main memory (called page buffer) that is reserved for the DBMS. The typical storage unit on disk is a āblockā of data; often this data block is ca...