1 Introduction
Research methods for large databases
Mark Casson and Nigar Hashimzade
1.1 The research agenda
Three recent developments have opened up new opportunities for research in economic history: digitisation of primary sources, collaborative research linking different data sets and the publication of databases on the internet. Systematic exploitation of source materials now makes it possible to generate large representative samples, such as plots of land in a country or region, or comprehensive population data, such as the stock of ploughing engines in use at any given time. Collaborative research grants have funded the development of new long-run annual time series on prices, outputs, money supply, etc., for up to 750 years (1250â2000), and publication on the internet has widened access to such material. Panel data sets can be constructed that track the same sample over time, for example, parish population from Census data or attendance at meetings by members of an organisation. Linking data from different sources, such as railways, geology and population, widens the range of research questions that can be addressed.
New statistical methods have facilitated the discovery of hidden patterns in long-run data, involving the analysis of autocorrelation, regression to the mean, stochastic and deterministic trends, co-integrating relationships between co-evolving series, and structural breaks. In addition, panel estimation techniques have facilitated the synthesis of time-series and cross-section data. In analytical work, large data sets make asymptotic theory more relevant. Greater computing power and modern software packages make estimation of complex models using large data sets very quick.
These developments have made it possible to analyse very long-run processes that are important agents of economic change, including technological progress, market integration, political integration and institution-building. They can be analysed in a systematic way using structural models. Economic historians can now move from simple questions answered by descriptive statistics to more complex questions answered by estimated models, and from questions about a single variable (e.g. standard of living) to questions about the co-movement of variables (e.g. prices, population and money supply). Causation can be considered more explicitly using dynamic models, and multiple causation, as well as the direction of causation, can be analysed more fully.
The main challenges in utilising the advantages offered by these new developments are technical ones. For numerical analysis the historical data, especially the qualitative data, must be codified in an appropriate form. Historical data are prone to missing observations and data input errors, requiring considerable care to avoid mistakes in estimation. It is important to specify the right model and to interpret statistical results correctly, keeping in mind that the results may only be valid under certain explicit or implicit assumptions. The analysis should aim at finding all the patterns concealed within the data, ensuring that what is unexplained is, in some sense, truly random.
Instead of relying on generalisations from a range of specific studies, often carried out using different methodologies, it is possible to construct a single coherent account based on evidence of a consistent standard. The book reports the results of new research of this type.
The book is intended for use by doctoral and post-doctoral researchers in business history, economic history and social history. The case studies will also appeal to historical geographers and applied econometricians, and the techniques explained in the book are potentially useful to government policy-makers too. This book demonstrates how to create âbig dataâ, and, above all, how to exploit it to the full. Many historians only âscratch the surfaceâ of the data they collect. There are often significant patterns hidden in their data that they fail to discover. This book shows how to unlock hidden patterns, and hence get more information out of the data.
Unlike conventional statistical texts, this book demonstrates how to put principles into practice with the aid of practical historical studies. This agenda leads, in some cases, to a reappraisal of conventional wisdom on such important issues as the development of the land market, the pricing of commodities, monetary instability, the economic impact of railways, the diffusion of steam technology and the role of women in the economy.
Many readers will be familiar with general statistical texts such as Wooldridge (2006). They will also be aware of âcliometricsâ literature, as summarised recently in Greasley and Oxley (2011). The case studies in this book build upon previous research in cliometrics. However, despite some similarities, the new research differs from earlier research in important respects.
Most cliometrics relies on single equation models and makes limited use of simultaneous equation models, stochastic trends and other concepts featured in the book. There is now more emphasis on qualitative evidence. Early cliometric research emphasised quantification (e.g. using heights as indicators of health and welfare) whereas much of the evidence in modern databases is qualitative. Recent research has tended to combine quantitative and qualitative evidence by using binary variables; this is particularly useful for testing institutional theories of economic change.
The book emphasises the importance of testing alternative theories rather than fitting models based on one specific theory. Cliometric research in the 1970s and 1980s tended to react against the Marxist turn in economic history during the 1960s by emphasising the ubiquitous and providential role of market forces. Recent research has paid closer attention to the speed of market adjustment, and has suggested that market adjustment is often a rather sluggish process, and certainly not an instantaneous one. The research reported in this book takes no ideological position on the role of markets whatsoever. The philosophy is simply to âlet the data speakâ; in practice, this means comparing the effectiveness of alternative models in explaining the patterns that are revealed by statistical analysis. Models therefore need to allow for variable speeds of adjustment in market processes.
1.2 Structure and content of the book
The structure of the book is as follows. Chapter 2 examines the covariation between the prices of eight widely traded commodities in England, 1250â1914. Chapter 3 presents new annual estimates for stocks of gold and silver coin, 1220â1750. These estimates are combined with price and output data to test the Quantity Theory of Money. Chapter 4 reviews the evidence on medieval international financial transactions, and shows how econometric studies of medieval finance can be used to identify structural breaks in economic behaviour. Chapter 5 analyses time series data on land and property values from feets of fines for two English counties, Essex and Warwickshire, 1300â1500. It demonstrates the changing uses of land and the differential movements in the values of various types of property, including agricultural land, mills and manorial rights. It also highlights the decline in smallholdings and the build-up of complex estates at the end of the fifteenth century. Chapter 6 shows how visual analytics can be used to summarise the structure of complex social networks. It presents a case study of Liverpool business networks at the end of the eighteenth century which demonstrates how a large binary data set of inter-personal relationships can be used to analyse the structure and dynamics of historical social networks. Chapter 7 develops and tests a model of the equilibrium distribution of population across towns and villages, using decadal data on local population units from the UK Census of Population 1801â1891 for the counties of Northamptonshire and Rutland. Chapter 8 develops and tests a theory of the role of women in land ownership, with special reference to nineteenth-century England, by modelling competition between men, women and institutions in the property market. For this purpose a database of 24,000 individual plots of land is created. Chapter 9 presents the first comprehensive database on the production and use of steam ploughing engines in English agriculture, 1859â1930. It examines the rise and decline of steam ploughing by analysing both spatial and temporal patterns in diffusion. Chapter 10 investigates changes in consumption patterns in the eighteenth- and nineteenth-century London by examining Old Bailey criminal records. Using a remarkable database linking individual burglars to the commodities they stole and the date of the theft, it shows how patterns of theft can inform recent historical debates over fashions and trends in consumer tastes.
The remainder of this chapter reviews the methodology that is common to all the following chapters. It presents the key concepts used in building economic models and in developing hypotheses regarding the long-run development of the economy. It outlines the statistical techniques that can be used to estimate the parameters of such models and to test hypotheses associated with them.
1.3 Fundamental concepts of data analysis in economic history
Observational nature of the data
A fundamental difficulty in establishing a relationship between economic variables is the observational nature of the economic data. Suppose that we want to know whether changes in price of wheat have an effect on the price of barley. One would expect there to be such an effect because these two commodities have similar uses. We cannot conduct an experiment, as we would do in natural sciences, by deliberately changing the wheat price and recording the response in the barley price. In practice, we have a set of observations on both and, further-more, the recorded prices are likely to have been affected by numerous other factors. It is often impossible to isolate the effect of these factors, even when we happen to have the relevant data. Therefore, statistical methods must be adapted in order to extract reliable information from the historical data in the most efficient way.
Use and limitations of descriptive statistics
A common way of summarising the properties of historical data on economic variables is the use of descriptive statistics, such as the sample mean, as the measure of central tendency, and the sample variance (or the standard deviation), as the measure of dispersion, or the spread of observation about the sample mean. Other frequently used statistics are the range and the minimal and the maximal value in the sample. While these characteristics of the data are often helpful, in many situations they do not reflect certain patterns in the data that can be the most important for the research question. A typical situation is a trend in prices or seasonal fluctuations in trade volumes. Furthermore, the value of descriptive statistics can be driven entirely by one outlier and, therefore, give a poor picture of the bulk of the sampl...