PART I
Big Data’s Impact on the Provision and Regulation of Insurance
1
Big Data and Predictive Analytics
I.Big Data: Definition and Techniques
Before considering its impact on the insurance industry, it is necessary to consider what the term ‘big data’ means. Data is, in essence, digitised information. Big data has been defined as ‘high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision-making’.1 Big data encompasses all types of information, not just that specific to an individual consumer. Mayer-Schonberger and Cukier explain that ‘Big data refers to things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value, in ways that change markets, organisations, the relationship between citizens and governments and more’.2 They go on to explain that ‘Big data is not about trying to “teach” a computer to “think” like humans, instead it’s about applying math to huge quantities of data in order to infer probabilities … [T]hese systems perform well because they are fed with lots of data on which to base their predictions.’3 Greengard observes that big data ‘centres on collecting storing, and using data-sets generated from both structured data (which resides in a database) and unstructured data (which exists outside a database), typically in the form of messaging streams, text documents, photos, video images, audio files and social media’.4 Previously, the storage of large amounts of data required supercomputers and was prohibitively expensive. Now, as data storage and processing is so much cheaper and faster, it has become possible to record and analyse data in volumes and forms that was, until relatively recently, inconceivable. As technology historian George Dyson puts it, ‘big data is what happened when the cost of storing information became less than the cost of throwing it away’.5
Big data identifies correlations, not causes. It works not by seeking to explain why certain variables give rise to certain outcomes but simply identifying the existence of the correlation in the first place; big data analytics are not used to test particular hypotheses or identify particular causes in the manner of traditional scientific method. If a pattern reliably emerges from the detailed examination of datasets, that is enough to provide a basis for the operation of predictive analytics, a core part of big data’s offering in the insurance industry. Analytics are divided between descriptive and predictive analytics. Descriptive analytics is designed to uncover and summarise patterns or features that exist in datasets. By contrast predictive analytics refers to the use of statistical models to generate new data. Predictive analytics use algorithms to analyse data in order to identify the likelihood of a certain event before it happens.6 In the context of insurance, the algorithm will be trained to analyse large datasets to identify both good and bad risks; the likelihood of insured perils occurring and claims being made. Algorithms can already predict when a customer is ready to buy a certain product, a car needs servicing or a person is at risk of disease.7 But in being focussed on correlation, not cause, the use of predictive analytics gives rise to its own risks.
Algorithms are designed by reducing a series of steps into code. Those steps, so encoded, embody a process whereby the algorithms make sense of the data which they have trawled. But because the steps in that process are determined, at least in the first instance, by humans, the coding is laced with value choices: the analysis of data is not value-neutral.8 The Financial Services Users Group (FSUG)9 makes the point clearly:
Even as the public gradually becomes more familiar with the way platforms work with data, and even with more pointed data scrutiny, it is still a common belief that data and algorithms systematically and impartially uncover genuine patterns of user activity. Big data and algorithms cannot and do not work without some form of human intervention/supervision. In fact, algorithms just do what they are programmed to do very quickly and with a vast amount of data. Algorithms identify patterns in data, but they cannot operate a judgment on that data. In the end, people are behind the algorithms because they produce the activity being measured, they design the algorithms and set their evaluative criteria, they decide what counts as a trend, they name and summarize them, etc. In the end these people who programmed the algorithms employ their own human judgment in its design, clouding such reality by assuming algorithms provide analytical certainty. But who are the people behind the algorithms and who gives them instructions? They are the holders of an incredible power in dictating what is and is not displayed, what is and is not presented, the logics, etc … This process is far from being neutral and there is always a strong human bias, which in this case is the financial services industry bias and/or interests.10
The next stage in predictive analytics involves machine learning, a subset of Artificial Intelligence (AI).11 Machine learning relies on algorithms that are designed to improve how they evaluate data over time.12 Machine learning can be separated into two types of learning: supervised and unsupervised. In supervised learning, algorithms are developed based on labelled datasets. In this sense, the algorithms have been trained how to map from input to output by the provision of data with ‘correct’ values already assigned to them. This initial ‘training’ phase creates models of the world on which predictions can then be made in the second ‘prediction’ phase, the phase of most interest to insurers seeking to model risk.13 Conversely, in unsupervised learning the algorithms are not trained and are instead left to find regularities in input data without any instructions as to what to look for.14 As the Information Commissioner notes, in both cases, it is the ability of the algorithms to change their output based on experience that gives machine learning its power.15 Algorithms search through and analyse data, assimilating the lessons of previous searches to more precisely undertake the next round of searching and analysis.16 Increasing the size of the data enables the algorithms to refine their analysis of the correlations identified. Having learned from the new data and refined the correlations, the algorithms are then able to fine tune their predictive power as well as making automated decisions, such as determining an applicant’s eligibility for insurance. Machine learning already excels in spotting unusual patterns of transactions which can indicate fraud. Startups such as Shift Technology are already offering such services to insurers.
II.The Nature, Collection, Sources and Aggregation of Data
A.Types of Data
The Information Commissioner distinguishes between four types of data which may be categorised based on the way in which that data is collected or generated:17
1.‘Provided data’ is data consciously given by individuals, e.g. when filling in an online form;
2.‘Observed data’ is data that is recorded automatically, e.g. by online cookies, geo-location data or CCTV linked to facial recognition technology;
3.‘Described data’ is data that is produced in a relatively simple fashion, e.g. calculating profit on an individual insured which is calculated by comparing premium income received against payment of claims; and
4.‘Inferred data’ is data produced by considering, for example, how the presence of certain identified risk factors might enable an insurer to predict the likelihood of future behaviours or outcomes. Inferred data is based on probabilities and is therefore less certain than derived data.
B.Collection of Data: First Party or Third Party
Businesses collect personal data18 directly from consumers as first parties. Data provided directly by consumers to first parties is often likely to be the most detailed and accurate form of consumer data. In the insurance context, insurers receive applications or proposal forms for insurance directly from prospective insureds (or their brokers).19 Taking motor insurance as an example, the information provided directly by the insured in seeking a quotation falls into three broad categories: (i) information about the insured (e.g. the driver for a motor policy: such details will include, for example, date of birth, occupation, no claims bonus (NCB), claims and convictions record, home ownership, annual mileage); (ii) information about the vehicle (e.g. make, model, value, transmission, security devices, modifications, year, colour); and (iii) information about the location (e.g. address, whether the vehicle is kept on the road, on a driveway or in a garage). That information is known to the insured and, barring transcription errors or conscious fraud, is likely to be correct and in any event objectively verifiable.
The CMA’s research into the commercial use of consumer data noted how the timing and frequency of interactions between businesses and consumers varied between sectors. In the motor insurance sector, the (erstwhile) annual nature of most cover means that insurers typically collect data from actual and potential customers as a snapshot once a year, close to renewal (unless a claim is made or policy details are changed). While historic data is important to the development of predictive models of risk, annual data on individuals can degrade in value even within a year.20 However, as insurance is increasingly provided on a pay as you go basis (Pay-As-You-Drive in motor insurance; Pay-As-You-Live for life and health insurance), the frequency of direct data collection will increase.
Third parties with no direct relationship with a potential insured may also collect data on that individual. Such data may be observed, derived or inferred. Third parties collect data from and about consumers in various ways. For example, a business may acquire data from a first party or another third party through purchase, licensing or exchange. Third parties may collect and analyse the data themselves or they may conduct analysis for other firms that may lack the required technical resources and skills. As the CMA notes, in practice there is a substantial amount of data sharing occurring between firms – for instance in support of first party service delivery. For example, first parties may commission third parties to gather data on their behalf and to pursue their own commercial interests (such as advertising and product development) by, for example, (i) enabling third parties to embed and control cookies on the first party’s website to track user visits to the site (on cookies, see further below); (ii) commissioning surveys and other market research; and (iii) using specialist data collection tools, such as ‘black box’ telematics devices.21
C.Sources of Data
Insurers and third party data collectors have access to internal and external sources of information on insureds and prospective insureds. Internal sources of information on insureds will include data held in paper files from legacy systems, and their newer generation of electronic databases. In addition, there are several external sources of information that can be aggregated to enable algorithms to identify good and bad risks. Those sources are public and private and may be obtained offline or online.
i.Public Sources of Data
Public sources of information that may be relevant to insurers to enable them to profile prospective insureds in detail include the land registry, court and insolvency records, Companies House, the electoral roll,22 and census information. In addition, the release of previously unavailable or inaccessible public-sector data has greatly expanded potential sources of third-party data. The US and UK Governments and the European Union have recently launched ‘open data’ websites to make available very substantial amounts of government statistics, including health, education, worker safety and energy data, amongst others.23 Further, insurers will routinely search databases available to insurers, principally as a means of cross-checking information and preventing fraud.24
ii.Private Sources of Data
Private sources of information may be obtained offline and online. Insurers can gain access to proprietary data, meaning data acquired from connected companies. That data will have been provided by the insured to that connected company on the basis that the company may share it with other connected companies. A further category of offline information is purchasing information. So...