Before you can put data into a database, you actually need to think about how it will be represented and manipulated. Most programmers have never heard of measurement theory or thought about the best way to represent their data. They either use whatever was there before or invent their own schemes on the fly. Most of the time, the data is put into the database in the units in which it was collected without regard to even a quick validation. It is assumed the input is in an appropriate unit, with appropriate scale and precision. In short, application programmers and users are perfect. This tendency to believe the computer, no matter how absurd the data, is called the âGarbage In, Gospel outâ principle in IT folklore.
This unwillingness to do validation and verification is probably the major reason for the lack of data quality.
1.1. Measurement Theory
âMeasure all that is measurable and attempt to make measurable that which is not yet so.â
âGalileo (1564â1642)
Measurement theory is a branch of applied mathematics that is useful in data analysis and database design. Measurements are not the same as the attribute being measured. Measurement is not just assigning numbers to things or their attributes so much as it is finding a property in things that can be expressed in numbers or other computable symbols. This structure is the scale used to take the measurement; the numbers or symbols represent units of measure.
Strange as it might seem, measurement theory came from psychology, not mathematics, statistics, or computer science. S. S. Stevens originated the idea of levels of measurement and classification of scales in 1946 for psychology testing. This is more recent than you would have thought. Scales are classified into types by the properties they do or do not have. The properties with which we are concerned are the following.
1. There is a natural origin point on the scale. This is sometimes called a zero, but it does not literally have to be a numeric zero. For example, if the measurement is the distance between objects, the natural zero is zero metersâyou cannot get any closer than that. If the measurement is the temperature of objects, the natural zero is absolute zeroânothing can get any colder. However, consider time; it goes from an eternal past into an eternal future, so you cannot find a natural origin for it.
2. Meaningful operations can be performed on the units. It makes sense to add weights together to get a new weight. Adding temperatures has to consider mass. Dates can be subtracted to give a duration in days. However, adding names or shoe sizes together is absurd.
3. There is a natural ordering to the units. It makes sense to speak about events occurring before or after one another in time or a physical object being heavier, longer, or hotter than another object.
But the alphabetical order imposed on a list of names is arbitrary, not naturalâa foreign language, with different names for the same objects, would impose another alphabetical ordering. And that assumes the other language even had an alphabet for an ordering; Chinese, for example, does not.
4. There is a natural metric function on the units. A metric function has nothing to do with the âmetric systemâ of measurements, which is more properly called SI, for âSystemĂ© International dâunitsâ in French. Metric functions have the following three properties:
a. The metric between an object and itself is the natural origin of the scale. We can write this in a notation as M(a, a) = 0.
b. The order of the objects in the metric function does not matter. Again in the semimathematical notation, M(a, b) = M(b, a).
c. There is a natural additive function that obeys the rule that M(a, b) + M(b, c) > = M(a, c), which is also known as the triangular inequality.
This notation is meant to be more general than just arithmetic. The âzeroâ in the first property is the origin of the scale, not just a numeric zero. The third property, defined with a âplusâ and a âgreater than or equalâ sign, is a symbolic way of expressing general ordering relationships. The âgreater than or equalâ sign refers to a natural ordering on the attribute being measured. The âplusâ sign refers to a meaningful operation in regard to that ordering, not just arithmetic addition.
The special case of the third property, where the âgreater than or equal toâ is always âgreater than,â is very desirable to people because it means that they can use numbers for units and do simple arithmetic with the scales. This is called a strong metric property. For example, human perceptions of sound and light intensity follow a cube root lawâthat is, if you double the intensity of light, the perception of the intensity increases by only 20% (Stevens 1957). The actual formula is âPhysical intensity to the 0.3 power equals perceived intensityâ in English. Knowing this, designers of stereo equipment use controls that work on a logarithmic scale internally but that show evenly spaced marks on the control panel of the amplifier.
It is possible to have a scale that has any combination of the metric properties. For example, instead of measuring the distance between two places in meters, you can measure it in units of effort. This is the old Chinese system, which had uphill and downhill units of distance, so you could estimate the time required to make a journey on foot.
Does this system of distances have the property that M(a, a) = 0? Yes; it takes no effort to get to where you are already located. Does it have the property that M(a, b) = M(b, a)? No; it takes less effort to go downhill than to go uphill. Does it have the property that M(a, b) + M(b, c) >= M(a, c)? Yes with the direction considered; the amount of effort needed to go directly to a place will always be less than the effort of making another stop along the way.
As you can see, these properties can be more intuitive than mathematical. Obviously, we like the more mathematical side of this model because it fits into a database, but you have to be aware of the intuitive side.
1.1.1. Range, Granularity, and Your Instruments
âThe only man who behaves sensibly is my tailor; he takes my measurements anew every time he sees me, while all the rest go on with their old measurements and expect me to fit them.â
âGeorge Bernard Shaw
Range and granularity are properties of the way the measurements are made. Since we have to store data in a database within certain limits, they are very important to a database designer. The type of scales is unrelated to whether you use discrete or continuous variables. While measurements in a database are always discrete due to finite precision, attributes can be conceptually either discrete or continuous regardless of measurement level. Temperature is usually regarded as a continuous attribute, so temperature measurement to the nearest degree Celsius is a ratio-level measurement of a continuous attribute.
However, quantum mechanics holds that the universe is fundamentally discrete, so temperature may actually be a discrete attribute. In ordinal scales for continuous attributes, ties are impossible (or have probability zero). In ordinal scales for discrete attributes, ties are possible. Nominal scales usually apply to discrete attributes. Nominal scales for continuous attributes can be modeled but are rarely used.
Aside from these philosophical considerations, there is the practical aspect of the instrument used for the measurement. A radio telescope, surveyorâs transit, meter stick, and a micrometer are tools that measure distance. Nobody would claim that they are interchangeable. I can use a measuring tape to fit furniture in my house but not to make a mechanical wristwatch or to measure the distance to the moon.
From a purely scientific viewpoint, measurements should be reduced to the least precise instrumentâs readings. This means that you can be certain that the final results of calculations can be justified.
From a practical viewpoint, measurements are often adjusted by statistical considerations. This means that final results of calculations will be closer to realityâassuming that the adjustments were valid. This is particularly true for missing data, which we will discuss later.
But for now consider the simple example of a database showing that Joe Celko bought 500 bananas this week. Unless I just started a gorilla ranch, this is absurd and probably ought to be adjusted to five bananas or less. On the other hand, if the Dairy Queen Company orders five bananas this week, this is absurd. They are a corporation that had about 6000 restaurants in the United States, Canada, and 20 foreign countries in 2007, all of which make a lot of banana splits every day.
1.1.2. Range
A scale also has other properties that are of interest to someone building a database. First, scales have a rangeâwhat are the highest and lowest values that can appear on the scale? It is possible to have a finite or an infinite limit on either the lower or the upper bound. Overflow ...