1oct2000

Outline:

UCLA Soc. 210A, Topic 2, Data and Their Computerization

Professor: David D. McFarland


Web Pages for Fall 1999


Topic 2: Data and their Computerization

Here we get set up with several real datasets, ranging from a tiny one suitable for keyboard entry, to a survey of approximately 3,000 respondents that is already in a Stata-format computer file.

Then, with actual examples of data at hand, we will take a closer look at "scales of measurement", the ways in which numbers are assigned, corresponding to various empirical observations. This is a process that requires substantive sociological judgement, not something straightforward or automatic. The level of measurement of a variable has implications for the types of statistical procedures that can be conducted without distorting the substantive information encoded in the numbers.

Three Real Datasets

At various points in this course we will make use of three real datasets, for illustrations and exercises. The first, a tiny dataset originally on paper, will be used for practice in getting data into a computer for stata calculations. The second is coordinated with the Hamilton book. The third is one full year's data from the General Social Survey.

Aspects of Data Organization

Social science datasets typically have the following type of organization: Not all social science datasets are like that, some not even approximately like that. For example, sociologists studying social networks sometimes have data organized in square arrays, with both rows and columns representing the same cases, and cell entries representing presence or level of some sort of dyadic relationship between the row and column cases. Similarly, sociologists studying social stratification sometimes have data organized as square arrays, with both rows and columns representing the same occupational categories, and cell entries representing the level of mobility between those categories over time.

Even in the usual dataset, there are commonly loose ends that need to be fixed, to fit the data into the cases x variables format. In particular, there are missing data cases in which no score has been obtained on some variable. The scores may be missing for various reasons, such as the following which arise in survey research:

These are handled by keeping the case and the variable in the dataset, but placing a "missing data code" in the cell where the score would otherwise be.

The General Social Survey does not have a consistent missing data code, variously using such things as 99, 0, or -1; and occasionally using such easily overlooked things as 8 or 22. When using GSS data in native form, one needs to consult the codebook for each variable, to determine which numerical values represent not valid scores but one or another type of missing data.

Stata uses a consistent missing data code, entered and displayed as a dot (a period, or decimal point, with no sentence or number for it to be punctuating).

Scales of Measurement

Variables are sometimes classified as either "categorical" or "numerical". Moore and McCabe use a similar distinction, categorical and quantitative (pp 5, 22): Moore and McCabe give gender (with categories male and female) as and example of a categorical variable, and height and salary (measured in centimeters and dollars respectively) as examples of quantitative variables.

Some authors use "qualitative" in place of "categorical", but that has a disparaging connotation, as if numbers were somehow antithetical to quality.

The categorical/quantitative dichotomy is useful, as far as it goes, and is sufficiently detailed for most of the things we will cover this quarter. It does have its limitations, however. Moore and McCabe's attempt to distinguish between histograms and (other) bar charts (p 16), for example, is really in need of a richer vocabulary of scale properties.

Their subsequent discussion of procedures such as the sign test, which discards the magnitude of a difference but retains its sign (p 521), would seem less baffling if they had the concept of "ordinal scale" available, and similarly for some other discussions that deal with such matters as nonparametric procedures, or violations of normality assumptions.

Outside the Moore and McCabe text, one encounters other terms for types of variables. Hamilton, for example, defines "string variables" (p 14) and "numerical variables" (p 14) and mentions without defining "categorical variables" (p 23 passim) and "measurement variables" (p 81).

The classic classification of measurement scales dates back to a 1951 publication by psychologist S. S. Stevens, which distinguished nominal, ordinal, interval, and ratio scales, and implicitly distinguished all of those from what I sometimes call mere lists. Here I will go over Stevens' types of scales, and then go on to several additional issues that arise in sociological research.

As stated above, the categorical/quantitative distinction of Moore and McCabe will suffice for most of our purposes this quarter, and the following is mostly for those occasions when that simplified classification seems inadequate.

S. S. Stevens: types of scales along a single dimension

0. Not a scale (McFarland's "mere list")

1. Nominal Scale

Categories are both mutually exclusive and exhaustive.

Empirical operation: determination of equality.
Invariance: any permutation (one-to-one transformation).
Invariant statistics: number of cases, mode.

2. Ordinal Scale

Meaningful order along a single dimension.

Empirical operation: determination of greater or less.
Invariance: any order-preserving transformation.
Invariant statistics: median, percentiles.
Ex: Many attitude items, such as XMARSEX: always, almost always, only sometimes, not at all.

3. Interval Scale

Meaningful unit of measurement.

Empirical operation: determination of equality of intervals or differences.
Invariance: any linear transformation, replacing x with y=a+bx.
Invariant statistics: mean, standard deviation.
Ex: Time, with different calendars using vastly different zeros, but more-or-less agreeing on length of a year.

4. Ratio Scale

Meaningful zero point.

Empirical operation: Determination of equality of ratios.
Invariance: proportional change of scale, y = bx.
Invariant statistics: geometric mean, coefficient of variation.

Ex: Years of schooling, EDUC
Ex: Dollar amount of money (but not unequally wide categories as in RINCOM91)

Other considerations beyond Stevens

  1. Distinction between how variable is conceptualized and how it is measured and recorded.
    Ex: RINCOM91 vs Income expressed to nearest thousand dollars
  2. Is it fundamentally not a single dimension?
    Ex: Season-related variables Ex: MARITAL? Ex: FEMARRY? This has 4 responses that combine aspects of {early, later} x {alone, live-in, spouse}
  3. Is it mostly on a single dimension, the exceptions being such as "Not Applicable" or "Other" or "More than one of the above"?
    Ex: NEARGOD for believer in omnipresent deity, or atheist
  4. In an ordinal or nominal scale, how firm or arbitrary are the number of categories and their boundaries?
    Ex: CLASS vs CLASSY
  5. In an interval scale, are intervals "subjectively" equal, or in some sense "objectively" equal?
    Ex: Partners vs. Myfaith
  6. Is there a meaningful upper anchor point for the scale?
    Ex: 1.0 ? 100% ? 212 degrees F ? Ex: Total RDA (recommended daily allowance) for nutrients ?
    Is the upper anchor point a maximum, or can it be exceeded?
  7. Are negative values meaningful?
    Ex: Loss = -(Profit) Ex: Decline = -(Increase) Ex: Attitude scales symmetric around a zero-like value: strongly agree, agree, undecided, disagree, strongly disagree.
  8. Can the value of the variable change over time?
    Ex: FAEDUC compared to EDUC
  9. In an ordinal scale, can, or must, an individual case move through the categories in order?
    Ex: Age vs. Childs vs. Polviews
  10. If a variable can change, can it go either direction?
    Ex: AGE vs POLVIEWS
  11. Who or what is hypothesized as being able to change the value of the variable?

References:

Stevens, Stanley Smith 1951. "Mathematics, Measurement and Psychophysics." Ch. 1, pp. 1-49, in: S. S. Stevens, ed. 1951. Handbook of Experimental Psychology. New York: Wiley.

Luce, R. Duncan, and Carol L. Krumhansl 1988. "Measurement, Scaling, and Psychophysics." Ch. 1, pp. 3-74, in: Richard C. Atkinson, Richard J. Herrnstein, Gardner Lindzey, and R. Duncan Luce, eds. 1988. Stevens' Handbook of Experimental Psychology. 2nd edn. New York: Wiley.