UCLA Soc. 210A, Topic 2, Data and Their Computerization

1oct2000

Outline:

UCLA Soc. 210A, Topic 2, Data and Their Computerization

Professor: David D. McFarland

Web Pages for Fall 1999

Syllabus for logistics
ClassWeb site for announcements, discussion board
Outline for course content

Topic 2: Data and their Computerization

Assignment 2

Here we get set up with several real datasets, ranging from a tiny one suitable for keyboard entry, to a survey of approximately 3,000 respondents that is already in a Stata-format computer file.

Then, with actual examples of data at hand, we will take a closer look at "scales of measurement", the ways in which numbers are assigned, corresponding to various empirical observations. This is a process that requires substantive sociological judgement, not something straightforward or automatic. The level of measurement of a variable has implications for the types of statistical procedures that can be conducted without distorting the substantive information encoded in the numbers.

Three Real Datasets

At various points in this course we will make use of three real datasets, for illustrations and exercises. The first, a tiny dataset originally on paper, will be used for practice in getting data into a computer for stata calculations. The second is coordinated with the Hamilton book. The third is one full year's data from the General Social Survey.

wstates.dta is a file you will create by typing in and editing the Western States dataset that you received previously as a one-page printed handout.
Selected stata commands are outlined on my quick reference sheet. This is no substitute for the more detailed treatment in the Hamilton book, and that in turn omits many details covered in Stata's own reference manuals.
hamilton.exe is a self-extracting archive file, available for download on this course's ClassWeb site. Running it produces the various .dta files for use in conjunction with the Hamilton book.
gss94.dta, also available for download from this course's ClassWeb site, is a 1994-only extract from the GSS data, in stata format. We shall use it two different ways:
- as a representative sample of US adults (this will be qualified when we discuss survey methods in detail)
- as a known population, with which to compare (sub)samples randomly selected from it, when studying the statistical properties of random sampling.
The online codebook links are:
- Main Codebook Pages
- Codebook Appendices

Aspects of Data Organization

Social science datasets typically have the following type of organization:

Rectangular array
Row = case (e.g., a particular state, or a particular survey respondent)
Column = variable (e.g., number of congressmen, or level of schooling completed)
Cell entry = that case's score on that variable.

Not all social science datasets are like that, some not even approximately like that. For example, sociologists studying social networks sometimes have data organized in square arrays, with both rows and columns representing the same cases, and cell entries representing presence or level of some sort of dyadic relationship between the row and column cases. Similarly, sociologists studying social stratification sometimes have data organized as square arrays, with both rows and columns representing the same occupational categories, and cell entries representing the level of mobility between those categories over time.

Even in the usual dataset, there are commonly loose ends that need to be fixed, to fit the data into the cases x variables format. In particular, there are missing data cases in which no score has been obtained on some variable. The scores may be missing for various reasons, such as the following which arise in survey research:

inapplicable questions (e.g., spouse's age for an unmarried respondent)
respondent didn't know the answer
respondent knew but refused to state the answer
interviewer neglected to ask the question, or neglected to record the answer.

These are handled by keeping the case and the variable in the dataset, but placing a "missing data code" in the cell where the score would otherwise be.

The General Social Survey does not have a consistent missing data code, variously using such things as 99, 0, or -1; and occasionally using such easily overlooked things as 8 or 22. When using GSS data in native form, one needs to consult the codebook for each variable, to determine which numerical values represent not valid scores but one or another type of missing data.

Stata uses a consistent missing data code, entered and displayed as a dot (a period, or decimal point, with no sentence or number for it to be punctuating).

replace age=. if age==99 is an example of the kinds of stata commands used to replace invalid numerical values with missing data codes. This is covered in Hamilton ch 2, along with other aspects of data management to which we will return at various points in the course.

Scales of Measurement

Variables are sometimes classified as either "categorical" or "numerical". Moore and McCabe use a similar distinction, categorical and quantitative (pp 5, 22):

A categorical variable places an individual into one of several groups or categories.
A quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense.

Moore and McCabe give gender (with categories male and female) as and example of a categorical variable, and height and salary (measured in centimeters and dollars respectively) as examples of quantitative variables.

Some authors use "qualitative" in place of "categorical", but that has a disparaging connotation, as if numbers were somehow antithetical to quality.

The categorical/quantitative dichotomy is useful, as far as it goes, and is sufficiently detailed for most of the things we will cover this quarter. It does have its limitations, however. Moore and McCabe's attempt to distinguish between histograms and (other) bar charts (p 16), for example, is really in need of a richer vocabulary of scale properties.

Their subsequent discussion of procedures such as the sign test, which discards the magnitude of a difference but retains its sign (p 521), would seem less baffling if they had the concept of "ordinal scale" available, and similarly for some other discussions that deal with such matters as nonparametric procedures, or violations of normality assumptions.

Outside the Moore and McCabe text, one encounters other terms for types of variables. Hamilton, for example, defines "string variables" (p 14) and "numerical variables" (p 14) and mentions without defining "categorical variables" (p 23 passim) and "measurement variables" (p 81).

The classic classification of measurement scales dates back to a 1951 publication by psychologist S. S. Stevens, which distinguished nominal, ordinal, interval, and ratio scales, and implicitly distinguished all of those from what I sometimes call mere lists. Here I will go over Stevens' types of scales, and then go on to several additional issues that arise in sociological research.

As stated above, the categorical/quantitative distinction of Moore and McCabe will suffice for most of our purposes this quarter, and the following is mostly for those occasions when that simplified classification seems inadequate.

S. S. Stevens: types of scales along a single dimension

0. Not a scale (McFarland's "mere list")

Categories overlap (aren't mutually exclusive).
- Ex: Ethnic group of person whose parents are of two different ethnic groups on the list.
- Fixes: "Choose the one response closest to your opinion"; or interviewer codes which response was given first.
- Or not: "Hispanics may be of either race" in census tables, and modification for 2000 census took ethnicity even farther from mutually exclusive categories.
Categories aren't exhaustive.
- Ex: Protestant/Catholic/Jewish doesn't provide categories for people with other religions or no religion.
- Fixes: "None of the above", "n.e.c.", "Other"; or limit the scope, e.g., "Among members of Judeo-Christian religious groups..."
Or both.

1. Nominal Scale

Categories are both mutually exclusive and exhaustive.

Empirical operation: determination of equality.
Invariance: any permutation (one-to-one transformation).
Invariant statistics: number of cases, mode.

2. Ordinal Scale

Meaningful order along a single dimension.

Empirical operation: determination of greater or less.
Invariance: any order-preserving transformation.
Invariant statistics: median, percentiles.
Ex: Many attitude items, such as XMARSEX: always, almost always, only sometimes, not at all.

3. Interval Scale

Meaningful unit of measurement.

Empirical operation: determination of equality of intervals or differences.
Invariance: any linear transformation, replacing x with y=a+bx.
Invariant statistics: mean, standard deviation.
Ex: Time, with different calendars using vastly different zeros, but more-or-less agreeing on length of a year.

4. Ratio Scale

Meaningful zero point.

Empirical operation: Determination of equality of ratios.
Invariance: proportional change of scale, y = bx.
Invariant statistics: geometric mean, coefficient of variation.

Ex: Years of schooling, EDUC
Ex: Dollar amount of money (but not unequally wide categories as in RINCOM91)

Other considerations beyond Stevens

Distinction between how variable is conceptualized and how it is measured and recorded.
Ex: RINCOM91 vs Income expressed to nearest thousand dollars
Is it fundamentally not a single dimension?
Ex: Season-related variables Ex: MARITAL? Ex: FEMARRY? This has 4 responses that combine aspects of {early, later} x {alone, live-in, spouse}
Is it mostly on a single dimension, the exceptions being such as "Not Applicable" or "Other" or "More than one of the above"?
Ex: NEARGOD for believer in omnipresent deity, or atheist
In an ordinal or nominal scale, how firm or arbitrary are the number of categories and their boundaries?
Ex: CLASS vs CLASSY
In an interval scale, are intervals "subjectively" equal, or in some sense "objectively" equal?
Ex: Partners vs. Myfaith
Is there a meaningful upper anchor point for the scale?
Ex: 1.0 ? 100% ? 212 degrees F ? Ex: Total RDA (recommended daily allowance) for nutrients ?
Is the upper anchor point a maximum, or can it be exceeded?
Are negative values meaningful?
Ex: Loss = -(Profit) Ex: Decline = -(Increase) Ex: Attitude scales symmetric around a zero-like value: strongly agree, agree, undecided, disagree, strongly disagree.
Can the value of the variable change over time?
Ex: FAEDUC compared to EDUC
In an ordinal scale, can, or must, an individual case move through the categories in order?
Ex: Age vs. Childs vs. Polviews
If a variable can change, can it go either direction?
Ex: AGE vs POLVIEWS
Who or what is hypothesized as being able to change the value of the variable?

References:

Stevens, Stanley Smith 1951. "Mathematics, Measurement and Psychophysics." Ch. 1, pp. 1-49, in: S. S. Stevens, ed. 1951. Handbook of Experimental Psychology. New York: Wiley.

Luce, R. Duncan, and Carol L. Krumhansl 1988. "Measurement, Scaling, and Psychophysics." Ch. 1, pp. 3-74, in: Richard C. Atkinson, Richard J. Herrnstein, Gardner Lindzey, and R. Duncan Luce, eds. 1988. Stevens' Handbook of Experimental Psychology. 2nd edn. New York: Wiley.