1oct2000
Outline:
UCLA Soc. 210A, Topic 2, Data and Their Computerization
Web Pages for Fall 1999
Topic 2: Data and their Computerization
Here we get set up with several real datasets, ranging from a
tiny one suitable for keyboard entry, to a survey of
approximately 3,000 respondents that is already in a Stata-format
computer file.
Then, with actual examples of data at hand, we will take a closer
look at "scales of measurement", the ways in which numbers are
assigned, corresponding to various empirical observations. This
is a process that requires substantive sociological judgement,
not something straightforward or automatic. The level of
measurement of a variable has implications for the types of
statistical procedures that can be conducted without distorting
the substantive information encoded in the numbers.
Three Real Datasets
At various points in this course we will make use of three real
datasets, for illustrations and exercises. The first, a tiny
dataset originally on paper, will be used for practice in getting
data into a computer for stata calculations. The second is
coordinated with the Hamilton book. The third is one full year's
data from the General Social Survey.
- wstates.dta is a file you will create by typing
in and editing the Western States dataset that you received
previously as a one-page printed handout.
- Selected stata commands
are outlined on my quick reference sheet. This is no substitute
for the more detailed treatment in the Hamilton book, and that in
turn omits many details covered in Stata's own reference
manuals.
- hamilton.exe is a self-extracting archive file,
available for download on this course's ClassWeb site. Running
it produces the various .dta files for use in conjunction with
the Hamilton book.
- gss94.dta, also available for download from this
course's ClassWeb site, is a 1994-only extract from the GSS data,
in stata format. We shall use it two different ways:
- as a representative sample of US adults (this will be
qualified when we discuss survey methods in detail)
- as a known population, with which to compare
(sub)samples randomly selected from it, when studying the
statistical properties of random sampling.
The online codebook links are:
Aspects of Data Organization
Social science datasets typically have the following type of
organization:
- Rectangular array
- Row = case (e.g., a particular state, or a particular
survey respondent)
- Column = variable (e.g., number of congressmen, or
level of schooling completed)
- Cell entry = that case's score on that variable.
Not all social science datasets are like that, some not even
approximately like that. For example, sociologists studying
social networks sometimes have data organized in square arrays,
with both rows and columns representing the same cases, and cell
entries representing presence or level of some sort of dyadic
relationship between the row and column cases. Similarly,
sociologists studying social stratification sometimes have data
organized as square arrays, with both rows and columns
representing the same occupational categories, and cell entries
representing the level of mobility between those categories over
time.
Even in the usual dataset, there are commonly loose ends that
need to be fixed, to fit the data into the cases x variables
format. In particular, there are missing data cases in
which no score has been obtained on some variable. The scores
may be missing for various reasons, such as the following which
arise in survey research:
- inapplicable questions (e.g., spouse's age for an
unmarried respondent)
- respondent didn't know the answer
- respondent knew but refused to state the answer
- interviewer neglected to ask the question, or
neglected to record the answer.
These are handled by keeping the case and the variable in
the dataset, but placing a "missing data code" in the cell where
the score would otherwise be.
The General Social Survey does not have a consistent missing data
code, variously using such things as 99, 0, or -1; and
occasionally using such easily overlooked things as 8 or 22.
When using GSS data in native form, one needs to consult the
codebook for each variable, to determine which numerical values
represent not valid scores but one or another type of missing
data.
Stata uses a consistent missing data code, entered and displayed
as a dot (a period, or decimal point, with no sentence or number
for it to be punctuating).
- replace age=. if age==99 is an example of the
kinds of stata commands used to replace invalid numerical values
with missing data codes. This is covered in Hamilton ch 2, along
with other aspects of data management to which we will return at
various points in the course.
Scales of Measurement
Variables are sometimes classified as either "categorical" or
"numerical". Moore and McCabe use a similar distinction,
categorical and quantitative (pp 5, 22):
- A categorical variable places an individual
into one of several groups or categories.
- A quantitative variable takes numerical values
for which arithmetic operations such as adding and averaging make
sense.
Moore and McCabe give gender (with categories male and female) as
and example of a categorical variable, and height and salary
(measured in centimeters and dollars respectively) as examples of
quantitative variables.
Some authors use "qualitative" in place of "categorical", but
that has a disparaging connotation, as if numbers were somehow
antithetical to quality.
The categorical/quantitative dichotomy is useful, as far as it
goes, and is sufficiently detailed for most of the things we will
cover this quarter. It does have its limitations, however.
Moore and McCabe's attempt to distinguish between histograms and
(other) bar charts (p 16), for example, is really in need of a
richer vocabulary of scale properties.
Their subsequent discussion of procedures such as the sign test,
which discards the magnitude of a difference but retains its sign
(p 521), would seem less baffling if they had the concept of
"ordinal scale" available, and similarly for some other
discussions that deal with such matters as nonparametric
procedures, or violations of normality assumptions.
Outside the Moore and McCabe text, one encounters other terms for
types of variables. Hamilton, for example, defines "string
variables" (p 14) and "numerical variables" (p 14) and mentions
without defining "categorical variables" (p 23 passim) and
"measurement variables" (p 81).
The classic classification of measurement scales dates back to a
1951 publication by psychologist S. S. Stevens, which
distinguished nominal, ordinal, interval, and ratio scales, and
implicitly distinguished all of those from what I sometimes call
mere lists. Here I will go over Stevens' types of scales, and
then go on to several additional issues that arise in
sociological research.
As stated above, the categorical/quantitative distinction of
Moore and McCabe will suffice for most of our purposes this
quarter, and the following is mostly for those occasions when
that simplified classification seems inadequate.
S. S. Stevens: types of scales along a single dimension
0. Not a scale (McFarland's "mere list")
- Categories overlap (aren't mutually exclusive).
- Ex: Ethnic group of person whose parents are of two
different ethnic groups on the list.
- Fixes: "Choose the one response closest to your
opinion"; or interviewer codes which response was
given first.
- Or not: "Hispanics may be of either race" in census
tables, and modification for 2000 census took
ethnicity even farther from mutually exclusive
categories.
- Categories aren't exhaustive.
- Ex: Protestant/Catholic/Jewish doesn't provide
categories for people with other religions or no
religion.
- Fixes: "None of the above", "n.e.c.", "Other";
or limit the scope, e.g., "Among members of
Judeo-Christian religious groups..."
- Or both.
1. Nominal Scale
Categories are both mutually exclusive and exhaustive.
Empirical operation: determination of equality.
Invariance: any permutation (one-to-one transformation).
Invariant statistics: number of cases, mode.
2. Ordinal Scale
Meaningful order along a single dimension.
Empirical operation: determination of greater or less.
Invariance: any order-preserving transformation.
Invariant statistics: median, percentiles.
Ex: Many attitude items, such as XMARSEX:
always, almost always, only sometimes, not at all.
3. Interval Scale
Meaningful unit of measurement.
Empirical operation: determination of equality of intervals
or differences.
Invariance: any linear transformation, replacing x with
y=a+bx.
Invariant statistics: mean, standard deviation.
Ex: Time, with different calendars using vastly different zeros,
but more-or-less agreeing on length of a year.
4. Ratio Scale
Meaningful zero point.
Empirical operation: Determination of equality of
ratios.
Invariance: proportional change of scale, y = bx.
Invariant statistics: geometric mean, coefficient of
variation.
Ex: Years of schooling, EDUC
Ex: Dollar amount of money (but not unequally wide categories
as in RINCOM91)
Other considerations beyond Stevens
- Distinction between how variable is conceptualized and how it
is measured and recorded.
Ex: RINCOM91 vs Income expressed to nearest thousand dollars
- Is it fundamentally not a single dimension?
Ex: Season-related variables
Ex: MARITAL?
Ex: FEMARRY? This has 4 responses that combine aspects of
{early, later} x {alone, live-in, spouse}
- Is it mostly on a single dimension, the exceptions being such
as "Not Applicable" or "Other" or "More than one of the
above"?
Ex: NEARGOD for believer in omnipresent deity, or atheist
- In an ordinal or nominal scale, how firm or arbitrary are the
number of categories and their boundaries?
Ex: CLASS vs CLASSY
- In an interval scale, are intervals "subjectively" equal, or
in some sense "objectively" equal?
Ex: Partners vs. Myfaith
- Is there a meaningful upper anchor point for the scale?
Ex: 1.0 ? 100% ? 212 degrees F ?
Ex: Total RDA (recommended daily allowance) for nutrients
?
Is the upper anchor point a maximum, or can it be exceeded?
- Are negative values meaningful?
Ex: Loss = -(Profit)
Ex: Decline = -(Increase)
Ex: Attitude scales symmetric around a zero-like value:
strongly agree, agree, undecided, disagree, strongly disagree.
- Can the value of the variable change over time?
Ex: FAEDUC compared to EDUC
- In an ordinal scale, can, or must, an individual case move
through the categories in order?
Ex: Age vs. Childs vs. Polviews
- If a variable can change, can it go either direction?
Ex: AGE vs POLVIEWS
- Who or what is hypothesized as being able to change the value
of the variable?
References:
Stevens, Stanley Smith 1951. "Mathematics, Measurement and
Psychophysics." Ch. 1, pp. 1-49, in: S. S. Stevens, ed. 1951.
Handbook of Experimental Psychology. New York: Wiley.
Luce, R. Duncan, and Carol L. Krumhansl 1988. "Measurement,
Scaling, and Psychophysics." Ch. 1, pp. 3-74, in: Richard C.
Atkinson, Richard J. Herrnstein, Gardner Lindzey, and R. Duncan
Luce, eds. 1988. Stevens' Handbook of Experimental Psychology.
2nd edn. New York: Wiley.