|
RESEARCH ISSUES SHAWNEE VICKERY, Feature Editor, Eli Broad Graduate School of Management, Michigan State University How Valid Are Measurements?by Cornelia Dröge, Eli Broad Graduate School of Management, Michigan State University
This article is the first of
two
that addresses the following question: how valid are our
measurements? Measurement is the process by which the
observable indicator(s) of an attribute are classified and/or
quantified so that a degree of isomorphism exists with an
underlying unobservable concept. Assessing whether measurement is
valid involves assessing the relationship between empirical
indicators and abstract concepts. There are two basic types of
assessments that need to be done before the researcher can claim
that the measurement process is sufficiently valid. The first is
reliability, which is concerned with the extent to which
the
measurement process yields consistent results when the process is
"repeated" in some way (i.e., not necessarily duplicated).
Reliability is necessary for measurement to be valid, but not
sufficient. The second assessment is validity, a term that
is usually preceded by an adjective such as "construct," "face,"
"statistical conclusion," "discriminant" and so on. Validity
assessment involves demonstrating that the theoretical construct
supposedly measured by the indicator(s) is actually being
measured
by that indicator(s).
Perfect reliability and/or validity are unachievable; rather the
goal should be achieving sufficient reliability and validity for
the particular purpose of the researcher. Thus, standards for
reliability and validity are generally lower for exploratory
research than for confirmatory or causal research. No study can
address all issues in measurement, but every study should
consider
at least some aspects of reliability and validity. The reason is
simple. If a carefully developed research hypothesis is supported
(or not supported) by empirical research, researchers want to be
able to state with some confidence that the substantive
underlying
theory is supported (or not supported). If significant
reliability
or validity problems exist, then substantive theoretical
conclusions cannot be drawn because the replicability and meaning
of the results will be questionable: has the theory been deemed
correct (or incorrect) because of measurement artifacts?
An example will serve to illustrate the principles discussed so
far. Suppose a researcher is interested in the relationship
between
manufacturing flexibility and firm performance. To test whether a
hypothesized positive relationship exists, the concepts
"flexibility" and "performance" must be measured. For firm
performance, the researcher might specify that "firm" means SBU
and
that "performance" means 1996 ROI, 1996 ROA and growth in these
two
over the last three years. Note that of the entire hypothetical
domain of attributes of firm performance, four were chosen and
others such as growth in market share, return on sales (ROS) and
sales growth, were rejected. A certain time frame was also
chosen,
and it was decided not to focus on one particular industry.
Next, the researcher decides to obtain data from CEOs in two
ways:
(1) on 7-point scales, and (2) actual estimates of the four
chosen
attributes. Both involve the CEO making marks on a paper
questionnaire. Note that the researcher wishes to make statements
about firm performance, but (strictly speaking) is asking for
CEOs'
perceptions of firm performance. Questions must be crafted and
response scales designed. For example, the scale descriptors
could
focus on "best"/``worst'' in the industry or ``best''/``worst''
in
the SBU's history. The researcher decides to ask for two "marks"
because past research has found that CEOs who are reluctant to
give
actual estimates will often respond on scales.
From this sketchy initial outline of a measurement process, a
number of questions are immediately obvious. Will the CEOs' data
from the 7-point scales be consistent with the estimated numbers?
If the researcher also obtains so-called "hard" data from a
public
source or data from a different executive, will the results be
consistent with those obtained from the CEOs? If the same CEOs
were
asked the same questions two days later, would consistent results
be obtained? If slightly different scale descriptors (or a
different question format) were used, would the results be about
the same? Do the four attributes of firm performance chosen
actually tap the concept of firm performance for the purpose of
examining its relationship to flexibility? (Or will the effect of
enhanced flexibility be seen in market share growth but not in
ROI
or ROI growth?) Will flexibility and performance be found to be
related because the CEOs attributed performance to flexibility
(i.e., attribution was actually measured, and other indicators of
these constructs will show flexibility and performance are
unrelated)? Alternately, will flexibility and performance be
found
to be unrelated in a particular study because unreliable
indicators
of each have been used (i.e., in fact they are related, but
measurement error has masked the relationship)? These and many
other questions have to do with the reliability and validity of
the
measurement process.
The remainder of this article will be devoted to reliability
assessment. The purpose is to list some of the reliability
assessments that are most relevant in business research. The
reader
is invited to consult the seminal references listed below for
complete discussions about reliability, validity, and the
research
process.
One of the most commonly reported reliability checks is
Cronbach's
alpha. This coefficient assesses the internal consistency
of
multiple indicators of the same construct. A common use is to
demonstrate sufficient reliability to justify taking the sum or
mean of the set of indicators of a construct, and then using that
sum or mean in the testing of hypothesized relationships between
constructs. The value of alpha depends on the average interitem
correlation ({ r bar}) and the number of items (n): {n {r bar} [
1
+ {r bar} ( n - 1 ) ]}. Suppose the researcher has four
indicators,
as in the above example of firm performance. Clearly, the more
correlated the four indicators are, the more confidence the
researcher has that the four are actually measuring the same
construct. Alpha expresses this intuitively appealing idea.
Cronbach's alpha has the following characteristics. First, the
range is 0 to 1. When alpha is .8 or over, the set of indicators
is
often deemed sufficiently reliable for confirmatory research (.6
for exploratory research).
Second, the mechanics necessary for the researcher to be able to
calculate alpha are relatively simple. One administration of the
data collection instrument can contain the multiple items, unlike
the determination of test-retest reliability which
requires
at least two administrations. Alternate forms of the same item
need
not be constructed, a frequent difficulty in assessing
alternate
forms reliability (and alpha can be interpreted as the
expected
correlation between the actual set of items and a hypothetical
alternate form). Alpha is unique, unlike split-halves
reliability where the reliability estimate depends on the
split
made of the items. Alpha is also easily calculated from the
correlation matrix by hand, calculator or computer.
Third, alpha is a lower bound for reliability of n unweighted
items. Thus alpha is a conservative estimate of reliability. For
example, coefficient theta, calculated from principal-component
factor analysis as <$E { [n"/"(n-1)][1-1"/"}>(largest
eigenvalue)],
is greater than alpha. Theta can be interpreted as the maximum
alpha achievable when the items are weighted. This connection
between factor analysis and reliability deserves mention because
factor analysis results are often also used as evidence of
unidimensionality. For example, four items supposedly
measure the same firm performance construct if they load on one
factor. Unidimensionality is evidence of validity.
Finally, alpha is a generalization of KR20, a reliability
estimate
for dichotomously scaled items. Alpha is used for interval or
ratio
scaled items, but sometimes research interest lies in dichotomous
items. For example, suppose "the extent of performance
measurement"
is the construct of interest, and 26 items are identified that
the
firm could be tracking. On a questionnaire, the researcher asks
the
respondent to tick the items actually tracked. The number of
ticks
is counted (each item ticked is 1, each not ticked is 0). How
reliable is this count as a measure of the construct? Alpha
cannot
be used here the appropriate reliability is KR20.
In the above discussion about internal consistency, several other
reliability estimates were mentioned: test-retest, alternate
forms
and split-halves. The first two involve administering two data
collection procedures to the same respondents. In test-retest,
the
same questionnaire or part of a questionnaire (for example) is
answered twice, while in alternate forms two questionnaires
purporting to be substantively the same are answered. In
split-halves, a set of items is divided in half and the summed
scores of the halves are compared. In each of these assessments,
simple correlations are used to demonstrate reliability.
In the example about flexibility and performance described
previously, the researcher may also consider demonstrating key
informant reliability. Here, the respondents' answers are
compared
(e.g., correlated) to data obtained elsewhere. For example, the
CEOs' answers for ROI and ROA could be compared to: (1) those
published in publicly available reports of some kind; and/or (2)
answers to the same questions obtained from some other
knowledgeable executives. In the former case, correlation with
"hard" data on key items is often used to liberally declare key
informant reliability for the entire questionnaire. (The
reliability of so-called "hard" data is another issue!) In the
latter case, the researcher is asking whether two "judges" of the
same thing will give the same answer. After all, firm performance
is being judged by the CEO but is not a construct about the CEO.
To
the extent that interjudge reliability is established, the
researcher can be more confident that firm characteristics are
being tapped rather than CEO characteristics (i.e., the informant
about the firm is reliable).
As a final note, consider the issue of reliability assessment
when
measurement focuses on classifying rather than quantifying. The
core questions about reliability are similar. Someone (or perhaps
a computer program) must do the classification. If the same coder
classifies the same content or items again, will the results be
consistent? Stability of classification is similarly to
test-retest reliability. If a different coder given the same
mandate classifies the same content or items, will the results be
the same? Reproducibility or intercoder reliability is similar to
interjudge reliability discussed earlier. If a standard or norm
exists for a particular content or set of items, will the coder
replicate that standard or norm? Accuracy of the
classification by that coder is similar to key informant
reliability when the respondent's answer (the coder's
classification) is compared to "hard" data (the standard or
norm).
Thomas D. Cook and Donald T. Campbell (1979),
Quasi-Experimentation: Design and Analysis Issues for Field
Settings, Houghton Mifflin Co., Boston, MA.
Fred N. Kerlinger (1986), Foundations of Behavioral Research
(Third Edition), Harcourt Brace Jovanovich College
Publishers,
Orlando, FL.
Jum C. Nunnally (1978), Psychometric Theory (Second
Edition), McGraw-Hill Inc., New York.
|