RESEARCH ISSUESSHAWNEE VICKERY, Feature Editor, Eli Broad Graduate School of Management, Michigan State UniversityAssessments of Validityby Cornelia Dröge, Eli Broad Graduate School of Management, Michigan State University
T
his article is the second of two that addresses the following
question: how valid are our measurements? The first
article appeared in the September/October 1996 issue of Decision
Line. Its content can be summarized as follows. The terms measurement,
reliability, and validity were defined. The impact of measurement
problems on the ability of researchers to draw substantive theoretical conclusions
was discussed. An example focusing on a hypothesized positive relationship
between manufacturing flexibility and firm performance was used to illustrate:
After key terms were defined and the importance of measurement in
the research process was discussed, the article turned to the
topic of reliability. Several ways of assessing aspects of
reliability were discussed, including Cronbach's alpha for
internal
consistency, theta from factor analysis, KR20 for a set of
dichotomously scaled items, and key informant reliability. The
article concluded with a very brief discussion of the meaning of
reliability when measurement focuses on classifying rather than
quantifying.
It is very important to note that measurement can be reliable but
not valid. The simple classic example is a thermometer that
always registers a temperature 5 degrees more than the actual
temperature. T and the actual temperature are perfectly
correlated, and a thousand judges could agree on what T is; i.e.,
T could be deemed perfectly reliable. It is only because we have
many other
indicators of temperature that norms for measurement are
sufficiently clear and we know the temperature T is not valid.
Unfortunately, for the topics of interest to business
researchers, measurement is not at a comparable level of
development. Even for ``hard'' data such as the Consumer Price
Index, questions about validity can be raised: Does the CPI
overestimate the inflation rate? The Boskin Senate Finance
Committee has said that it does, by 1.1%. Overall, it is
essential for researchers to assess both reliability and
validity.
The purpose of this second article is to list some of the types
of validity assessment that are most relevant in business
research. The reader is invited to consult the seminal reference
books listed below for complete discussions about reliability,
validity, and the research process. Validity assessment involves
demonstrating that the theoretical construct supposedly being
measured is actually being measured by the empirical
indicator(s). ``Validity'' is usually preceded by an adjective
(such as ``construct'' or
``discriminant'') that indicates what type of validity is being
assessed.
Different authors suggest different other criteria for claiming
construct validity, but those listed certainly form the core of
every author's recommendations.
Assessing unidimensionality means determining whether a set of
indicators reflect one underlying factor (as opposed to more than
one). In the example discussed in the first part of this series,
the unidimensionality of firm performance as measured by ROI,
ROA, ROI growth, and ROA growth could be determined by conducting
an exploratory factor analysis and finding that only the first
factor has an eigenvalue of over one. Lately unidimensionality
has been identified as a crucial implicit assumption of valid
measurements (Hattie, 1985). If the set of indicators supposedly
measuring the same theoretical construct actually reflect more
than one factor, then construct validity cannot be claimed and
reliability measures such a Cronbach's alpha may be meaningless
(Gerbing and Anderson, 1988).
Convergent and discriminant validities are often evaluated
together. Convergent validity is the degree of convergence
seen when two attempts are made to measure the same construct
through maximally different methods. Different methods are
necessary so that common method variance is minimized.
Discriminant validity assesses the degree to which a
concept and its indicators differ from another concept and its
indicators.
One of the early and still classic ways of investigating
convergent and discriminant validities is Campbell and Fiske's
multitrait-multimethod approach. This approach is based on an
analysis of the correlations among indicators of different
constructs along four criteria. The four criteria may lead to
different conclusions about overall validity. The basic idea is
that the correlations among indicators of the same construct
should be (1) ``sufficiently'' different from zero and (2)
greater than the correlations with indicators of different
constructs. A more sophisticated approach to evaluating
convergent and discriminant validities is confirmatory factor
analysis, using LISREL, for example. This approach has the
advantage of allowing an overall inference about validity, as
well as simultaneously giving
estimates of indicator and composite reliabilities.
The final common criterion for construct validity is
nomological validity, or the degree to which the construct
as measured by a set of indicators predicts other constructs that
past theoretical and empirical work says it should predict.
Suppose a researcher proposes an entirely new way to measure
manufacturing flexibility and has demonstrated reliability,
unidimensionality, and convergent and discriminant validities.
However, this new construct is not related to other constructs in
established ways that past research strongly supports. ``Lack of
nomological validity'' in this case is an expression of the
intuitively appealing idea that the new construct is suspect:
although past measurement could be faulty or so-called
established relationships could be wrong, the burden of proof is
on the researcher proposing the new way to measure the construct.
Two final comments about construct validity are noteworthy.
First, from the discussion about the various aspects of construct
validity it should be clear that a theoretical framework is
necessary in order to do an assessment. A single construct cannot
be assessed in isolation. The theory that specifies which
indicators measure which constructs and the theory that specifies
which constructs are related to which constructs are intimately
fused when it comes to assessing construct validity. Second,
there is a completely different approach to construct validation.
Cook and Campbell (1979) propose that known threats to validity
should be evaluated lest they lead to confounding. Their seminal
book, which lists and explains these threats, is a particularly
valuable reference to those researchers contemplating
experimental research. Due to lack of space, this approach cannot
be discussed here.
The correlation between the test result and the criterion is the
most important indicator of criterion-related validity, and thus
this form of validity is far less dependent than construct
validity on the theoretical net of constructs. Of course, the
test items are not chosen in an atheoretical manner in practice.
But as far as assessing criterion-related validity is concerned,
the theoretical relationship between the test and the criterion
need not be shown. The test items also need not be
unidimensional, and indeed are often chosen to cover a multitude
of dimensions. Finally, a potentially serious problem in
assessing criterion-related validity is the selection and
measurement of an appropriate criterion. Referring to the
examples given earlier, how is ``inflation rate,'' ``job
performance,'' or ``success in MBA studies'' measured? If the
criterion is measured unreliably, the simple correlation between
the test and the criterion will be attenuated and the researcher
may unnecessarily conclude that the test is not valid. Even the
selection of a criterion is problematic when the research topic
of interest is leadership, self-esteem, or consumer personality,
to name just a few.
Once the domain has been specified, the researcher must construct
items or select items from the literature's pool so that the
items capture the meaning either of the entire concept or of each
of the concept's dimensions. As a practical matter, it is always
better to have too many rather than too few items, since items
can be dropped but rarely added during later research stages.
Unfortunately, there is no rigorous way (e.g., not even
correlations) to assess the content validity of the set of items
that are chosen and then operationalized. The researcher must
argue that the meanings as stated by the literature's experts are
captured in the current research, or alternatively, that the
current research's focus is on particular dimensions of the
concept's domain. The researcher might also ask certain experts
directly for their opinion about the set of items as it is being
developed. These experts may be other researchers, industry
experts, managers, and/or others with expertise, and the purpose
of soliciting their opinion is to assess whether the instrument
is appropriate ``on the face of it'' (i.e., face
validity). Gerbing, David W. and James C. Anderson (1988), ``An Updated
Paradigm for Scale Development Incorporating Unidimensionality
and Its Assessment'', Journal of Marketing Research, 25
(May), pp. 186-192.
Hattie, John R. (1985), ``Methodological Review: Assessing
Unidimensionality of Tests and Items'', Applied Psychological
Measurement, 9 (June), pp. 139-164.
Kerlinger, Fred N. (1986), Foundations of Behavioral
Research (Third Edition), Harcourt Brace Jovanovich College
Publishers, Orlando, FL.
Nunnally, Jum C. (1978), Psychometric Theory (Second
Edition), McGraw-Hill Inc., New York.
|