RESEARCH ISSUES
SHAWNEE VICKERY, Feature Editor, Eli Broad Graduate School of Management, Michigan State University

How Valid Are Measurements?

by Cornelia Dröge, Eli Broad Graduate School of Management, Michigan State University

This article is the first of two that addresses the following question: how valid are our measurements? Measurement is the process by which the observable indicator(s) of an attribute are classified and/or quantified so that a degree of isomorphism exists with an underlying unobservable concept. Assessing whether measurement is valid involves assessing the relationship between empirical indicators and abstract concepts. There are two basic types of assessments that need to be done before the researcher can claim that the measurement process is sufficiently valid. The first is reliability, which is concerned with the extent to which the measurement process yields consistent results when the process is "repeated" in some way (i.e., not necessarily duplicated). Reliability is necessary for measurement to be valid, but not sufficient. The second assessment is validity, a term that is usually preceded by an adjective such as "construct," "face," "statistical conclusion," "discriminant" and so on. Validity assessment involves demonstrating that the theoretical construct supposedly measured by the indicator(s) is actually being measured by that indicator(s).

Perfect reliability and/or validity are unachievable; rather the goal should be achieving sufficient reliability and validity for the particular purpose of the researcher. Thus, standards for reliability and validity are generally lower for exploratory research than for confirmatory or causal research. No study can address all issues in measurement, but every study should consider at least some aspects of reliability and validity. The reason is simple. If a carefully developed research hypothesis is supported (or not supported) by empirical research, researchers want to be able to state with some confidence that the substantive underlying theory is supported (or not supported). If significant reliability or validity problems exist, then substantive theoretical conclusions cannot be drawn because the replicability and meaning of the results will be questionable: has the theory been deemed correct (or incorrect) because of measurement artifacts?

An example will serve to illustrate the principles discussed so far. Suppose a researcher is interested in the relationship between manufacturing flexibility and firm performance. To test whether a hypothesized positive relationship exists, the concepts "flexibility" and "performance" must be measured. For firm performance, the researcher might specify that "firm" means SBU and that "performance" means 1996 ROI, 1996 ROA and growth in these two over the last three years. Note that of the entire hypothetical domain of attributes of firm performance, four were chosen and others such as growth in market share, return on sales (ROS) and sales growth, were rejected. A certain time frame was also chosen, and it was decided not to focus on one particular industry.

Next, the researcher decides to obtain data from CEOs in two ways: (1) on 7-point scales, and (2) actual estimates of the four chosen attributes. Both involve the CEO making marks on a paper questionnaire. Note that the researcher wishes to make statements about firm performance, but (strictly speaking) is asking for CEOs' perceptions of firm performance. Questions must be crafted and response scales designed. For example, the scale descriptors could focus on "best"/``worst'' in the industry or ``best''/``worst'' in the SBU's history. The researcher decides to ask for two "marks" because past research has found that CEOs who are reluctant to give actual estimates will often respond on scales.

From this sketchy initial outline of a measurement process, a number of questions are immediately obvious. Will the CEOs' data from the 7-point scales be consistent with the estimated numbers? If the researcher also obtains so-called "hard" data from a public source or data from a different executive, will the results be consistent with those obtained from the CEOs? If the same CEOs were asked the same questions two days later, would consistent results be obtained? If slightly different scale descriptors (or a different question format) were used, would the results be about the same? Do the four attributes of firm performance chosen actually tap the concept of firm performance for the purpose of examining its relationship to flexibility? (Or will the effect of enhanced flexibility be seen in market share growth but not in ROI or ROI growth?) Will flexibility and performance be found to be related because the CEOs attributed performance to flexibility (i.e., attribution was actually measured, and other indicators of these constructs will show flexibility and performance are unrelated)? Alternately, will flexibility and performance be found to be unrelated in a particular study because unreliable indicators of each have been used (i.e., in fact they are related, but measurement error has masked the relationship)? These and many other questions have to do with the reliability and validity of the measurement process.

The remainder of this article will be devoted to reliability assessment. The purpose is to list some of the reliability assessments that are most relevant in business research. The reader is invited to consult the seminal references listed below for complete discussions about reliability, validity, and the research process.

One of the most commonly reported reliability checks is Cronbach's alpha. This coefficient assesses the internal consistency of multiple indicators of the same construct. A common use is to demonstrate sufficient reliability to justify taking the sum or mean of the set of indicators of a construct, and then using that sum or mean in the testing of hypothesized relationships between constructs. The value of alpha depends on the average interitem correlation ({ r bar}) and the number of items (n): {n {r bar} [ 1 + {r bar} ( n - 1 ) ]}. Suppose the researcher has four indicators, as in the above example of firm performance. Clearly, the more correlated the four indicators are, the more confidence the researcher has that the four are actually measuring the same construct. Alpha expresses this intuitively appealing idea.

Cronbach's alpha has the following characteristics. First, the range is 0 to 1. When alpha is .8 or over, the set of indicators is often deemed sufficiently reliable for confirmatory research (.6 for exploratory research).

Second, the mechanics necessary for the researcher to be able to calculate alpha are relatively simple. One administration of the data collection instrument can contain the multiple items, unlike the determination of test-retest reliability which requires at least two administrations. Alternate forms of the same item need not be constructed, a frequent difficulty in assessing alternate forms reliability (and alpha can be interpreted as the expected correlation between the actual set of items and a hypothetical alternate form). Alpha is unique, unlike split-halves reliability where the reliability estimate depends on the split made of the items. Alpha is also easily calculated from the correlation matrix by hand, calculator or computer.

Third, alpha is a lower bound for reliability of n unweighted items. Thus alpha is a conservative estimate of reliability. For example, coefficient theta, calculated from principal-component factor analysis as <$E { [n"/"(n-1)][1-1"/"}>(largest eigenvalue)], is greater than alpha. Theta can be interpreted as the maximum alpha achievable when the items are weighted. This connection between factor analysis and reliability deserves mention because factor analysis results are often also used as evidence of unidimensionality. For example, four items supposedly measure the same firm performance construct if they load on one factor. Unidimensionality is evidence of validity.

Finally, alpha is a generalization of KR20, a reliability estimate for dichotomously scaled items. Alpha is used for interval or ratio scaled items, but sometimes research interest lies in dichotomous items. For example, suppose "the extent of performance measurement" is the construct of interest, and 26 items are identified that the firm could be tracking. On a questionnaire, the researcher asks the respondent to tick the items actually tracked. The number of ticks is counted (each item ticked is 1, each not ticked is 0). How reliable is this count as a measure of the construct? Alpha cannot be used here the appropriate reliability is KR20.

In the above discussion about internal consistency, several other reliability estimates were mentioned: test-retest, alternate forms and split-halves. The first two involve administering two data collection procedures to the same respondents. In test-retest, the same questionnaire or part of a questionnaire (for example) is answered twice, while in alternate forms two questionnaires purporting to be substantively the same are answered. In split-halves, a set of items is divided in half and the summed scores of the halves are compared. In each of these assessments, simple correlations are used to demonstrate reliability.

In the example about flexibility and performance described previously, the researcher may also consider demonstrating key informant reliability. Here, the respondents' answers are compared (e.g., correlated) to data obtained elsewhere. For example, the CEOs' answers for ROI and ROA could be compared to: (1) those published in publicly available reports of some kind; and/or (2) answers to the same questions obtained from some other knowledgeable executives. In the former case, correlation with "hard" data on key items is often used to liberally declare key informant reliability for the entire questionnaire. (The reliability of so-called "hard" data is another issue!) In the latter case, the researcher is asking whether two "judges" of the same thing will give the same answer. After all, firm performance is being judged by the CEO but is not a construct about the CEO. To the extent that interjudge reliability is established, the researcher can be more confident that firm characteristics are being tapped rather than CEO characteristics (i.e., the informant about the firm is reliable).

As a final note, consider the issue of reliability assessment when measurement focuses on classifying rather than quantifying. The core questions about reliability are similar. Someone (or perhaps a computer program) must do the classification. If the same coder classifies the same content or items again, will the results be consistent? Stability of classification is similarly to test-retest reliability. If a different coder given the same mandate classifies the same content or items, will the results be the same? Reproducibility or intercoder reliability is similar to interjudge reliability discussed earlier. If a standard or norm exists for a particular content or set of items, will the coder replicate that standard or norm? Accuracy of the classification by that coder is similar to key informant reliability when the respondent's answer (the coder's classification) is compared to "hard" data (the standard or norm).

References

Thomas D. Cook and Donald T. Campbell (1979), Quasi-Experimentation: Design and Analysis Issues for Field Settings, Houghton Mifflin Co., Boston, MA.

Fred N. Kerlinger (1986), Foundations of Behavioral Research (Third Edition), Harcourt Brace Jovanovich College Publishers, Orlando, FL.

Jum C. Nunnally (1978), Psychometric Theory (Second Edition), McGraw-Hill Inc., New York.