Sustainability | Free Full-Text | A Flexible Inventory of Survey Items for Environmental Concepts Generated via Special Attention to Content Validity and Item Response Theory

[ad_1]

5.1. Item Reduction via Item Response Theory

To assess the items and concepts, we used the statistical program IRTPRO v4.2 and a graded response model [30] using the Bock–Aitkin estimation method. The principles of classical test theory suppose that the multiple, redundant items are “the root of precision” (p. 57 in [72]). In item response theory, however, the basis for judging the adequacy for a set of items is the discrimination parameters and threshold values. We use that basis with an aim of reducing the number of items for each concept to three or four.
More specifically, to guide the item reduction process, we used discrimination parameters (DPs) and threshold values. Higher DPs are better than lower DPs, ceteris paribus. While there is no hard rule, DPs 1.34 are more than adequate [73]. With respect to threshold values, sets of survey items whose threshold values cover the range [–2.0, 2.0] are good when the interest is to distinguish people across a broad range of the latent traits [29]. When DP- and threshold-based criteria provided an equivocal basis for item reduction, we then considered other properties, such as the grammatical simplicity of an item or a post hoc evaluation of content validity.
The results below are ordered so that similarly performing concepts appear, more or less, in succession. These results are also detailed in Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10, Table 11, Table 12, Table 13 and Table 14. In those tables, each trial item is identified with notation such H1 for item 1 from Hope and Sa2 for item 2 from Sacredness. We use such notation beginning in the next sentence.
Sacredness: The three items with the highest DPs were Sa5 (3.1), Sa6 (4.0), and Sa7 (3.0) (Table 2). In the previous sentence, and from this point forward, unlabeled numbers in rounded parentheses are DPs. The range of threshold values for that set of items was [–2.5, 0.8]. However, post hoc consideration of Sa6′s content validity suggests that it might overlap with Connectedness. Furthermore, removing Sa6 does not reduce the range of thresholds. For those reasons, we removed it from consideration.

The next most useful item was Sa4, whose DP was 1.5. Its inclusion with Sa5 and Sa7 increases the range of threshold values to [–2.5, 1.5]. These items (Sa4, Sa5, Sa7) are likely to be a useful scale for Sacredness, but there would be value in future research that developed items with good discrimination at the top end of the scale.

Hope: The three items with the highest DPs were H2 (DP = 2.2), H4 (DP = 2.2), and H5 (DP = 1.8) (Table 3). Item H3 also had an acceptably high DP (1.5). Because H3 and H5 had similar DPs, we recommend selecting H3 instead of H5 for two reasons. Namely, H5 has more complex grammar and its content validity might be obscured by the prominent reference to government and business. The range of threshold values for that set of items (H2, H3, H4) was [–2.0, 2.3]. Other items had unacceptably low DPs or did not lead to an increased range of threshold values. While future research might develop items with higher DPs at the higher and lower ends of the scale, this set of items is at least a good starting place for assessing Hope.
Doubting Others: The items DO2 (3.0) and DO4 (2.6) had significantly higher discrimination than the remaining items (Table 4). Three other items (DO1, DO3, and DO6) had acceptable and similar DPs. Of those three, one had threshold values that covered the lower end of the scale (DO6), and another had threshold values that covered the higher end of the scale (DO1). For these reasons, we recommend retaining four items to represent Doubting Others. This set of four items have threshold values covering the range [–2.4, 2.6].
Nature’s Breadth: We eliminated several items for having DPs less than one (NB4, NB5, NB6, and NB7, Table 5). The remaining three items had acceptable DPs, i.e., NB1 (3.2), NB2 (1.8), and NB3 (1.2). The range of threshold values for those items was [–1.7, 1.7]. There would be value in developing items with good discrimination at the lower and upper portions of the scale in future research.
Connectedness: We eliminated two items for having low DPs. They are Cn1 (0.6) and Cn4 (0.9) (Table 6). The remaining three items had acceptable DPs (1.0, 1.5, and 4.6). The range of threshold values for this set of three items was [–5, 2.2].
Comfort: All five items performed similarly in the sense of having acceptable DPs and having similar ranges of threshold values (Table 7). As such, we picked the three items with the highest DPs, Cm1 (4.4), Cm2 (3.7), and Cm3 (4.6). As a set, these items had a range of [–2.3, –0.4]. There would be value in developing items with good discrimination at the lower and upper portions of the scale in future research.

For each of the next four concepts, the trial items exhibited a trade-off in the sense that DPs were negatively correlated to both the range of threshold scores and maximum threshold score. In other words, items with better discrimination tended to represent the lower portion of the scale (i.e., individuals with low values of trait), but not the upper portion.

For these concepts, the first step we took in item reduction was to eliminate items with low discrimination. Then, we identified an item with high discrimination (and tending to cover the low end of the scale) and two items that had better coverage at the high end of the scale (and tending to have lower discrimination). When that second criterion identified more than two items, we selected the two items with simpler grammar or better content validity.

While each of the next three scales are likely to be useful, there would be value in developing items with higher discrimination at the top end of the scale in future research.

Shared traits: For these nine trial items (Table 8), the DPs were negatively correlated to both the range of threshold scores (r = 0.87 ± 0.10 SE) and the score of the upper threshold (r = 0.82 ± 0.14). We eliminated ST1 from consideration for having lower discrimination (0.8). The item that we selected with higher discrimination was ST6 (1.9). The items that we picked with coverage at the higher end of the scale were ST3 (1.6) and ST7 (1.2). The range of threshold values for this set of three items is [–3.2, 1.9]. Note that we recommend selecting ST3 over ST8 because ST8 has considerably more complex grammar. For example, the Flesch–Kincaid index for sentence complexity indicates that ST3 is readable down to 4th grade level, but ST8 is rated at a 12th grade or higher.
Fragility: For these 14 trial items (Table 9), the DPs were negatively correlated to both the range of threshold scores (r = 0.84 ± 0.09) and the score of the upper threshold (r = 0.78 ± 0.12). None of the items needed to be eliminated for having unacceptably low discrimination. The item that we selected with higher discrimination was F8 (2.5). The items that we picked with coverage at the higher end of the scale were F1 (1.9) and F4 (1.8). The range of threshold values for this set of three items is [–2.7, 1.4].
Dependency: For these nine trial items (Table 10), the DPs were negatively correlated with both the range of threshold scores (r = 0.90 ± 0.08) and the score of the upper threshold (r = 0.62 ± 0.25). We eliminated D1 from consideration for having low discrimination (0.83). The item that we selected with higher discrimination was D6 (3.0). The items that we picked with coverage at the higher end of the scale were D6 (1.6) and D7 (2.1). The range of threshold values for this set of three items is [–2.2, 1.0].
Non-anthropocentrism: Three items had a wider range of threshold values, but notably lower discriminations (Table 11). In particular, the items Anta1 (0.9), Anta2 (1.4), and Anta3 (2.6) had threshold values whose range was [–4.2, 0.56]. These three items were also characterized by the word “nature” being the subject of the statements.

The other three items had more specific subjects (e.g., rivers or forests). Those items had especially high DPs (>9), but an especially narrow range of threshold values, i.e., [–1.3, –0.2].

Because we could not say in advance whether items with a subject that is generic (nature) or specific (rivers or forests) would have better predictive ability, we treated the first three items as being indicative of a dimension that we labeled Non-anthropocentrism a and the second three items as indicative of Non-Anthropocentrism b.

For each of the next three dimensions, we developed only three trial items, because we were unable to think of additional items that would have high content validity without being especially redundant with other items. Consequently, there is no need to reduce the number of items. But there is value in evaluating their performance.

Animism: All three items had DPs > 2 (Table 12). The range of threshold value for this set items covered the range [–2.1, 0.7]. There would be value developing items with good discrimination at the upper portions of the scale in future research.
Stability: All three items had DPs > 1 (Table 13). Item Stb3 had especially high discrimination (28) and a range of threshold values that covered the highest end of the scale, i.e., [–0.6, 1.6]. The range of threshold values for this set of three items was [–2.2, 3.3]. There would be value in developing items with better discrimination in future research. We guess that the higher discrimination of Stb3 is associated with word choices that are more meaningful to a general audience.
Holism: Item Hol1 was eliminated because its DP was 0.3 (Table 14). The remaining two items had high discrimination, i.e., Hol2 (25) and Hol3 (2.9). The range of thresholds for this pair of items was [–1.8, 0.7].

The item with poor discrimination was a semantic differential for which many survey participants may have seen as a false dichotomy. We recommend restructuring that item to be like the items that performed better, so that it reads: Forests are composed of many individual living things (individual trees and animals). But is a forest itself an individual living thing.

In any case, there would be value in developing items with good discrimination at the lower and upper portions of the scale in future research.

5.3. Predictive Ability

A survey instrument is often said to have predictive validity if it is correlated with another well-established instrument representing the same or similar underlying construct [74]. Assessing this kind of validity can be important if there are doubts about what an instrument is measuring. Predictive validity is not the goal of the analysis described in this section. (We write about predictive validity in Section 6.4.)
Rather, the purpose here is two-fold. The first purpose is to assess hypotheses about the degree to which the environmental concepts that we operationalized are predictive of measures for environmental behaviors, behavioral intentions, and two overarching attitudes about the environment. This predictive ability is taken as evidence for or against those hypotheses, not as evidence for the quality of the items’ ability to measure what they purport to measure. The second purpose is to assess predictive ability while being mindful of the often-severe constraints on survey length, which will sometimes prevent an analyst from presenting as many survey items as we have developed. We fulfill these purposes with best subsets regression, described just below. For clarity, our purpose here is not to test any particular behavioral theory (e.g., value-belief-norm theory), though we discuss the relevance of such theories in Section 6.2.

To perform this analysis, we used regression and four dependent variables to quantify the predictive ability of the survey items that we developed. The dependent variables were:

  • Responses to a survey item about the overall importance of environmental issues (Table 16 for details).
  • Responses to a survey item about how well humans treat nature (Table 17 for details).
  • A scale based on responses to five items pertaining to pro-environmental behavioral intentions (PEBIs).

  • A scale based on responses to four items pertaining to pro-environmental behaviors (PEBs).

We developed items about PEBs and PEBIs to represent behaviors that seemed likely to impact the environment (sensu, [75]). The items also represented behaviors that we believe were not entrained by habit for most participants. The PEBIs also represented sociopolitical actions (e.g., writing a politician about an environmental issue) that can be accurately recalled and reported. The items representing PEBs and PEBIs are presented in Table 16. Other details pertaining to those items are presented in Appendix D.
We used best subsets regression—i.e., the function regsubset() in program R—which finds the best model with k predictors. We examined models for k = [1, 2, …8]. The model results are reported in Table 17, Table 18, Table 19 and Table 20.
Drawing attention to the models with only statistically significant predictors, the most predictive models explained 44% and 40% of the variance for the survey items pertaining to the importance and treatment of the environment (Table 17 and Table 18). For PEBIs and PEBs, the most predictive models comprised of only statistically significant predictors explained 19% (PEBIs) and 24% (PEBs) of the variance (Table 19 and Table 20).

The regression results suggest that a plausible ranking of the concepts’ predictive ability would be as follows.

Most predictive:

  • Sacredness was a strong predictor for three of the four responses (import, PEBI, PEB).

  • Hope was a strong predictor for two responses (treatment, PEBI), an important predictor for a third (PEB), and possibly a weak predictor for a fourth response (import).

  • Doubting Others was a strong predictor for three responses (import, PEBI, PEB) and possibly a weak predictor for a fourth response (treatment).

  • Dependency was a strong predictor for two responses (treatment, PEBI), and a weak predictor for two responses (treatment, PEB).

Moderately predictive:

  • Fragility was a strong predictor for one response (treatment), an important predictor of another response (PEB), and a weak predictor for another response (PEBI).

  • Connectedness was a strong predictor of two responses (import, PEBI).

  • Nature’s breadth was a strong predictor of one response (PEBI) and an important predictor of another (treatment).

  • Non-anthropocentrism a was a strong predictor of one response (import) and an important predictor of another response (treatment).

Minimally predictive:

  • Non-anthropocentrism b was an important predictor of one response (treatment) and perhaps a weak predictor of another response (import).

  • Animism was perhaps a weak predictor of two responses (treatment, PEBI).

Least predictive:

  • Comfort was perhaps a weak predictor of one response (import).

  • Holism was perhaps a weak predictor of one response (PEB).

  • Shared traits was perhaps a weak predictor of one response (PEBI).

  • Stability was not a predictor of any response.

This ranking is intended to be no more than a qualitative summary of Table 17, Table 18, Table 19 and Table 20; it is not a claim about the ability of survey items to measure what they purport to measure, nor is it a broad claim about the general importance of those aspects of environmental beliefs to predict other phenomena (see Section 6.4). Rather, they are claims about the dataset that we analyzed. For additional context, see Table A1, Appendix E for a matrix of bivariate correlations among the variables.

[ad_2]

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More