When a serious game is commissioned, it is expected that in-game learning should transfer to the work place or a clinical setting, not just lead to improvements in game play.

Vegas effect

Evidence of transfer should be a priority in serious game development; there should be evidence that learning acquired in a game is applicable outside of the game.

The Vegas Effect is not unique to games; however, serious games will need to provide evidence that learning that happens in games, does not stay in games.

The tradition of psychometrics may provide methods for data collection and analysis so that serious games may eventually serve as empirically validated diagnostic tools and measures of learning—applicable inside and outside of the game. With tools for measuring training effectiveness from psychometrics, ROI analysis of training solutions and clinical tools can be conducted, and the risk associated with the costs of game development may be diminished.

Serious games and assessment

Serious games are very much like the tools used in psychological assessments and evaluations. Three types of assessments from psychometric methods:

  • Formative assessments –a measurement tool used to measure growth and progress in learning and activity and can be used in games to alter subsequent learning experiences in games. Formative assessments represent a tool external to the learning activity, and typically occur in leading up to a summative evaluation.
  • Summative assessments provide an evaluation or a final summarization of learning. Summative assessment is characterized as assessment of learning and is contrasted with formative assessment, which is assessment for learning. Summative assessments are also tools external to the learning activity, and typically occur at the end of the learning intervention to evaluate and summarize and is conducted with a tool that is external, not part of the training.
  • An informative assessment guides and facilitates learning as part of the assessment. The assessment is the intervention. Successful participation in the learning results in evidence that learning has taken place. The behaviors in the activity have been shown to verify that learning has taken place. No external measures have been added on for assessment.

Games are typically used in the definition of what is an informative assessment. This makes sense, as a game, by its very nature, provides an activity along with assessments, measures, and evaluation. What, why, and how a game measures learning is of primary importance—and this is why serious game designers must learn assessment methods from the field of psychometrics if serious games are to grow as diagnostic tools, assessments, and evaluations.

If a game is to act as an informative assessment, it will stress meaningful, timely, and continuous feedback about learning concepts and process that are accurately depicted. As in an informative assessment, feedback in a game can be a powerful part of the assessment process. As the learner acts in the context of the games rule environment, they may learn the rules and tools through trial and error—eventually developing tactical approaches, and potentially formulate strategies from the possibilities for action deduced from learning from the in-game assessment criteria. This can be powerful.

Evidence supports this powerful learning tool. Research findings from over 4,000 studies indicate that informative assessment has the most significant impact on achievement (Wiliam, 2007). When serious games are built with same care as an informative assessment using methods from psychometrics, serious games can be as effective as an informative assessment.

Currently, most games are not designed as informative assessments. This means that learning in a serious game might suffer from the Vegas Effect. For a game to act as informative assessment, the game must accurately measure the learning the concepts, and the concepts from the game must transfer to other performance contexts—beyond the game. In order to achieve this, the issue of construct validity must be addressed.

For a serious game to have construct validity, the training interventions that they present must have been designed with emphasis on the creation of internal and external validity—what we model, how we measure it, and how it is presented in a game:

  • External validity: the ability to generalize in-game learning to other contexts.  To what extent can a training effect from a game be generalized to other populations (population validity), other settings (ecological validity), other treatment variables, and other measurement variables?
  • Internal validity: examines whether the adequacy of the study design, or in this case of the game, that the intervention was the only possible cause of a change in the players learning.

To do this, serious game development requires valid concepts for modeling, implementation, and assessment of what is to be learned, as well as how it will be measured outside the game. This is essential for ROI (return on investment) analysis. Serious game development requires research and construct validity to conduct ROI and to avoid the Vegas Effect. Learning that happens in games should not stay in games.

 

Leaving Las Vegas:

I have come across few if any games that have been designed with the kind of careful attention to research methodology that would be expected when measuring learning, intelligence, personality, or depression. Methods that ensure construct validity are expected in the field of psychometrics and the learning sciences, and may soon emerge as standard practice in serious game design.

Games are often designed to have surface validity. This means that the game APPEARS to measure what it is supposed to measure. Surface level validity is a useful beginning, but should only be considered a step towards having a valid assessment. It should be considered a gamble to build a serious game on surface validity. Designing a serious game on surface validity increases the likelihood of the Vegas Effect.

To reduce the likelihood of the Vegas Effect, a serious game designer could take their game and correlate learning outcomes with validated tools external to the serious game, such as formative and summative assessments. This method of validation is called criterion validity. To do this, the game designer might correlate success in the game with other diagnostic measures with verified content validity. For example, a claim may be made that a game improves working memory. This claim may be validated using the Dual N-Back Test for measures of working memory. The game designer might choose to have a sample of individuals take the Dual N-Back Task, play the game, and then use the Dual N-Back Task after the serious game to measure changes in working memory using the Dual N-Back Task as criterion for measuring changes in working memory.

Criterion validity is a powerful way to claim effectiveness, and reduce the likelihood of a Vegas Effect. However, the research design is essential in using criterion validity. One cannot simply have someone play their serious game and then attribute changes in the Dual N-Back score by correlation with having played the serious game . . . correlation does not imply causation. To validate the serious game with improvements in working memory on the Dual N-Back Task, the serious game developer should recruit methods from psychometrics such as a Repeated Measures Design, with attention to Sampling.

To really avoid the Vegas Effect, the serious game developer should adopt the gold standard:  Construct Validity. Meaning that the learning designed into the game is measured with the same rigor as the diagnostic tools in psychometrics. Through designing games with construct validity, the game scenarios can be shown to be definitively delivering and measuring the theoretical construct.  Although this is the gold standard, it requires significant investment in time and money to develop. There are however, some methods from psychometrics that can be adopted in the design process of a serious game to reduce the probability of the Vegas Effect.

One methodological step that can be taken towards construct validity is to conduct a study of inter rater agreement on the game elements that deliver instruction. The inter-rater reliability method can be used to identify and score of how much agreement there is on whether the game content is what we say it is.  One way to do this is to individually present the game content to a number of sequestered subject matter experts and ask them to judge. For example, we might present judges with number of scenarios from a game about Decision Making Stages based upon B. Aubrey Fisher’s four stages of group decision making (Fisher, 1970). To do this, the game developer might present the game scenarios to an expert on this topic and ask them to judge, whether the scenario is an example of Fisher’s Orientation Stage in Group Decision Making? Here is the definition:

Orientation stage- this phase is where members meet for the first time and start to get to know each other.

When the expert judges the scenarios, the responses from all the judges can be gathered and inter-rater reliability can be calculated from the responses using Cohen’s Kappa. If the percentage of agreement is low, either the scale (game scenario) is defective or the raters need to be re-trained. If agreement is high, the game scenario is a step closer to construct validity.

Inter-rater agreement is a simple, low-cost method for increasing assessment and content validity. This is an example of how traditional research methods from psychometrics can be integrated as part of the design process from the beginning. As suggested here, an early step in the design process is to conduct tests of inter-rater agreement.

 

This is an excerpt from:

Dubbels, B.R. (in preparation) The Importance of Construct Validity in Designing Serious Games for Return on Investment.

Works cited:

Fisher, B. A. (1970). Decision emergence: Phases in group decision making. Speech Monographs, 37, 53-66.