Keywords

1 Introduction

The last years in Human Computer Interaction (HCI) literature have been characterized by an intense and rich exploration of user experience (UX) related concepts. Researchers have been investigating definitions and understanding of UX across different cultures and perspectives [1,2,3]; establishing concepts, frameworks and models for supporting design and development processes [4, 5]; and developing and evaluating methods, techniques, instruments and measures for evaluating UX [6,7,8,9]. Particularly, researchers have been calling attention to the relevance of and the need for a theoretical discussion around UX research and practice [10]. The theoretical roots used to develop different types of UX work are a broad work in progress, including a range of different types of theories, models and conceptual frameworks [1]. For instance, UX research has been based on psychological models and theories, formalist aesthetics and product semiotics, Gestalt theory, theories about communication, and theories inspired by the art and design fields [1]. Consequently, there are several possible understandings about the meaning of UX, each proposing different approaches for evaluating its qualities, which results in broadly different evaluation methods, techniques, and instruments.

Despite the established standards that define usability (ISO 9241-11:2018) and UX (ISO 9241-210), there is also a growing discussion intending to clarify the distinction between these often confused concepts [78: 104]. Because UX remains a rather vague concept, difficult to fully understand for both researchers and practitioners [11, 12], UX measurement is frequently confused with usability measurement while satisfaction, which is a component of usability [13, 14], is indistinctly treated as a UX quality - sometimes the one and only necessary to assess user experience [15]. Besides, the literature has demonstrated that several factors, including social and cultural changes, can directly interfere in the way UX is understood and hence practiced [12, 16].

The problem behind UX underdeveloped concept is the danger that user experience and its related concepts such as trust, loyalty, identity, and engagement will not be fully realized in studies of people and technology [17]. In this scenario, selecting a combination of UX evaluation methods commonly relies on individuals’ experience and expertise rather than on information about the UX constructs that can be measured in empirical studies [18], and on which instruments can support UX measurement [19]. Although the literature has not yet established standard UX metrics and several philosophical arguments on UX measurement have been raised [18], evaluators should not conduct UX evaluations mostly based on their personal experience and a very restricted knowledge of the methods and instruments employed [20]. There is a huge need for UX professionals, researchers and HCI learners to make informed and conscious choices to select right instruments and methods when evaluating UX qualities [21].

Aiming to help fill in this gap, in this work we present the results of a systematic snowballing procedure [22] conducted to investigate the characteristics of the UX evaluation instruments that have been proposed and used by HCI community in the last years. Our main goal in this research is to compile a large quantity of knowledge covering a wide variety of types of UX evaluation instruments, updating the literature on UX evaluation, and to provide researchers and practitioners with a useful catalog of UX instruments. We present a compilation of 116 instruments to assist researchers and practitioners in making informed choices about which instruments can support UX data collection, according to their research goals. In addition to that, the data analysis provided a glance on how the initial list of instruments evolved, allowing us to contribute with a discussion on the directions the research on UX evaluation instruments is taking.

2 Related Work

For a long time, usability figured as the main HCI criterion on which researchers and practitioners relied for measuring the quality interactive systems interaction. According to Bargas-Avila and Hornbæk, usability focus on the efficiency and the accomplishment of tasks was one of the instigating factors for the development of user experience as a concept of quality of use that addresses hedonic qualities and emotional factors, in addition to the utility and pragmatic aspects commonly covered by usability [15]. Bevan, Nigel, and Miles presented an overview of tools developed to assess user performance, user satisfaction, cognitive workload, and analytic measures [23]. Some of the concepts they analyzed, such as perceived usability, were later understood as part of UX qualities.

Agarwal and Meyer conducted a survey to list existing instruments, motivated by the idea of identifying methods that went beyond usability, i.e., methods that more explicitly included emotions, relating directly to User Experience [24]. They identified verbal, nonverbal and physiological measurement tools and discussed that good usability metrics are often indicative of good user experience. Roto, Obrist and Väänänen-Vainio-Mattila categorized User experience Evaluation Methods (UXEM) for academic and industrial contexts, gathered in a special interest group session (SIG) [25]. They distinguished UX and usability methods based on pragmatic and hedonic model [26], and classified them according to their methodology.

Reviews have also been conducted to investigate UX evaluation in specific application domains and UX measurement. Ganglbauer et al. conducted an overall review about psychophysiological methods used in HCI, describing in details different psychophysiological methods, such as electroencephalography (EEG), electromyography (EMG), electrodermal activity (EDA) [27]. Nacke, Drachen and Göbel presented a classification of methods to measure Game Experience, presenting three categories of experience related to games: (1) quality of product, (2) quality of human-product interaction and (3) quality of interaction in a given social, temporal, spatial or other context. Yiing, Chee and Robert described categories for HCI qualitative methods that are used to evaluate interface of video-games, focusing in Affective User-Centered Design [28]. Aiming to guide professionals in choosing UX methods, they classified methods into user feedback and non-invasive categories. Hung and Parsons conducted a survey with emphasis on the Engagement construct, cataloging instruments self-reported instruments related to UX, engagement, communication, emotion, and other qualities, excluding later those that did not belong to the HCI field [29].

Although several important studies have been investigating different types of UX evaluation methods and instruments, Vermeeren’s et al. list of UX evaluation methods was used as basis for our work, as it consists in one of the most complete and well-known compilation of methods presented in UX literature [30]. They collected data from workshops and SIGs, and also searched literature for previously categorizations of UX methods. As a result, they categorized 96 methods according to specific information, such as study type, development phase, requirements, type of approach and applications. In the present work, we chose to focus on practical and well defined evaluation instruments, instead of including UX frameworks, techniques, methods and models. Hence, we analyze the original instruments listed by Vermeeren’s et al. under a different point of view, including new instruments.

3 Methodology

The present study classifies and catalogs a set of 116 UX evaluation instruments gathered from a snowball sampling [22], which consists in gathering research subjects through the identification of an initial subject which is used to provide other related subjects. Our initial subject was a subset of the papers listed by Vermeeren et al., which is a seminal and highly cited paper in the area. This subset consists in the 39 papers that describe UX tools and instruments in Vermeeren’s et al. list, since it also includes methods, models and frameworks, which are out of this work scope. The 39 papers - which describe 49 UX evaluation instruments - were used as start set in the snowballing technique. As a result, we obtained a final set of 116 instruments, which include updated versions of the ones originally listed, in addition to novel instruments proposed for different domains, such as the Internet of Things, and for specific audiences, for example, children.

Figure 1 provides a schematics of the methodology followed in this research, which have three main steps: (1) Selection of Initial Set of Instruments, (2) Snowballing [22] and (3) Instruments Cataloging.

Fig. 1.
figure 1

Summary of methodology steps.

The first step of this research methodology was to select a start set of papers to use for the snowballing procedure. Having chosen Vermeeren’s et al. list as our basis, we analyzed the 86 UX evaluation artifactsFootnote 1 made available by authors in their site allaboutux.com. For each artifact, two researchers independently read the name, description, and intended applications. For purposes of selecting the papers to be included in the start set, we considered UX evaluation instruments as planned and validated tools designed to systematically collect qualitative data/measure quantitative data related to UX constructs from a variety of participants, producing results based on psychometric properties in a format ready for analysis/interpretation. The two sets selected by the researchers were later compared and consolidated, after being checked by an expert researcher, which resolved the inconsistencies. The inclusion criteria were: (a) the artifact must match the adopted definition of UX evaluation of instrument; (b) the artifact paper must be available in Portuguese, English or Spanish; and (c) the paper have to be available in a digital library. By the end of this phase, we had selected 39 papers as our start set.

Then, to execute the snowballing sampling, the 39 papers were distributed between two researchers, which applied independently a forward snowballing technique in order to find new instruments. Forward snowballing refers to identifying new papers based on those papers citing the paper being examined [5]. The citations to the paper being examined were studied using Google Scholar and, for each original paper, researchers verified how many times it had been cited by others. If the number of citations was greater than 100, they should select among them the 25 most relevant papers in addition to the 10 most recent articles, totalizing 35 new papers for each original papers with more than 100 citations. In case the original paper had less than 100 citations, the 25 most relevant papers were included. We acknowledge this procedure limited our capacity to catalog as many instruments as possible. Given our constraints, however, we adopted this procedure to make significant work more likely to be included, as well papers from authors that regularly publish in the area.

Given these criteria, each candidate paper citing the original paper was examined. The first screening was done based on the reading of paper title, abstract, and keywords. If this information was insufficient for a decision, the citing paper was studied in more detail and the place citing the paper already included was examined. If this was insufficient too, then the full text was studied to make a decision regarding the new paper. The goal was to identify any evidence that the citing paper proposed a new UX evaluation artifact or an update of an existing one. In this phase, from the 39 start set, 1001 citing papers were screened, 221 papers were read and analyzed, resulting in the inclusion of 51 papers. By the end of this phase, we had a set of 96 papers describing 103 UX evaluation instruments.

Finally, the instrument cataloging step consisted in the data extraction of the selected papers. In this step, 13 new papers were included after the indication of a senior researcher, helping to mitigate the limitation of our paper search process. Two researchers read the full text of 103 papers describing 116 UX evaluation instruments - as some papers described more than one instrument [e.g. 31 and 32], and cataloged them. The cataloging process consisted of extracting and tabulating the following data for each instrument: reference, publication year, instrument name, type of instrument (scales, psychophysiology, post-test pictures, two-dimensional graph area, other [21]), UX qualities (overall UX, affect, emotion, fun, enjoyment, aesthetics, hedonic, engagement, flow, motivation [15, 21]), type of approach (quantitative, qualitative or quali-quantitative), main idea, general procedure, applications, and target users.

The oldest instrument cataloged is from 1982 [33], and the newest are from 2018 [34, 35]. The complete categorization of the 116 UX evaluation instruments is available at https://bit.ly/2N7K2ly. We intend to keep periodically updating and expanding the information available.

4 Results

From the 116 UX evaluation instruments identified, 48 (41.38%) come from the start set of papers gathered from Vermeeren’s et al. list, and 68 (58.62%) are instruments developed from 2011 onwards, identified using the methodology described before. The cataloged instruments reported addressing 29 different UX qualities, which can be evaluated by eight different types of instruments, as exemplified in Table 1.

Table 1. Examples of UX qualities evaluated by the different types of instruments

Scales and questionnaires constitute 62.07% of the 116 instruments identified. The second most common type of instrument is classified as psychophysiology (10.34%), followed by two-dimensional diagrams/graph area (7.76%) and software/equipment (7.76%), and post-test picture/object (6.90%). Other types of instruments occurred less frequently, being usually developed for specific contexts, such as: diary templates [36, 37], scale combined with two-dimensional graph area [16, 38] and observational checklist [34]. These trends are suggestive of the directions research has taken in this field, and are further described in the remainder of this Section.

We classified scales and questionnaires in the same category (“scale/questionnaire”), although we acknowledge there is a conceptual difference between their definitions. Still, we grouped them together because often authors use the terms interchangeably and, in some cases, scales are developed for specific questionnaires [45]. Good questionnaires can be described as a well-defined and well-written set of questions to which an individual is asked to respond open-ended or closed-ended questions [46]. Scales are used in closed-ended questions to support an ordered response from a number of given choices, in some logical order [47].

The prevalence of self-reported UX data collection is clear in the 72 scales/questionnaires identified, which report to evaluate a range of 26 different UX qualities (Table 2). From these, seven (9.72%) evaluate general aspects of UX (i.e. the authors do not describe any specific UX quality) such as [48], seven (9.72%) evaluate specific sets of UX qualities [49], as shown in Table 3, and six (8.33%) evaluate satisfaction [50]. It is important to notice that is out of this research scope to analyze whether different terms employed by authors refer to a same UX quality.

Table 2. Examples of UX qualities evaluated by scales/questionnaires
Table 3. Examples of specific sets of UX qualities evaluated by scales/questionnaires

The UX scales found target nine different types of application. Thirty seven (51.39%) are classified as “application-independent” (i.e. they are reportedly suitable to evaluate UX in three or more types of application), such as [51]. Thirteen (18.06%) aims to evaluate UX in games and virtual environments [52], eight (11.11%) are focused on online platforms [45], four (5.56%) are for mobile devices and three (4.17%) target intelligent systems, environments and objects [53].

The variety of UX qualities evaluated by scales and questionnaires is greater than in other types of instruments. While the 72 scales and questionnaires measure 26 different UX qualities, the remaining 44 instruments evaluate only 8 different qualities. Regarding the target users, 58 scales/questionnaires (80.56%) aim to evaluate user experience for all type of users [51], while eight (11.11%) are aimed at children [61], five (6.94%) were designed for users performing specific roles, such as journalists [48] and consumers [62], and one scale/questionnaire (1.39%) is aimed at people with disabilities [50]. Scales and questionnaire are more common than other types of instruments, nevertheless, this predominance seems to be decreasing over the last years, considering the cataloged instruments. Between 1982 and 1999, 14 out of 16 (87.50%) instruments are scales/questionnaires, while in the next 10 years (2000 to 2009) its quantity drops to 33 out of 48 (68.75%). Finally, from 2010 to 2018, it constitutes 25 out of 52 (48.08%) of the UX instruments identified (Fig. 2).

Fig. 2.
figure 2

Comparison between scales/questionnaires and other instruments by year.

4.1 Psychophysiological, Graphs, Software Instruments and Post-Test Pictures

The second most recurrent type of UX evaluation instrument cataloged is psychophysiological, in which user’s physiological responses are recorded and measured, usually with sensors attached to the participant. The 12 psychophysiological instruments (10.34%) identified reported evaluating 4 different UX qualities: affect, emotion, generic user experience and specific sets of qualities (Table 4). Nine of the psychophysiological instruments found (75%) evaluate emotion [63]. One of them evaluates affect [64], another evaluates generic user experience [65] and the other targets a set of UX qualities: emotion and perception [66]. Most of these instruments are “application-independent” (91.67%), and one is specific to evaluate UX in audiovisual applications [65]. The most common purpose of the psychophysiological instruments found is to measure emotion in any types of applications (75%). All the 12 psychophysiological instruments aim to evaluate user experience for all types of users.

Table 4. Examples of UX qualities evaluated by psychophysiological, two-dimensional diagrams/chart area and software/equipment instruments.

The third most common type of instrument identified is two-dimensional diagrams/graph area (7.76%). This category of type of instrument covers diagrams, charts, timelines and two-dimensional graph areas through which the users can report their experiences. We found nine instruments of this category, which evaluate three types of UX qualities. Four of these instruments (44.44%) evaluate emotion [63] and also four (44.44%) evaluates specific sets of UX qualities, such as attractiveness (appeal) of the product, ease of use and utility [67], and usability, challenge, quantity of play and general impression [19], while one (11.11%) evaluates affect [68]. In addition to application-independent instruments [63] - which were the most common target of this type of instrument (66.67%) - they target four different types of application: audiovisual [69], games and virtual environments [19] intelligent systems, environments and objects [70]. Eight of the two-dimensional diagrams/graph area instruments (88.89%) target all types of users, and one aim to evaluate UX specifically for children [19].

We also identified nine UX evaluation instruments developed as software or specific equipments. They evaluate seven different UX qualities: affect [71], aspects of game experience [72], behavior [73], emotion [35], feelings [74], stress [75] and generic user experience [76]. The most common target of software instruments are aspects of game experience (33.33%). The software/equipment group aims to evaluate games and virtual environments [77] and online platform [76], besides those that are application-independent [75]. With regard to the target users, eight (88.89%) software/equipment instruments evaluate user experience for all type of users and one (11.11%) is specific for product customers [74].

Eight of the 116 instruments are post-test pictures/objects. Among of these, three different UX qualities were revealed: emotion, evaluated by five (62.50%) instruments, affect, evaluated by two (25.00%) instruments and one (12.50%) instrument is to evaluate a specific set of UX qualities [78], which consists in: emotion, ease of use, usefulness and intention to use. Six (75.00%) of the post-test pictures/objects instruments evaluate UX holistically and two (25.00%) are for intelligent systems, environments and objects. One (12.50%) of these instruments are specifically to evaluate UX for childrens [79] and the seven others (87.50%) are suitable for all types of users.

4.2 Catalog of UX Evaluation Instruments

The set of UX evaluation instruments identified was organized as a catalog, systematizing and relating the data extracted from the papers describing each instrument. The catalog compiles 116 instruments intending to assist researchers and practitioners in making informed choices about which instruments can support UX data collection, according to their research goals. For now, the catalog is presented as a set of spreadsheets, but as this research goes on we will periodically update the information available. As future work, we are going to develop and make available an interactive version of the catalog. For each instrument, the catalog describes: reference, publication year, instrument name, main idea, general procedure, type of instrument, type of approach, UX quality, target users and applications.

Main idea and general procedure are summarized textual descriptions that provide the reader, respectively, with an overall understanding of the purpose of an instrument, and with information on how to conduct an evaluation using it. Instruments are categorized in six categories: scale/questionnaire, psychophysiological, software/equipment, two-dimensional diagrams/graph area, post-test pictures/objects and others. Each category groups two or more types of instruments (Table 5).

Table 5. Categories of types of instruments

The type of approach can be either qualitative, quali-quantitative, or quantitative. Applications are divided in eight categories: (i) online platform, (ii) audiovisual, (iii) intelligent systems, environments and objects, (iv) games and virtual environments, (v) hardware and robotics, (vi) mobile devices, and (vii) e-learning. Besides these, there is also the application-independent category that describes instruments aiming to evaluate UX in three or more different types of applications. Each category includes two or more types of applications cited in instruments papers (Table 6). Finally, target users are categorized in four main groups: children, people with disabilities and role-specific, a category that characterize instruments developed to evaluate UX with persons performing specific functions or roles. Besides, the category “all types of users” describes instruments aiming to be used with any users.

Table 6. Categories of types of applications

All these aspects are presented in the catalog as filters, to help people interested in conducting UX evaluation in analyzing which instruments to choose, depending on the goals of their evaluation, the type of application, the UX qualities to be evaluated, and the target users. In addition to present the classification of each instrument, the catalog also shows the relationships between categories and instruments descriptors, as depicted in Fig. 3. The full version of the catalog can be accessed in this link https://bit.ly/2N7K2ly.

Fig. 3.
figure 3

Portion of the UX instruments catalog.

5 Discussion

The term “instrument” is traditionally associated with the measurability of UX qualities [18, 21]. Although it is far from our intention to define what characterizes a user experience instrument, in this research, UX instruments are addressed in a broader way. They are seen as evaluation artifacts designed to collect user data and to facilitate observation and/or measurement of UX qualities. In our understanding, given the nature of the user experience, qualitative and quantitative approaches have to be articulated to a thorough evaluation and deeper understanding of the UX qualities. Hence, our focus is on stimulating practical UX work by cataloging tools with diverse approaches, designed to systematically collect data related to UX constructs from a variety of participants.

In the remainder of this section we discuss some insights, trends and concerns yielded during data analysis about the use and development of UX evaluation instruments and how they incorporate UX qualities.

5.1 (Re)Use of UX Evaluation Scales/Questionnaires

Overall, our findings show a consistent picture that indicates scales and questionnaires as the most common types of UX instruments, also addressing a greater variety of UX qualities than all the other types of instruments. This indicates that the trend once identified in the early 2010s [21], that scales are commonly used with most UX qualities, remains unchanged. However, a rising trend seems to be combining traditional techniques for capturing self-reported data with UX measurement, in quali-quantitative approaches.

Some parsimony is necessary in the development and utilization of UX evaluation questionnaires, since this type of instrument can either be structured, well-tested, robust, and result in data with a high level of validity, or poorly done, resulting in data of questionable validity [46]. This type of instrument is often used not because it is the most appropriate method but because it is the easiest method [46]. A clear example of this situation is the experience report presented by Lallemand and Koenig [7] in which they report a bad experience using a UX questionnaire that was supposed to be standardized and validated. The problem they faced came from the fact that scales are often considered validated after a single validation study which leads to conclusion that the scale psychometrics properties are good and can therefore be considered as valid.

Hence, before creating new UX scales we must consider if, given the great quantity and variety of existing instruments, it is really necessary to create new ones. Wouldn’t these instruments be more robust if we focused our efforts on validating, translating, expanding and improving already existing scales? It would be an effective way to improve UX instruments and make them suitable for the widest range of users possible. This is the case of MemoLine [19], an adaptation for kids arised from UX Curve [67]; AttrakWork [48], that proposes to “support the evaluation of user experience of mobile systems in the context of mobile news journalism” which is based on AttrakDiff [79]; and TangiSAM [80], a Braille adaptation of the Self Assessment Manekin [40]. In this regard, researchers have been discussing about the holistic UX questionnaires trend to follow a “one size fits all” approach [7]. In this regard, we agree with Lallemand and Koenig [7] when they state that the development of more specific methods, targeted at particular application domains is necessary. We further add that more generic evaluation instruments that already exist should be used as basis for this development.

5.2 Different Perspectives on How to Consider UX Qualities

Several instruments propose evaluating specific UX qualities such Emotion, Affect, Presence and Immersion, or even a specific set of qualities, such as Aesthetics and Emotion combined [53]. The fact that in the last years more instruments are focusing on evaluating the subjective components of user interaction is positive, because it demonstrates that researchers begun to reflect in a more deeply way about the specificities of user experience. Thus, a broad spectrum of UX qualities have been evaluated, addressing particular types of applications and users’ characteristics. Some important examples of contributions designed to evaluate UX in specific and complex situations are a questionnaire developed for measuring emotions and satisfaction of Brazilian deaf users [50], and a scale developed to measure specific sets of UX qualities with preschoolers: (1) challenge and control, (2) fantasy, (3) creative and constructive expressions, (4) social experiences and (5) body and senses [61].

In a different direction, as listed in Sect. 4, some instruments have been proposed to evaluate UX without specifying which qualities are taken into account [37], following a more generalist UX evaluation approach. This can be a consequence of the lack of consensus about what User Experience means [12], since the different understandings of this concept impacts the effectiveness, development and even teaching of this discipline [81]. There are also instruments that define User Experience as a sum of qualities [e.g. 82 and 31]. Those were classified by us as instruments that measures specific sets of UX qualities. However, the set of qualities that characterize UX varies widely from one instrument to another, which is, again, a consequence of the lack of a shared UX definition.

These situations depict a scenario where the term User Experience seems to be used almost instinctively in some cases, making it hard to know what is assessed when an instrument claims evaluate UX. For instance, [83] and [67] are respectively a questionnaire and a two-dimensional graph area, both aimed at evaluating experience with a product/artifact focus, targeted at all types of users, and application-independent. However, the first one understands UX as usability, desirability, credibility, aesthetics, technical adequacy and usefulness, while the second one considers attractiveness of the product, ease of use and utility.

A similar scenario occurs for specific UX qualities, such as emotion, the most frequently evaluated UX quality, according to our results. For measuring emotion, [84] examines levels of desire, surprise, inspiration, amusement, admiration and satisfaction, while [85] measures valence, arousal and engagement, and [86] analyses anger, fear, happiness, and sadness. Although one may argue that they were constructed under the assumptions of different theoretical roots, often the reasoning behind the instruments psychometrics is not explicit. Consequently, perhaps the evaluator - specially in case of professionals - is not even aware of these differences when choosing a UX evaluation instrument or method.

However, there are some UX qualities that seem to be more established in the literature, such as affect. Most of the instruments that measure affect are based on or adapted from the PAD scale [87] and PANAS scale [88], evaluating a commonly defined set of aspects that describe affect. In this context, it is important to highlight that instruments with good psychometric properties in one culture may not have the same properties when translated for another culture, hence the relations between UX components have to be validated [89].

Although the concept of UX still needs to be better established, the commitment of researchers and practitioners in investigating definitions and improving the understanding of UX factors has been very constructive to the community. The joint efforts to develop effective evaluation methods have been resulting in a variety of instruments for diverse application domains and groups of users. Also, psychophysiological measures like [64] provide the opportunity to cross self-reported and observation measures with psychophysiological information, enriching data. This type of instrument was the second most frequently found in our cataloging, which seems to be an indication that HCI community is following the UX research agenda proposed by Law and Van Schaik [89].

6 Conclusion

This work presented an analysis and compilation of a variety of types of UX evaluation instruments and qualities, providing researchers and practitioners with a systematized catalog of UX instruments. Although this research has some limitations previously discussed, our goal is to help supporting researchers and professionals in making informed decisions about the choice of instruments for UX evaluation in their every day work. We also shared some insights and concerns about the directions research on UX evaluation has been taking, that we expect to inspire the community. Our future work include expanding the collection and analysis of UX evaluation instruments, comprise more categories of analysis, types of applications and target users, and developing an interactive version of the catalog presented in this paper.