Keywords

1 Introduction

Advancements in the technology in past decades have given rise to the development of autonomous agents and robots. Despite being deployed in various domains for specific tasks such as in manufacturing and bomb disposal, robots, being mostly teleoperated and requiring detailed commands, are ineffective with unstructured tasks in novel environments. Nevertheless, there are recent efforts made to develop robots that have task-level autonomy. These robots are capable of multimodal communications through gestures, natural speech, and implicit communication [1]. These communication channels support direct interaction between robots and humans. Direct interaction is often found in human teams and facilitates more dialogue and collaboration, which would help transit robots from being tools to teammates [2].

1.1 A Closed-Loop System

Implicit communication involves information being transmitted from the human to the robot without the explicit intent of sending a command or instruction. It includes sensing of operator fatigue, workload and other psychological states, which can convey an unarticulated need for assistance with a task. Research has shown that human operators are not always aware of their own mental state and workload and even if they were, they may not be able to assess accurately when they would benefit from aid [3]. In adaptive systems, the use of physiological workload measures is advantageous over self-report measures as physiological measures are more objective, allow continuous assessment of workload state with high temporal resolution, and do not require any overt response from the operator that may interfere with the task.

A closed-loop system that incorporates physiological workload measures would enable a robot to reprioritize its tasks to initiate aid without the human teammate explicitly requesting it. The robot’s response to an unstated need would provide relief in cases when the human is in a high workload state and at jeopardy of poor performance.

Information about the human teammate’s state can be used as inputs. When workload reaches a level deemed as “high”, the robot would adapt its behavior to alleviate the workload. These may include active or passive, direct or indirect behaviors, such as taking over the main task or preventing the operator from being hindered by a secondary task etc. [4]. When the closed-loop system senses that the human teammate’s workload level has returned to a level considered to be “low” or manageable, it would trigger robot behaviors that allow the human to resume all duties and full control. This feature is to minimize the performance issues (e.g., loss of situational awareness, skill atrophy, over-reliance issues) associated with having the human “out-of-the-loop” [5].

1.2 Modeling Workload

In order to determine the levels of “high” and “low” workload as assessed by the various physiological workload measures, a workload model is needed. There are several approaches to modeling workload such as using tools like the Improved Performance Research Integration Tool (IMPRINT), or discrete-event simulations [6]. However, these approaches do not clearly distinguish the concepts of workload and task demands, and seem to define workload with respect to the objective demands of the task. This assumes a relatively simple relationship between task demands and workload which may not be true. Instead, in this paper, we view workload as “a mental construct that reflects the mental strain resulting from performing a task under specific environmental and operational conditions, coupled with the capability of the operator to respond to those demands” [7]. Workload is more akin to an operator’s dynamic response to task demands. By this definition, the assessment of workload requires inputs from the operator which can be in the form of physiological measures of workload.

1.3 Using Physiological Workload Measures to Classify Workload State

Physiological measures such as heart rate, heart rate variability, brain activity, and pupil size, have been found to index workload and operator state [8] and have been used in adaptive systems that classify workload with some success [9]. Classifiers such as stepwise discriminant analysis (SWDAs) [10] and artificial neural networks (ANNs) [11] have been used to classify workload states. However, although these often result in high classification accuracy, their diagnosticity is limited as their algorithms do not provide much to inform design of adaptive aids [12].

For present study, multiple physiological workload measures were used because different physiological measures assess workload differently [13]. There are measures that tap metabolic responses which reflect a more global state and respond more slowly to changes in workload state (e.g., cerebral blood flow velocity, CBFV), as well as measures which respond to changes in workload more immediately (e.g. eye fixation durations). In addition, given the multidimensional nature of workload [14] and the workload in multi-tasking environments, any single workload measure cannot be expected to capture workload changes in all tasks, because while a measure may respond to a particular task manipulation, it may not respond to other types of task manipulations [14]. Thus, instead of relying on just one or two measures, using multiple workload measures to classify workload state would provide a more complete picture of the workload experienced [15].

1.4 Developing and Validating Models

Two datasets were used to develop and validate the workload classification model: a training dataset from which the model was derived, and a validation dataset to determine if the model was robust enough to classify workload states on a separate sample. The training dataset comprised data from a previous study, Abich [16], while the validation dataset was data from another study, the Validation Study. The Abich [16] study administered two tasks that typified an intelligence, surveillance and reconnaissance (ISR) mission. Hence the present effort sought to model the workload experienced while performing tasks related to an ISR mission. The two tasks administered in the Abich [16] study were (i) a Change Detection task, and (ii) a Threat Detection task. The Change Detection (CD) task required participants to detect and identify changes to icons, representing enemy assets and activities, overlaid on a map of an area of interest (AOI). The second task was a Threat Detection (TD) task in which participants viewed and identified characters who were pre-defined as threats, from a video feed of characters lined along the streets in a geotypical Afghan environment. The task parameters used in both tasks have shown successful workload manipulation in past studies (i.e., Abich [16]). In addition to the suite of physiological measures, workload was also assessed with the NASA Task Load Index (NASA-TLX [17]), which taps six sources of workload as well as a global index of workload.

2 A Novel Approach to Modeling Workload

In the Abich [16] study (training dataset), participants underwent the following four study scenarios in a within-subjects design (see Table 1):

Table 1. Abich [16] study scenarios

The performance and workload responses in the single task scenarios (i.e., Scenarios 1 and 3) were found to be clearly distinct from that in the dual task scenarios (i.e., Scenarios 2 and 4), and the direction of scores showed that the single task scenarios elicited low workload while the dual task scenarios elicited high workload as reported in the NASA-TLX ratings. This was also true for the physiological workload measures, which suggests that the single-dual task manipulation of workload was a robust one. The physiological workload measuresFootnote 1 included electroencephalography (EEG) tapping brain activity in different lobes, electrocardiography (ECG) measures like heartrate variability (HRV), measures of regional oxygen saturation (rSO2) from functional near-infrared spectroscopy (fNIRS), cerebral blood flow velocity (CBFV) from transcranial doppler ultrasonography (TCD), and a variety of ocular measures such as fixation duration.

2.1 Matching Difference Scores

Observation of the robust differences in workload found between the single and dual task scenarios led to the computation of difference scores that reflected the change in workload response between a low (single) and high (dual) workload task. For instance, a difference score for HRV was obtained from HRV in Scenario 1 (single task eliciting low workload) and HRV in Scenario 2 (dual task eliciting high workload). Another difference score for HRV was computed from HRV in Scenario 1 (single task eliciting low workload) and HRV in Scenario 4 (dual task eliciting high workload). The correlations of these difference scores, obtained from different pairs of single-dual tasks, were positive and significant (p < 0.05), and ranged from 0.299 (Theta at F3) to 0.820 (mean fixation duration). This was further evidence that, for the physiological workload measures, the magnitude of the difference in workload response between single task (low workload) and dual task (high workload) was large and stable enough to be exploited as the basis of determining the level of workload for a new task that elicited an unknown level of workload. If the difference score obtained from the single task (low workload) and new task (unknown workload) matched the difference score obtained from the same single task (low workload) and dual task (high workload), then the new task would have elicited the same high workload response as the dual task.

In this approach, there is a Single Task Baseline which is a single task condition known to elicit low workload, and a Dual Task Baseline which is a dual task condition known to elicit high workload. A task pair is the pairing of any two scenarios/conditions to obtain a difference score. The Benchmark Difference Score is obtained from a task pair that comprised the Single Task Baseline condition and the Dual Task Baseline condition, while the Test Difference Score is computed from a task pair consisting of the same Single Task Baseline condition and the new task condition that elicits an unknown level of workload. If the Benchmark Difference Score and Test Difference Score matched, then the new task condition would have elicited a similarly high workload response as the Dual Task Baseline condition. Given the consistently large differences in physiological workload scores found between Single and Dual task condition, matches are more likely to occur when the Test Difference Score, like the Benchmark Difference Score, is from a task pair that comprised a Single and a Dual Task condition.

3 Algorithms to Combine Multiple Physiological Measures

Although the matching of difference scores enabled the workload level of a new task to be determined, an algorithm was needed to combine the difference scores from various physiological measures to form a workload index. The workload index would reflect the match of physiological workload responses between the set of Benchmark Difference Scores and the set of Test Difference Scores. A high degree of match would indicate that the physiological workload response to the new task was similar to that of a Dual task, which had been established as eliciting high workload.

Studies suggest that there is sizeable individual variability in the physiological responses to workload [18], i.e., some individuals may show a marked difference in ECG measures between low and high workload-eliciting tasks, whereas other individuals may show a larger change in EEG measures. Hence a robust algorithm would need to account for individual differences in physiological response to workload and allow for a customization based on these differences for computing a workload index. Although several algorithms were explored and evaluated, the two that appeared most promising for accommodating variability in workload responses are described below.

3.1 Algorithm 1: Proportion of Repeated Markers

The physiological markers for workload for the individual were first identified. These were the measures that, for the individual, showed marked differenceFootnote 2 in response between the Single Task Baseline and Dual Task Baseline conditions, i.e., sensitive to dual-tasking. The markers were obtained from the set of Benchmark Difference Scores, and these were compared to the markers obtained from the set of Test Difference Scores. The workload index was the proportion of markers in the Benchmark Difference Scores that re-emerged as markers (“repeated markers”) among the Test Difference Scores. A large proportion of repeated markers would mean that the physiological responses evoked by the new task were similar to that elicited by the Dual Task Baseline (i.e., high workload task), and the workload index would approach 1 (Fig. 1).

Fig. 1.
figure 1

Computation of workload index reflecting proportion of repeated markers

3.2 Algorithm 2: Distance Between All Difference Scores

Similarity in physiological responses (workload index) was quantified as the Euclidean distance between the set of Benchmark Difference Scores and Test Difference Scores. Smaller Euclidean distances denoted a higher degree of similarity between physiological workload responses. There is no set range for Euclidean distance (d) which is computed as follows (see Fig. 2):

Fig. 2.
figure 2

Computation of workload index based on Euclidean distance

4 Algorithm Evaluation

The algorithms were evaluated for use in the closed-loop system on the following:

  1. 1.

    For the algorithm, how distinguishable the mean workload index values from similar task pairs were from those obtained by chance.

  2. 2.

    For the algorithm, how distinct the distributions of workload index values from similar task pairs were from those obtained from random data.

  3. 3.

    For the algorithm, how discriminating the workload index values were in a mock-up of the closed-loop system, and how easily a cutoff for high workload could be set.

  4. 4.

    Use sensitivity (d’) to fine-tune cutoff parameter.

To validate the model derived from each algorithm, a separate cross-validation dataset from a new sample of participants was used (Validation Study). This sample was also administered tasks that typified an ISR mission, including the same Change Detection task. However, instead of the Threat Detection task, they were administered a Peripheral task as the second task. The purpose was to determine the robustness of the model when applied to workload from tasks within the same ISR context, but not featuring the exact same tasks as the training dataset. The Peripheral task required participants to monitor a video feed of their robot teammate scouting the AOI, and maintain awareness of the robot’s whereabouts and the features in the environment. They had to respond to auditory prompts such as “In which direction was the robot heading before the last turn?”, and “Did the robot pass any men since the last turn?” The study scenarios for this dataset (Validation Study) were as follows (see Table 2):

Table 2. Validation study scenarios

4.1 Mean Workload Index Values

First, different task pairs were formed yielding various sets of difference scores. The difference scores denoted the changes in physiological responses between the scenarios. Next, pairs of difference scores were formed. The pairs always contained data from the same sample/study because the intent was to have the first set of difference scores reflect the difference in workload response between low (Single Task) and high (Dual Task) workload (i.e. Benchmark Difference Scores), and the second set of difference scores showing the difference in workload response between low (Single Task) workload condition and a new condition that elicited an unknown level of workload. The Benchmark Difference Scores and Test Difference Scores were expected to match when both sets originated from Single-Dual Task Pairs, and not expected to match when the Test Difference Scores involved random data.

The mean workload index values under the two algorithms were computed for different pairs of Benchmark and Test Difference Scores. Using Algorithm 1, when both the Benchmark Difference Scores and Test Differences Scores were from Single-Dual Task Pairs, the mean workload indices were similar, falling within a narrow range of 0.509 to 0.552. This indicated that between 50.9 % and 55.2 % of markers identified from the Benchmark Difference Scores were also markers in the Test Difference Scores, indicating relatively comparable workload responses. On the contrary, when the Test Difference Scores included random data, the mean workload indices, or proportion of repeated markers declined to 28.7 % and 32.3 %, denoting little similarity in workload responses. A similar pattern of results was obtained with Algorithm 2. Smaller Euclidean distances, indicating greater similarity, were obtained when both the Benchmark Difference Scores and Test Difference Scores were from Single-Dual Task Pairs (distances ranging from 4.677 to 4.978), compared to indices that utilized random data (distances were larger at 7.518 and 7.624).

All these findings indicated that as long as the Benchmark Difference Scores and Test Difference Scores were derived from Single-Dual Task Pairs, the workload indices would indicate a degree of match or similarity in physiological responses that was substantially higher than what would be obtained by chance. In addition, for both algorithms, the mean workload index values computed from Validation Study data were comparable to those from the Abich study. As these datasets were from different samples, this provided evidence of cross-validation of the algorithms.

4.2 Distribution of Workload Index Values

Apart from the mean of the workload index values, the algorithms were evaluated on the range and distribution of the index values. This was to check if, in addition to the mean, the range of index values from similar task pairs was also distinct from the range of values obtained by chance. Results revealed that, for both Algorithms 1 and 2, the distribution of index values when both the Benchmark Task Pair and Test Task Pair were similar single-dual task pairs were distinct from the index values where those were dissimilar and included random data. This provided evidence that workload index values yielded from Algorithms 1 and 2 are likely to be sufficiently distinct from values that would be obtained by chance. Hence, if the new task induced a similar level of workload as that of the Baseline Dual task (i.e., high workload), then the workload index computed from these algorithms are likely to be able to reflect that similarity.

4.3 Workload Index Values in Mock-Up of Closed-Loop System

The algorithms were also evaluated in a mock-up of the closed-loop system with the Abich data. From the mock-up, an algorithm would be selected to be used in the closed-loop system and to derive a cutoff point for classifying level of workload.

In addition to Algorithms 1 and 2, a derivative of Algorithm 2 was also evaluated. Instead of using all the physiological measures in the computation of Euclidean distance, Algorithm 2a used the top ten measures on which the individual exhibited the greatest change in physiological response between the Single task (low workload) and Dual task (high workload) conditions (i.e. measures that show the largest absolute difference). This procedure was to further individualize the algorithm as the workload index would include the measures that are sensitive to the workload experienced by the individual, i.e. his markers of workload.

The mock-up simulated the “streaming” of data blocks every 30 s. Each data block or sample comprised data collected over 2 min. This feature was to ensure that there were enough data from the various sensors, all of which have different sampling rates, to compute a meaningful index that reflected the state of workload at that time.

To further examine the contrast between workload index values from similar and dissimilar task pairs, other task pairs were formed from the study scenarios in the Abich study (see Table 2). As the Benchmark Task Pair was always formed from a Single and Dual task Pair (i.e., Scenario 1 and Scenario 2, or S12), greatest similarity would be expected if the Test Task Pair was also a Single-Dual Task Pair (e.g., Scenario 1 and Scenario 4, or S14), and larger dissimilarity would result from Test Task Pair that were a Single-Single Task Pair (e.g., Scenario 1 and Scenario 3, or S13). Based on this rationale, the expected similarity would vary across different task pairs, and this should be captured by the selected algorithm. The task pairs expected to show greatest similarity over all data samples is S12 and S12 or S12_S12Footnote 3, followed by S12_S4, then S12_ S13, and lastly S12_S11 were expected to be very dissimilar.

In the mock-up with Algorithm 1 the expected order of task pairs from the most similar (S12_S12) to the most dissimilar (S12_S11) was observed (see Fig. 3). Closer examination of the index values for S12_S13 and S12_S14 (i.e., the middle two sets of task pairs that were the most easily confounded) revealed a sufficient distance between index values for all data samples, except data sample 7. Index values were relatively stable over all data samples, and there was a possible cutoff at 0.62 (i.e., if at least 62 % of physiological markers were repeated with the new task, the new task would be considered to have induced a similarly high workload as the Dual Task Baseline condition).

Fig. 3.
figure 3

Workload index values under Algorithm 1 (larger index values denote greater similarity)

With Algorithm 2, the expected order of task pairs from most similar to most dissimilar was not obtained (see Fig. 4). The index values for S12_S13 (dissimilar task pair) yielded values that indicated greater similarity than the values for S12_S14 (similar task pair). Moreover, there was greater variability in the index values across the data samples despite a constant level of taskload. Although a cutoff of 7.2 for this algorithm seem plausible, it is likely that the workload levels indicated by data samples 1, 3, and 7 would be erroneously classified (see Fig. 4). Because Algorithm 2 could not correctly identify when workload responses were similar or dissimilar (it showed S12_S13 as being more similar than S12_S14), it was excluded from further consideration.

Fig. 4.
figure 4

Workload index values under Algorithm 2 (larger index values denote lower similarity)

Algorithm 2a computed the workload index from the 10 measures that were most sensitive (markers) to changes in the individual’s workload. Although the expected order of similarity was obtained, the index values at data samples 1, 2, 5, 6 and 7 are likely to misclassify workload levels with a cutoff score of 4.4 (see Fig. 5).

Fig. 5.
figure 5

Workload index values under Algorithm 2a (larger index values denote lower similarity)

4.4 Sensitivity of Algorithms

The cutoff scores that were obtained from the mock-up served as decision thresholds that would classify workload in the closed-loop system. Hits, would comprise instances where high workload was experienced and the aid was appropriately evoked (i.e., workload was correctly classified as “high”), Correct rejections were when low workload was experienced and no aid was evoked (i.e., workload was correctly classified as “low”). False alarms occurred when low workload was erroneously classified as “high” and aid was rendered when it was not required, while Misses were cases where high workload experienced was incorrectly classified as low and aid was not provided when it was needed. The optimal cutoff should maximize hit and correct rejection rates without inflating false alarm and miss rates.

For Algorithm 1, moving the cutoff from a less conservative 0.55 to a more conservative 0.62 decreased hit rates as expected, the percentage of the sample who would have been given aid in S12_S12 declined from 89.9 % to 88.6 %, and decreased from 68.5 % to 63.1 % in S12_S14. However, as expected, false alarms also declined with the more conservative cutoff. For Algorithm 2a, shifting the cutoff from a Euclidean distance of 4.4 to a more conservative cutoff of 3.5 (lower distance denotes greater similarity) led to the anticipated reduction in hit and false alarm rates, but declines were much sharper. Hit rates declined from 91.95 % to 79.87 % for S12_S12 and from 75.17 % to 50.34 % for S12_S14/. False alarm rates decreased from 60.13 % to 44.30 % for S12_S13, and from 68.76 % to 40.94 % for S12_S11. This suggest that with a less than optimum cutoff, Algorithm 2a can result in drastic changes in classification.

The algorithms were next evaluated on the signal detection measure of sensitivity, or d-Prime (d’), which is computed from hit and false alarm rates.

$$ {\text{Sensitivity or d'}} = {\text{Z}}\left( {\text{Proportion of HITS}} \right) \, {-}{\text{ Z}}\left( {\text{Proportion of FALSE ALARMS}} \right) $$

The average d’ for all task pairs under Algorithm 1 was 0.788 for the cutoff of 0.55, and 0.811 for the cutoff of 0.62. For Algorithm 2a, d’ was 0.574 for a cutoff of 3.5, and 0.552 for a cutoff of 4.4. Since Algorithm 1 with the cutoff of 0.62 had the highest sensitivity, it was selected for the workload model for the closed-loop system.

5 Conclusion

The present study described the development and validation of a workload model for a closed-loop system that accommodated variability in physiological workload responses workload across individuals. It defined a systematic method for evaluating workload classification algorithms which includes comparisons with index values obtained by chance. In addition, the study offers a viable approach to developing an individualized workload model and contributes to the direction of future modeling efforts. Future research is needed to implement such a model in a closed-loop system where adaptive robot aiding would be driven by physiological measures.