Research into the phenomenon of memory has interested psychologists for well over a century (Ebbinghaus, 1885; James, 1890). First popularized by Ebbinghaus, studies of human memory for stimuli generated in a laboratory, such as words and pictures, have produced myriad insights into the characteristics of memory. With tight control of encoding conditions, this approach continues to generate vital findings in the field. Yet, aspects such as deep personal relevance and entangled multi-sensorial environments cannot be captured within a laboratory setting. Accordingly, the 1970s saw a growing interest in exploring memory for real-world experiences, namely, autobiographical memory (AM). This prompted debate as to whether findings from laboratory-based studies could translate to AM and, in turn, how AM might be measured for scientific study (e.g., Banaji & Crowder, 1989; Neisser, 1978, 1982). The works of Tulving (1972), Crovitz and Schiffman (1974), Rubin, Wetzler, and Nebes (1986), Neisser (1982), Kopelman, Wilson, and Baddeley (1989), and Conway and Bekerian (1987) brought forth new methods for quantifying AM, expanding the scope of memory research by integrating the systematic study of participant narratives (see Sheldon et al., 2018, for a recent review). Critically, research has shown that AM performance can be dissociated from performance on laboratory tests of memory, both in terms of behavioral and neural processes (Conway & Rubin, 1993; Diamond, Abdi, & Levine, 2020; Gilboa, 2004; Leport, Stark, Mcgaugh, & Stark, 2017; Mcdermott, Szpunar, & Christ, 2009; Palombo, Alain, Söderlund, Khuu, & Levine, 2015).

Accordingly, today AM is recognized as a vital field of study, armed with vast implications for understanding healthy individuals, aging, dementia, amnesia, depression, post-traumatic stress disorder (PTSD), and beyond. To briefly highlight a few examples, studies of AM have shown that, relative to younger adults, older adults produce a paucity of episodic details from events. Yet, they produce an augmented number of external (non-event-specific) details—a pattern attributed to compensatory processes to “fill in” for impoverished episodic detail (Levine et al., 2002; also see Gaesser, Sacchetti, Addis, & Schacter, 2011; Devitt, Addis, & Schacter, 2017; Addis, Musicaro, Pan, & Schacter, 2010). A similar pattern has been observed in PTSD (Brown et al., 2014). AM research in depression has revealed reduced retrieval of specific memories (e.g., Dalgleish et al., 2007; Williams et al., 2007), with more recent work showing that this may be a more pervasive deficit that extends to general (categorical) AMs as well (Hitchcock et al., 2019). Other work shows a deficit in the production of episodic AM details in depression (Söderlund et al., 2014). AM approaches have been particularly useful in shedding light on the nature of remote memory loss (which cannot be readily captured by laboratory approaches) in patients with certain forms of amnesia (e.g., Reed & Squire, 1998; Nadel, Samsonovich, Ryan, & Moscovitch, 2000; Irish et al., 2011).

To study AM, researchers rely on participants to narrate their personal past experiencesFootnote 1. However, as crucial as narrative studies are in providing real-world context to research, they can also be elaborate and time-consuming. This paper aims to facilitate narrative methodologies in AM research by providing a simple protocol for augmenting processing and scoring procedures. We have developed a semi-automated paperless transcribing and scoring protocol that employs computer programing to accurately and automatically summate data to bolster (but not replace) the efficiency of manual transcribing and scoring (Fig. 1). This paper provides a guide for running narrative studies using this approach (with documentation and code included), making this methodology more accessible for future AM research. An ancillary goal is to provide some “best practices” to further facilitate transcribing and scoring narratives (see Adler et al., 2017; Syed & Nelson, 2015).

Fig. 1
figure 1

Stages of the semi-automated transcribing and scoring procedure

Data collection

The current procedure was developed with data collected using the “Autobiographical Interview” (AI) protocol (see Levine et al., 2002; also see Addis, Wong, & Schacter, 2008). The AI is a standardized semi-structured interview and scoring method that has been used to examine autobiographical narratives in over 200 narrative studies (see AutobiographicalInterview.com).Footnote 2 Briefly, in the AI protocol, participants are asked to select events from their lives that are specific to a time and place, i.e., episodic memories, and then to describe these events in as much detail as possible (i.e., “Free Recall”).Footnote 3

The narratives that participants provide are aided by two stages of probes to elicit more mnemonic details: The “General Probe” prompts participants to recall any additional details or helps guide them towards recalling a specific event, if one was not selected in Free Recall. Finally, the “Specific Probe” consists of direct questions related to the experience of the event (for further information, see Levine et al., 2002). For a variety of reasons, some researchers opt not to administer the Specific Probe, although it can be very useful, particularly in clinical populations, wherein the additional probes provide scaffolding support for cueing memory recall. In the protocol described below, we do not include the Specific Probe. Although a review of AI findings is beyond the scope of this paper, we note that the AI has been used successfully to characterize patterns of memory performance in studies of healthy individuals, aging, development, patients with brain lesions, psychiatric populations, and neurodegeneration. Other studies have applied the AI to examine imaginative processes, including future thinking and the like (see Sheldon et al., 2018 for review).

The AI administration is captured using a digital recording device for subsequent transcription and scoring. We used a Sony PX370 Mono Digital Voice Recorder due to its user simplicity and long battery life. Placing the recording device between the experimenter and participant in a quiet laboratory room will result in a high-quality recording. Clear audio is paramount, as the recording will provide an original record of the narrative data that will be transcribed for analysis. After a testing session is complete, the audio file can be downloaded and saved to a secure server.

Transcribing

Following the interview, the narrative provided must be transcribed. As the transcripts developed during this process will directly impact how the data are scored, accuracy is paramount. Transcripts must be written verbatim, capturing the interview clearly enough that the written text is as representative of the interview as possible (also see Adler et al., 2017). The process can be tedious and very time-intensive. If you are using a large team of transcribers (as many labs do), developing clear protocols for the format of the final transcripts is crucial to ensure consistency. Whether using a pre-existing transcribing system (such as the Jefferson Transcription System, which captures not only what was said but how it was said; Jefferson, 1984) or developing your own (as we have done, see Appendix 1), a well-defined protocol outlining punctuation, filler words (e.g., “um” or “like”), and de-identification (e.g., personal names, addresses) relieves some of the difficulty of the transcribing process. Cementing a template document for final transcripts that marks speakers, data identification, and any other information pertinent to the study at hand simplifies the formatting of draft transcripts. As our template is formatted for the Python code used for counting scored details (see below), we recommend using this prescription (see Fig. 2 for an example; also see Appendix 1).

Fig. 2
figure 2

Transcription of a mock Free Recall and General Probe formatted in our transcription template and scored with the Autobiographical Interview protocol (Levine et al., 2002; also see Fig. 3 for scoring legend)

Our laboratory has opted to use Nuance’s Dragon NaturallySpeaking transcription software (version 15; 2016) to further augment human transcribing, although there are a variety of other transcription software options available that one can choose from. Dragon translates audio files into draft transcripts and saves the transcript locally. (As Dragon does not rely on cloud services, it allows for the use of transcription software without added risks to confidentiality.) Notably, there are two ways in which one can employ Dragon. The first is during data collection, wherein Dragon transcribes speech in the moment using voice recognition software (hereafter referred to as “online transcribing”) and writes the transcription as a text file such as .doc. To augment accuracy, prior to the interview, participants’ voices can be trained on Dragon by selecting from a list of accents (by region, e.g., “US; English with Chinese accent” or “Canada; English with Spanish accent”) and having the participant read a brief passage into a microphone. This training tailors Dragon to the individual’s unique manner of speech, optimizing its ability to accurately transcribe the speaker.

Alternatively, when Dragon is not used initially, a prerecorded audio file can be fed through the software to generate a draft transcript from the recording (i.e., “offline transcribing”). Launching Dragon at this stage tends to generate a slightly less accurate transcription (even with training). When a recording is of poor quality (e.g., due to excessive background noise or when the participant does not speak clearly), Dragon will not perform well. When this occurs, employing online transcription is even more valuable, as writing a transcript from scratch with challenging audio can greatly impair the progress of the study. Notably, some use the “listen and repeat” technique, in which a researcher, who has trained the software with their own voice, listens to the prerecorded file and vocalizes what they hear, recreating a post-interview automated transcription more akin to our online transcription method (see Matheson, 2007). As Dragon is a self-learning software that improves with use as it “learns” your style of speaking, this approach is advantageous in that Dragon will become more proficient over time at understanding the individual employing this method.

Critically, depending on the method of data collection, dialogue will alternate between the experimenter and the participant at different frequencies throughout the interview. Dragon will produce a continuous block of text that does not differentiate between the participant and the experimenter (or between memories) for both online and offline approaches. Further, Dragon tailors to one voice at a time. This can present challenges in capturing dialogue between two people, i.e., the participant and the experimenter. If opting for online transcribing, one way to mitigate the blocked-text issue is to place your mouse in the appropriate place in a template file before the participant begins speaking (and again when the experimenter begins speaking) as this will separate the text prior to when the speaker changes (or when a new memory begins). However, doing so can interrupt the natural flow of the interview and is thus not encouraged for all studies. This approach may be useful in studies with a simplified protocol (e.g., studies that only employ one probe). Alternatively, we utilize an editing process in which researchers meticulously examine draft transcripts and split the text onto new lines when the speaker changes and when a new memory begins. We find that in doing so, transitioning the draft transcript into our final transcription template is simplified (see Appendix 1). Given the semi-structured format of the AI, the experimenter’s speech is scripted and thus can be identified and deciphered with relative ease. In the case of interview procedures with less structured experimenter-participant interactions, the “listen and repeat” technique discussed above may provide the most efficient method, as it capitalizes on the limitation of Dragon to tailor to one voice at a time and allows for the researcher to separate dialogue as they repeat the interview.

Regardless of interview structure, and whether online or offline transcribing is employed, the transcript produced by Dragon must be reviewed for errors by comparing it to the audio recording of the interview. To do so, we opted to use Express Scribe Transcription Software (version 8; 2019) to play back the audio file, coupled with an Infinity IN-USB-2 foot pedal. A foot pedal allows transcribers to easily navigate time-position in the file by simply using their foot to pause, fast forward, and rewind the recording, leaving their hands free for typing. Researchers new to transcribing benefit from slowing the speed of the audio playback; however, with practice, transcriptions can easily be edited in real time. If at any point researchers are unable to discern the dialogue, they insert the word “inaudible”, followed by a timestamp noting the point in the interview the stifled audio occurred (see Appendix 1). This allows senior transcribers to easily find unfinished portions of the transcript by searching “inaudible” via the “Find” function in Microsoft Word (also see Footnote 4) and navigate to the appropriate section of the audio file in order to decipher what was said. (Occasionally, the inaudible text cannot be recovered and is noted in the file.)

While editing and formatting raw transcripts is laborious, our experience suggests that editing the initial outputted draft from Dragon is still much less time-consuming than manually writing transcripts directly from the audio file. Conversely, some studies opt for narratives to be collected in written, rather than oral, format (e.g., Ison, 2009). In such cases, namely where transcription is not needed, our scoring protocol (described below) may still be useful.

Scoring

A brief overview of the Levine et al. (2002) scoring protocol

Once transcripts have been edited, data must be extracted from the narratives. By implementing scoring procedures, this qualitative data can be quantified for statistical analysis. Below, we first review key features of the AI scoring procedure (Levine et al., 2002), before turning to our pipeline for augmenting the scoring process. In the AI procedure, details are subdivided into two overarching groups: “internal” and “external.” Internal tags are given to any information pertaining to the event that the participant identified as the memory (e.g., “it was a sunny day in Vancouver”). That is, internal details represent episodic memory. Internal tags are further subdivided into five detail types to classify the content of the information provided regarding the memory: event, perception, emotion/thought, time, and place. External tags are given to any information that does not reflect the specific event (e.g., I always loved Vancouver). This group is further subdivided into detail types, including external events, semantic details, repetition, and other comments (such as metacognitions or clarifications; see Levine et al., 2002, for a full breakdown of detail types). More recent analyses of AI data have expanded these initial categories, for example by classifying semantic details as “personal” or “general” (Strikwerda-Brown, Mothakunnel, Hodges, Piguet, & Irish, 2019; Renoult et al., 2020). The dissociation between types of details has proved important for understanding memory performance in a variety of populations, particularly those discussed above (see Sheldon et al., 2018), but is not reviewed any further here for the sake of brevity. Finally, experimenter ratings are assigned to each narrative, including a rating of episodic richness, which captures the extent to which the participant was able to evoke a sense of re-experiencing an event that is specific in time and place (see Fig. 2). For simplicity, our protocol includes only the episodic richness rating, but the reader is encouraged to see Levine et al., 2002 for the full list of ratings.

Learning the scoring protocol is not trivial and requires practice. Briefly, our laboratory follows the scoring training protocol of the Levine Lab, wherein new scorers practice on an initial pool of memories and then move on to an established set of 20 additional memories for further training. To assess reliability, new scorers are compared to the established Levine training set (comprised of seasoned scorers’ data) via intraclass correlations (see Syed & Nelson, 2015; also see Miloyan, Mcfarlane, & Echeverría, 2019, 2019, for a more detailed discussion of best practices for the AI specifically). Under this approach, it is not uncommon to observe the involvement of multiple primary scorers within a study who are randomly assigned memories from a pooled set of narratives. In another common approach, a primary scorer is identified, who scores all the memories from a study, while a second scorer randomly scores a subset of these memories (e.g., 10–20%) so that interrater reliability can be computed and reported.

Regardless of what approach is used, we also recommend performing “drift checks” on additional practice memories (i.e., having multiple scorers compare their scored memories with each other over time) to ensure that scorers within a laboratory are not implicitly deviating from the original AI scoring over time. Moreover, where possible, it is optimal for the scorers to be blind to experimental conditions or groups.

Paperless scoring pipeline

Traditionally, scoring is often done on paper copies of the interview. However, our procedure offers researchers a paperless method of scoring, which not only reduces resources, but importantly, also minimizes the chance of error: by scoring in an electronic format, tallying what was scored can be accomplished automatically by computer software as opposed to by hand. Some software packages exist that automate the scoring and tallying procedure, such as the commercially available and general purpose NVivo software (version 12; 2018) or the freely available “Autobiographical Interview Scoring” (AIS) software. NVivo allows for themes to be coded in transcripts (e.g., each detail type from the AI can be coded as a theme) from which a report is produced, providing the total number of references to the theme in each transcript as well as the raw text that was initially coded. In contrast, the AIS is designed specifically for use with the AI (Wickner, Englert, & Addis, 2015) and allows for digital scoring and tallying of details that can be exported into a spreadsheet for analysis. Here, we have developed an additional pipeline for scoring that can be tailored to a range of AI procedures, dubbed “scoreAI” (Scoring the Autobiographical Interview). Our protocol is conceptually similar to the AIS, but is more extensive, as it spans the entire processing pipeline from transcription to analysis.

First, to insert a given detail tag within a narrative, we altered keyboard shortcuts in Microsoft Office 365 Word 2019Footnote 4 so that simple keystrokes would result in complete tags for detail types after the relevant to-be-scored clause (see Appendix 2 for complete instructions). That is, whenever a detail is identified, a tag would be inserted into the Word document via the tailored keyboard. For example, if the experimenter wanted to score a detail, such as “we were at the Cheesecake Factory”, as an internal place detail, they would insert a tag (in this case “Int_PL”) after the appropriate clause (see Fig. 2). We also created a keyboard cover using a keyboard skin protector cover to assist the scoring process (see Fig. 3 for a schematic).

Fig. 3
figure 3

Top: Legend for keyboard shortcuts for internal and external details and rating from the Autobiographical Interview protocol (Levine et al., 2002). Bottom: Example layout for a keyboard cover with keyboard shortcuts designated for each detail type. Stickers can be used to label the keyboard

Analysis

Finally, to automatically summate all the detail types, the scored transcripts were fed through a Python (2020) script and saved in a .csv file for subsequent analyses. This Python script is provided along with instructions for use (see Appendix 2). Briefly, this Python script uses the python-docx module (Canny, 2019) to read in the Microsoft Word document, based on the formatting indicated earlier (i.e., the template), and isolates the portions of the text associated with each transcribed and scored memory. The counts associated with each of the tags from the scoring procedure are then calculated, along with extracting the episodic richness rating value, and collated into a summary table. This procedure is then repeated for all available Word documents to generate a single summary table for all participants and all scored memories. We note that researchers differ in terms of whether they examine individual detail types or composite internal versus external scores. Moreover, depending on the goal of the study, the researcher may opt to control for verbal output by computing an internal-to-total ratio score (see Miloyan et al., 2019).

To capture nuances of different applications of the AI, or other narrative methodologies, modifications to the template or code may be needed. Both the template and code are easily adaptable to such modifications. For example, the “tags” used for scoring can be easily adjusted in the code to tailor output towards the intended measure (for additional information, see https://github.com/cMadan/scoreAI; Madan, 2020).

We provide three practice memories and an accompanying output file for the reader to run through the code to ensure the Python script is being used correctly (see Appendix 2, Fig. 5). (We note that these practice memories are scored based on our interpretation of the Levine et al., 2002, protocol and the instructions provided by the Levine laboratory.) We encourage the reader to perform “spot checks” on a small subset of their actual data to ensure that the outputted Python results line up with manual counting.

Discussion

In the current paper we presented a novel, semi-automated, paperless transcribing and scoring procedure tailored to AM research, particularly research that employs the AI protocol (Levine et al., 2002). For transcribing, we presented two ways of applying automatic transcription software (in this case, Dragon) to aid the transcribing process of participant interviews. Transcribing software does not replace human labor but accelerates it considerably. We also provided some recommendations for editing transcriptions to ensure consistency across narratives.

We then introduced an electronic scoring procedure for AM details that incorporates basic keyboard shortcuts in Microsoft Word to facilitate the standard Levine et al., 2002 scoring procedure. We also introduced a simple Python script (scoreAI) written by our group that performs automated detail counting and generates a user-friendly output file. The data in the output can then be easily analyzed with a variety of statistical procedures.

Although these procedures do not eliminate the time commitment and human labor required for AM narrative studies, they extricate and reduce error, making this methodology more accessible for future research.