Practice effects in healthy adults: A longitudinal study on frequent repetitive cognitive testing

Background Cognitive deterioration is a core symptom of many neuropsychiatric disorders and target of increasing significance for novel treatment strategies. Hence, its reliable capture in long-term follow-up studies is prerequisite for recording the natural course of diseases and for estimating potential benefits of therapeutic interventions. Since repeated neuropsychological testing is required for respective longitudinal study designs, occurrence, time pattern and magnitude of practice effects on cognition have to be understood first under healthy good-performance conditions to enable design optimization and result interpretation in disease trials. Methods Healthy adults (N = 36; 47.3 ± 12.0 years; mean IQ 127.0 ± 14.1; 58% males) completed 7 testing sessions, distributed asymmetrically from high to low frequency, over 1 year (baseline, weeks 2-3, 6, 9, months 3, 6, 12). The neuropsychological test battery covered 6 major cognitive domains by several well-established tests each. Results Most tests exhibited a similar pattern upon repetition: (1) Clinically relevant practice effects during high-frequency testing until month 3 (Cohen's d 0.36-1.19), most pronounced early on, and (2) a performance plateau thereafter upon low-frequency testing. Few tests were non-susceptible to practice or limited by ceiling effects. Influence of confounding variables (age, IQ, personality) was minor. Conclusions Practice effects are prominent particularly in the early phase of high-frequency repetitive cognitive testing of healthy well-performing subjects. An optimal combination and timing of tests, as extractable from this study, will aid in controlling their impact. Moreover, normative data for serial testing may now be collected to assess normal learning curves as important comparative readout of pathological cognitive processes.


Background
Cognitive decline is a common feature of many neuropsychiatric diseases and among the strongest determinants of real-world functioning and quality of life in affected individuals [e.g. [1][2][3][4]]. Moreover, it poses enormous and ever-increasing costs on the progressively aging industrial societies.
Efficient treatment of cognitive impairment is urgently needed but not yet available. Therapeutically addressing cognitive outcome requires careful assessment based on comprehensive neuropsychological examination of relevant cognitive domains. Cognitive tests can be applied cross-sectionally to obtain first diagnostic information but solid clinical judgments as well as research require longitudinal observation. In clinical neuropsychology, serial test administration is essential for (1) monitoring of disease progression and/or potential recovery, or (2) evaluating efficacy of a therapeutic agent or other interventions (e.g. rehabilitation programs) in both, randomized clinical trials or clinical follow-up of single cases. Dependent on the underlying questions, testing frequencies have to be adapted to enable measurement of short-term or long-term processes. With repeated testing, however, the phenomenon of 'practice effects', reflecting the capability of an individual to learn and adjust, represents not only an additional important cognitive readout but also an interfering variable complicating result interpretation [5][6][7].
If not properly integrated into interpretation of cognitive results, practice effects can easily lead to false conclusions: (1) degenerative processes obscured by practice may be underestimated [30] or (2) treatment effects might be overestimated, particularly in the absence of adequate control groups [31]. Even though the integration of appropriate control groups into clinical treatment trials remains inevitable, a solid prediction of expected practice effects under healthy good-performance conditions is deemed essential for accurate effect size estimation and selection of a suitable test set. In a single case follow-up, this may actually be the only reasonable basis of judgment. Therefore, it is surprising that despite a number of pivotal prior studies dealing with practice effects, comprehensive characteristics of normal performance over time are not available. Specifically, the impact of practice effects on frequent repetitive neuropsychological testing over as long as 1 year, comprising all major cognitive domains, has not been systematically studied in healthy well-performing individuals.
The first objective of the present study has been to start filling this gap by exploring test-specific practice effects on performance in the 6 major cognitive domains upon repetitive testing of healthy subjects over a whole year. The design of intertest intervals has been chosen to meet typical requirements of neuroprotective trials, with short-term high-frequency followed by long-term low-frequency testing. As the second objective, recommendations should be extractable from the findings of the present work to design an optimal neuropsychological testing procedure for future longitudinal clinical research or routine. Finally, the third objective has been to provide the ground for future collection of normative data on longitudinal learning curves as an important but thus far largely ignored diagnostic readout of cognitive abilities.

Participants
The present study was approved by the local ethical committee (Ethikkommission der Medizinischen Fakultät der Georg-August-Universität Göttingen). All study participants gave written informed consent after complete description of the study. Native German speaking healthy subjects were recruited via public advertising and financially compensated upon completion of all follow-up sessions. (In our experience, financial compensation increases motivation of subjects to keep the appointments but is highly unlikely to influence cognitive performance itself.) A total of 36 healthy individuals (21 males and 15 females) with a mean age of 47.3 ± 12.0 years (range 24-69 years) at study entry participated. Prior to enrolment, a standardized, semistructured interview and a physical screening examination confirmed that subjects were free of significant medical conditions or neuropsychiatric diseases (past or current). Psychopathological ratings (

Study design
All screened and included subjects underwent comprehensive neuropsychological and psychopathological testings of approximately 2h duration under standardized conditions (fixed sequence, fixed day time per subject) on 7 occasions in total. The entire study was performed by 2 examiners (trained psychologists). Tests were administered according to standard instructions and, if available, alternate forms were used (for overview see Additional file 1). The longitudinal study design comprised a short-term high-frequency testing phase with a 3-week intertest interval (baseline, week 2-3, week 6, week 9 and month 3) and a long-term low-frequency testing phase (months 6 and 12), amounting to a total duration of 1 year per individual ( Figure 1). The rationale of this testing schedule is derived from neuroprotective treatment trials on cognitive outcomes [e.g. [35]]. All included subjects completed all 7 testing sessions as scheduled (no drop outs), resulting in a complete data set without any missing data.

Neuropsychological test battery
A total of 25 tests were selected to cover major cognitive domains: (1) [41]) subtest phonemic verbal fluency; (4) motor functions: 9-Hole Peg Test [42], Purdue Pegboard Test [43], MacQuarrie Test for Mechanical Ability [44] subtests Tapping and Dotting; (5) language: RBANS subtests Picture Naming and Semantic Fluency; and (6) visuospatial functions: RBANS subtests Lines and Figure  Copy. All listed tests are well-established and have been described in detail as referenced (for a short description see Additional file 1). Of all tests, only the most relevant parameter is presented to avoid overrepresentation of one test. To minimize expected strong recall effects, RBANS short-and long-term memory tests, visuospatial and language functions as well as the WCST-64 have been performed less frequently (baseline, week 6, months 3, 6, 12). Intelligence [premorbid intelligence (Mehrfachwahl-Wortschatz-Intelligenz-Test, MWT-B [45]) and state intelligence (revised German version of the Wechsler Adult Intelligence Scale, short version, HAWIE-R [46])] -as well as personality (NEO-PI-R) measures were performed only at baseline to explore their potential influence on the course of cognitive performance. Current psychopathological symptoms (HAMD, PANSS [32,33]) and quality of life (visual analogue scale ranging from 0-10) were assessed at each testing time-point ( Figure 1) to control for their potentially fluctuating nature.

Statistical analysis
All numerical results are presented as mean ± SD in text/ Upon long intertest intervals after month 3 until study end (month 12) no significant performance changes could be found in 23 of 25 tests, i.e. performance levels acquired by high-frequency testing remained stable and did not return to baseline values. Only one test, RWT phonemic verbal fluency, showed further enhanced test scores (p < 0.001). A ceiling effect in RBANS Picture Naming during high-and low-frequency testing phase artificially produced significant results ( Table 1).
The longitudinal course of performance in all 6 cognitive domains (data of single tests were combined to yield respective super-ordinate cognitive categories) is illustrated in Figure 2. ANOVAs conducted on this data confirmed the time pattern of single test comparisons: Strong practice effects upon high-frequency testing and a plateau held with decreasing frequency. Regarding cognitive domains, the most pronounced changes occur until month 3 in executive functions (14.0 ± 10.7%), followed by learning/memory (13.3 ± 12.3%) and attention (11.9 ± 10.6%) (Figure 2). Improvement from baseline to second testing accounts for the largest proportion of change in all cognitive domains ( Figure 3). Accordingly, for most tests, Cohen's d was highest for baseline -week 2-3 interval (d = 0.30-0.55 for domains, d = 0.22-0.71 for single tests). In contrast, Cohen's d, if calculated for the late betweenassessment intervals, i.e. from month 3 to 6 or 12, would show mainly 'no effect' (exceptions: d = 0.28 for RWT phonemic verbal fluency and d = 0.50 for RBANS Picture Naming).
To address the important question of ceiling effects, the proportion of subjects reaching a defined performance  In almost all cognitive domains, changes in total test scores over time exhibit a similar practice pattern: significant improvement during the highfrequency testing phase and stabilization of performance during the low-frequency testing phase. Most pronounced score increases are seen in executive functions as well as in learning and memory, whereas changes in visuospatial performance fail to reach significance. Significance refers to a main effect of time determined with ANOVA for repeated measures, including all testing time-points from baseline to month 3, or from month 3 to month 12, respectively. Mean ± SEM given. ***p < 0.001; *p < 0.05; n.s., not significant.   Taken together, the evaluation of potential modulators of cognitive performance and practice effects (age, IQ, personality factors, QoL, degree of depression and psychopathology) revealed only isolated findings with single cognitive tests at baseline (20 significant of 275 correlations) or the course of cognitive performance (only 3 significant time x covariate interactions of 200 ANCO-VAs). Using a conservative approach of alpha adjustment for multiple testing, these isolated findings even disappear. Thus, none of the analyses (Pearson's correlations, repeated-measures ANCOVAs) suggest that the cognitive performance pattern was due to pre-existing intellectual, personality, sociodemographic or to current psychopathological differences that systematically affected the slope of practice effects. All before mentioned data on cognition is therefore presented without any of the explored covariates.

Discussion
In the present study, we provide for the first time comprehensive data on clinically relevant practice effects in healthy well-performing subjects over a 1-year period of frequent repetitive testing across 6 distinct cognitive domains. During the initial phase of high-frequency testing for 3 months, strong practice effects occur early on, most prominent in executive functions and learning/ memory. After 3 months and upon reduced testing frequency, a stabilization/plateau of the acquired cognitive level until study end is observed. Age, intellectual capacity, personality features, or psychopathological scores have no consistent influence on the course of cognitive performance.
Generally, comparisons between the present and previous studies are confounded by different designs, including the use of diverse cognitive tests, fewer repetitions, and/or varying intertest intervals. The finding that strongest changes in performance occur from baseline to the second testing, however, complies well with a number of similar results on distribution of practice effects [10,12,15,17,18,48]. The extent of practice effects observed here even exceeds effect sizes described by Hausknecht et al [10] (d = 0.26) or Bird et al [20], using comparable intervals.
In contrast to previous studies, showing a similar magnitude of practice effects short-term [9,12,48], our longitudinal design addresses particular needs of neuroprotective/ neuroregenerative treatment trials, including both, a practice and a retention phase. Just McCaffrey et al [25] had a somewhat related long-term design, but only 4 sessions in total (baseline, week 2, months 3 and 6), with the last testing at month 6, and a much shorter test battery. The essential findings of this study are in agreement with the respective parts of the present work. Another study worth mentioning here, provided useful information about practice-dependent test selection to build on, but used only a high-frequency testing schedule (20 sessions in 4 weeks) without long-term follow-up and without change in testing frequency [49].
Regarding the different cognitive domains, executive functions showed highest score increases over time, followed by learning and memory. For executive functions, results of other studies are contradictory [e.g. [17,20,50]], ranging from no over small to strong practice effects. The strong practice effects in almost all executive functions found here are most likely the result of a higher repetition rate (as compared to [20,50]) or the use of less alternate forms (as compared to [17]). In line with our findings, there is wide agreement that memory functions benefit most from practice [7,25,48,51] and are evident even when alternate test forms are applied [10,14,15,17,52]. Since parallel forms were also administered in the present study, and the respective tests were reduced to 4 repetitions, test sophistication [8] as well as improvement of the underlying functions rather than simple recall effects may have contributed to improved performance.
On the basis of single test characteristics and results over time, no prediction can be made regarding the impact of repetitive testing. Practice effects seem to be unrelated to task complexity or modality. On the other hand, the present work provides more than test-specific information: cognitive domains, assessed with an extensive test battery, covering each domain by several tests, revealed very homogenous effect sizes within one domain, i.e. similar practice effects irrespective of the test used, pointing to genuine change in the underlying target domain (transfer effects). Only within the attention domain, highly varying effect sizes of individual tests may indicate respective test specificity [53], e.g. in our study TAP Visual Scanning displayed largest practice effects whereas RBANS Digit Span revealed no effects. In the overall picture of transfer effects, the few tests with ceiling effects did not play a role in this respect.
Logically, our findings on practice effects raise the question whether after 3 months of regular practice the maximally possible improvement is already achieved or whether continued practice would lead to an even more enhanced performance. Even though this was not the objective of the present study, it would be interesting to investigate how many additional sessions within the high-frequency period are required until the individual upper performance limit is reached.
Although the majority of tests showed considerable practice effects, at least one test in most of the cognitive domains was found resistant to practice. Again, task complexity does not seem to be the underlying factor explaining resistance. For more 'deficit-oriented' subtests like RBANS List Recognition, Lines and Figure Copy, ceiling effects (expected especially in high IQ subjects) did not allow further improvement of test scores. For most other tests this was not the case since the majority of subjects, despite high IQ, did not score at aboveaverage. Nevertheless, the high IQ level of our sample may have contributed to the observed strong practice effects as reported in studies showing that high IQ subjects benefit more from prior test exposure ('the rich get richer' [16,18]). This greater benefit of high IQ, however, is still equivocal as is a potential influence of age [20,50]. In fact, neither age nor IQ, applied as covariates, revealed a clear effect in the present work. Also other covariates, i.e. personality and psychopathology ratings, failed to show any appreciable impact on learning curves. The most plausible explanation would be the fact that healthy volunteers scored in a very restricted 'normal' range in these categories. Such restricted range holds similarly true for IQ.
The aim of the present study, apart from long-term analysis of practice effects, was to provide recommendations for an 'ideal' neuropsychological test battery suitable for serial testing in research and routine. As obvious from our results, two major points have to be considered in this recommendation: test selection and timing. Tests of first choice are those that are essentially resistant to practice: TAP Alertness or RBANS Digit Span for attention; TAP Working Memory for executive functions; MacQuarrie Dotting for motor functions; RBANS Semantic Fluency for language.
For learning and memory, no practice-resistant valid test could be identified. Therefore, for evaluation of this particular domain, a 'dual baseline' approach [5,6] is suggested to partly cut off early practice effects: If the most prominent improvement occurs from first to second assessment, the second may serve as baseline for subsequent assessments. For the domain learning and memory, this applies to RBANS Figure Recall, List Recall and List Learning.
As eventual alternatives for the above listed domainspecific, practice-resistant tests, the dual baseline approach may be used for TMT A, RBANS Coding (attention), WCST-64, WMS-III Letter Number Sequencing, RWT phonemic verbal fluency (executive functions) and Mac-Quarrie Tapping (motor functions). Of all the explored cognitive domains, only for visuospatial functions a valid test recommendation cannot be made at this point.
The selection of tests for a neuropsychological battery is often a matter of compromises and limitations. Due to time restrictions and fatiguing effects, it is impossible to completely cover all relevant cognitive domains with all their facets in one session. For instance, in this comprehensive test battery, data on inhibition control or interference resolution as important aspects of executive function had to be omitted due to these restrictions. On the other hand, some deficit-oriented tests, essential for clinical studies, were selected that ultimately displayed ceiling effects in the healthy sample. Especially for the domains visuospatial functions and language, not only more tests but also more suitable tests have to be identified and investigated longitudinally.
In addition to our recommendations for an optimal, practice-resistant test battery, also our data of tests with strongest practice effects are useful for future applications. Based on reliable change index calculations, hierarchical linear modelling or regression models, it will now be possible to discriminate whether performance change of an individual or a group is clinically meaningful or whether it simply reflects change due to the here described practice effects.

Conclusions
Although the present study with its asymmetrical testing design addresses particularly needs of neuroprotective trials, the principal findings on practice effects also apply to all kinds of clinical and non-clinical studies with repetitive short-and long-term neuropsychological testing. Based on the here reported results, an essentially complete cognitive test battery covering all major cognitive domains can be composed. This battery should be largely resistant to practice or at least allow a valid estimate of practice effects. Thus, true cognitive improvement will be better discernible in healthy individuals and even more so in patient populations with expectedly reduced capabilities to learn [31,49,54,55]. Along these lines, the collection of normative data for serial test administration as important information on individual longitudinal learning can now easily be initiated.

Additional material
Additional file 1: Detailed descriptive information on the neuropsychological test battery. This file contains descriptive information on the neuropsychological tests of the presented study (underlying function, procedure) and an overview of alternate test versions.