Skip to main content
  • Research article
  • Open access
  • Published:

Practice effects in healthy adults: A longitudinal study on frequent repetitive cognitive testing

Abstract

Background

Cognitive deterioration is a core symptom of many neuropsychiatric disorders and target of increasing significance for novel treatment strategies. Hence, its reliable capture in long-term follow-up studies is prerequisite for recording the natural course of diseases and for estimating potential benefits of therapeutic interventions. Since repeated neuropsychological testing is required for respective longitudinal study designs, occurrence, time pattern and magnitude of practice effects on cognition have to be understood first under healthy good-performance conditions to enable design optimization and result interpretation in disease trials.

Methods

Healthy adults (N = 36; 47.3 ± 12.0 years; mean IQ 127.0 ± 14.1; 58% males) completed 7 testing sessions, distributed asymmetrically from high to low frequency, over 1 year (baseline, weeks 2-3, 6, 9, months 3, 6, 12). The neuropsychological test battery covered 6 major cognitive domains by several well-established tests each.

Results

Most tests exhibited a similar pattern upon repetition: (1) Clinically relevant practice effects during high-frequency testing until month 3 (Cohen's d 0.36-1.19), most pronounced early on, and (2) a performance plateau thereafter upon low-frequency testing. Few tests were non-susceptible to practice or limited by ceiling effects. Influence of confounding variables (age, IQ, personality) was minor.

Conclusions

Practice effects are prominent particularly in the early phase of high-frequency repetitive cognitive testing of healthy well-performing subjects. An optimal combination and timing of tests, as extractable from this study, will aid in controlling their impact. Moreover, normative data for serial testing may now be collected to assess normal learning curves as important comparative readout of pathological cognitive processes.

Background

Cognitive decline is a common feature of many neuropsychiatric diseases and among the strongest determinants of real-world functioning and quality of life in affected individuals [e.g. [1–4]]. Moreover, it poses enormous and ever-increasing costs on the progressively aging industrial societies.

Efficient treatment of cognitive impairment is urgently needed but not yet available. Therapeutically addressing cognitive outcome requires careful assessment based on comprehensive neuropsychological examination of relevant cognitive domains. Cognitive tests can be applied cross-sectionally to obtain first diagnostic information but solid clinical judgments as well as research require longitudinal observation. In clinical neuropsychology, serial test administration is essential for (1) monitoring of disease progression and/or potential recovery, or (2) evaluating efficacy of a therapeutic agent or other interventions (e.g. rehabilitation programs) in both, randomized clinical trials or clinical follow-up of single cases. Dependent on the underlying questions, testing frequencies have to be adapted to enable measurement of short-term or long-term processes. With repeated testing, however, the phenomenon of 'practice effects', reflecting the capability of an individual to learn and adjust, represents not only an additional important cognitive readout but also an interfering variable complicating result interpretation [5–7].

Practice effects are defined as increase in a subject's test score from one administration to the next in the absence of any interventions. Various reasons have been discussed to explain practice-induced score gains, such as reduced anxiety in or growing familiarity with the testing environment, recall effects, improvement of underlying functions, procedural learning, test sophistication, or regression to the mean [5, 8–10]. Furthermore, practice effects seem to be influenced by test (complexity, modality, alternate forms introduced to make them 'repeatable') [9–17], or test-taker characteristics (IQ, age, personality, mood, motivation) [5, 6, 13, 16, 18–24], as well as intertrial interval [11, 19, 25]. Surprisingly, for most cognitive instruments, only test-retest reliability coefficients, usually including not more than 2 sessions, are available. Even the useful concept of the 'reliable change index' corrected for practice and its extensions [e.g. [26–29]] has not been systematically integrated in most test manuals due to the low repetition rate of tests.

If not properly integrated into interpretation of cognitive results, practice effects can easily lead to false conclusions: (1) degenerative processes obscured by practice may be underestimated [30] or (2) treatment effects might be overestimated, particularly in the absence of adequate control groups [31]. Even though the integration of appropriate control groups into clinical treatment trials remains inevitable, a solid prediction of expected practice effects under healthy good-performance conditions is deemed essential for accurate effect size estimation and selection of a suitable test set. In a single case follow-up, this may actually be the only reasonable basis of judgment. Therefore, it is surprising that despite a number of pivotal prior studies dealing with practice effects, comprehensive characteristics of normal performance over time are not available. Specifically, the impact of practice effects on frequent repetitive neuropsychological testing over as long as 1 year, comprising all major cognitive domains, has not been systematically studied in healthy well-performing individuals.

The first objective of the present study has been to start filling this gap by exploring test-specific practice effects on performance in the 6 major cognitive domains upon repetitive testing of healthy subjects over a whole year. The design of intertest intervals has been chosen to meet typical requirements of neuroprotective trials, with short-term high-frequency followed by long-term low-frequency testing. As the second objective, recommendations should be extractable from the findings of the present work to design an optimal neuropsychological testing procedure for future longitudinal clinical research or routine. Finally, the third objective has been to provide the ground for future collection of normative data on longitudinal learning curves as an important but thus far largely ignored diagnostic readout of cognitive abilities.

Methods

Participants

The present study was approved by the local ethical committee (Ethikkommission der Medizinischen Fakultät der Georg-August-Universität Göttingen). All study participants gave written informed consent after complete description of the study. Native German speaking healthy subjects were recruited via public advertising and financially compensated upon completion of all follow-up sessions. (In our experience, financial compensation increases motivation of subjects to keep the appointments but is highly unlikely to influence cognitive performance itself.) A total of 36 healthy individuals (21 males and 15 females) with a mean age of 47.3 ± 12.0 years (range 24-69 years) at study entry participated. Prior to enrolment, a standardized, semi-structured interview and a physical screening examination confirmed that subjects were free of significant medical conditions or neuropsychiatric diseases (past or current). Psychopathological ratings (Hamilton Rating Scale for Depression, HAMD [32], mean 2.1 ± 2.9; Positive and Negative Syndrome Scale, PANSS [33], mean 32.3 ± 3.5) further ascertained a healthy sample with an overall high intellectual level of performance as underlined by premorbid (mean 124.2 ± 12.8) and status intelligence (mean 127.0 ± 14.1) quotients. Additionally, a personality questionnaire (revised NEO Personality Inventory, NEO-PI-R [34]) yielded results in the middle normal range (mean T values for personality factors: Neuroticism 46.3 ± 9.3; Extraversion 51.7 ± 10.7; Openness 52.2 ± 9.3; Agreeableness 53.4 ± 8.3; Conscientiousness 50.6 ± 8.3). Exclusion criteria comprised use of illicit and prescribed psychoactive drugs as well as nicotine. Urine drug screening (testing for ethanol, cannabinoids, benzodiazepines and cocaine) at baseline and unheralded repeat screens at random intervals verified the drug-free state of all individuals.

Study design

All screened and included subjects underwent comprehensive neuropsychological and psychopathological testings of approximately 2h duration under standardized conditions (fixed sequence, fixed day time per subject) on 7 occasions in total. The entire study was performed by 2 examiners (trained psychologists). Tests were administered according to standard instructions and, if available, alternate forms were used (for overview see Additional file 1). The longitudinal study design comprised a short-term high-frequency testing phase with a 3-week intertest interval (baseline, week 2-3, week 6, week 9 and month 3) and a long-term low-frequency testing phase (months 6 and 12), amounting to a total duration of 1 year per individual (Figure 1). The rationale of this testing schedule is derived from neuroprotective treatment trials on cognitive outcomes [e.g. [35]]. All included subjects completed all 7 testing sessions as scheduled (no drop outs), resulting in a complete data set without any missing data.

Figure 1
figure 1

Study design comprising a comprehensive cross-sectional baseline evaluation and 2 longitudinal phases with high-frequency followed by low-frequency testing. With a semi-structured interview, sociodemographic variables and medical history were collected. Drug screening: Urine samples of all subjects were tested for ethanol, cannabinoids, benzodiazepines and cocaine at baseline, and tests were repeated randomly afterwards. MWT-B, Mehrfachwahl-Wortschatz-Intelligenz-Test (premorbid intelligence measure); HAWIE-R, revised German version of the Wechsler Adult Intelligence Scale (subtests Information, Similarities, Picture Completion, Block Design); NEO-PI-R, revised NEO Personality Inventory; QoL, quality of life visual-analogue scale; PANSS, Positive and Negative Syndrome Scale; HAMD, Hamilton Rating Scale for Depression; MacQuarrie; MacQuarrie Tapping and Dotting tests; Purdue Pegboard, Purdue Pegboard Test, TAP, Test for Attentional Performance (subtests Alertness, Visual Scanning, Working Memory, Flexibility); TMT, Trail Making Test (A and B); WMS-III, Wechsler Memory Scale - 3rd edition (subtest Letter Number Sequencing); RWT, Regensburger Wortflüssigkeitstest (subtest phonemic verbal fluency); RBANS, Repeatable Battery for the Assessment of Neuropsychological Status (Full Scale, complete RBANS performed; Attention, RBANS Attention subtests only); WCST-64, Wisconsin Card Sorting Test - 64 Card Version.

Neuropsychological test battery

A total of 25 tests were selected to cover major cognitive domains: (1) attention: Trail Making Test - part A (TMT A [36]), Repeatable Battery for the Assessment of Neuropsychological Status (RBANS [37]) subtests Coding and Digit Span, Test for Attentional Performance (TAP [38]) subtests Visual Scanning and Alertness; (2) learning and memory: RBANS subtests Figure Recall, List Learning, List Recall, List Recognition, Story Memory, Story Recall; (3) executive functions: Trail Making Test - part B (TMT B [36]), Wisconsin Card Sorting Test - 64 Card Version (WCST-64 [39]), TAP subtests Flexibility and Working Memory, Wechsler Memory Scale - 3rd edition (WMS-III [40]) subtest Letter Number Sequencing, Regensburger Wortschatztest (RWT [41]) subtest phonemic verbal fluency; (4) motor functions: 9-Hole Peg Test [42], Purdue Pegboard Test [43], MacQuarrie Test for Mechanical Ability [44] subtests Tapping and Dotting; (5) language: RBANS subtests Picture Naming and Semantic Fluency; and (6) visuospatial functions: RBANS subtests Lines and Figure Copy. All listed tests are well-established and have been described in detail as referenced (for a short description see Additional file 1). Of all tests, only the most relevant parameter is presented to avoid overrepresentation of one test. To minimize expected strong recall effects, RBANS short- and long-term memory tests, visuospatial and language functions as well as the WCST-64 have been performed less frequently (baseline, week 6, months 3, 6, 12). Intelligence [premorbid intelligence (Mehrfachwahl-Wortschatz-Intelligenz-Test, MWT-B [45]) and state intelligence (revised German version of the Wechsler Adult Intelligence Scale, short version, HAWIE-R [46])] - as well as personality (NEO-PI-R) measures were performed only at baseline to explore their potential influence on the course of cognitive performance. Current psychopathological symptoms (HAMD, PANSS [32, 33]) and quality of life (visual analogue scale ranging from 0-10) were assessed at each testing time-point (Figure 1) to control for their potentially fluctuating nature.

Statistical analysis

All numerical results are presented as mean ± SD in text/table and mean ± SEM in figures. Statistical tests were two-tailed for all analyses with a conservative significance level at p < 0.01 to account for multiple statistical testing (p < 0.05 was considered only marginally significant). Analysis of variance (ANOVA) for repeated measures with time as independent variable was applied to all individual cognitive tests in order to investigate significance of score changes over time (practice effects). For analysis of cognitive domains, data of all single cognitive tests (always expressed as % individual baseline of the respective test) were combined to yield respective super-ordinate cognitive mean curves. Two sets of ANOVAs were calculated examining performance changes during (1) the high-frequency testing phase, including all testing time-points from baseline to month 3; and (2) the low-frequency testing phase, including all testing time-points from month 3 to 12. Only if differences over time reached significance, effect sizes (Cohen's d[47]) were calculated as d = (μtime-point n-μbaseline)/σpooled for determination of the clinical relevance of practice effects. Thus, improvement in speed tests (i.e. reduction in reaction time) or reduction in error rate (WCST-64) resulted in negative effect sizes. For consistency, these negative effect sizes were inverted to express enhanced test performance as positive Cohen's d. Altogether, d = 0.20-0.49 was considered to represent small, d = 0.50-0.79 moderate and d ≥ 0.80 large effects [47]. To detect potential confounders of cognitive performance, Pearson's correlations were calculated between age, IQ (MWT-B or HAWIE-R IQ), all 5 NEO-PI-R personality factors, HAMD, PANSS total scores and single cognitive measures at baseline. The influence on the course of cognitive performance was further explored using repeated-measures analyses of covariance (ANCOVA) with time as independent variable (baseline to month 12), and the respective potential confounder as covariate. All analyses (Pearson's correlations, repeated-measures ANOVAs and ANCOVAs) were carried out using SPSS 17.0.

Results

Significance and clinical relevance of practice effects by test and cognitive domain: main effect of time in repeated-measures ANOVA and Cohen's d

Descriptive statistics for all cognitive tests at all 7 time-points are presented in Table 1. Expectedly, repeated-measures ANOVA across the first 5 sessions of the high-frequency testing phase revealed highly significant score increases over time in the vast majority of tests (practice effects in 17 of 25 tests; all p ≤ 0.006), with the exception of TAP Alertness and Working Memory, RBANS subtests Digit Span, List Recognition, Semantic Fluency, Lines, Figure Copy, and MacQuarrie Dotting. Even after a most conservative Bonferroni correction for 25 comparisons, resulting in an adjusted alpha of 0.002, the significance of results is preserved, with the exception of TMT A and WCST-64 (both p = 0.006).

Table 1 Neuropsychological follow-up data (N = 36)

To estimate the magnitude of the observed significant practice effects, most prominent from baseline to month 3, the Cohen's d statistic of each test and cognitive domain was calculated for comparison baseline - month 3 (Table 1, hierarchical listing of tests per domain according to effect sizes). Effect sizes of the high-frequency testing phase (baseline - month 3) are mainly within the range of moderate effects (d = 0.51-0.75) and congruent with repeated-measures ANOVA results. A domain-wise comparison of effect sizes demonstrates a homogenous pattern in all cognitive domains (mainly moderate effects) except for attentional functions (high variance with small to large effect sizes).

Upon long intertest intervals after month 3 until study end (month 12) no significant performance changes could be found in 23 of 25 tests, i.e. performance levels acquired by high-frequency testing remained stable and did not return to baseline values. Only one test, RWT phonemic verbal fluency, showed further enhanced test scores (p < 0.001). A ceiling effect in RBANS Picture Naming during high- and low-frequency testing phase artificially produced significant results (Table 1).

The longitudinal course of performance in all 6 cognitive domains (data of single tests were combined to yield respective super-ordinate cognitive categories) is illustrated in Figure 2. ANOVAs conducted on this data confirmed the time pattern of single test comparisons: Strong practice effects upon high-frequency testing and a plateau held with decreasing frequency. Regarding cognitive domains, the most pronounced changes occur until month 3 in executive functions (14.0 ± 10.7%), followed by learning/memory (13.3 ± 12.3%) and attention (11.9 ± 10.6%) (Figure 2).

Figure 2
figure 2

Pattern of practice effects in cognitive domains over time. Data of all single tests (always expressed as % individual baseline of the respective test), representing one particular cognitive domain, were combined to yield respective super-ordinate cognitive mean curves. In almost all cognitive domains, changes in total test scores over time exhibit a similar practice pattern: significant improvement during the high-frequency testing phase and stabilization of performance during the low-frequency testing phase. Most pronounced score increases are seen in executive functions as well as in learning and memory, whereas changes in visuospatial performance fail to reach significance. Significance refers to a main effect of time determined with ANOVA for repeated measures, including all testing time-points from baseline to month 3, or from month 3 to month 12, respectively. Mean ± SEM given. ***p < 0.001; *p < 0.05; n.s., not significant.

Improvement from baseline to second testing accounts for the largest proportion of change in all cognitive domains (Figure 3). Accordingly, for most tests, Cohen's d was highest for baseline - week 2-3 interval (d = 0.30-0.55 for domains, d = 0.22-0.71 for single tests). In contrast, Cohen's d, if calculated for the late between-assessment intervals, i.e. from month 3 to 6 or 12, would show mainly 'no effect' (exceptions: d = 0.28 for RWT phonemic verbal fluency and d = 0.50 for RBANS Picture Naming).

Figure 3
figure 3

Distribution of practice effects: Changes from one testing to the next. All cognitive domains (respective single tests grouped as described in Figure 2) show most pronounced improvement in performance from baseline to the 2nd testing time-point. At late testing time-points with long intertest intervals (5th to 6th, and 6th to 7th testing), test scores show a slight tendency to decrease. Mean of % change given. Lines indicate logarithmic trends.

To address the important question of ceiling effects, the proportion of subjects reaching a defined performance level at baseline, month 3 and month 12 was calculated (Figure 4; clinical classifications based on test-specific normative data). For all cognitive domains, changes of clinical classification over time still exclude ceiling effects except for visuospatial functions.

Figure 4
figure 4

Magnitude of practice effects: development of clinical classification over 1 year. Clinical classification of baseline performance shows that cognitive performance is distributed across all categories (below- to above-average percentile ranks, PR) despite a high-IQ sample of healthy individuals. In all depicted cognitive domains, score gains lead to better clinical classification over time (selected time-points months 3 and 12 presented) without reaching upper limits for most subjects. Only in visuospatial functions, the majority of subjects achieved highest scores already at baseline with only modest subsequent changes, altogether pointing to a ceiling effect. Clinical classifications of individual performance are based on test-specific normative data and averaged by cognitive domains. For executive functions, normative data of RWT phonemic verbal fluency is unavailable. Data on motor tests are not presented due to insufficient normative data.

Control of potentially confounding factors: Pearson's correlations at baseline and ANCOVAs on course of cognitive performance up to month 12

Age

Since consistently significant correlations between age and cognitive baseline values in most tests with a speed component were found, age was implemented as a covariate in repeated-measures ANCOVA for all cognitive tests. Only for TAP flexibility, a significant time x age interaction effect could be shown (F = 3.43; p = 0.01), whereas all other learning curves were not influenced by age.

Intelligence

Neither MWT-B premorbid IQ nor HAWIE-R full scale IQ correlated systematically with cognitive test scores at baseline (only MWT-B IQ - RWT phonemic verbal fluency: r = -0.47, p = 0.004; HAWIE-R IQ - TMT A: r = -0.42, p = 0.01; HAWIE-R IQ - RBANS Figure Copy: r = 0.57, p < 0.001). Accordingly, IQ did not mediate the course of cognitive performance except for RBANS Figure Copy (time x HAWIE-R IQ interaction F = 5.87; p = 0.001). These findings are most likely due to the high and homogenous IQ of our sample (mean MWT-B IQ: 124.2 ± 12.8, mean HAWIE-R IQ: 127.0 ± 14.1).

Personality

Explorative correlational analyses of each NEO-PI-R personality factor with each cognitive test at baseline revealed only isolated significance for Agreeableness - MacQuarrie Tapping and Dotting (r = -.041, p = 0.01 for both) and Conscientiousness - RBANS Story Memory (r = -0.41, p = 0.01). Also, none of the 5 personality factors consistently influenced practice effects (only interaction time x Agreeableness F = 5.17; p = 0.001 for RBANS Coding).

Psychopathological symptoms and quality of life

As expected in a sample of healthy volunteers, HAMD and PANSS scores were low and showed small variance ('floor effect' - see description of subjects). Only 3 significant correlations with cognition could be determined at baseline: HAMD - TAP Flexibility (r = 0.47, p = 0.004) HAMD - TMT A (r = 0.53, p = 0.001), PANSS - TMT A (r = 0.42, p = 0.01). Over the study period, HAMD remained unchanged (2.1 ± 2.9 to 2.6 ± 2.8; F = 0.545, p = 0.77), and PANSS scores increased marginally from baseline to month 12 (32.3 ± 3.5 to 33.0 ± 3.5, F = 4.97, p = 0.001) but never reached a pathological level, i.e. is clinically irrelevant. A similar pattern was obtained for QoL and cognition: QoL stayed stable over time and yielded only 2 significant and counter-intuitive correlations with cognitive tests at baseline (QoL - RBANS Digit Span: r = -0.42, p = 0.01, QoL - WMS-III Letter Number Sequencing: r = 0.43, p = 0.01).

Taken together, the evaluation of potential modulators of cognitive performance and practice effects (age, IQ, personality factors, QoL, degree of depression and psychopathology) revealed only isolated findings with single cognitive tests at baseline (20 significant of 275 correlations) or the course of cognitive performance (only 3 significant time x covariate interactions of 200 ANCOVAs). Using a conservative approach of alpha adjustment for multiple testing, these isolated findings even disappear. Thus, none of the analyses (Pearson's correlations, repeated-measures ANCOVAs) suggest that the cognitive performance pattern was due to pre-existing intellectual, personality, sociodemographic or to current psychopathological differences that systematically affected the slope of practice effects. All before mentioned data on cognition is therefore presented without any of the explored covariates.

Discussion

In the present study, we provide for the first time comprehensive data on clinically relevant practice effects in healthy well-performing subjects over a 1-year period of frequent repetitive testing across 6 distinct cognitive domains. During the initial phase of high-frequency testing for 3 months, strong practice effects occur early on, most prominent in executive functions and learning/memory. After 3 months and upon reduced testing frequency, a stabilization/plateau of the acquired cognitive level until study end is observed. Age, intellectual capacity, personality features, or psychopathological scores have no consistent influence on the course of cognitive performance.

Generally, comparisons between the present and previous studies are confounded by different designs, including the use of diverse cognitive tests, fewer repetitions, and/or varying intertest intervals. The finding that strongest changes in performance occur from baseline to the second testing, however, complies well with a number of similar results on distribution of practice effects [10, 12, 15, 17, 18, 48]. The extent of practice effects observed here even exceeds effect sizes described by Hausknecht et al [10] (d = 0.26) or Bird et al [20], using comparable intervals.

In contrast to previous studies, showing a similar magnitude of practice effects short-term [9, 12, 48], our longitudinal design addresses particular needs of neuroprotective/neuroregenerative treatment trials, including both, a practice and a retention phase. Just McCaffrey et al [25] had a somewhat related long-term design, but only 4 sessions in total (baseline, week 2, months 3 and 6), with the last testing at month 6, and a much shorter test battery. The essential findings of this study are in agreement with the respective parts of the present work. Another study worth mentioning here, provided useful information about practice-dependent test selection to build on, but used only a high-frequency testing schedule (20 sessions in 4 weeks) without long-term follow-up and without change in testing frequency [49].

Regarding the different cognitive domains, executive functions showed highest score increases over time, followed by learning and memory. For executive functions, results of other studies are contradictory [e.g. [17, 20, 50]], ranging from no over small to strong practice effects. The strong practice effects in almost all executive functions found here are most likely the result of a higher repetition rate (as compared to [20, 50]) or the use of less alternate forms (as compared to [17]). In line with our findings, there is wide agreement that memory functions benefit most from practice [7, 25, 48, 51] and are evident even when alternate test forms are applied [10, 14, 15, 17, 52]. Since parallel forms were also administered in the present study, and the respective tests were reduced to 4 repetitions, test sophistication [8] as well as improvement of the underlying functions rather than simple recall effects may have contributed to improved performance.

On the basis of single test characteristics and results over time, no prediction can be made regarding the impact of repetitive testing. Practice effects seem to be unrelated to task complexity or modality. On the other hand, the present work provides more than test-specific information: cognitive domains, assessed with an extensive test battery, covering each domain by several tests, revealed very homogenous effect sizes within one domain, i.e. similar practice effects irrespective of the test used, pointing to genuine change in the underlying target domain (transfer effects). Only within the attention domain, highly varying effect sizes of individual tests may indicate respective test specificity [53], e.g. in our study TAP Visual Scanning displayed largest practice effects whereas RBANS Digit Span revealed no effects. In the overall picture of transfer effects, the few tests with ceiling effects did not play a role in this respect.

Logically, our findings on practice effects raise the question whether after 3 months of regular practice the maximally possible improvement is already achieved or whether continued practice would lead to an even more enhanced performance. Even though this was not the objective of the present study, it would be interesting to investigate how many additional sessions within the high-frequency period are required until the individual upper performance limit is reached.

Although the majority of tests showed considerable practice effects, at least one test in most of the cognitive domains was found resistant to practice. Again, task complexity does not seem to be the underlying factor explaining resistance. For more 'deficit-oriented' subtests like RBANS List Recognition, Lines and Figure Copy, ceiling effects (expected especially in high IQ subjects) did not allow further improvement of test scores. For most other tests this was not the case since the majority of subjects, despite high IQ, did not score at above-average. Nevertheless, the high IQ level of our sample may have contributed to the observed strong practice effects as reported in studies showing that high IQ subjects benefit more from prior test exposure ('the rich get richer' [16, 18]). This greater benefit of high IQ, however, is still equivocal as is a potential influence of age [20, 50]. In fact, neither age nor IQ, applied as covariates, revealed a clear effect in the present work. Also other covariates, i.e. personality and psychopathology ratings, failed to show any appreciable impact on learning curves. The most plausible explanation would be the fact that healthy volunteers scored in a very restricted 'normal' range in these categories. Such restricted range holds similarly true for IQ.

The aim of the present study, apart from long-term analysis of practice effects, was to provide recommendations for an 'ideal' neuropsychological test battery suitable for serial testing in research and routine. As obvious from our results, two major points have to be considered in this recommendation: test selection and timing. Tests of first choice are those that are essentially resistant to practice: TAP Alertness or RBANS Digit Span for attention; TAP Working Memory for executive functions; MacQuarrie Dotting for motor functions; RBANS Semantic Fluency for language.

For learning and memory, no practice-resistant valid test could be identified. Therefore, for evaluation of this particular domain, a 'dual baseline' approach [5, 6] is suggested to partly cut off early practice effects: If the most prominent improvement occurs from first to second assessment, the second may serve as baseline for subsequent assessments. For the domain learning and memory, this applies to RBANS Figure Recall, List Recall and List Learning.

As eventual alternatives for the above listed domain-specific, practice-resistant tests, the dual baseline approach may be used for TMT A, RBANS Coding (attention), WCST-64, WMS-III Letter Number Sequencing, RWT phonemic verbal fluency (executive functions) and MacQuarrie Tapping (motor functions). Of all the explored cognitive domains, only for visuospatial functions a valid test recommendation cannot be made at this point.

The selection of tests for a neuropsychological battery is often a matter of compromises and limitations. Due to time restrictions and fatiguing effects, it is impossible to completely cover all relevant cognitive domains with all their facets in one session. For instance, in this comprehensive test battery, data on inhibition control or interference resolution as important aspects of executive function had to be omitted due to these restrictions. On the other hand, some deficit-oriented tests, essential for clinical studies, were selected that ultimately displayed ceiling effects in the healthy sample. Especially for the domains visuospatial functions and language, not only more tests but also more suitable tests have to be identified and investigated longitudinally.

In addition to our recommendations for an optimal, practice-resistant test battery, also our data of tests with strongest practice effects are useful for future applications. Based on reliable change index calculations, hierarchical linear modelling or regression models, it will now be possible to discriminate whether performance change of an individual or a group is clinically meaningful or whether it simply reflects change due to the here described practice effects.

Conclusions

Although the present study with its asymmetrical testing design addresses particularly needs of neuroprotective trials, the principal findings on practice effects also apply to all kinds of clinical and non-clinical studies with repetitive short- and long-term neuropsychological testing. Based on the here reported results, an essentially complete cognitive test battery covering all major cognitive domains can be composed. This battery should be largely resistant to practice or at least allow a valid estimate of practice effects. Thus, true cognitive improvement will be better discernible in healthy individuals and even more so in patient populations with expectedly reduced capabilities to learn [31, 49, 54, 55]. Along these lines, the collection of normative data for serial test administration as important information on individual longitudinal learning can now easily be initiated.

References

  1. Green MF: What are the functional consequences of neurocognitive deficits in schizophrenia?. Am J Psychiatry. 1996, 153: 321-330.

    Article  CAS  PubMed  Google Scholar 

  2. Ritsner MS: Predicting quality of life impairment in chronic schizophrenia from cognitive variables. Qual Life Res. 2007, 16: 929-937. 10.1007/s11136-007-9195-3.

    Article  PubMed  Google Scholar 

  3. Farias ST, Harrell E, Neumann C, Houtz A: The relationship between neuropsychological performance and daily functioning in individuals with Alzheimer's disease: ecological validity of neuropsychological tests. Arch Clin Neuropsychol. 2003, 18: 655-672.

    Article  PubMed  Google Scholar 

  4. Nys GM, van Zandvoort MJ, van der Worp HB, de Haan EH, de Kort PL, Jansen BP, Kappelle LJ: Early cognitive impairment predicts long-term depressive symptoms and quality of life after stroke. J Neurol Sci. 2006, 247: 149-156. 10.1016/j.jns.2006.04.005.

    Article  CAS  PubMed  Google Scholar 

  5. McCaffrey RJ, Duff K, Westervelt HJ: Practitioner's guide to evaluating change with neuropsychological assessment instruments. 2000, New York: Kluwer Academic/Plenum Publishers

    Google Scholar 

  6. McCaffrey RJ, Westervelt HJ: Issues associated with repeated neuropsychological assessments. Neuropsychol Rev. 1995, 5: 203-221. 10.1007/BF02214762.

    Article  CAS  PubMed  Google Scholar 

  7. Lezak MD, Howieson DB, Loring DW, Hannay HJ, Fischer JS: Neuropsychological Assessment. 2004, New York: Oxford University Press, 4

    Google Scholar 

  8. Anastasi A: Psychological Testing. 1988, New York: Macmillan Publishing Company

    Google Scholar 

  9. Benedict RHB, Zgaljardic DJ: Practice effects during repeated administrations of memory tests with and without alternate forms. J Clin Exp Neuropsychol. 1998, 20: 339-352. 10.1076/jcen.20.3.339.822.

    Article  CAS  PubMed  Google Scholar 

  10. Hausknecht JP, Halpert JA, Di Paolo NT, Gerrard MOM: Retesting in selection: A meta-analysis of coaching and practice effects for tests of cognitive ability. J Appl Psychol. 2007, 92: 373-385. 10.1037/0021-9010.92.2.373.

    Article  PubMed  Google Scholar 

  11. Donovan JJ, Radosevich DJ: A Meta-Analytic Review of the Distribution of Practice Effect: Now You See It, Now You Don't. J Appl Psychol. 1999, 84: 795-805. 10.1037/0021-9010.84.5.795.

    Article  Google Scholar 

  12. Collie A, Maruff P, Darby DG, McStephen M: The effects of practice on the cognitive test performance of neurologically normal individuals assessed at brief test-retest intervals. J Int Neuropsychol Soc. 2003, 9: 419-428. 10.1017/S1355617703930074.

    Article  PubMed  Google Scholar 

  13. Reeve CL, Lam H: The Relation Between Practice Effects, Test-Taker Characteristics and Degree of g-Saturation. International Journal of Testing. 2007, 7: 225-242. 10.1080/15305050701193595.

    Article  Google Scholar 

  14. Watson FL, Pasteur M-AL, Healy DT, Hughes EA: Nine Parallel Versions of Four Memory Tests: An Assessment of Form Equivalence and the Effects of Practice on Performance. Hum Psychopharmacol. 1994, 9: 51-61. 10.1002/hup.470090107.

    Article  Google Scholar 

  15. Salthouse TA, Tucker-Drob EM: Implications of short-term retest effects for the interpretation of longitudinal change. Neuropsychology. 2008, 22: 800-811. 10.1037/a0013091.

    Article  PubMed Central  PubMed  Google Scholar 

  16. Kulik JA, Kulik CLC, Bangert RL: Effects of Practice on Aptitude and Achievement-Test Scores. Am Educ Res J. 1984, 21: 435-447.

    Article  Google Scholar 

  17. Beglinger LJ, Gaydos B, Tangphao-Daniels O, Duff K, Kareken DA, Crawford J, Fastenau PS, Siemers ER: Practice effects and the use of alternate forms in serial neuropsychological testing. Arch Clin Neuropsychol. 2005, 20: 517-529. 10.1016/j.acn.2004.12.003.

    Article  PubMed  Google Scholar 

  18. Rapport LJ, Brines DB, Axelrod BN, Theisen ME: Full scale IQ as mediator of practice effects: The rich get richer. Clin Neuropsychol. 1997, 11: 375-380. 10.1080/13854049708400466.

    Article  Google Scholar 

  19. Dikmen SS, Heaton RK, Grant I, Temkin NR: Test-retest reliability and practice effects of Expanded Halstead-Reitan neuropsychological test battery. J Int Neuropsychol Soc. 1999, 5: 346-356. 10.1017/S1355617799544056.

    Article  CAS  PubMed  Google Scholar 

  20. Bird CM, Papadopoulou K, Ricciardelli P, Rossor MN, Cipolotti L: Monitoring cognitive changes: Psychometric properties of six cognitive tests. Br J Clin Psychol. 2004, 43: 197-210. 10.1348/014466504323088051.

    Article  PubMed  Google Scholar 

  21. Mitrushina M, Satz P: Effect of Repeated Administration of a Neuropsychological Battery in the Elderly. J Clin Psychol. 1991, 47: 790-801. 10.1002/1097-4679(199111)47:6<790::AID-JCLP2270470610>3.0.CO;2-C.

    Article  CAS  PubMed  Google Scholar 

  22. Horton AM: Neuropsychological practice effects x age: a brief note. Percept Mot Skills. 1992, 75: 257-258. 10.2466/PMS.75.4.257-258.

    Article  PubMed  Google Scholar 

  23. Schaie KW, Willis SL, Caskie GI: The Seattle longitudinal study: relationship between personality and cognition. Neuropsychol Dev Cogn B Aging Neuropsychol Cogn. 2004, 11: 304-324.

    Article  PubMed Central  PubMed  Google Scholar 

  24. Yaffe K, Blackwell T, Gore R, Sands L, Reus V, Browner WS: Depressive symptoms and cognitive decline in nondemented elderly women: a prospective study. Arch Gen Psychiatry. 1999, 56: 425-430. 10.1001/archpsyc.56.5.425.

    Article  CAS  PubMed  Google Scholar 

  25. McCaffrey RJ, Ortega A, Haase RF: Effects of repeated neuropsychological assessments. Arch Clin Neuropsychol. 1993, 8: 519-524.

    Article  CAS  PubMed  Google Scholar 

  26. Heaton RK, Temkin N, Dikmen S, Avitable N, Taylor MJ, Marcotte TD, Grant I: Detecting change: A comparison of three neuropsychological methods, using normal and clinical samples. Arch Clin Neuropsychol. 2001, 16: 75-91.

    Article  CAS  PubMed  Google Scholar 

  27. Jacobson NS, Truax P: Clinical significance: a statistical approach to defining meaningful change in psychotherapy research. J Consult Clin Psychol. 1991, 59: 12-19. 10.1037/0022-006X.59.1.12.

    Article  CAS  PubMed  Google Scholar 

  28. Temkin NR, Heaton RK, Grant I, Dikmen SS: Detecting significant change in neuropsychological test performance: a comparison of four models. J Int Neuropsychol Soc. 1999, 5: 357-369. 10.1017/S1355617799544068.

    Article  CAS  PubMed  Google Scholar 

  29. Parsons TD, Notebaert AJ, Shields EW, Guskiewicz KM: Application of reliable change indices to computerized neuropsychological measures of concussion. Int J Neurosci. 2009, 119: 492-507. 10.1080/00207450802330876.

    Article  PubMed  Google Scholar 

  30. Zehnder AE, Bläsi S, Berres M, Spiegel R, Monsch AU: Lack of practice effects on neuropsychological tests as early cognitive markers of Alzheimer disease?. Am J Alzheimers Dis Other Demen. 2007, 22: 416-426. 10.1177/1533317507302448.

    Article  PubMed  Google Scholar 

  31. Goldberg TE, Goldman RS, Burdick KE, Malhotra AK, Lencz T, Patel RC, Woerner MG, Schooler NR, Kane JM, Robinson DG: Cognitive improvement after treatment with second-generation antipsychotic medications in first-episode schizophrenia - Is it a practice effect?. Arch Gen Psychiatry. 2007, 64: 1115-1122. 10.1001/archpsyc.64.10.1115.

    Article  CAS  PubMed  Google Scholar 

  32. Hamilton M: A rating scale for depression. Br Med J. 1960, 23: 56-62.

    CAS  Google Scholar 

  33. Kay SR, Fiszbein A, Opler LA: The positive and negative syndrome scale (PANSS) for schizophrenia. Schizophr Bull. 1987, 13: 261-276.

    Article  CAS  PubMed  Google Scholar 

  34. Costa PT, McCrae RR: Revised NEO Personality Inventory (NEO PI-R) and NEO Five-Factor Inventory (NEO-FFI), Professional Manual. 1992, Odessa, FL: Psychological Assessment Resources, Inc

    Google Scholar 

  35. Ehrenreich H, Fischer B, Norra C, Schellenberger F, Stender N, Stiefel M, Siren AL, Paulus W, Nave KA, Gold R, Bartels C: Exploring recombinant human erythropoietin in chronic progressive multiple sclerosis. Brain. 2007, 130: 2577-2588. 10.1093/brain/awm203.

    Article  PubMed  Google Scholar 

  36. Reitan RM: Validity of the Trail Making Test as an indicator of organic brain damage. Percept Mot Skills. 1958, 8: 271-276. 10.2466/PMS.8.7.271-276.

    Article  Google Scholar 

  37. Randolph C: RBANS Manual - Repeatable Battery for the Assessment of Neuropsychological Status. 1998, San Antonio, TX: The Psychological Corporation (Harcourt)

    Google Scholar 

  38. Zimmermann P, Fimm B: Test for Attentional Performance (TAP; German:Testbatterie zur Aufmerksamkeitsprüfung). 1995, Herzogenrath: Psytest

    Google Scholar 

  39. Kongs SK, Thompson LL, Iverson GL, Heaton RK: WCST-64: Wisconsin Card Sorting Test-64 Card Version; Professional Manual. 2000, Odessa, FL: Psychological Assessment Resources, Inc

    Google Scholar 

  40. Wechsler D: Wechsler Memory Scale - 3rd edition (WMS-III). 1998, Harcourt, TX: The Psychological Corporation

    Google Scholar 

  41. Aschenbrenner S, Tucha O, Lange KW: Regensburger Verbal Fluency Test (RWT; German: Regensburger Wortflüssigkeits-Test). 2000, Göttingen: Hogrefe

    Google Scholar 

  42. Cutter GR, Baier ML, Rudick RA, Cookfair DL, Fischer JS, Petkau J, Syndulko K, Weinshenker BG, Antel JP, Confavreux C: Development of a multiple sclerosis functional composite as a clinical trial outcome measure. Brain. 1999, 122: 871-882. 10.1093/brain/122.5.871.

    Article  PubMed  Google Scholar 

  43. Tiffin J: Purdue Pegboard examiner manual. 1968, Chicago, IL: Science Research Associates

    Google Scholar 

  44. MacQuarrie TW: MacQuarrie Test for Mechanical Ability. 1925, Monterey, CA: California Test Bureau/McGraw-Hill, 1953

    Google Scholar 

  45. Lehrl S: Multiple Choice Vocabulary Intelligence Test (MWT-B; German: Mehrfachwahl-Wortschatz-Intelligenztest). 1995, Erlangen: Straube

    Google Scholar 

  46. Tewes U: German Adaption of the Revised Wechsler Adult Intelligence Scale (HAWIE-R; German: Hamburg-Wechsler Intelligenztest für Erwachsene - Revision. 1991, Bern: Hans Huber

    Google Scholar 

  47. Cohen J: Statistical power for the behavioral sciences. 1988, Hillsdale, NJ: Lawrence Erlbaum, 2

    Google Scholar 

  48. Theisen ME, Rapport LJ, Axelrod BN, Brines DB: Effects of practice in repeated administrations of the Wechsler memory scale-revised in normal adults. Assessment. 1998, 5: 85-92. 10.1177/107319119800500110.

    Article  CAS  PubMed  Google Scholar 

  49. Wilson BA, Watson PC, Baddeley AD, Emslie H, Evans JJ: Improvement or simply practice? The effects of twenty repeated assessments on people with and without brain injury. J Int Neuropsychol Soc. 2000, 6: 469-479. 10.1017/S1355617700644053.

    Article  CAS  PubMed  Google Scholar 

  50. Basso MR, Bornstein RA, Lang JM: Practice effects on commonly used measures of executive function across twelve months. Clin Neuropsychol. 1999, 13: 283-292.

    Article  CAS  PubMed  Google Scholar 

  51. McCaffrey RJ, Ortega A, Orsillo SM, Nelles WB, Haase RF: Practice effects in repeated neuropsychological assessments. Clin Neuropsychol. 1992, 6: 32-42. 10.1080/13854049208404115.

    Article  Google Scholar 

  52. Hinton-Bayre A, Geffen G: Comparability, reliability, and practice effects on alternate forms of the Digit Symbol Substitution and Symbol Digit Modalities tests. Psychol Assess. 2005, 17: 237-241. 10.1037/1040-3590.17.2.237.

    Article  PubMed  Google Scholar 

  53. Feinstein A, Brown R, Ron M: Effects of Practice of Serial Tests of Attention in Healthy-Subjects. J Clin Exp Neuropsychol. 1994, 16: 436-447. 10.1080/01688639408402654.

    Article  CAS  PubMed  Google Scholar 

  54. Dikmen S, Machamer J, Temkin N, McLean A: Neuropsychological recovery in patients with moderate to severe head injury: 2 year follow-up. J Clin Exp Neuropsychol. 1990, 12: 507-519. 10.1080/01688639008400997.

    Article  CAS  PubMed  Google Scholar 

  55. Weingartner H, Eckardt M, Grafman J, Molchan S, Putnam K, Rawlings R, Sunderland T: The effects of repetition on memory performance in cognitively impaired patients. Neuropsychology. 1993, 7: 385-385. 10.1037/0894-4105.7.3.385.

    Article  Google Scholar 

Download references

Acknowledgements

This study has been supported by the Max Planck Society and the DFG Research Center for Molecular Physiology of the Brain (CMPB). They had no role in study design, collection, analysis and interpretation of data, in the writing of the manuscript, and in the decision to submit the manuscript for publication. We would like to express our gratitude to all participants.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Claudia Bartels or Hannelore Ehrenreich.

Additional information

Authors' contributions

All authors qualify for authorship, based on their substantive intellectual contributions to the study. HE and CB designed the present study and wrote the manuscript. CB, MW, AW and VA participated in acquisition and analysis of the presented data. MW, AW and VA also contributed to the interpretation of data and helped revising the manuscript critically for important intellectual content. All authors gave final approval of the version to be published.

Electronic supplementary material

12868_2010_1677_MOESM1_ESM.DOC

Additional file 1:Detailed descriptive information on the neuropsychological test battery. This file contains descriptive information on the neuropsychological tests of the presented study (underlying function, procedure) and an overview of alternate test versions. (DOC 100 KB)

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Bartels, C., Wegrzyn, M., Wiedl, A. et al. Practice effects in healthy adults: A longitudinal study on frequent repetitive cognitive testing. BMC Neurosci 11, 118 (2010). https://doi.org/10.1186/1471-2202-11-118

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2202-11-118

Keywords