Capturing multiple interaction effects in L1 and L2 object-naming reaction times in healthy bilinguals: a mixed-effects multiple regression analysis

Background It is difficult to set up a balanced higher-order full-factorial experiment that can capture multiple intricate interactions between cognitive and psycholinguistic factors underlying bilingual speech production. To capture interactions more fully in one study, we analyzed object-naming reaction times (RTs) by using mixed-effects multiple regression. Methods Ten healthy bilinguals (median age: 23 years, seven females) were asked to name 131 colored pictures of common objects in each of their languages. RTs were analyzed based on language status, proficiency, word choice, word frequency, word duration, initial phoneme, time series, and participant’s gender. Results Among five significant interactions, new findings include a facilitating effect of a cross-language shared initial phoneme (mean RT for shared phoneme: 974 ms vs. mean RT for different phoneme: 1020 ms), which profited males less (mean profit: 10 ms) than females (mean profit: 47 ms). Conclusions Our data support language-independent phonological activation and a gender difference in inhibitory cognitive language control. Single word production process in healthy adult bilinguals is affected by interactions among cognitive, phonological, and semantic factors.


Background
Bilinguals and the language phenomena specific to them have long puzzled researchers, due to their deviation from monolinguals in multiple regards. For the process of word production in monolinguals, there is a general understanding of a sequential process that a person performs when naming an object. After first defining the concept to be expressed, a lemma is selected, a phonological code is retrieved, syllabified, and phonetically encoded before articulation ensues. This model was derived from a body of research that has identified specific time windows for each single step in word production [1,2]. However, there are competing hypotheses to the proposed serial models. Indeed, some studies argued for cascade models in which a set of semantic candidates unselected could enter into the phonological stage and the corresponding multiple phonological codes are activated [3,4]. Schramm et al. BMC Neurosci (2020) 21:3 Bilinguals show behavior yet to be fully explained by the current models. Compared to monolinguals, they possess a slower reaction time (RT) when confronted with an object-naming task, both in their first language (L1) and their second language (L2). Also, responses given in the L1 generally happen faster than in the L2 when L1 is of currently dominant use, but the reverse pattern has also been observed [5][6][7]. By now, a multitude of competing explanation attempts focusing on different specific steps of the word production process exist [5,8].
Regarding the selection of the task-relevant language, phonological activations were shown to occur both in L1 and L2, suggesting that task-relevant language selection does not occur in the semantic/lemma selection stages [9,10]. The inhibitory control (IC) model introduced the selection of task-relevant language earlier at the stage of semantic/lemma selection [11]. Herein, lexical representations are equipped with a mark indicating the corresponding language. A higher-level control system would then, depending on the task, inhibit all representations with the L1 or L2 mark respectively (resulting in effectively a L1-vs. a L2-mode), allowing for the correct lexical route to be taken [11]. Because these language modes would hardly ever be used equally, different levels of basal activation would result and make activation of one of the two languages more time-consuming.
Concerning variables related to the semantic/lemma selection stage, both language proficiency and age of acquisition of L2 have been shown to impact the RT. At present, it has been hypothesized that both earlier acquisition and higher proficiency can lead to stronger activation levels of lemmas and thereby faster RT, and that this effect may arise out of a modulation of cortical activity patterns, making the L2 activity progressively more (or less) similar to the L1 activity [5,12,13]. Because these findings stress the influence of lemma activation level on RT, we formed the hypothesis that obvious responses to a given stimulus (the modal response) should be given faster than less obvious responses (the non-modal response), due to the latter case reflecting a conflict in lemma selection, which would add time to the RT.
At the stage of phonological code retrieval, the word frequency (WF) effect must be mentioned. This phenomenon describes the tendency that the RT length diminishes when the target word is a commonly used one [14]. Currently, research indicates that the WF effect occurs relatively late in the word production process and at least partly reflects the process of phonological code retrieval [15]. A hypothesis on bilingual word-production delay focused on the WF effect is represented by the weakerlinks hypothesis, which will be examined more closely later.
For an effect again more exclusive to bilinguals, we have to consider the language-independent, simultaneous activation of L1-L2 phonological representations that multiple studies point to [9,10]. Such an activation implies a conflict having to be resolved between the L1 and L2 in choosing the phonological code in the task language. This would then impact the RT. In contrast, a shared initial phoneme increases the activation of the target phonological code, yielding a facilitating effect on RT. As a result, another hypothesis was formed for our present paper: analogous to one of the experiments performed by Colomé and Miozzo, we would expect a facilitating effect on RT when comparing target words that share the same initial phoneme between both their L1-L2 translations with target words that do not, arising out of an additive activation of the shared phoneme [10].
After the phonological code is retrieved and syllabification and phonetic encoding are applied, articulation ensues. Bilinguals herein commonly deal with differences in articulation efforts, because many times the different translations of a given target word contain varying numbers of syllables or even just differences in pronunciation, which impact the plan for forming the corresponding sound sequences. For example, German target words include more complex affricate "pf " or "ts" sounds that take longer than a simple obstruent "p", which is rather prevalent in other languages. These and similar duration differences among phonemes in speech-motor planning influence the effort of articulation and cannot be captured by the number of phonemes, but could be assessed by measuring the actual word duration (WD).
Furthermore, we may not forget the importance of higher-level executive functions. New meta analyses seem to indicate that the widely presumed bilingual advantage over monolinguals for executive functions may in fact be less powerful than previously assumed, or even an artifact due to publication bias [16]. Considering this, findings suggesting such a possible positive effect of bilingualism on executive function should be reconsidered [17]. This justifies investigating the possibility of a bilingual disadvantage in some regards. As has been previously put forth, bilingualism may be connected to the expenditure of additional cognitive resources due to a higher need for self-monitoring during speech compared to monolinguals [18]. This might enable a fatigue effect, slowing down RT over time. Our study's specific setup allowed for not only an analysis of such a fatigue effect, but also a learning effect and a possible interaction of both. Moreover, we formulated the hypothesis that due to motivational factors, an inverse relationship between accuracy in an object-naming task and RT is possible. This is based on the scenario of a speed-accuracy tradeoff that participants face when making the decision of either putting sustained effort into finding the correct response or, instead, focusing on minimizing the respective RT.
Another way in which the effects of bilingualism on higher cognitive functions have been evaluated is with the Simon task. Recently, a gender-dependent executive effect has been pointed out, with females being more easily distracted by the unnecessary spatial information presented during the task [19]. Translating this finding to a within-bilingual framework, it remains to be seen whether a similar gender effect can be shown in the context of bilingual language control regarding the suppression of task-irrelevant language. The respective hypothesis we defined in the current study was based on the previously mentioned effect of a cross-linguistically shared initial phoneme. If a gender-dependent cognitivecontrol advantage of suppressing the phonological information in the task-irrelevant language exists for males, they should profit less from the facilitating effect of a cross-linguistically shared initial phoneme compared to females.
Other important variables modulating cognitive control in different tasks are represented by age and age of L2 acquisition [5,20]. Bilingualism has been linked to improved retention of cognitive skills in later life periods compared to monolingualism [21]. Furthermore, inhibitory cognitive control decreases with age as shown by the existing literature [21,22]. This decrease would take effect in tasks relying on inhibitory control, such as finding appropriate non-modal responses when no modal response is present. Thus, one would therefore expect an increase in the difference in RT between modal-and non-modal responses with age. Regarding age of L2 acquisition, studies have struggled to clearly identify both the locus of influence as well as the mechanism of mediation [5]. Its connection to cognitive control mechanisms has hardly been explored, even though it is argued that there is a fundamental difference in network organization based on this variable [23]. Furthermore, age of L2 acquisition has been theorized to determine the size of phonological representations, with earlier learned words saved as blocks and expressions acquired later being deconstructed into phonological elements [24]. Thus, we hypothesize that the influence of the initial phoneme status (shared or different between L1 and L2) on RT would be stronger the later the L2 was acquired.
Importantly, there is not only the possibility of these factors acting isolated, but rather in combination with one another. Here one has to point toward the weakerlinks hypothesis, according to which bilinguals possess a weaker connection between their semantic and phonological representations when compared to monolinguals. This is believed to occur due to the former having to split their phonological activations between two different sets of representations due to language specificity of phonological codes, while the latter are able to focus the entire activation on one single set [8,25]. The hypothesis in this regard bears similarities to the theoretical underpinnings of the WF effect, but with the degree of the WF effect varying with language use. WF initially benefits high-frequency words, but eventually the low-frequency words catch up. Thus, it is hypothesized that RT difference between the high-frequency words and the low-frequency words would be smaller for the language of longer use than for the language of shorter use, and RTs would be shorter for the language of the currently dominant use than for the language of the currently non-dominant use. With regards to this theoretical construct, we set out to investigate the interaction between language dominance and WF on one hand and participant age and WF on the other hand.
In addition to the predictions we derived so far from the serial model extended to bilinguals, testing an interaction effect between word choice (WC; semantic/lemma selection) and phonological encoding in a single language (German) could distinguish the purely serial model, in which phonological encoding occurs only after semantic/lemma selection, from cascade models, in which a set of unselected semantic/lemmata activates phonological codes of these candidate lemmata [4]. Cascade models typically predict that WF effects increase when multiple lemma candidates remain unselected, which may be the case when non-modal word response is made, compared to when a modal word response is made [3]. To investigate the influence of the mentioned variables, the present study uses mixed-effects multiple regression and intends to test the effects of various established psycholinguistic and cognitive factors and new two-way interactions between these established factors in one statistical approach [26].

Participants and study design
The entire data used was collected from twenty healthy volunteers (median age: 24 years, 10 females; Table 1), 10 of which were collected in the context of a study on cortical language representations investigated by navigated transcranial magnetic stimulation (nTMS) [27]. The additional 10 volunteers were collected for analyses 3 and 4 (see below). The participants confirmed to the Kohnert definition of bilingualism, as each of them reported regular exposure to both L1 and L2 before the age of 10 years [28]. The following inclusion criteria were considered: age of at least 18 years, right-handedness according to the Edinburgh Handedness Inventory, and acquisition of two languages before the age of 10 years [27]. The data collection took place on two appointments separated by at least 14 days to exclude nTMS aftereffects [27]. In the present study, we utilized the data taken as "baseline", meaning that object-naming performance prior to nTMS application was analyzed.

Object-naming task
The same object-naming task was carried out on both appointments (one per language, two consecutive runs per appointment) using a NexSpeech module (version 4.3; Nexstim Plc., Helsinki, Finland) [27]. It consisted of 131 colored pictures of different concrete animate and inanimate objects (such as "baby", "rake", or "orange") in a sequence randomized for each run [27,[29][30][31].
During each appointment, the language used in the task was invariant. The sequence of languages was counterbalanced. Each participant was instructed to name the pictured object as simply, quickly, and plausibly as possible [27,[29][30][31]. One initial object-naming run was followed by another containing only the objects that the investigator deemed correctly named in the first run. The objects were displayed for 700 ms each, with an interval of 2500 ms between the display of two consecutive objects [27,[29][30][31].

Audio extraction and measurement of reaction times
We used the built-in report mechanism of the Nex-Speech module to get information on when each single trial began (trial start time). The recorded video files of .asf data type were copied to an external computer, where an in-house Matlab script was used that first separated the audio track from the video and then saved each audio track in the form of a .wav file. Subsequently, we performed RT measurements on the audio tracks using Praat (version 6.0.28; http://www.fon.hum.uva.nl/praat /; Fig. 1).
The respective response to each object was documented for each trial, and both voice onset (time at which the response to a given trial began) and voice offset (time at which the response was finished) were measured and noted. This allowed for immediate calculation of both WD and RT by subtraction of voice onset from voice offset (for WD) and subtraction of trial start time from voice onset (for RT), respectively. Different categories of possible errors were defined to characterize incorrect responses. There was a no response (NR) when the participant did not give any response or audibly indicated not knowing a proper response (e.g., "ehh…"), a performance error (PE) if the word was articulated in a flawed way such as by distorted pronunciation or through the interjection of inappropriate utterances (e.g., "tea-ehhh-pot"), and a semantic error (SE) if the given response was correctly articulated, but from a semantic point of view not adequate to describe the pictured object (e.g., "tomato" as a response to a picture of an orange) [27,[29][30][31].

Data analyses Statistical method
We performed confirmatory forward mixed-effects multiple regression analysis on the RTs of the L1 and L2. We used a mixed-effects multiple regression analysis, a method introduced into RT analyses of psycholinguistic studies to overcome problems regarding factorial study designs [26]. Mixed-effects multiple regression allows (1) to analyze all observations without averaging, (2) to test multiple, possibly interacting nominal and continuous factors, and (3) to estimate the genuine effect of each factor by partialling out the information common between fixed-effects factors and the random effects of participants and objects. In other words, mixed-effects regression allows to partial out the idiosyncrasies that participants and objects brought with them into the object-naming datasets in one model. Moreover, if the by-participant random intercept and the by-picture random intercept are significant, it means that the studied sample is diverse enough in terms of participants and objects. The significant diversity of the sample data, in turn, allows to generalize the results of the significant fixed-effects factors beyond the sample of participants and objects used in the present study. Our approach is confirmatory in the sense that we tested preselected factors known to influence the RT according to previous studies as aforementioned. However, past studies have not shown the individual degree to which each factor accounts for the variance in its corresponding level of word production. Therefore, we performed a forward step-wise model comparison instead of a hierarchical model comparison.

Factors
Regarding the random effects, we tested the by-participant random intercept and the by-picture random intercept. Regarding the fixed-effects factors, we tested five types of variables. These variables include factors related to cognitive states (practice effect and/or fatigue effect) that would change over time (run numbers 1 vs. 2; trial numbers in each run from 1 to maximum 131), a factor related to language status (L1 vs. L2), factors related to semantic/lemma selection (test language run 1 percent correct [L1: 0.73-0.93; L2: 0.65-0.94]; task-relevant German run 1 percent correct [0.65-0.95]; WC: modal word vs. others), factors related to phonological code retrieval (log10 WF; first phoneme difference: same vs. different), factors related to articulatory load such as WD for included objects (ranging from 159 ms for "bi" in L1 Chinese ["fountain pen"] to 2165 ms for "panchina" in L2 Italian ["bench"]) and WD difference (L2-L1: ranging from − 1520 ms for "Mais" in L2 German ["corn on the cob"] to 1924 ms for "Trommel", a non-modal name in L2 German for "Fass" ["barrel"]), as well as other demographic factors (age: 19 to 27 years; age of L2 acquisition: 0 to 10 years; gender: female vs. male). For the grouping factors, the slope was calculated as the change from the subgroup listed first to the subgroup listed second.
Regarding the fixed-effects interactions, we tested four interaction effects motivated by the current literature: language status × log10 WF interaction, age × log10 WF interaction, gender × first phoneme difference, and run number × trial number interaction. The factors of each interaction term are ordered so that the coefficient estimated for the interaction term is used to adjust the coefficient of the second factor for the first factor's second subgroup. The information about the log10 WF for the object target names for the picture set used in the present study was taken from the SUBTLEX-DE [32].
Because our ultimate goal was to identify the contexts in which longer RTs are likely to occur during the objectnaming task, instead of removing outlying longer RTs, the positively-skewed RT distribution was corrected by inverse-transforming the RT. Moreover, because WD and WD difference were also positively skewed, they were log10 transformed.

Local purposes
We planned two analyses for different foci. Analysis 1 was intended to compare the RTs of L1 and L2, with special attention to the first phoneme difference factor (same vs. different) and the WD difference factor. Analysis 2 was intended to compare only German RTs, with special attention to the WC factor (modal word vs. others) in addition to the first phoneme difference factor and the WD difference factor.

Data selection
We took three steps to select trials from the baseline datasets. First, we selected trials for which verbal responses were made in the 2500 ms time window and for which we were able to measure the RT properly. For Analysis 1, we paired up L1 and L2 trials for each object in each run of each participant, enabling us to calculate WD differences for each pair. We further grouped the paired words into one set in which the L1 and L2 translation did share the same first phoneme and one set in which they did not (3506 trials). Then, for Analysis 2, we selected German trials (1448 trials) from the trials selected for Analysis 1 and divided them according to whether or not the specific response was a modal or non-modal response.

Statistical procedures
Prior to the regression analysis, we corrected the positive skewness of the RT distribution by inverse transformation, in addition to log10 transformation of WD and WD difference mentioned earlier. We performed a forward model comparison, selecting at each step the factor that reduced the variance most among the factors that independently significantly reduced the variability in objectnaming RT, with the threshold set at 0.05 for alpha. For the forward model comparison, the empty model with only the fixed intercept was calculated first. Against this empty model, by-participant random intercept was tested. Next, the by-picture random intercept was tested. Then, the preselected fixed effects factors were tested one by one. Afterwards, the by-participant random slopes for fixed-effects factors and by-picture random slopes for fixed-effects factors were tested. Then, the two-way interaction effects between fixed factors were tested. The final model was rerun by using the restricted maximum likelihood method to obtain the unbiased variance components. In the final model, the order of factors in the regression equation was rearranged so that the analysis program forms the interaction terms consistent with the interaction hypotheses of the fixed effects. When a theoretically motivated 2-way interaction was significant, a-theoretic 3-way interactions were additionally tested to see if there was a significant 3-way interaction that would make the 2-way interaction non-significant and reduce the remaining variance significantly. It was also used to help localize the source of the effect of interest.
The assumptions for multiple regressions were examined for each final model, following Baayen [33]. To see if the residuals are normally distributed, standardized residuals were calculated and a density plot was generated for visual inspection. The skewness of the distribution was calculated to see if it would fall in the normal range between − 0.5 and + 0.5. To check the homoscedasticity assumption by visual inspection, fitted values are plotted along the horizontal axis and the corresponding standardized residuals were plotted along the vertical axis with the reference lines drawn at ± 2.5 for the standardized residuals. Trials with residuals that fell outside the ± 2.5 standard deviation (SD) were identified and tagged with actual reaction times and participants in order to find where in the range of reaction times the deviated residuals lay and see if they exclusively belonged to one or two participants.
For the final mixed-effects model, because there is no agreed-upon way of determining the degrees of freedom to translate the obtained t-values for the coefficient of each factor into p-values, p-values based on degrees of freedom returned by statistical programs may be misleading [34,35]. Therefore, to complement the information, we provide the bootstrap confidence intervals (CIs) of each factor's coefficient obtained by 10,000 times of bootstrapping in addition to providing the p-values determined by using the degrees of freedom calculated by Kenward and Roger's method [36][37][38]. Additionally, the proportion of variance accounted for was calculated for the final model, the fixed effects, and the random effects.
In the results section, we report means and CIs of the back-transformed fitted RTs indicated by the subscript btf. To perform this series of statistical analysis, we used R (version 3.1.1; The R Foundation for Statistical Computing, Vienna, Austria) in combination with the lme4 package, the nlme package, lmerTest, krbttest, the MuMIn package, and the effects package [34,36,37,[39][40][41][42].

Analyses extended with a larger more representative and gender-balanced sample
Irreproducibility of results is a recently surging concern in neurobiology of language. The male sample (n = 3, contributing 561 trials) may not be representative to claim the gender effect and/or the first phoneme difference × gender effect even if 10,000-times bootstrap replications confirmed them. To address this concern, additional data were collected to see if the results of the first sample could be replicated with a larger, more representative, and more gender-balanced sample (n female = 10, n male = 10, in 7145 trials in total). With the time constraints imposed on the study 2 completion, the data were collected with a simplified procedure, scheduling the L1 and L2 sessions on the same day without the nTMSrelated steps. In addition, the two samples differ in gender composite (7:3 vs. 3:7). Here, our report focuses on the replicability test of the effects detected in the sample that may be less representative and gender-imbalanced. At the end of the result section, a brief report was added to mention two of the interactions that were part of the decomposition of sample difference and relevant to the present-theory testing investigation.
The data from the previous analysis was combined with the new data set. Using this larger, more representative, and gender-balanced data set, the final models of analysis 1 and analysis 2 were tested. Where applicable, the hypothesized effects that were not significant in sample 1 were added to the final model to see if they would become significant with a larger, more representative, gender-balanced sample. These hypotheses included word frequency × age (or age of L2 acquisition) for the Weaker Links hypothesis from Analysis 1 and word frequency × word choice interaction for the Cascade model from analysis 2. Because the first replicability test asks if there are non-contributing terms in the proposed final model, backward model comparisons for elimination was performed instead of forward model comparison. To be consistent, subsequent testing of the previously nonsignificant terms was also performed by backward model comparison. The threshold for elimination was set at α = 0.05. As the model increases its complexity with the doubled sample size, calculating the Kenward and Roger degrees of freedom became impractically time-consuming. The default method of calculating the degrees of freedom (Satterthwaite method) was used. The bootstrap test was performed with 10,000 replications as was done in the previous analyses. When the effects package did not generate the plot to show the specific aspect of the interaction between a continuous variable and a categorical variable or between continuous variables, the fitted means and confidence intervals were calculated in the effect package and the result was reorganized and plotted by our custom scripts.

Analysis 1: Analysis including L1 vs. L2 comparisons Analysis 1: Overview
3506 trials from 10 participants in responses to 131 objects were analyzed. As shown by the model comparison (Table 2), the forward-model comparisons arrived at the final model that consisted of the by-participant random intercept, the by-picture random intercept, five fixed-effects factors (run number, trial number, first phoneme difference, language status, and log10 WF), and three two-way interactions (run number × log10 WF, language status × log10 WF, and run number × trial number). The final model accounted for 34.91% of the variance. The by-participant random intercept and the by-picture random intercept jointly accounted for 22.38% of the variance. The five fixed-effects terms and the three interaction terms jointly accounted for 12.53% of the variance. The variables related to the articulatory effort were not contributing factors.
For the verbal summary about the continuous variable factors (log10 WF and trial numbers), means and CIs of the RTs are represented at log10 WF = 1 (10 occurrences per million) as low frequency, log10 WF = 4 (10,000 occurrences per million) as high frequency, trial number 20th as earlier trials and trial number 120th as later trials. For an overview, see Tables 2, 3   The skewness of the distribution of the residuals fell in the range of normal distribution (skewness − 0.49). Homoscedasticity assumption was not violated by visual inspection. Residuals outside 2.5 SD occupied 1.96% of the trials (69 out of 3506) and all participants in the analysis contributed 3-14 trials (median = 5.5). The 10,000 times bootstrap test showed that all the significant factors and interactions were stable.

Analysis 1: Random effects
Regarding the random effects, adding the by-participant random intercept first (χ 2 (1) = 439.51, p < 0.0001) and adding the by-picture random intercept second (χ 2 (1) = 325.41, p < 2.2e−16) both significantly reduced the variance (Table 2). These results suggest that for the final model reported, by partialling out the idiosyncrasies of the participants and the objects in the sample, significant effects of the fixed-effects factors and their interactions are generalizable beyond the participants and the objects employed in the present study. Regarding the byparticipant random intercept (SD = 6.603e−05, 95% CI 3.590e−05, 9.617e−05), the back-transformed adjusted random intercepts ranged from 936 to 1162 ms. Regarding the by-picture random intercept (SD = 5.566e−05, 95% CI 4.678e−05, 6.414e−05), the back-transformed adjusted random intercepts ranged from 938 ms for "Schlange" ("snake") to 1211 ms for "Kommode" ("dresser").

(L1 and L2 combined): final model fixed effects
This table provides an overview reflecting the final statistical model used for the comparison of the first language (L1) and second language (L2). In the table, the name of the subgroup in parentheses is the subgroup to which the regression slope is calculated as the change from the other subgroup. A pair of factors of an interaction term is ordered in a way that the coefficient estimated for the interaction term is used to adjust the coefficient of the second factor for the first factor's second subgroup   Table 2), the coefficient of the language status factor was non-significant in the final model (b = − 2.117e−05, t(3361)= − 1.578, p = 0.1147; Table 3). More specifically, RTs were, on average, shorter for the high-frequency words (log10 WF = 4) than for the low-frequency words (log10 WF = 1), but the difference between the high-frequency words and the low-frequency words was greater for L2 (L2 high: M btf = 916 ms, 95% CI btf 875 ms, 960 ms; L2 low: M btf = 1043 ms, 95% CI btf 996 ms, 1095 ms) than for L1 (L1 high: M btf = 973 ms, 95% CI btf 927 ms, 1024 ms; L1: low M btf = 1044 ms, 95% CI btf 996 ms, 1096 ms) and the L2 high-frequency words received the shortest RTs (Fig. 2b, Table 4). The RTs were, on average, shorter during L2 object naming than during the L1 object naming in the present sample. Now even though the L2 may thus be the language of currently dominant use, the hypothesis that the difference between the high-frequency words and the low-frequency words being greater in the L2 than in the L1 nevertheless correctly distinguished the L2 from the L1 in the present sample.
None of the additional a-theoretical 3-way interactions (language status × word frequency × run number, or × trial number, or × first phoneme difference) were significant, made the significant two-way interaction nonsignificant, or significantly reduced the variance at the same time.
Factors related to cognitive states Run number × log10 word frequency degree interaction There was a significant effect of the run-number factor (b = 1.467e−04, t(3405)= 8.991, p < 2e−16) in addition to the significant effect of the log10 WF factor reported earlier. More importantly, there was a significant degree interaction effect between the run number factor and the log10 WF factor (b interaction = − 2.257e−05, t(3389)= − 3.837, p = 1.27e−04; Table 3). More specifically, RT was, on average, shorter for the high-frequency words than for for the fixed-effects factors and the interaction terms visible in a-d with the right vertical axis annotated with back-transformed reaction times in ms. RT is shorter as it is higher up along the vertical axis the low-frequency words. Besides, RT was shorter in run 2 than in run 1, which suggests a practice effect. Furthermore, the RT difference between the high-frequency words and the low-frequency words was smaller in run 2 (Run 2 high: M btf = 929 ms, 95% CI btf 887 ms, 976 ms; Run 2 low: M btf = 989 ms, 95% CI btf 946 ms, 1036 ms) than in run 1 (Run 1 high: M btf = 956 ms, 95% CI btf 912 ms, 1004 ms; Run 1 low: M btf = 1094 ms, 95% CI btf 1042 ms, 1151 ms), possibly due to fatigue effects depriving the high-frequency words of their advantage (Fig. 2c, Table 4).
None of the additional a-theoretical 3-way interactions (run number × word frequency × trial number, or × first phoneme difference, or × language status) were significant, made the significant two-way interaction non-significant, or significantly reduced the variance at the same time.
Run number × trial number degree interaction There was a significant effect of the run-number factor and a significant effect of the trial-number factor (b = − 5.155e−07, t(3425) = − 5.827, p = 6.17e−09). More important, there was a significant degree interaction effect between the run-number factor and the trial-number factor (b interaction = − 4.414e−07, t(3418)= − 2.951, p = 0.0032, Table 3). More specifically, RT was, on average, shorter in run 2 than in run 1, suggesting a practice effect. Also, RT was shorter for the earlier trials than for the later trials, suggesting a fatigue effect developing over 131 trials. Furthermore, the RT difference between the earlier trials and the later trials was greater for run 2 (Run 2 20th trial: M btf = 930 ms, 95% CI btf 894 ms, 969 ms; Run 2 120th trial: M btf = 1021 ms, 95% CI btf 976 ms, 1071 ms) than for run 1 (Run 1 20th trial: M btf = 1014 ms, 95% CI btf 927 ms, 1061 ms; Run 1 120th trial: M btf = 1070 ms, 95% CI btf 1023 ms, 1123 ms) depriving the later trials in run 2 of the practice effect advantage (Fig. 2d, Table 4).
None of the additional a-theoretical 3-way interactions (run number × trial number × first phoneme difference, or × language status, or × word frequency) were significant, made the significant two-way interaction non-significant, or significantly reduced the variance at the same time.

Analysis 2: German object naming only Analysis 2: Overview
1448 trials from eight participants in responses to 131 objects were analyzed. The forward-model comparisons arrived at the final model that consisted of the byparticipant random intercept, the by-picture random intercept, seven fixed-effects factors (run number, trial number, log10 WF, first phoneme difference, WC, German run 1 percent correct, and participant's gender) and two two-way interactions (gender × first phoneme difference, German run 1 percent correct × log10 WF).
The skewness of the distribution of the residuals fell in the range of normal distribution (skewness − 0.49). Homoscedasticity assumption was not violated by visual inspection. Residuals outside 2.5 SD occupied 2.14% of the trials (31 out of 1448) and all participants in the analysis contributed 1-9 trials (median = 3.5). The 10,000 times bootstrap test showed that all the significant factors and interactions were stable.
The final model accounted for 48.41% of the variance. More specifically, the by-participant intercept and the by-picture intercept jointly accounted for 20.99% of the variance, while the seven fixed-effects terms and the two fixed-effects interaction terms jointly accounted  Fig. 3.

Analysis 2: Random effects
Regarding the random effects, adding the by-participant random intercept first (χ 2 (1) = 292.27, p < 0.0001) and adding the by-picture random intercept second (χ 2 (1) = 159.41, p < 2.2e−16) both significantly reduced the variance (Table 5). These results suggest that, for the final model reported below, by partialling out the idiosyncrasies of the participants and the objects in the sample, significant effects of the fixed-effects factors and their interactions are generalizable beyond the participants and the objects employed in the present study. Regarding the by-participant random intercept (SD = 5.878e−05, 95% CI 2.143e−05, 9.512e−05), the back-transformed adjusted intercepts ranged from 722 ms to 811 ms. Regarding the by-picture random intercept (SD = 6.634e−05, 95% CI 5.418e−05, 7.790e−05), the back-transformed adjusted intercepts ranged from 692 ms for "Schreibtischstuhl" ("desk chair") to 858 ms for "Kamera" ("camera").

Analysis 2: Fixed effects
Fixed-effects factors related to semantic or lemma selection Word choice The word-choice factor was significant (b = − 2.916E−05, t(1434) = − 2.688, p = 0.0073) and did not interact with other factors. More specifically, RT was, on average, longer for the naming responses of non-modal words (M btf = 1012 ms, 95% CI btf 966-1062 ms) than for the naming responses of modal words (M btf = 983 ms, 95% CI btf 942-1027 ms), regardless of other factors (Fig. 3c).
More specifically, RT was, on average, shorter for higher-frequency words than for lower-frequency words. Furthermore, the difference between the high-frequency words and the low-frequency words was greater for the participants with lower German run 1 percent correct (70% correct, high frequency: M btf = 803 ms, 95% CI btf 741 ms, 878 ms; 70% correct, low frequency: M btf = 950 ms, 95% CI btf 869 ms, 1049 ms) than for the participants with higher German run 1 percent correct (95% correct, high frequency: M btf = 976 ms, 95% CI btf 915 ms, 1047 ms; 95% correct, low frequency: M btf = 1076 ms, 95% CI btf 1008 ms, 1154 ms), with the advantage associated with higher frequency words attenuated for those high in German run 1 percent correct (Fig. 3e, Table 7).
None of the additional a-theoretical 3-way interactions (German run 1 percent correct × word frequency × run number, or × trial number, or × first phoneme difference, or × word choice, or × gender) were significant, made the Factors related to phonological code retrieval In addition to the log10 WF factor reported earlier, the first-phoneme difference factor was significant (b = − 5.199e−05, t(1410) = − 4.281, p = 1.98e−05). Moreover, there was a significant degree interaction between the gender factor and the first-phoneme difference factor (b interaction = 4.277E−05, t(1388) = 2.302, p = 0.0215, Table 6). The RT was, on average, shorter for the trials of L1-L2 target words sharing the same first phoneme than for the trials in which L1-L2 target words started with different phonemes. More importantly, the RT difference between the trials of the L1-L2 target words starting with different first phonemes and the trials of the L1-L2 target words sharing the same first phoneme was smaller for male participants (male, first phoneme diff: M btf = 1041 ms, 95% CI btf 971 ms, 1122 ms; male, first phoneme same: M btf = 1031 ms, 95% CI btf 959 ms, 1115 ms) than for female participants (female, first phoneme diff: M btf = 976 ms, 95% CI btf 926 ms, 1032 ms; female, first phoneme same: M btf = 929 ms, 95% CI btf 882 ms, 981 ms, Fig. 3d, Table 7).
Four of the additional a-theoretical 3-way interactions (first phoneme difference × gender × trial number, or × word frequency, or × word choice, or × German run 1 percent correct) were non-significant and did not significantly reduced the remaining variance. First phoneme difference × gender × run number was significant (p = 0.003) and significantly reduced the remaining variance jointly with the other two automatically added a-theoretical two-way interactions (p = 0.008). The theoretically motivated two-way interaction (first phoneme difference × gender) became non-significant, whereas one of the automatically added a-theoretic twoway interaction run number × gender was significant (p = 0.0006). The first phoneme factor remained significant with the benefit by the same first phonemes. These results together showed that the significant run number × gender interaction depended on the first phoneme difference factor (Fig. 4). Female participants benefitted from the second run regardless of the first phoneme difference factor. In contrast, male participants benefitted in the second run when the first phonemes were different, whereas they did not benefit from the second run when the first phonemes were the same. Therefore, the source of the lack of language-independent phonological activation in male participants was localized in this condition (Fig. 4, right bottom panel).

Analysis 3 (Analysis 1 extended with n = 20) Analysis 3 Overview
7145 trials from 20 participants in responses to 131 objects were analyzed. The data set consisted of 3471 trials from 10 female participants and 3674 trials from 10 male participants, and thus, it was gender-balanced. The final model consisted of the fixed intercept, the byparticipant random intercept, the by-picture random intercept, six fixed-effects factors and four 2-way interactions (Tables 8 and 9, Fig. 5). First phoneme difference, word frequency × language status, word frequency × run number, and trial number × run number were replicated. Word frequency × age became significant with this large sample. The 2-way interaction was predicted by the Weaker Links hypothesis. However, contrary to the prediction, the advantage of the higher frequency words over lower frequency words was greater for older participants than for the younger participants.
The extended model accounted for 40.79% of the variance. The by-participant random intercept and the bypicture random intercept jointly accounted for 29.42% of the variance. The six simple fixed-effects terms and the four interaction terms jointly accounted for 11.36% of the variance. The skewness of the distribution of the residuals fell in the range of normal distribution (skewness − 0.428). Homoscedasticity assumption was not violated by visual inspection. Residuals outside ± 2.5 SD occupied 1.89% of the trials (135 out of 7245) and 19 out of 20 participants in this larger data set contributed 1-23 trials (median = 3.5). When these 135 trials with outlying residuals were removed, all the significant terms remained significant and all the non-significant terms remained non-significant. Therefore, none of the results were driven by these trials. Moreover, the 10,000-times bootstrap test showed that all the significant factors and interactions were stable (Tables 8, 9 and 10; Fig. 5).

Fixed effects
First phoneme difference First phoneme difference was significant. It did not interact with gender or age. Reaction times were shorter for names with the same first phoneme (M btf = 935 ms, 95% CI btf 895 ms, 979 ms) than for those with the different first phonemes (M btf = 992 ms, 95% CI btf 947 ms, 1042 ms), (b = − 6.144e−04, t(7053) = − 11.291, p = 2.593e−29; Tables 8, 9 and 10, Fig. 5a). The direction of the difference was the same as observed in analysis 1. Thus, the effect of first phoneme difference was replicated.
Word frequency × language status The word frequency × language status interaction was significant. Reaction times were shorter for high frequency names than for low frequency names. However, the advantage of higher frequency names over lower frequency names was greater in L2 (typically currently dominant-use)  Fig. 5b). The pattern of the directions of the reaction time difference was the same as observed in analysis 1. Thus, the effect of the word frequency × language status interaction was replicated.
Word frequency × run number The word frequency × run number interaction was significant. Reaction times were shorter for higher frequency names than for lower frequency names. However, the advantage of higher frequency names over lower frequency names was greater in run  Fig. 5c). The pattern of the directions of the reaction time difference was the same as observed in analysis 1. Thus, the effect of the word frequency × run number interaction was replicated.
Trial number × run number The trial number × run number interaction was significant. Reaction times were longer for later trials than for earlier trials. However, the advantage of earlier trials over later trials was greater in run 2 (run 2, early trial: M btf = 881 ms, 95% CI btf 844 ms, 921 ms; run 2, later trial: M btf = 966 ms, 95% CI btf 921 ms, 1015 ms) than in run 1 (run 1, early  Fig. 5d). The pattern of the directions of the reaction time difference was the same as observed in analysis 1. Thus, the trial number × run number interaction was replicated.
Word frequency × age The trial number × age interaction was significant. It was a degree interaction. Reaction times were shorter for higher frequency names than for lower frequency names.  Fig. 5e). Thus, the word frequency effect was replicated. However, the pattern of the directions of the reaction time difference was not consistent with the prediction derived by the Weaker Links hypothesis. Thus, the Weaker Links hypothesis was not supported.

Analysis 4 (Analysis 2 extended with n = 18) Analysis 4 Overview
The data set of analysis 4 consisted of 3267 German trials from the data set of analysis 3. The data consisted of 1430 trials from eight female participants and 1837 trials from 10 male participants, and thus, it was gender-balanced. The final model consisted of the fixed intercept, the by-participant random intercept, the by-picture random intercept, eight fixed-effects factors, five 2-way interactions, and one 3-way interaction (Tables 11,  12 and 13, Figs. 6 and 7). Among the terms that were significant in sample 1, run number, trial number, and word frequency × German run 1 percent correct remained significant, without changing the direction of reaction time difference. Thus, each of their effects were replicated (Tables 11, 12 and 13; Fig. 6a-c). In contrast, word choice interacted with word frequency. The advantage of modal names over non-modal names was replicated. However, the reaction time difference was not greater for non-modal names than for modal names. Thus, the Cascade hypothesis was not supported (Tables 11, 12 and 13; Fig. 6d). First phoneme difference × gender interacted with age (Tables 11, 12 and 13; Fig. 7e1-e5). The superior inhibitory control of male participants decreased with the increase of age (Tables 12, 13, Fig. 7e1-e5). The pattern of first phoneme difference × gender in analysis 1 was replicated among younger participants (Tables 12, 13; Fig. 7e1, e2) but it was not replicated among the older participants (Tables 12, 13; Fig. 7e3-e5).
The extended model accounted for 47.90% of the variance. The by-participant random intercept and the bypicture random intercept jointly accounted for 33.76% of the variance. The fixed-effects terms jointly accounted for 14.14% of the variance. The skewness of the distribution of the residuals fell in the range of normal distribution (skewness − 0.335). Homoscedasticity assumption was not violated by visual inspection. Residuals outside ± 2.5 SD occupied 1.87% of the trials (61 out of 3267) and 17 out of 18 participants in the data set contributed 1-10 trials (median = 2.5). When these 61 trials with outlying residuals were removed, all the significant terms remained significant and all the non-significant terms remained non-significant. Therefore, none of the results were driven by these trials. Moreover, the 10,000-times bootstrap test showed that all the significant factors and interactions were stable (Table 12).

Random effects
The by-participant random intercept was significant (χ 2 (1) = 791.39, p = 4.023e−174, SD = 1.200e−04, 95% CI 7.395e−05, 1.662e−04). Likewise, the by-picture random intercept was significant (χ 2 (1) = 396.02, p = 4.04576e−88, SD = 7.021e−04, 95% CI 5.916e−05, Fig. 6 Interactions determined in analysis 4. This figure details findings made in analysis 4. This entails the replication of the effects of run number (a), trial number (b) and word frequency × German run 1 (c) on reaction time (RT). While the benefit of modal names over non-modal names was replicated (d), the difference in RT was not greater for non-modal names than for modal names 8.132e−05). Regarding the representativeness of each gender group, the by-participant random intercept of the female sample was significant (χ 2 (1) = 456.26, p = 3.128e−101). Likewise, the by-participant random intercept of the male sample was significant (χ 2 (1) = 291.9298, p = 1.888e−65). These results suggest that each gender group consisted of sufficiently diverse participants, and thus, for the final model reported below, significant effects of the gender factor and their interactions as well as other significant effects are generalizable beyond the participants in the present study.
Word frequency × German run 1 percent correct The effect of word frequency × German run 1 percent correct was significant. Reaction times were longer for lower frequency words than for higher frequency words. This difference was greater for participants with lower German run 1 percent correct (70% correct, high frequency: M btf = 830 ms, 95% CI btf 756 ms, 920 ms; 70% correct, low frequency: M btf = 1062 ms, 95% CI btf 942 ms, 1217 ms) than for those with higher German run 1 percent correct (90% correct, high frequency: M btf = 880 ms, 95% CI btf 827 ms, 941 ms; 90% correct, low frequency: M btf = 1050 ms, 95% CI btf 974 ms, 1139 ms), (b interaction = − 1.003e−04, t(3158) = − 2.470, p = 0.014; Tables 12, 13, Fig. 6c). The Fig. 7 Age based modulation of gender × first phoneme interaction. This figure visualizes the effect of gender × first_phoneme_difference on reaction time (RT) split by age groups. While the facilitatory effect of shared first phoneme was for younger age groups only present in females (e1, e2), the gender difference disappeared for older age groups (e3-e5) direction of the reaction time difference was the same as observed in analysis 2. Thus, the effect of word frequency × German run 1 percent correct was replicated.
Word choice and word choice × word frequency Word choice × word frequency was significant. Reaction times were shorter for modal names than for non-modal names. The advantage of higher frequency words over lower frequency words was greater for modal names (modal, high frequency: M btf = 851 ms, 95% CI btf 806 ms, 903 ms; 70% correct, modal, low frequency: M btf = 1053 ms, 95% CI btf 983 ms, 1135 ms) than for the non-modal names (nonmodal, high frequency: M btf = 951 ms, 95% CI btf 888 ms, 1023 ms; non-modal, low frequency: M btf = 1050 ms, 95% CI btf 975 ms, 1138 ms), (b interaction = − 3.185e−05, t(3239) = − 3.276, p = 0.001; Tables 12, 13, Fig. 6d). Thus, the effect of word choice was replicated. The word choice × word frequency interaction became significant in this larger sample. However, the pattern of the directions of the reaction time difference was not consistent with the prediction by the Cascade hypothesis. Thus, the Cascade hypothesis was not supported.
First phoneme difference × gender × age The effect of first phoneme difference × gender was qualified by age. Among younger participants (e.g., below 26 years old), the advantage of the same first phoneme over the different first phonemes was smaller for males ( Fig. 7e1, e2). However, among older participants (e.g., over 26 years old), the advantage of the same first phoneme over the different first phonemes increased in males (male: age 32 Fig. 7e3-e5). These results were consistent with the prediction by the decrease of the inhibitory cognitive control with the increase of age.

Sample difference and theoretically-relevant participant-related variables
Part of the sample difference was the increase of the age range. Here we briefly report two of the age-related results that were significant in a separate comprehensive study of sample difference decomposition.

First phoneme difference × age of L2 acquisition
In a complex model to systematically decompose the sample difference present in analysis 3, first phoneme difference × age of L2 acquisition was one of the significant interactions that involved participant-related variables. The advantage of the same initial phoneme across both languages was smaller as the age of L2 acquisition was earlier (Fig. 8a). This result was consistent with the prediction derived by the different phonological encoding hypothesis.

Word choice × age
In a complex model to systematically decompose the sample difference present in analysis 4, word choice × age was one of the significant interactions that involved participant-related variables. The advantage of the modal names over non-modal names was smaller as the participants were younger (Fig. 8b). This result was consistent with the prediction by the decline of cognitive control with the increase of age.

Discussion
The present study investigated in what context longer RTs for object naming are likely to occur along the various stages of single-word production in healthy proficient bilingual adults. We tested preselected factors well-established in bilingual cognition and general psycholinguistic word production theories. We also tested interactions between these factors. This could help to gain a better in toto understanding of the inter-language competition processes.
We have found that longer RTs of our proficient bilingual adults were associated with factors taken to reflect the difficulty in the semantic/lemma selection stage and the phonological code retrieval stage of single-word production interacting with cognitive states changing over trials and runs. These factors include (1) the fatigue effect building over the 131 trials for about 5 min 30 s and over 2 runs, (2) the difficulty in the semantic/lemma selection reflected in non-modal WC and the German run 1 naming accuracy, (3) the difficulty in phonological code retrieval associated with low-frequency words and words with the non-overlapping initial phoneme in the two languages, and (4) the reduced advantage of the run 2 practice effect due to the increasing fatigue effect in later trials and the minimal advantage of practice effect on high-frequency words in the second run. These findings would imply the same phenomenon to occur in settings not confined to the frame of study. Prolonged word production could, for example, play a role in the increased frequency of tip-of-tongue states for bilinguals, or possible involuntary switches between L1 and L2 partly due to exhausted executive functions [43]. This hypothesis should however be considered tentatively, because it is unclear whether exhaustion similar to the one in a test setting tends to occur outside of long and strenuous study tasks.
The most intriguing interaction was observed where phonological factors interacted with other aspects of cognitive control. The gender difference in the inhibitory control of task-irrelevant information interacted with the bilingual advantage of enhanced phonological activation from L1 to L2 shared initial phonemes, which adversely affected the male speakers. Here, their presumed superior inhibitory control suppressed the facilitative phonological activation associated with the task-irrelevant language. The female speakers on the other hand benefitted from the doubled phonological activation regarding their presumed inferior inhibitory control of the task-irrelevant information.
Another important interaction concerned the speedaccuracy tradeoff. Speakers with higher accuracy in German object naming were associated with longer RTs. Also, an interaction with WF was observed. The WF effect was smaller for slower but highly accurate participants than for quick but less accurate participants. The accuracy difference likely arises at the stage of phonological code retrieval.

Support of language-independent phonological activation
In the present study, a facilitatory effect on RTs was demonstrated when both the L1 and the L2 target word shared the same initial phoneme. The presence of this effect confirms our initial hypothesis. Herein, we suspected a possible increased activation of the initial part of the target word building up by both languages providing a converging access on the level of phonological representations. As a result, a faster phonological-code retrieval process occurs compared to cases not sharing the initial phoneme. In this line, our findings support the hypothesis established by Colomé and Miozzo, which argues that during bilingual speech production, phonological representations of a given concept are activated in both languages [9,10]. Additionally, an influence of task language status was not shown. Therefore, the lack of the language-status effect in this dataset cannot be taken as evidence for language-specific activation or the inhibitory control model [4,11]. We suspect the lack of the language-status effect to be due to the high proficiency that our participants possess.
Additionally, we observed a significant interaction of first phoneme status with age of L2 acquisition. This falls in line with the discussion on language-independent phonological activation above, but more importantly supports the notion that age of L2 acquisition plays a role in organizing phonological representations as postulated before [24]. We can, however, make no claims regarding whether there are additional loci influenced by age of acquisition.

Gender difference in inhibitory control
Our working hypothesis with regards to a gender difference in inhibitory control in bilingual object naming was built on previous findings implying such a difference for certain processes relying on self-monitoring. The measure previously used was the Simon task, which requires suppressing task-irrelevant location information to correctly process task-relevant direction information and at which females were shown to perform worse [19]. While a very recent study provides compelling evidence for the case that bilingual language control is in fact isolated from other inhibitory control, such as tested in the Simon task, the possibility of an unrelated yet analogous influence of gender on language control was not addressed [44]. Thus we extended the gender difference in suppressing task-irrelevant information from the spatial domain to the language domain. Here we would, therefore, expect a gender-dependent difference in profit from other facilitating effects, such as the shared initial phoneme facilitation.
For our primary sample, the facilitatory effect of a cross-linguistically shared initial phoneme occurred in females, but not in males to the same extent. A confirmatory analysis with our secondary sample however revealed a slightly different finding, namely an interaction between first phoneme status, gender, and age. While for ages below 26 years, same initial phonemes across languages did shorten RT in women and not in men, the same was not true for ages above 26 years. We interpret this difference to signify a stronger basal level of self-monitoring about task relevance in language that is prominent in bilingual males compared to bilingual females, but is notably influenced by the worsening of cognitive control during the ageing process [21,22]. One possible mechanism could be a stronger a priori inhibition of the non-target language, which would render any facilitation on RTs by means of a cumulative activation of phonological representation null. However, a priori inhibition of the task-irrelevant language already from the semantic process on is not consistent with our data that showed the simultaneous bilingual phonological activation.

Support for the weaker-links hypothesis
In our findings, the L2 responses were generally given faster than the L1 responses. This finding stands in contrast with frequent reports of the L2 being slower in word production than the L1 [5,6]. A similar situation was reported by Christoffels and colleagues, where behavioral data showed a faster RT for the L2 than for the L1 [7]. In this study, however, the effect only occurred in languagemixed settings, whereas it disappeared in same-language block design such as the one used by us.
A possible explanation might be found in the weakerlinks hypothesis, which stresses the importance of differences in WF as a highly relevant factor leading to different RTs [5,8]. Since 75% of our participants reported German, presumably the dominant language at the time of the experiment, as their L2, the higher WF gained through the German language dominance might lead to a situation in which this paradoxical RT effect occurs. It did no escape our view however, that the WF effect showed to be stronger for the L2 as well. This in turn conflicts, on first view, the weaker-links hypothesis, which predicts that language dominance should be related to a smaller WF effect [25]. This interaction effect could be explained in two different ways.
First, it should be reminded of how the smaller WF effects is achieved along the time course of language development: WF first benefits high-frequency words in reducing RTs before low-frequency words catch up [25]. Therefore, following this line and counterfactually going back the timeline, if L1 had been the language of dominant use and L2 had been the language of non-dominant use until a point in time, RTs would have been, on average, shorter for L1 than for L2 and the WF effect would have been smaller for L1 than for L2 at that time point. Then, as L1 became the language of non-dominant use as with the bilinguals in the present study, RT increased on average for L1, keeping the previously achieved smaller WF effect for L1 but increasing the L1 RTs until RTs for L1 low-frequency word match RTs for L2 low-frequency words.
An alternative possible explanation for the conundrum of the interaction effect could come in the consideration of not only ceiling effects playing a role in activation, but also floor effects, affecting high-frequency words of non-dominant L1 adversely. There is the possibility of L1, being the predominantly non-dominant language in our dataset, summarily having reached an activation floor level through continued non-use. If in such a scenario even words with a relatively high frequency are rarely used simply due to them belonging to the L1, this attenuated activation would mean that even these highfrequency words rest on a, compared to the much more dominant L2, minor level of activation. The L2, which is summarily more activated due to its dominance, could in this context profit far more from the WF effect: only low-frequency words would rest at an activation floor, while the more often used words would experience the usual acceleration in RTs via the WF effect. This difference could explain a stronger WF effect for a dominant language; it is however a highly speculative hypothesis deserving of further critical thought.

Distinguishing word choice, proficiency and age of L2 acquisition
WC, proficiency in terms of naming accuracy and age of L2 acquisition are variables shown to affect semantic/lemma selection in the aforementioned studies. We intended to distinguish these variables. The choice of modal vs. non-modal responses reflects semantic decision processes at the start of word production. As expected, analysis demonstrated a significant effect of WC on RTs. Responses containing non-modal words arguably involve a more difficult semantic decision for the participant than trials in which the modal word is the obvious choice. This process of decision-making seems to take up enough time to impact the resulting RTs (by 100-200 ms on depending on word frequency). While WC does therefore still seem to be a viable measure of processing difficulty at the semantic/lemma selection stage, this study identified age as a factor that has to be taken into account. As previous studies have pointed out, bilinguals do seem to possess distinct advantages in retaining age-dependent loss of cognitive ability compared to monolinguals [21]. In this withingroup setting the effects of age are still detectable, and awareness of possible confounding effects via this interaction is important.
Regarding the speed-accuracy tradeoff, our initial hypothesis concerning the inverse relationship between naming accuracy and RT speed was confirmed. As a significant main effect, a higher percentage of initially correctly named objects went in conjunction with slower RTs. In contrast to the factor of WC, naming accuracy did interact with another factor, namely WF, a variable of phonological code retrieval. Naming accuracy therefore seems to be less suited as a reflection of a purely semantic/lemma selection level than WC. This interaction could however be related to cascade models, which predict a semantic-phonological interaction. For instance, the size of the unselected semantic/lemma candidates interact with WF, which indexes phonological code retrieval. The more limited the set of candidates is, the smaller the WF effect will be [3]. Therefore, naming accuracy might be connected to a higher, task-controlling level rather than to the purely semantic/lemma selection level. From there, it would be possible for naming accuracy to influence the efficacy of word production via modulation of internal monitoring, effectively creating internal constraint on semantic/lemma selection.
This additional hypothetical link is further confirmed by the direction of the significant interaction effect that the WF effect was stronger for lower accuracy naming than for higher accuracy in our dataset. Less self-monitoring means relying more on the established activation patterns given by the WF effect, while a stronger monitoring results in a stricter internal constraint with less reliance on established activation levels. This pattern falls in line with previous research, showing an inverse relationship between semantic constraint and WF effect in object naming predicted by cascade models [3]. If we hypothesize that naming accuracy is part of the higherlevel constraint generating system, it remains to be seen, in future studies, specifically on what aspect the naming accuracy variable imposes a top-down constraint. Here, the soon to be made available name-and image-agreement rating scores specific to our set of objects will certainly prove to be helpful.
Age of acquisition did not turn out to be a significant factor on the level of semantic/lemma selection in our study. We conclude that for the purpose of reflecting semantic processing, WC is the most well-suited variable in the present study [5].

Limitations
Data analysis under factorial study designs with analysis of variance without the use of mixed-effects multiple regression usually requires a very extensive set of data. Considering the huge sample sizes common for variancebased analyses, we have to acknowledge that our small sample size limits our interpretations.
We circumvented this by taking advantage of the flexibility that the mixed-effects multiple regression analysis offers but that the conventional analysis of variance does not. By using mixed-effects multiple regression, the present study detected the effects of 10 theoretically motivated categorical factors and continuous factors and their interactions on trial-by-trial RT measured for 7145 trials for analysis 1 and 3267 trials for analysis 2. In addition to the advantage of multiple regression analysis that is able to compute the effects of fixed-factors, controlling for all other factors in the model, mixed-effects multiple regression performs by-participant analysis and by-item analysis standardly required from psycholinguistic study in one analysis and partialled out the significant participantrandom effect (idiosyncrasy of the study participants) and the significant item-random effect (idiosyncrasy of objects used in the study); thus, the significant effects of the fixed factors should be generalizable to people and stimuli outside the samples used in the study.
Furthermore, our data is subject to an imbalanced language distribution. 75% of our participants reported German as their L2, which may be enough to heavily influence the results, but not enough to clearly attribute any specific observations to. This imbalance would pose a problem if the statistical method was insufficient to partial out the effects of other fixed factors and random effects of participants and items. The consequences might include: • 25% non-dominant L2 masking an even stronger WF effect for L2, which could, if present, be interpreted to disconfirm the weaker-links hypothesis. • 15% dominant L1 feigning a bigger WF effect for the L1. If this were the case, it might also be interpreted against the weaker-links hypothesis.
• Skewing of RT towards a German language-specific average, weakening the potential for generalization of our data interpretations [45].
Outside statistics, regarding language dominance, we assume German language dominance due to the experiment taking place in a German-speaking frame, yet there was no specific data lifted regarding the amount of usage of each participant's languages.
Similarly, because the source study for which we measured object naming RT does not have supplementary language proficiency scores measured on established batteries in languages of the participants (German, English, French, Italian, Luxembourgian, Slovakian, Chinese, Bosnian, Croatian, Spanish, and Cantonese) beyond object naming accuracy, we can make no hard statements regarding individual language proficiency, a factor that has been suspected to strongly influence bilingual word production peculiarities [7,46].
Lastly, we have to concede that for variables such as gender, it is impossible for us to control for any unknown third factors across the grouping variable. To solve this problem, a much bigger sample size across many different personal backgrounds would be required, which we unfortunately did not have access to.

Conclusions
Our mixed-effects multiple regression analysis of bilingual object naming RT revealed that the single word production process in healthy adult bilinguals is affected by interactions among cognitive, phonological, and semantic factors. Bilingual phonological activation interacted with gender in the inhibitory control of task-irrelevant language. Phonological code retrieval interacted with language status, language dominance, practice effect and speed-accuracy tradeoff. The practice and fatigue effects interacted as well. Age of acquisition appears to modulate phonological word representations. Our analysis revealed that WC stands out as a robust predictor, unaffected by other factors, to detect failures in semantic/lemma selection. Taken together, dense interactions between phonological factors and other factors revealed in the present study have confirmed that meaning-sound mappings are arbitrary within and across different languages and bilingual brains orchestrate cognitive, psycholinguistic, and functional components to enable speedy and accurate single word production.