Probabilistic model for individual assessment of central hyperexcitability using the nociceptive withdrawal reflex: a biomarker for chronic low back and neck pain

Background The nociceptive withdrawal reflex (NWR) has been proven to be a valuable tool in the objective assessment of central hyperexcitability in the nociceptive system at spinal level that is present in some chronic pain disorders, particularly chronic low back and neck pain. However, most of the studies on objective assessment of central hyperexcitability focus on population differences between patients and healthy individuals and do not provide tools for individual assessment. In this study, a prediction model was developed to objectively assess central hyperexcitability in individuals. The method is based on statistical properties of the EMG signals associated with the nociceptive withdrawal reflex. The model also supports individualized assessment of patients, including an estimation of the confidence of the predicted result. Results up to 80% classification rates were achieved when differentiating between healthy volunteers and chronic low back and neck pain patients. EMG signals recorded after stimulation of the anterolateral and heel regions and of the sole of the foot presented the best prediction rates. Conclusions A prediction model was proposed and successfully tested as a new approach for objective assessment of central hyperexcitability in the nociceptive system, based on statistical properties of EMG signals recorded after eliciting the NWR. Therefore, the present statistical prediction model constitutes a first step towards potential applications in clinical practice.


Background
Chronic pain states are associated with changes in the nociceptive system that may lead to hypersensitivity, i.e., pain after innocuous stimulation or exaggerated pain after low-intensity nociceptive stimulation [1]. Patients with chronic pain syndromes, such as whiplash, fibromyalgia, osteoarthritis, basal ganglia disorders, migraine and tension-type headache, endometriosis or chronic low back pain may display such pain hypersensitivity after stimulation of healthy tissues, most likely resulting from increased excitability in the central processing of sensory input [2,3].
The assessment of these conditions may be hampered by the subjective and unreliable nature of self-report based instruments. The establishment of objective, affordable and reliable measures of pain hypersensitivity would advance the understanding of neural mechanisms behind chronic pain, provide a basis for improved clinical management of pain, and establish much needed objective measures of treatment success or failure [4].
In this regard, the nociceptive withdrawal reflex (NWR) is a potential biomarker that has already been proven useful in the assessment of physiological, chemical and pharmacological modulation of nociceptive transmission/processing [5][6][7][8][9]. Moreover, it has been shown that the NWR is a valuable tool in the objective assessment of central hyperexcitability in the spinal nociceptive system that is present in many chronic pain disorders [7,9]. However, most of the studies on objective assessment of central hyperexcitability focus on population differences between patients and healthy individuals and do not provide tools for individual assessment [7,8].
The aim of this study was to develop a method to provide objective and individual assessment of central hyperexcitability using a biomarker derived from the NWR. In order to accomplish this, data from chronic low back and neck pain patients and healthy subjects was used to construct and test a prediction model based on statistical properties of the NWR signals. The methods and results of this model are presented and the relevance of such biomarker is discussed in the context of individual assessment of central hyperexcitability in clinical settings.

Participants
Data from 280 subjects was collected and divided in two groups. One group contained data from 140 patients (70 males and 70 females) with chronic low back pain or chronic neck pain. Inclusion criteria for chronic pain patients were: daily pain of at least 6 months duration and pain at the time of testing with an intensity of at least 3 using a 10-cm visual analogue scale (VAS), whereby 0 = no pain and 10 = worst pain imaginable. Exclusion criteria for chronic pain patients were: radicular pain (as defined by leg pain associated with a magnetic resonance imaging finding of herniated disc or foraminal stenosis with contact to a nerve root); peripheral or central neurological disorders, diabetes mellitus, insufficient knowledge of the German language, pregnancy (as ruled out by pregnancy test), breast feeding, intake of oral contraceptives or hormones, intake of opioids and antidepressants during the previous 2 weeks, and intake of other analgesics during the 48 hours before testing. The second group contained data from 140 healthy subjects (70 males and 70 females). Exclusion criteria for healthy subjects were the same as for the patient group, plus any pain at the time of testing or history of chronic pain syndrome of any nature. Both groups were evenly distributed with respect to gender and age. The age of the subjects ranged from 20 to 80 years with a mean age of 50 years. Both groups were tested using the same experimental setup for the recording of the NWR. All subjects were recruited at the University Hospital of Bern, Inselspital. The dataset was originally collected in order to establish reference values for the NWR and reflex receptive fields (RRF, the area from which the NWR can be elicited) in healthy subjects, and to determine if there are population differences compared to chronic pain patients (for further details, please refer to [9][10][11][12]). Written informed consent was obtained prior to participation and the Declaration of Helsinki was respected. The study was approved by the local ethics committee of the Canton of Bern (KEK 147/04).

Electrical stimulation
During testing, participants were lying in a bed, in a quiet room. A leg rest was placed under the knees to obtain a 30º semiflexion during electrophysiological testing. Ten surface stimulation electrodes (15 x 15 mm, type Neuroline 700, Ambu A/S, Denmark) were non-uniformly mounted on the sole of the foot and a common anode (5 x 9 cm) was placed on the dorsum of the foot, which are subsequently denoted as sites 1,2,…,10. Each stimulus consisted of a constant-current pulse train of 5 individual 1-ms pulses delivered at 200 Hz (Noxitest IES 230 stimulator, Aalborg, Denmark). This stimulus is perceived by the subjects as a single pricking/pinching stimulus, and potentially evokes a single ankle dorsiflexion/plantarflexion withdrawal reaction (depending on the stimulation site). A computer-controlled stimulator delivered a stimulus to one electrode at a time in a randomized order, with a random inter-stimulus interval ranging from 10 to 15 s. For each electrode site, the lowest stimulus intensity that evoked pain (i.e., the pain threshold) was assessed using a staircase procedure, and a stimulation intensity of 1.5 times the pain threshold was selected for eliciting the NWR [13], in order to ensure that a NWR response was evoked by most of the stimulations. Each electrode site was stimulated 4 times.

Signal recording
Activity in the tibialis anterior muscle was measured using surface electromyography (EMG), as can be seen in Figure 1.A. Initially the skin was lightly abraded, and then two surface electrodes (30 x 22 mm, type Neuroline 720, Ambu A/S, Denmark) were placed along the muscle fiber direction over the muscle with an inter-electrode distance of 20 mm. The signal was amplified (up to 20000 times) and filtered (5-500 Hz, 2nd order) by custom-made EMG amplifiers (Aalborg University, Denmark). Afterwards, it was sampled (2000 Hz) and stored (1000 ms window including 200 ms of pre-stimulation activity, commonly used to verify that there is no muscle activity before the stimulation), using a software specifically developed for NWR acquisition and analysis [14]. The NWR was quantified within the 60-180 ms post-stimulation interval (Figure 1.B).

Prediction model Data preparation
Data was divided into three disjoint subsets: training set (TR), validation set (V) and test set (TE) [15][16][17]. The training set TR contains the data from which the model was derived. The validation set V was used to adjust and find the optimal model. Finally, the test set TE was used to assess the performance of the optimal model. Data was split as 70%-15%-15% of the whole dataset for TR (200 subjects), V (40 subjects) and TE (40 subjects) sets, respectively.
Two restrictions were applied when assigning subjects into each set. First, the number of patients and healthy subjects should be balanced. Second, subjects within each group (patient or healthy) should be distributed evenly with respect to gender and age. Assuming a data vector of N samples X ¼ X i f g i¼1;N ---from which n samples need to be selected following the aforementioned restrictions; two age classes: A 1 = {X i < 50y} and A 2 = {X i ≥ 50y}, and two genders G 1 = {X i is male} and G 1 = {X i is female} were included. The set S was derived as: and ‖S i ‖ i¼1::4 ¼ n 4 = . Following these restrictions, the data was finally divided as shown in Table 1.

Feature selection
Several features were derived from the EMG recordings of the NWR in order to perform an initial exploratory analysis of their discriminative capacity. Preliminary tests showed promising results using the EMG amplitudes of the NWR, in line with previous investigations [18][19][20][21]. Specifically, the exploratory analysis showed some differences between the probability distributions of EMG amplitudes between patients and healthy subjects, so it was hypothesized that those differences could be used for classification purposes.
Since the stimulation was repeated four times at each site, each sample X i ∈ X had four signals F ij ; j ¼ 1; 4 --. For each signal, a probability distribution histogram was constructed to be used as classification feature. To compute a probability distribution histogram, it was required to determine the number of sub-ranges or bins that should be used. To this end, "EMG amplitudes of all signals {F ij } from all training samples X i were taken into account," where X i ∈ TR. The EMG amplitude range was defined as Rg = [min (F ij ), max (F ij )]. Since the EMG signal values have rather large range Rg, the selected  All 140 patients and 140 healthy subjects were divided into a training set (TR), a validation set (V) and a test set (TE). Each set satisfied two criteria: balance between patients and healthy subjects, and even distribution on gender and age.
number of bins should be reasonably high. Therefore, different numbers of bins, namely 100, 150, 200 and 300 were tested. The experiments showed that in all the cases, there were very few extreme EMG amplitude values, in the order of~1/1000, which means that the probability distribution of EMG amplitudes in the area of the two ends of Rg is~0. The data was mainly distributed within the range rg = [μ-σ, μ + σ], where μ and σ denote the mean and the standard deviation of {F ij }, respectively, so this new, restricted range was used. With the new range rg and selecting the model using a leave-one-out cross-validation procedure [22], only 30 bins were required to construct a probability distribution histogram, effectively reducing the number of features to be used as input to the classifier. Figure 2 shows the average histogram (an estimate of the probability distributions of EMG amplitudes) for patients and healthy volunteers.

Prediction scheme
Given a set of training subjects {X i ∈ TR}, which contain N h and N p training samples from healthy subjects and patients respectively, probability distributions P ij were computed for each signal The subject X i was represented by the probability P i , where: X i was assigned with a label L i = h if X i was a healthy subject and L i = p otherwise. When a query subject was sent to the prediction model to be classified as a patient or a healthy subject, the decision was made based on nearby training subjects with respect to the query subject. This approach is known as the k-nearest neighbour (kNN) [23]. kNN is an efficient algorithm in machine learning, showing comparable classification performance to more complex algorithms [16,24]. Briefly, when given a query subject, the kNN algorithm searches in the training set for the k subjects that are closest to the query, based on some predefined criterion to measure closeness (e.g. Euclidean distance). The query subject is then assigned to the group which has the majority among k subjects. Regarding the value of k, there is no specific approach for selecting an optimal value, as this strongly depends on the data structure [25]. Using leave-one-out cross-validation [22], a k value of 5 was finally selected.
The proposed prediction scheme was implemented as follows: given an unknown subject X q (X q ∈ V or X q ∈ TE), a probability distribution P qj was computed for each signal F qj ; j ¼ 1; 4 --. Then, each probability distribution P qj was compared with all training probability distri- The Euclidean distance between these distributions was calculated as: Where k training subjects with smallest distance were considered, and x ¼ 1; 30 --represents the size of each classification feature (i.e. the 30 bins of each histogram). If the majority of these k training subjects had label h then the signal F qj was labelled as healthy, and vice versa. Therefore, X q had four predicted labels, one for each of the four signals F qj . The final prediction result for the unknown subject X q was decided based on a voting mechanism, where the majority of votes was chosen to classify the subject as belonging to a patient or a healthy group. With four signals, there were cases where voting result was equal, i.e. 50:50 prediction results. From a clinical perspective, it is better to minimize the chance of missing any potential patients. Therefore, in case of equal voting, the unknown subject X q was classified as a patient.

Individual assessment
The prediction model also allows for individual assessment; as mentioned in the previous section, each signal F qj of an unknown subject X q returned a predicted label, according to the majority of k nearest neighbours. This label can be described as a probability value. The following formulas were used to compute the probability value for each signal F qj to be labelled as patient or healthy, respectively: Where L k denotes the labels of k nearest neighbours at the j-th signal. For example, for the j-th signal, if all k nearest neighbours had the same label p, then that particular signal F qj was labelled p with a probability of 100%. Another example: if 3 ⁄ 5 neighbours are labelled p and 2 / 5 neighbours are labelled h, then the signal F qj was labelled p with probability of~60%. In other words, the signal F qj had~60% probability of belonging to a chronic pain patient and~40% probability of belonging to a healthy subject. With four signals for each subject, the individual assessment can be obtained by: If l p q is higher or equal to 50%, the query subject X q is finally classified as patient, otherwise it is classified as healthy. Furthermore, the above value indicates how likely a subject is predicted as a patient: higher values result in higher confidence in the assessment. Figure 3 shows a general overview of the proposed prediction scheme.

Model validation
The validation set V was used to find an optimal combination of different factors in order to achieve the highest prediction rate. As mentioned in the Data preparation section, 40 subjects (of which 20 are healthy subjects and 20 are patients) were included in V. Further experiments to validate the prediction model were conducted with different number of training subjects, since this number can influence the performance of the model [10]. Out of the 200 training subjects (100 patients and 100 healthy subjects), different subsets were extracted as a new training sets. N h and N p denote the number of selected training healthy subjects and patients, respectively. Since the training set had to follow the rules established in the Data preparation section, N h and N p were selected such that they were modulus 4: When N h l and N p l < 100, more than one combination among training subjects to select a subset were available. For the model validation, 10 combinations were randomly chosen for each case. It should be noted that all combinations should obey the even distribution restrictions. Therefore, for each set of values N h l ; N p l À Á ; l ¼ 1; 6 --, 10 training sets were collected, namely TR l t ; t ¼ 1; 10 ---. The same validation set V was used in all cases to test the prediction model.

Model evaluation
Different parameters might affect the performance of the prediction model, such as selection of training set, number of training subjects and stimulation sites (see next section). The validation set V was first used to tune these parameters. Once the model was optimized, the test set TE was used to evaluate the real performance of the model, without further parameter tuning. Since it is known in advance which subjects belong to the healthy group and which ones belong to the patient group, this knowledge was used to evaluate the prediction model. The evaluation of the model's performance was based on prediction rates: Where M þ denotes the number of correct classified subjects among total number of subjects for validation or test, respectively, i.e. M = ‖V‖or M = TE depending on whether the validation set or the test set was used. Higher prediction rates indicate better performance of the model. To avoid cases where the prediction rate was high but most of the query subjects were classified as either patient or healthy, the previous equation was also applied to each group separately. In other words, it was extended as Where M þ h denotes the number of healthy subjects which were correctly classified and M þ p denotes the number of patients which were correctly classified. M h and M p are the total number of healthy subjects and patients for validation or test, respectively.

Model validation Stimulation site evaluation
Several factors can influence the EMG signals recorded after stimulation of each site depending on the location of the electrode on the sole of the foot, such as skin thickness or nerve fiber density [24]. Eventually, sites with low prediction rates should be eliminated, meaning that fewer electrodes will be needed to place on a subject's sole of the foot in a potential future application of the NWR as a biomarker for individual assessment of pain. Therefore, the first part of the model's validation was carried out to compare the performance using each of the ten stimulation sites separately. Following the scheme (Figure 3), each subject from the validation set V was sent as input. kNN was applied with k=5. The average prediction rate r was reported. Given N h l ; N p l À Á ; l ¼ 1; 6 -training subjects, average prediction performance over 10 runs with TR l t ; t ¼ 1; 10 --at each site was reported. As there was only one set of N h 7 ; N p 7 À Á training subjects, only one prediction result was obtained for this case (see Table 2).
The best performance at each site over different number of training subjects is displayed in Table 2. To decide which sites should be discarded, a threshold of 75% for the prediction rate was used. From this, 3 sites were selected whereas the remaining 7 were discarded (see Figure 1.B for an illustration of the locations of selected and eliminated sites). The final results on stimulation site evaluation suggested that signals recorded by stimulation of sites 3, 9 and 10 should be chosen for further evaluation.

Site combination
Following with the model's validation, the analysis was focused on the remaining 3 sites, namely sites {3, 9, 10}. Different combinations among those sites were also tested: {3, 9}, {3, 10}, {9, 10} and {3, 9, 10}. When more than one site was used, the voting mechanism was applied. Results are displayed in Table 3 for comparisons between single site and combinations of sites. Results showed that in general, the prediction rates were improved by combining sites compared to using Figure 3 Scheme of the proposed probabilistic prediction model. (A) Given a query subject X q ∈ TE, a set of EMG signals F qj ; j ¼ 1; 4 are obtained as a response to repeated electrical stimulation of ten sites on the sole of the foot. (B) A probability distribution histogram P qj is constructed from each signal F qj (or combination of signals from multiple sites) to be used as classification feature. (C) The signal F qj is labelled p (for patient) or h (for healthy), depending on the distances d qj to the closest neighbouring histograms P i , derived from the set of training subjects {X i ∈ TR}. (D) The final prediction for the subject X q is carried out based on the labels l qj derived from the individual assessment of all four signals. Query subjects X q ∈ V were used instead for all validation procedures (site combination and training set selection). a single site. Overall, the best performance (85% correct predictions) was reached by a combination of EMG signals recorded after stimulation of sites {9, 10}, {3, 9, 10} and {9}. The combination {9, 10} was finally chosen because it gave the best average prediction with different numbers of training subjects.

Training set selection
In general, higher numbers of training samples do not always lead to better training model [15], since at a certain point the model will not improve further or even drop down in performance. To find an optimal number of training samples, an empirical approach is often used [24,26]. The performance of the prediction model was influenced by the number of training subjects, as shown in Tables 2 and 3 : r, r h and r p , representing the average prediction rate, the prediction rate for healthy subjects and the prediction rate for patients, respectively. From previous validation results, a combination of EMG signals recorded after stimulation of sites 9 and 10 was used. Figure 4 shows the prediction result for each case, demonstrating that even with only 12 training subjects for each group, a well selected subset can reach up to 80% correct classified on average (Figure 4.A, t = 7). It was also observed that some of the training subsets did not perform very well, with prediction rates lower than 60%. In general, higher number of training subjects (48, 60 and 80, Figures 4.D, 4.E and 4.F, respectively) resulted in an improvement in the performance over lower number of training subjects (12, 24 and 32, Figures 4.A, 4.B and 4.C, respectively). With higher number of training subjects (48, 60 and 80), performances among different subsets were rather comparable. In case of N h 7 ; N p 7 À Á with t = 1 (only one training set), results were 75%, 95% and 85% for r p , r h and r, respectively.
The case with highest prediction rate was selected for each set N h l ; N p l À Á . As mentioned before, balance between r h and r p is also important. This means that Sites are sorted based on best performance in the last column. Prediction using four combinations of site were reported and compared to the classification performance of each of the three selected sites separately. Sites are sorted based on best performances displayed in the last column (when best performance is similar, sites are sorted based on average performance). the selected model should have a high average prediction rate r, and at the same time, the difference between r h and r p should be small. For instance, in Figure 4.B, at t = 10, the average prediction rate was r = 72% but difference between r h = 50% and r p = 95% was large, i.e., most of the subjects were predicted as patients. Therefore, this model should not be chosen. In Figure 4.F, average prediction rates at t = 3 and t = 9 were equal, but case t = 9 was selected because ‖r h − r p ‖ was smaller than in case t = 3. Comparison among best prediction rates with different numbers of training subjects is displayed in was selected with t = 9 as the training set for the proposed model.

Model evaluation
The model validation stage was used to determine optimal parameters for the prediction model, resulting in a combination of EMG signals recorded after stimulation of sites 9 and 10 and 80 training subjects for each group (t = 9). The optimized model was tested with the test set TE. The test set contained 40 subjects (20 patients and 20 healthy volunteers). Following the scheme in Figure 3, the evaluation with the test set returned an average prediction r = 80% with r h = r p = 80%. This means that with 40 query subjects, 8 were misclassified, in which 4 healthy subjects were misclassified as patients and 4 patients were misclassified as healthy subjects. Since the test set was evenly distributed with respect to gender and age, misclassified subjects were grouped based on these two factors to see how they affected prediction results. The following values were computed: Results revealed differences among misclassified subject based on both factors. There were 7 female subjects misclassified υ G 2 ¼ 87:5% ð Þ . The age factor also showed rather strong influence on the prediction result: υ A 2 ¼ 75% meaning that 6 out of 8 misclassified subjects were from the elder age group (age > 50 years).

Comparison with current methods
Critical values to assess widespread central hyperexcitability using the NWR and RRF have recently been published [10]. In particular, estimates of 95 th , 90 th and 75 th percentile values of the distribution of the test responses have been obtained (named p 95 , p 90 and p 75 , respectively) by computing quantile regressions for the each assessment method (e.g. NWR thresholds, RRF areas). More extreme values (e.g. p 95 ) are more likely to lead to the correct identification of patients, but could leave out a number of subjects that could also potentially be at risk, whereas critical values that correspond to more central percentiles of the distribution (e.g. p 75 ) would include more subjects potentially at risk, but at the cost of misclassifying more healthy subjects as presenting hyperexcitability.
With the present dataset, it was possible to compute individual RRF areas for chronic pain patients and healthy subjects (for details, please refer to [27]), and compare them to these critical values to obtain equivalent classification rates r p , r h and r using the same test set TE. The classification rates for the most restrictive conditions (p 95 and p 90 ) were r p = 0%, r h = 100% and r = 50%, whereas for the least restrictive condition (p 75 ) the classification rates were r p = 20%, r h = 95% and r = 57.5%. This means that the distribution of RRF areas of patients and healthy controls largely overlap resulting in criteria that is too restrictive for detecting central hyperexcitability, as evidenced by the very low classification rates for patients obtained with this method.

Discussion
Experimental and clinical studies in diverse cohorts of patients (e.g., whiplash, fibromyalgia, osteoarthritis, musculoskeletal disorders, headache, and neuropathic, visceral and post-surgical pain) have shown that these pathologies share common features, which are likely to reflect alterations in central nociceptive processing [28,29] leading to exaggerated pain sensitivity. It has been previously established that changes in central nociceptive processing can be detected by electrophysiological tests, such as those based on the NWR. In the past, the NWR has been widely used as a biomarker in the assessment of the state of the nociceptive system [5,6,30], and it has been proposed as a key tool in the research of central sensitization mechanisms, which are believed to be linked to the development of chronic pain [5,7,8,31,32].
In this regard, a number of studies showed that several patient groups present lower NWR thresholds compared to control groups of healthy volunteers [7,8,33,34]. Moreover, it was also demonstrated that chronic pain patients (endometriosis, chronic low back and chronic neck pain) present larger RRF compared to pain free subjects [12,31]. Lower NWR thresholds and enlargement of RRF are objective signs of central hyperexcitability, which could be a consequence of increased number of responsive spinal neurons or an expansion of the receptive fields of spinal neurons as a result of increased synaptic sensitivity [35,36]. In the light of these facts, there is clear evidence that groups of patients with different chronic pain conditions display on average altered central pain processing. However, the next translational step in this field involves the definition of diagnostic criteria in individual patients, in order to develop treatments that are tailored to detect individual disturbances in central pain processing [29].
In this study, a set of features derived from the NWR of chronic low back and neck pain patients and healthy volunteers was used as input to a prediction model, in order to test the hypothesis that the NWR contains specific information that would allow individual classification regarding the condition of the test subjects. Several features derived from the NWR have been used in the past for detection or quantification purposes: NWR latencies, raw EMG amplitudes, mean and peak EMG values, EMG probability distribution, EMG root-mean-square (RMS), z-scores and RRF area size, among others [6,27,32,[37][38][39][40][41]. Additionally, other EMG features have also been used in classifications tasks in other fields (most notably myoelectric control systems), such as number of zeros crossings, slope sign changes, spectral moments, as well as frequency domain and time-scale features [42][43][44].
A preliminary analysis showed that, among these variables, EMG probability distributions showed the most promising results in terms of discriminating potential for classifying between patients and healthy volunteers. Thus, they were selected for further development of the prediction model. However, the EMG signals showed a rather large range, thus requiring a high number of bins for their histogram representation. In order to overcome this, a new range was defined, restricting the original range around one standard deviation of the mean, In this way, less bins were required for the representation (as a simple method of feature selection), effectively reducing the number of features to be fed to the prediction model.
The evaluation of stimulation sites for eliciting the NWR revealed that EMG signals recorded after stimulation of electrodes located in the anterolateral (site 3) and heel (sites 9 and 10) regions and of the sole of the foot presented the best prediction rates. This is in accordance with previous research showing that the RRF in chronic low back and neck pain patients are expanded compared to healthy volunteers, precisely towards these regions [31,45]. On the other hand, EMG signals recorded after stimulation of sites located at or around the arch on the sole of the foot resulted in the worst prediction rates. These locations often have thin skin layer which lead to higher pain sensitivity and large reflexes regardless of whether they are patients or healthy subjects [46].
The final model evaluation showed an average prediction rate of 80%. For that particular choice of model, there were no differences in the misclassification rates between healthy subjects and patients. A more detailed analysis of the results focusing on the demographics of the two groups, revealed that women and elder subjects are more likely to be misclassified using the selected model. To date, there are no studies describing age or gender differences in EMG signals recorded from chronic pain patients compared healthy volunteers in relation to the NWR, since most of the research is focused on other biomarkers, most notably the NWR thresholds to single and repeated stimulation (temporal summation), and the RRF areas [9,26,32]. In this regard, there is still no agreement on the effects of age and gender on the NWR, although most of the evidence seems to point towards generally lower NWR thresholds in women and elderly subjects, most likely due to reduced endogenous analgesic mechanisms [6,10,47,48].
To date, only population differences have been reported between chronic low back and neck pain patients compared to healthy volunteers, showing an enlargement of the RRF in patients [12]. More recently, however, reference ranges for the NWR and RRF have been established for healthy subjects [10]. These ranges establish critical values for several parameters derived from the NWR (e.g. NWR threshold to single and repeated electrical stimulation, RRF area), above which an individual subject can be considered to present widespread central hyperexcitability. Results using this method with RRF areas as the classification parameters showed lower average classification rates (r = 57.5%) and very low classification rates for patients (r p = 20%) compared to the prediction model. This is most likely due to the large inter-individual variation of the RRF areas and the high overlap that exists between the probability distribution of RRF areas in patients and healthy subjects.

Limitations and future work
This work focused on the assessment of central hyperexcitability in individual chronic low back and neck pain patients using the NWR. Although the quantification of the NWR does not rely on subjective self-reports of pain sensation, it is subjected to supraspinal modulation. External factors involving affective and cognitive processes or other ongoing nociceptive processes (e.g. endogenous pain modulatory mechanisms) can affect the NWR characteristics [6,24], so these factors have to be carefully controlled for in order to provide reliable outcomes. Further tests in other patient groups should be conducted in order to test if this model could also be used to characterize other pain conditions. Furthermore, this is the first known attempt at individualized classification between healthy subjects and chronic pain patients based on the assessment of central hypersensitivity provided by the NWR. As it is common in classification tasks, there are several variables that require a careful selection, such as the choice of features to be used as input to the prediction model (in this case, the EMG probability distribution), the parameters of the classifier (for kNN, the number of neighbours), the size of the datasets for classification, validation and test, and the number and location of stimulation sites selected. Some of these variables were chosen based on prior knowledge and/or empirical tests, so whereas the proposed statistical model is able to achieve high prediction rates, future research could focus on the application of more advanced signal processing methods, e.g. alternative methods for feature generation and selection, adaptive histograms, adaptive kernel density estimators and optimal parameter selection for the classifier, among others.

Conclusions
A prediction model was proposed as a new approach for objective and individual assessment of central hyperexcitability in the nociceptive system. The model was developed using statistical properties of EMG signals recorded after eliciting the nociceptive withdrawal reflex. The model supports individualized assessment of patients, including an estimation of the confidence of the predicted result. Evaluation was carried out using an independent test set of healthy subjects and chronic pain patients and a high prediction rate of 80% was achieved. Therefore, the present statistical prediction model constitutes a first step towards potential applications in clinical practice.