INTER-OBSERVER REPRODUCIBILITY OF 15 TESTS USED FOR PREDICTING DIFFICULT INTUBATION

a Department of Anesthesiology and Intensive Care Medicine, University Hospital Olomouc and Faculty of Medicine and Dentistry, Palacky University Olomouc, Czech Republic b Faculty of Medicine and Dentistry, Palacky University Olomouc c Department of Neurosurgery, University Hospital Olomouc and Faculty of Medicine and Dentistry, Palacky University Olomouc d Department of Medical Biophysics, Faculty of Medicine and Dentistry, Palacky University Olomouc e Institute of Molecular and Translational Medicine, Faculty of Medicine and Dentistry, Palacky University Olomouc f Department of Epidemiology and Public Health, Faculty of Medicine, Ostrava University g Department of Preventive Medicine, Faculty of Medicine and Dentistry, Palacky University Olomouc E-mail: milan.adamus@seznam.cz


INTRODUCTION
Tracheal intubation is a mainstay of airway management during general anesthesia and usually performed uneventfully.However, if the intubation appears to be difficult or impossible after induction of anesthesia, critical oxygen desaturation may occur.Unanticipated difficult intubation can be more dangerous than a predicted one when a potential airway problem is detected before anesthesia.
There are many clinical tests for predicting difficult intubation (DI).2][3][4] ).Combining several tests may be more effective and may improve the accuracy of the assessment.However, studies have demonstrated conflicting results in the predictive value of models consisting of different test combinations [5][6][7] .A correct model must be both reliable in classifying patients' airway and give the same results when performed by different assessors (reproducibility).
The aim of this study was to determine the inter-rater agreement between two assessors (medical students) using fifteen parameters for predicting difficult intubation.

MATERIALS AND METHODS
Following local ethics committee approval and informed consent, 101 volunteers (medical students) were  examined (2-5/day) with 15 tests for predicting difficult intubation.In random order, each volunteer was examined in one session independently by two co-authors (O.J., T. V.) who were thoroughly instructed in carrying out the tests.Examinations performed by assessor O. J. created the O group while the group T included the corresponding examinations done by T. V.The measurements were done under standardized conditions and each assessor was blinded to the results of the other.The airway assessment consisted of fifteen parameters and measurements (see Table 1).The data were recorded into an Excel spreadsheet application (Microsoft Office 2007 SP2, Microsoft Corporation), and statistically analyzed (SPSS v. 15.0 statistical software, SPSS Inc., Chicago, USA).Inter-observer reproducibility of 15 tests used for predicting difficult intubation Descriptive statistics was used to summarize the demographic data of the volunteers and comparison of genders was done with a Mann-Whitney U test.A p-value less than 0.05 was considered significant.Agreement between assessors (percentage), Cohen's kappa (κ), or first-order agreement coefficient (AC1) were used for comparison of qualitative parameters.The inter-rater agreement in measurement of quantitative parameters was analyzed with the intraclass correlation coefficient (ICC) with 95% confidence intervals and Pearson's or Spearman's correlation coefficients.We used the following interpretation 8 of the inter-rater agreement in the kappa values and correlation coefficients: poor (< 0.20), fair (0.21-0.40), satisfactory (0.41-0.60), good (0.61-0.80), and finally excellent (0.81-1.00).The distribution of inter-observer differences was tested with the Kolmogorov-Smirnov test.Data with normal distributions were compared using the paired Student's t-test; the Wilcoxon signed-rank test was used for data that did not pass the normality test.Scatter plot and Bland-Altman plot 9 were used to demonstrate the systematic bias for measurements between assessors.).There was no significant difference in age for the two genders (p=0.847).Compared to females, males were significantly taller (p<0.0001),heavier (p<0.0001) and had higher BMI (p<0.0001).
Two tests (positive history of DI and retrogenia), were excluded from calculation because no positive cases were found.The coefficients of the inter-rater agreements of the qualitative and quantitative tests are given in Tables 2  and 3, respectively.
There was a systematic bias between assessors for measurements of all quantitative parameters (see Table 4, Fig. 1 and Fig. 2).

DISCUSSION
We demonstrated variable inter-observer reproducibility of tests for predicting DI.The best agreement between assessors was found for determining neck circumference; the worst results were obtained for the goniometric measurements (anteroflexion and retroflexion of cervical spine).Inter-observer reproducibility of a test depends upon factors related to both rater and person examined 10 .Rater components of errors include incorrect/ inconsistent measurement technique that may be due to insufficient instructions and/or inaccurate methodology.Each test must be described as simply as possible but the accuracy of the measurement procedure has to be maintained.To reduce this potential source of error, we used both written description of all measurements and practical training of the assessors before the start of the study.However, it is doubtful whether these steps were   sufficient because the inter-rater bias in the measurements of quantitative parameters was substantial.Factors related to the examined volunteer may be based on misunderstanding or not following the instructions appropriately.When necessary, the required maneuvers were clearly described several times and demonstrated repeatedly 11 .This study is relevant not only to pre-anesthetic airway assessment and predicting DI.The results present a statistical challenge, too.The Cohen's kappa coefficient (κ) is a statistical measure of inter-rater agreement for qualitative (categorical) items 12,13 .Kappa-values range from -1.0 to 1.0.Negative values occur when agreement is weaker than expected by chance.When we get κ=0, the agreement is the same as would be expected by chance, κ=1.0 indicates perfect agreement above chance.However, in our study, the high percentage agreement between assessors for some parameters did not correspond to low κ-values (see Table 2).For these parameters, first-order agreement coefficient (AC1) was calculated as an alternative to the κ coefficient.Some authors believe that the AC1 value reflects the inter-rater agreement more realistically than the Cohen's kappa coefficient 14,15 .The limitation of the AC1 is that it can be used for contingency tables 2 × 2 only.When a qualitative parameter has more than two values, either another agreement coefficient (AC2) has to be used or some groups have to be merged for AC1 calculations.In our study, MMT (Modified Mallampati test) (ref. 16), ULBT (Upper lip bite test) [17][18][19] and Temporo-mandibular (TM) joint movement were the relevant parameters suitable for merging groups.
In three volunteers, clinical impression of potential DI was positive (one in the O group, two in the T group).This parameter had poor inter-observer correlation when measured with Cohen's kappa coefficient (-0.013), but excellent if AC1 was used (0.969).Very low incidence of positive cases is a limitation of these results.
No volunteer declared a positive history of DI.This could be for two reasons.Either the volunteer had had no anesthesia or the intubation for his/her previous anesthesia was not difficult.As we were unable to distinguish between these two groups and the incidence of positive anamnesis of DI was zero, this parameter was excluded from statistical analysis.The same applied to retrogenia: there was no positive case in the groups.
Pathologies associated with DI were detected only in two volunteers.The agreement between the assessors was as high as 99%.Based on the Cohen's kappa coefficient, good agreement was detected (κ=0.662).When AC1 was used, the strength agreement was graded as excellent (AC1=0.99).These results show the advantage of AC1 over Cohen's kappa coefficient.As we intuitively feel, when the examinations of the assessors were identical in 99% cases, the degree of agreement should be described as excellent.On the other hand, one must take into consideration the low incidence of pathologies determined and probably not present in the study group.Inter-observer reproducibility of 15 tests used for predicting difficult intubation ease of tracheal intubation could not be determined.For this reason, in spite of inter-observer bias in most parameters, we were not able to distinguish which measurement, if any, reflected the reality.
To be useful in clinical settings, a model for predicting DI should be simple and feasible, with high accuracy, sensitivity and positive predictive value to identify all patients in whom intubation will be difficult 5 .These criteria can only be met when the input data are correct and consistent.If not, the construction of a model predicting DI may become rather a mathematic entertainment for the particular rater than a valuable clinical tool.

CONCLUSION
Although performed under standardized conditions, not all tests for predicting DI achieved acceptable interobserver reproducibility in our study.Best agreement was demonstrated for the assessment of neck circumference while the highest discrepancies between raters were in goniometrically-measured mobility of the C-spine (max.anteroflexion and retroflexion).The high inter-observer variability of examinations may be one reason why the models for predicting DI are not reliable in all cases.

- 19 )
Biting the upper lip with the lower incisors 1 = the incisors in front of the lip 2 = the lip partly visible 3 = the lip visible Retrogenia (receding mandible) A line drawn from the upper eye lid to the maxilla yes -the chin behind the line no -the chin in front of the line Hyo-mental distance (HMD) Distance: the body of the hyoid bone -the mentum mm TM joint movement Full mouth opening (IIG) + slux 1 = IIG > 50 mm + slux > 0 2 = IIG < 50 mm + slux > 0 3 = IIG < 50 mm + slux < 0 Maximal anteroflexion of the C-spine • The goniometer head to the ear canal • First arm in the long axis of the neck above the ear • Second arm to the nasal wing degrees Maximal retroflexion of the C-spine degrees Mandibular length Distance: outer angle -the middle of the chin (follow the shape of the mandible) cm Neck circumference Measured at the level of the cricoid, perpendicular to the long axis of the neck cm Thyro-mental distance (TMD) Distance: superior thyroid notch -the lower edge of the middle of the chin mm Sterno-mental distance (SMD) Distance: jugulum -the lower edge of the middle of the chin mm Inter-incisor gap (IIG) Full mouth opening, distance between the incisors (gums) mm DI = difficult intubation, TM = temporo-mandibular, slux = subluxation (maximal forward protrusion of the lower incisors beyond the upper incisors) RESULTS A total of 101 volunteers were enrolled and they all successfully finished the study with no drop-outs.Thirty (29.7%) were males (median age 23 years, range 20-26 years; median height 184 cm, range 173-190 cm; median weight 82 kg, range 66-100 kg; median BMI 24.0 kg m -2 , range 21.1-31.7 kg m -2 ) and 71 (70.3%) were females (median age 23 years, range 20-25 years; median height 168 cm, range 153-182 cm; median weight 61 kg, range 45-83 kg; median BMI 21.5 kg m -2 , range 16.2-30.5kg m -2

Fig. 1 .
Fig. 1.Scatter plot of assessor's T measurements of neck circumference against the paired measurements of assessor O. Bias is present, the measurements of assessor T were systematically lower than the measurements of assessor O.

Fig. 2 .
Fig. 2. Bland-Altman plot of differences between assessors in paired measurements of neck circumference against mean neck circumference, in 101 volunteers.

Table 1 .
Parameters for predicting difficult intubation used in the study.

Table 4 .
Systematic deviation (bias) in the measurements of quantitative parameters.Minimum, maximum, mean, and percentiles of differences between assessors.