SEARCH WITHIN CONTENT
Katarzyna Dobruch-Sobczak / Bartosz Migda * / Agnieszka Krauze / Krzysztof Mlosek / Rafał Z. Słapa / Paweł Wareluk / Elwira Bakuła-Zalewska / Zbigniew Adamczewski / Andrzej Lewiński / Wiesław Jakubowski / Marek Dedecjus
Citation Information : Journal of Ultrasonography. Volume 19, Issue 78, Pages 198-206, DOI: https://doi.org/10.15557/JoU.2019.0030
License : (CC-BY-NC-ND 4.0)
Received Date : 20-January-2019 / Accepted: 29-March-2019 / Published Online: 30-September-2019
The worldwide incidence of thyroid cancer is steadily increasing(1). According to the American Thyroid Association (ATA), thyroid nodules are a common clinical problem, with nearly 68% of all examined adult patients diagnosed with these lesions. In 7–15% of these cases, the nodules were found to be carcinomas(2). In Poland, 3,529 new cases of thyroid cancer were diagnosed in 2015. The annual incidence rate has increased from 3.8 per 100,000 in 2000 to 9.2 per 100,000 in 2015(3,4).
Although thyroid nodules are a common occurrence, it is usually difficult to detect them without imaging techniques (only 4–7% can be palpated)(5). Thus, ultrasound (US) examinations play an important role in detection. US is a non-invasive, cost-effective, and widely available technique used to discern specific features of nodules and guide fine-needle aspiration biopsy (FNAB)(5).
Published studies, including the ATA Management Guidelines, have demonstrated that hypoechogenicity, irregular margins, microcalcifications, and a taller-than-wide shape are the B-mode features with the highest level of specificity for the detection of malignant thyroid nodules(6–8). However, none of these features, taken individually, are exclusive to malignant lesions, and benign nodules with a single abnormal feature are relatively common(2,9–11).
Currently, two main types of elastography are available: shear wave elastography (SWE) and strain elastography (SE). Some authors have suggested that SWE is superior to SE in thyroid nodule stratification because of its operator independency, but recent meta-analyses have surprisingly shown that SE has better diagnostic accuracy than SWE(14,15) In addition, Dighe et al. described SWE artifacts and their impact on final results(16). In this paper, the authors suggested that almost 70% of SWE scans have artifacts. Over 18% of scans were unsuitable for final assessment and over 43% of artifacts from unsuitable SWE evaluation were operator dependent(16). It is known that SE also has limitations. Results are highly dependent on the presence of calcifications (macro- and rim calcifications) in the tumor, as well as location (deep-lying lesions) and tumor type (papillary thyroid cancers – PTC are often more suspicious than follicular thyroid cancers – FTC). Assessments of thyroid nodule malignancy carried out with strain elastography and US depend on examiner’s experience level and are characterized by substantial inter-observer variation(5,17–20), but they are useful in monitoring patients who have undergone FNAB(21–23).
Few studies analyzed inter- and intra-observer variability in US diagnosis and even fewer evaluated variability in the case of sonoelastography(5,17–20). Therefore, we investigated these two variabilities in thyroid nodule evaluations carried out with US and sonoelastography.
In this prospective study, patients first gave informed consent to participate in the research. Then they underwent US examination of the thyroid, followed by US-guided biopsy or surgical procedures. The study protocol was approved by the institutional review board of the Maria Skłodowska-Curie Memorial Cancer Centre and Institute of Oncology, Warsaw, Poland. From February to November 2017, 92 consecutive patients (22 men, 70 women) with a total of 149 thyroid nodules were included and examined in the Department of Oncological Endocrinology and Nuclear Medicine, Maria Skłodowska-Curie Memorial Cancer Centre and Institute of Oncology, Warsaw, Poland. Of these, 18 (4 men, 14 women) patients aged 21–78 years, with a total of 20 thyroid nodules, were randomly selected for the study. The nodules included eight malignant and 12 benign diagnoses. All malignant lesions and eight benign nodules were confirmed by postoperative histopathology. The remaining four benign nodules with CII in cytology were excluded from surgery because it was unethical to operate on patients without any indications. In this group, we performed US follow-up at 6 month intervals (Fig. 1).
The inclusion criteria were the presence of a thyroid nodule that underwent US-guided FNAB (according to the Guidelines of Polish National Societies(23), prepared on the initiative of the Polish Group for Endocrine Tumours and the ATA) and was operated or was under active observation including repeated FNAB. The following criteria excluded nodules from further analysis: pure cystic lesions, eggshell calcifications, or non-diagnostic cytology results. The researchers were blinded to the cytological and/or histological results.
Fourteen patients underwent thyroidectomy and FNAB while four underwent FNAB only. Histologic and cytological findings were used as study endpoints. For patients with benign FNAB results, a US examination was performed within six months. FNABs were performed with 22- to 24-gauge needles, and aspirates were fixed in 75% ethanol and stained with hematoxylin and eosin (H&E). Lesions were assigned to the Bethesda I–VI category(24) based on FNAB findings. FNAB was repeated for nodules classified as CI (non-diagnostic specimen for example cystic fluid only, acellular specimen), CIII (AUS/FLUS Atypia of Undetermined Significance/Follicular Lesion of Undetermined Significance), and small C IV nodule (<1 cm) (Suspicion of Follicular Lesion in small nodules under 1 cm). If possible, a specific histotype was suggested. Cytological results (CV and CVI) were verified by an independent, second pathologist. Surgical specimens were immediately fixed in 10% buffered formalin. Representative sections from these specimens were processed and routinely stained with H&E for histopathologic (microscopic) examinations.
Five radiologists (one from oncology centre and four from clinical centre), with experience in thyroid B-mode and CDUS imaging ranging from six to 22 years and experience in US elastography from one to seven years, performed and assessed the examinations. Before the study began, the radiologists were trained in our lexicon: composition; echo pattern in comparison to thyroid parenchyma (Echo-Pa); dominating echo pattern in comparison to thyroid parenchyma – in the case of mixed echogenicity dominating component (Echo-Pb); echo pattern in comparison to muscles (Echo-M); margins; the ‘halo’ phenomenon; extrathyroidal extension (the observers were asked to determine whether the extrathyroidal extension modeled the shape of the thyroid and its capsule or extended beyond it); macrocalcification; microcalcification; elasticity score (according to Asteria scale), (all features included in Table 1). All examinations were performed with the same protocol, described below.
The US probe was gently placed on the thyroid in a transverse and longitudinal orientation while the patient was in the supine position. The thyroid gland was scanned from superior to inferior in transverse cross sections and from the outer to the inner margin in longitudinal cross sections. The anteroposterior, transverse, and longitudinal measurements of the gland and nodules were taken on frozen images during examination and archived as well. Other B-mode features regarding the lexicon as well as CDUS and SE were assessed retrospectively on archived AVI films and frozen images. CDUS was performed in all cases with the same scale settings (maximal velocity of 2.5 cm/s). The gain of CDUS was adjusted to each patient individually, achieving the appropriate highest sensitivity without blooming artifacts. In the case of SE, since nodules become stiffer during compression, all radiologists avoided pressing the neck with the probe during examinations to minimize false-positive findings. Grey-scale conventional US with CDUS and SE were performed using an iU22 US machine (Philips Medical Systems, Bothell, WA) equipped with a 5–12 MHz linear array transducer. Sonoelastography was assessed qualitatively using Asteria four-point scale criteria (Tab. 1)(25). The following lesion features were assessed in US and SE examinations (Tab. 1). We excluded shape (taller than wide parameter) of the nodule because the assessment of this feature is more objective as it is done by comparing nodule measurements (height and width, in this research done prospectively). In the case of this research, assessment of inter- and intra-observer agreement including this parameter could have overestimated the final results.
From 149 examined thyroid nodules, records of 20 nodules were drawn out. For this purpose, we used MS Excel. The 20 original US records from B-mode, CDUS, and SE were duplicated. The resulting 40 records were numbered and arranged randomly in a final file. All researchers received the same set of files for evaluation. Then, five radiologists evaluated records (AVI loops and JPG images) containing transversal and longitudinal B-mode cross sections of the thyroid lobes. Next, CDUS and SE records (AVI loops and JPG images) of these nodules were assessed.
The scoring results for all five observers were calculated using Statistical Software Package (Dell Inc. (2016)), Dell Statistica (data analysis software system, v. 13. software.dell.com). An overall kappa value (κ-value) was estimated for multiple observers. Cohen’s kappa coefficient was used to determine the degree of intra-observer agreement, after correcting for agreement expected by chance, between duplicated records of the same patient. For inter-observer agreement, Fleiss kappa statistic was used. Again, correction was made for agreement expected by chance(26).
The kappa values were interpreted according to Landis and Koch(27), i.e., κ <0.00 corresponds to poor agreement, κ = 0.00–0.20 to slight agreement, κ = 0.21–0.40 to fair agreement, κ = 0.41–0.60 to moderate agreement, κ = 0.61–0.80 to substantial agreement, and κ = 0.81–1.00 to almost perfect agreement.
Finally, the accuracies of all researchers were assessed and compared, and the mode was determined for every descriptor in this set of data. This value was assumed to be the correct one for a given descriptor. Researchers who agreed with this value were given an accuracy value of 1; the rest were given an accuracy value of 0. Next, the total accuracy score for every researcher was calculated independently for every descriptor.
Our randomly selected group consisted of 20 nodules in 18 patients (12 nodules were benign, 8 were malignant). The maximum length of the tumors ranged from 6 to 46 mm (mean length 9.7 ± 5.6 mm). Five of them were PTC (papillary thyroid cancer), two were FTC (follicular thyroid cancer) and one was MTC (medullary thyroid cancer). In the benign group, eight were verified by histology and most (7/8) of them were hyperplastic nodules (Fig. 1).
The results of accuracy assessment of the five radiologists are presented in Table 2. The mean accuracy rates for all radiologists for all features ranged from 82.7 to 87.8%. All radiologists achieved accuracy rates ranging from 65.0 to 100% for B-mode examination, and from 47.4 to 86.8% for SE. The highest level of accuracy among all observers was noted when the following features were analyzed: macrocalcifications (from 90.0 to 100%), microcalcifications (from 85.0 to 100%), and evaluation of echo pattern in comparison to strap muscles (from 87.5 to 95. 0%). The intra- and inter-observer variabilities for US, CDUS, and SE features of thyroid nodules are presented in Table 3.
Concerning intra-observer variability, almost perfect agreement was noted for three observers: the second, third, and fourth observers achieved mean κ-values of 0.82, 0.86, and 0.81, respectively. However, substantial agreement in mean κ-values was also noted for the first and fifth observer. Inter-observer agreement, demonstrated by κ-values, ranged from 0.61 for macrocalcifications (substantial agreement) to 0.33 for Asteria criteria (fair agreement).
Inter-observer variability for the majority of US features showed moderate agreement in the estimation of composition (κ = 0.55), echo pattern (Echo-Pa, Echo-Pb, Echo-M) (κ ranging from 0.48 to 0.50), capsule assessment (κ = 0.40) (Fig. 3A), and microcalcifications (κ = 0.57) (Fig. 2). When assessing vascularity, overall agreement was fair (κ = 0.34). The mean inter-observer agreement for all US and SE features was 0.42, corresponding to moderate agreement (Fig. 3B).
Ultrasonography is a widely accepted imaging technique that accurately detects thyroid nodules and architectural distortion. Over the past decade, significant improvements have been made in US machine technology and high-resolution probes. Therefore, US features specific to thyroid tumors such as lesion stiffness, microcalcification, vascular pattern or margins, can be observed with high accuracy. These US features enable better stratification of malignancy risk and were used to create several Thyroid Imaging Reporting and Data System (TIRADS) classifications, although none were used in clinical practice in Poland(9,28–32). The primary objective of our study was to evaluate inter- and intra-observer agreement for the selected US and SE features as a first step towards proposing the TIRADS classification.
In our study, the most substantial agreement was obtained when macrocalcifications were evaluated: κ was 0.61 for inter-observer agreement and between 0.64 and 1.0 for intra-observer agreements. For microcalcifications, characterized by stronger associations with tumor malignancy than macrocalcifications, we achieved moderate agreement(33). Therefore, we assessed them separately in our study. Our results are similar to those reported by Park et al.(18), who used the same definition: microcalcification ≤1 mm, macrocalcification >1 mm. In this study, five radiologists, each with one to six years of experience in the assessment of thyroid nodules, received κ-values of 0.40 for macrocalcifications and 0.54 for microcalcifications.
In another study, in which Park et al.(20) evaluated thyroid carcinomas only (51 of 52 were PTCs), calcifications were observed in over half of the nodules (depending on observer, ranged between 34 and 42 of 52 nodules; 65.4–80.7%). These authors achieved similar results to those presented here, with κ-values ranging from 0.47 to 0.62 for all types of calcifications. In our study, microcalcifications were found in 7 of 8 thyroid carcinomas, while macrocalcifications were present in 2 of 8 thyroid carcinomas. Moreover, the rate of agreement in the assessment of calcifications in our study was comparable to the results of Choi et al.(5) The presence of microcalcifications in the sonographic pattern indicates the need for a biopsy, but more importantly, accurate evaluation of the sample(6).
In this study, assessment of the final results revealed that disagreement in terms of microcalcifications appears in nodules that were more normoechogenic or had hyperechogenic components (in the case of mixed echogenicity where microcalcifications were presented in the hyperechogenic component). This could affect the contrast between spot-like <1 mm reflection and surrounding solid parts of the nodule. Unfortunately, this disagreement was found in three malignant lesions, one PTC, one follicular variant of PTC and one FTC (Fig. 2). The follicular variant of PTC and FTC were normoechogenic, which could decrease contrast mentioned above. PTC was hypoechogenic, but the dimension was under 10 mm, which could be another limitation in the evaluation of microcalcifications.
In order to assess an echogenic nodule, we compared it with the thyroid parenchyma or the strap muscles, or used the dominant echo pattern. Inter-observer analysis of this parameter revealed moderate agreement (κ-values ranging from 0.48 to 0.5). This result may be partially explained by complex echogenicity of thyroid tumors and the background parenchyma. Data analysis revealed that besides complex echogenicity, the structure of the nodule was also important. More disagreement occurred for nodules with a mixed solid-fluid structure. The size of the nodules was also important. There was more disagreement for large nodules filling the whole lobe than smaller ones in terms of echogenicity in relation to parenchyma. This could be caused by less surrounding parenchyma for comparison. Choi et al. found fair agreement when these features were evaluated (κ-values were 0.34 for the first observer, 0.45 for the second), subdividing this category into only four types: hyperechoic, isoechoic, hypoechoic, and marked hypoechoic. On the other hand, we observed almost perfect intra-observer agreement (κ-values ranging from 0.83 to 1), even though we provided a very detailed definition of this feature by assigning it seven possible characteristics (Tab. 1).
The characteristics of lesion margins are an important feature when evaluating malignancy. When differentiating between benign and malignant thyroid nodules, nodules with circumscribed margins are more likely to be benign. However, this feature has low sensitivity as 33 to 93% of malignant tumors may also have circumscribed margins(34). It is difficult to identify margins when the surrounding thyroid gland is heterogeneous or borders of the nodules overlapped. The results presented by other researchers demonstrated a high degree of inter-observer variability when nodule margins were assessed(11). In our study, the margins could be described as either well-circumscribed or not circumscribed (lobular, spiculated, angular, jagged). Evaluation of this feature resulted in the lowest level of inter-observer agreement (κ-value of 0.39) and satisfactory intra-observer agreement (κ-values ranging from 0.65 to 0.90). Choi et al. and Park et al. used the same categorization of margins and obtained similar results, with κ-values of 0.42 and 0.03–0.29, respectively(5,20). In our study, most of the disagreement was for nodules positioned tangentially to the thyroid capsule, between the isthmus and lobe or when nodules brought out the capsule. These differences can be caused by uneven thickness of the thyroid capsule in contact with the nodule.
The subsequent features assessed were the ‘halo’ phenomenon and capsule invasion. Here, observer agreement was moderate, indicating that evaluation of this feature is characterized by limited accuracy. Park et al. showed even less favorable results, with only fair agreement (κ-value of 0.32) for determination of capsule invasion(18). In both the “halo” phenomenon and capsule invasion, we demonstrated disagreement mostly in large nodules that brought out the gland capsule, or nodules that were in contact with capsule (Fig. 3A), or were part of a nodule conglomerate.
In our study, the determination of lesion stiffness using a 4-grade scale was a difficult task for all observers as the level of agreement was fair. Four radiologists experienced in SE assessment achieved levels of accuracy from 68.4 to 86.8%. One radiologist, with only one year of experience achieved only a 47.4% level of accuracy (Fig. 3B). Inter-observer agreement was fair, with a κ-value of 0.33. This could be caused by different level of experience. In published papers, results vary between research centers(19,20). Friedrich-Rust et al. used the same 4-grade scale for qualitative SE and obtained substantial agreement between three observers (κ = 0.66). In contrast, Park et al. obtained only slight agreement between observers for this technique (ranging from κ = 0.08 to κ = 0.22)(20). However, a meta-analysis of SE from 2010 reported high mean values of sensitivity and specificity for diagnoses of 92% and 90%, respectively(34), testifying to the efficacy of this method, which, in our opinion, could be an important accessory in an experienced hand. In our study all observers assessed the same copies of the original files (AVI videos) along with the B-mode real-time records. Therefore, assessment of the accuracy was greater in comparison to the still images used only by Park CS et al. The fair agreement in the case of SE assessment may be associated with the lower experience of one of the researchers in this technique. The results of our study suggest the need for a discussion concerning whether SE, which is still a new and rarely used technique, should be part of the lexicon in further research.
Our study had several limitations. We included patients from the Department of Oncological Endocrinology and Nuclear Medicine pre-diagnosed with suspicious nodules or in whom carcinomas were detected. Therefore, the group of patients differed from a general screening population; the proportion of malignant lesions in our group was 45%. This is a general limitation of most studies performed in endocrinology and oncology centers, in which there are generally high percentages of malignant cases. The proposed lexicon was very detailed and despite previous training for all radiologists, some misunderstandings occurred. Our results showed too many US features used for nodule assessment, and further research should reevaluate them. We used operator-dependent strain elastography, which has some limitations (probe placement in relation to common carotid artery – more noise when probe in transverse section close to CCA (common carotid artery); probe compression; the presence of rim calcifications or multiple macrocalcifications covering the nodule; fluid parts of the nodule). However, in relation to SWE, which is thought to be more independent, recent reports have pointed out that this technique also has operator-dependent artifacts and limitations(16). Besides that, we used a single US machine and did not compare SE from different US systems. It could be assumed that the use of SE from different companies could cause differences in the final results, but this should be further analyzed in a prospective study. Hence, the US machine software and hardware should be considered when creating the TIRADS lexicon.
In this study, five radiologists, each with more than six years of experience in thyroid B-mode imaging, assessed 40 thyroid nodules, with relatively good inter-observer agreement and excellent intra-observer agreement in the assessment of thyroid nodules using US and fair agreement in the case of sonoelastography. The highest disagreement was found for capsule invasion, “halo” phenomenon, and the margins of large nodules especially those filling most of the thyroid lobe and/or found in vicinity of the thyroid capsule. In the case of microcalcifications, the differences appear mostly in normoechogenic nodules or nodules with a hyperechogenic component.
Sonographers must be watchful when assessing margin and capsule invasion in large nodules that are filling a significant part of the lobe or lying near the capsule, as well as when assessing microcalcifications in normoechogenic nodules or with hyperechogenic components.
All results suggest relatively good inter-observer and excellent intra-observer agreement in the assessment of thyroid nodules using US, and fair agreement in the case of sonoelastography.