Staging cancer through text mining of pathology records

Published in Studies in Classification, Data Analysis, and Knowledge Organization, 2020

Recommended citation: P. Belloni, G. Boccuzzo, S. Guzzinati, I. Italiano, C. R. Rossi, M. Rugge, M. Zorzi. Staging Cancer Through Text Mining of Pathology Records. In: Mariani P., Zenga M. (eds). Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-51222-4_4

Abstract Valuable information is stored in a healthcare record system and over 40% of it is estimated to be unstructured in the form of free clinical text. A collection of pathology records is provided by the Veneto Cancer Registry: these medical records refer to cases of melanoma and contain free text, in particular, the diagnosis. The aim of this research is to extract from the free text the size of the primary tumour, the involvement of lymph nodes, the presence of metastasis, and the cancer stage of the tumour. This goal is achieved with text mining techniques based on a supervised statistical approach. Since the procedure of information extraction from a free text can be traced back to a statistical classification problem, we apply several machine learning models in order to extract the variables mentioned above from the text. A gold standard for these variables is available: the clinical records have already been assessed case-by-case by an expert. The most efficient of the estimated models is the gradient boosting. Despite the good performance of gradient boosting, the classification error is not low enough to allow this kind of text mining procedures to be used in a Cancer Registry as it is proposed.