Markus Grasmair, Department of Mathematical Sciences, NTNU

Cervical cancer is one of the most common types of cancers in women worldwide and accounts for over 300000 yearly deaths. Starting in the 60’s, the introduction of national screening programmes in high-income countries, where all women are regularly called in, has led to a significant decrease in mortality, though. In the Nordic countries, for instance, it is estimated that about half of the malignant cervical cancer cases have been prevented through screening alone. In addition the discovery of the connection with the infection with the human papillomavirus (HPV) and the subsequent development of effective vaccines is expected to lead to a further decrease in the future.
Personalised Cancer Screening
Still, not everybody is reached by the screening and vaccination programmes. Also, it is known that the risk of developing cervical cancer varies significantly between individuals, personal life style having a large effect. As a consequence, mass screening programmes with fixed screening intervals only depending on age lead to a large number of unnecessary screenings of women at insignificant risk level. At the same time, high-grade lesions in high risk individuals might remain undetected and develop to cancer due to too infrequent screening. In order to prevent this simultaneous over-screening and under-treatment, it is desirable to introduce personalised screening strategies, where screening intervals are based on an individual risk assessment. For this to be successful, though, it is necessary to have a good understanding of the factors that lead to the development of cervical cancer.
In the DeCipher project funded by the Research Council of Norway and jointly led by the Cancer Registry of Norway (https://www.kreftregisteret.no/en/) and Simula Research Lab (https://www.simula.no), we have used various statistical and machine learning techniques to improve our knowledge about risk factors for the development of cervical cancer. For this, we relied on two different, complementary approaches using different data material.
The DeCipher Project
The first approach was based on an analysis with modern machine learning techniques of large-scale questionnaires that were sent out in 2004/05 and 2011/12 in Norway to selected women in connection with their screening. Using generalised low rank models (GLRMs), we tried to identify phenotypes of risk groups for cervical cancer, which may in the future be used for the formulation of refined screening strategies. The results indicated phenotypes related to the age of the first sexual partner, hormonal contraception, the number of sexual partners, and contraception usage [2].
The second approach relied on the existing vast databases of past screening results that have been built up in the more than 60 years of mass screening programmes. In Norway alone, these contain the screening results of 1.8 million women. Here the question was to which extent it was possible to infer the future development from an individual’s past screening results only. For this, we have tested different prediction methods based on matrix factorisation, geometric deep learning, and hidden Markov models, amongst others [1,3,4]. A major difficulty here are the scarcity and imbalance of the data, with only few available screenings per woman, the vast majority of which show normal results. Still, the results indicate that this approach might lead to improved screening recommendations in the future, especially if it can be coupled with the phenotyping results from the first approach.

Characteristics of a subset of the screening data used for the training of our models.
Left: Lexis diagram illustrating different screening histories. A history is depicted as gray line from first to last visit; each visit is indicated by a marker for the type and result of the exam. Middle: Histogram of the time between visits. Right: The proportion of women in the age groups 16-35/35-45/45+ with normal (blue), low-grade lesions (orange), and high-grade lesions (red). The low number of abnormal results poses a significant challenge for machine learning algorithms.
References:
[1] Gogineni, V. C., Langberg, S. R., Naumova, V., Nygård, J. F., Nygård, M., Grasmair, M., & Werner, S. (2021, October). Recurrent Time-Varying Multi-Graph Convolutional Neural Network for Personalized Cervical Cancer Risk Prediction. In 2021 55th Asilomar Conference on Signals, Systems, and Computers (pp. 1541-1545). IEEE.
[2] Becker, F., Nygård, M., Nygård, J., Smilde, A., & Acar, E. (2022, May). Phenotyping of cervical cancer risk groups via generalized low-rank models using medical questionnaires. In Symposium of the Norwegian AI Society (pp. 94-110). Cham: Springer International Publishing.
[3] Langberg, G. S. R., Stapnes, M., Nygård, J. F., Nygård, M., Grasmair, M., & Naumova, V. (2022). Matrix factorization for the reconstruction of cervical cancer screening histories and prediction of future screening results. BMC bioinformatics, 23(12), 1-15.
[4] Langberg, G. S. R., Nygård, J. F., Gogineni, V. C., Nygård, M., Grasmair, M., & Naumova, V. (2022). Towards a data-driven system for personalized cervical cancer risk stratification. Scientific Reports, 12(1), 12083.

