Abstract

Benefits for adjuvant chemotherapy of resected colon carcinoma examined by Artificial Intelligence Laboratory (NCI-Br) through synthetic data and machine learning: A tribute to Prof. Charles George Moertel.

Author
Rubens Kesley Faculdade de Ciências Médicas, Rio De Janeiro, Brazil info_outline Rubens Kesley, Leonaldson Dos Santos Castro, José Humberto Simões Correa
Full text
Authors Rubens Kesley Faculdade de Ciências Médicas, Rio De Janeiro, Brazil info_outline Rubens Kesley, Leonaldson Dos Santos Castro, José Humberto Simões Correa Organizations Faculdade de Ciências Médicas, Rio De Janeiro, Brazil, National Cancer Institute - Brazil, Niterói, Brazil, Instituto Nacional de Câncer - Brasil, Rio De Janeiro, Brazil Abstract Disclosures Research Funding No funding sources reported Background: Data is considered the basis of the modern economy and the production of synthetic data can revolutionize the applicability of artificial intelligence in healthcare. However, there are few synthetic databases in biostatistics and few mechanisms capable of generating them. The objective of the study is to develop a synthetic database based on the seminal article by Moertel et al (1990) entitled: Levamisole and Fluoracil for adjuvant therapy of resected colon carcinoma, which revolutionized oncological treatment. Methods: An algorithm with probabilistic and logical methods and specialized knowledge was used to reconstruct the dataset with the variables studied by Moertel et al (1990). Based on the percentages observed in the study and logical programming rules in R language, we propose to create a model capable of simulating the original dataset with the epidemiological (age and sex) and anatomopathological (tumor location, depth of tumor invasion, involvement of adjacent organs, obstruction, perforation, peritoneal implants, number of metastatic lymph nodes and histological differentiation type). Using the decreasing exponential function it was possible to simulate the decay of the probability density function of the original sample for recurrence and survival. The synthesized data was subjected to traditional statistical analysis and machine learning, using decision trees. Results: Statistical analyzes of the independent exploratory variables of relapse (invasion, metastatic lymph nodes, cellular differentiation) and survival (tumor location, invasion, metastatic lymph nodes, cellular differentiation) demonstrated statistical significance (p = 0.001). The behavior of the exploratory variables for recurrence and survival was similar to the recurrence and survival curves observed by the Kaplan-Meier statistical method (p = 0.001). Decision trees were created with amplified samples that demonstrated the importance of the variables number of lymph nodes, tumor invasion and chemotherapy. Conclusions: The proposed algorithm showed that it is possible to recreate a dataset with statistical similarities to the original, based on detailed analysis, knowledge in Data Science and Analytics, mathematical methods and rules determined by the domain expert. The study of synthetic dataset allows evaluating the quality of the article studied, amplifying or reducing the sample, mapping interrelationships between explicative and dependent variables, and analyzing the synthetic dataset according to artificial intelligence, demonstrating that artificial intelligence laboratories are essential for analyzing the health data.
Clinical status
Pre-clinical

3 organizations