Lung cancer remains the leading cause of cancer-related mortality worldwide, largely because most cases are detected at advanced stages. This study develops and validates multifactorial machine-learning models that integrate demographic, behavioural, psychological, symptom-based and comorbidity variables to identify individuals at high risk of lung cancer. An anonymised dataset of 13.000 subjects (74% lung-cancer positive) obtained from the public “Lung Cancer Patient Records” repository was pre-processed through recoding, one-hot encoding and stratified train/test partitioning. To address class imbalance the training subset was balanced with Synthetic Minority Oversampling Technique (SMOTE). Three supervised algorithms—Logistic Regression, Random Forest and Extreme Gradient Boosting (XGBoost)—were tuned via grid search with five-fold stratified cross-validation optimising area under the receiver-operating-characteristic curve (AUC). On the independent hold-out set XGBoost achieved superior discrimination (AUC=0.93), sensitivity (0.95) and F1-score (0.93), followed closely by Random Forest (AUC=0.91). Univariate analyses confirmed significant associations (p
Key words: Lung cancer risk, machine-learning, early detection, XGBoost, digital-health
|