Abstract
Lung cancer remains one of the deadliest cancers
worldwide, and cigarette smoking continues to be its most significant risk factor. A major reason for its high fatality rate is that
many patients are diagnosed only after the disease has progressed
to its later stages, where treatment becomes far less effective.
The disease is also more common among older adults, with most
diagnoses occurring in individuals aged 65 and above. To support
earlier and more accurate risk assessment, this study explores the
use of machine learning for predicting lung cancer stages based
on a range of clinical, behavioral, and environmental factors. The
research uses the Smoking and Cancer Risk Analysis dataset,
which includes 3,000 patient records and 17 attributes. The
dataset covers demographic information, smoking intensity and
duration, alcohol use, physical activity, diet quality, secondhand
smoke exposure, air pollution levels, BMI, chronic symptoms,
family medical history, and the final cancer stage. Its combination
of numerical and categorical features makes it suitable for
statistical analysis and predictive modeling. Five widely used machine learning algorithms Logistic Regression, Random Forest,
Gradient Boosting, AdaBoost, and Extra Trees, were trained and
evaluated using 3-fold cross-validation to ensure consistent and
reliable performance. All models demonstrated strong predictive
capabilities, with Binary LR achieved an accuracy of 99.5%.
Multinomial LR showed slightly lower but strong performance,
with 97.7% accuracy and Random Forest performing slightly
better at 98.17%. Gradient Boosting reached 97.33%, followed
by Extra Trees at 96.5% and AdaBoost at 95.33%. Overall,
the findings highlight the effectiveness of tree-based ensemble
methods, particularly Random Forest and Gradient Boosting, for
accurately assessing lung cancer risk associated with long-term
smoking behavior.