Lung cancer prediction from survey data: what a small, imbalanced dataset taught me

This was one of my first "real" ML projects — a Jupyter notebook that takes a survey dataset (lifestyle factors, symptoms, age, etc.) and predicts whether a respondent is likely to have lung cancer. Same caveats as my other healthcare project: I'm a student, this is educational, do not use this to diagnose anyone.

I'm writing this up because the dataset taught me three things I hadn't really understood before:

Accuracy is a liar on imbalanced data.
SMOTE is a tool, not a magic wand.
Simple models often win.

The dataset

A few hundred rows of survey responses with categorical features (yes/no for things like smoking, fatigue, wheezing) and a binary target (lung cancer: yes/no). The class distribution was heavily skewed — way more positives than negatives, actually, which was the opposite of what I expected. Real-world skews can go either direction.

Why accuracy lied to me

My first model was Logistic Regression, untouched dataset. Accuracy: 87%. I almost wrote "done!" in a commit message and moved on.

Then I looked at the confusion matrix. The model was correctly identifying almost every positive case and missing roughly half the negatives. On an imbalanced dataset, accuracy ≈ "guess the majority class." If 87% of your data is one class, predicting that class always gives you 87%. It's not a model, it's a lookup table.

That moment is the entire reason I now report precision, recall, F1, and a confusion matrix on every classification project, no exceptions. Accuracy on its own is a vibe, not a metric.

SMOTE: useful, but not a wand

SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic samples of the minority class by interpolating between real ones. I applied it. The model got better on minority recall — by a meaningful amount.

What I didn't realize until I read more carefully: SMOTE can make your model overconfident if you're not careful. Synthetic samples sit between real samples in feature space, so the model learns the boundary as if those interpolated points are real observations. They aren't. On a tiny dataset like this one, that's a real risk.

The fix is to only SMOTE the training set (never the test set) and to be careful about cross-validation — apply SMOTE inside each fold, not before splitting. I got this wrong on my first attempt. AI caught it when I described my pipeline.

Feature engineering

I added polynomial features and used mutual information for selection. Mutual information is great for survey data because most features are categorical and the relationships aren't linear. AI was useful here for explaining why mutual information makes sense for categorical-target tasks — I'd been defaulting to chi-squared without thinking.

Polynomial features helped a little. Not as much as I hoped. The signal in survey data tops out fast.

The model bake-off

I compared seven approaches:

Logistic Regression (baseline)
KNN
Decision Tree
SVM
Naive Bayes
Random Forest
A stacking ensemble of the above

Random Forest and the stacking ensemble were the best on F1 and balanced accuracy. KNN was the worst (small dataset + many categorical features = KNN suffers). Naive Bayes did surprisingly well, which I expected — Naive Bayes loves independent categorical features.

The takeaway: the bake-off itself is the value. Don't pick a model and stick with it. Run five. Compare on a metric that's appropriate for your imbalance.

What AI specifically helped with

Catching the SMOTE-before-split mistake. That bug is silent — your reported metrics look great because the test set is contaminated. AI caught it from a description of my pipeline. I'm grateful.
Explaining mutual information vs chi-squared. I'd been using chi-squared because every tutorial uses it. AI walked me through when mutual information is a better fit.
Writing the disclaimer. Healthcare datasets get used in scary ways. I asked Claude to help me phrase the README disclaimer in a way that was honest without being preachy. I think the result is okay.

What's next

The notebook is fine as a learning artifact, but it's not reproducible. The next step is extracting the pipeline into a Python script with a Makefile, so anyone can make train and get my numbers. I also want a tiny static dashboard that shows the model tradeoffs visually — confusion matrices, ROC curves, the band table.

If you're a student starting on classification: respect the imbalance. Accuracy is the easiest metric to compute and the easiest one to be wrong about.

Lung cancer prediction from survey data: what a small, imbalanced dataset taught me

I'm writing this up because the dataset taught me three things I hadn't really understood before:

Accuracy is a liar on imbalanced data.
SMOTE is a tool, not a magic wand.
Simple models often win.

The dataset

Why accuracy lied to me

My first model was Logistic Regression, untouched dataset. Accuracy: 87%. I almost wrote "done!" in a commit message and moved on.

That moment is the entire reason I now report precision, recall, F1, and a confusion matrix on every classification project, no exceptions. Accuracy on its own is a vibe, not a metric.

SMOTE: useful, but not a wand

Feature engineering

Polynomial features helped a little. Not as much as I hoped. The signal in survey data tops out fast.

The model bake-off

I compared seven approaches:

Logistic Regression (baseline)
KNN
Decision Tree
SVM
Naive Bayes
Random Forest
A stacking ensemble of the above

The takeaway: the bake-off itself is the value. Don't pick a model and stick with it. Run five. Compare on a metric that's appropriate for your imbalance.

What AI specifically helped with

Catching the SMOTE-before-split mistake. That bug is silent — your reported metrics look great because the test set is contaminated. AI caught it from a description of my pipeline. I'm grateful.
Explaining mutual information vs chi-squared. I'd been using chi-squared because every tutorial uses it. AI walked me through when mutual information is a better fit.
Writing the disclaimer. Healthcare datasets get used in scary ways. I asked Claude to help me phrase the README disclaimer in a way that was honest without being preachy. I think the result is okay.

What's next

If you're a student starting on classification: respect the imbalance. Accuracy is the easiest metric to compute and the easiest one to be wrong about.

Lung cancer prediction from survey data: what a small, imbalanced dataset taught me

Lung cancer prediction from survey data: what a small, imbalanced dataset taught me

The dataset

Why accuracy lied to me

SMOTE: useful, but not a wand

Feature engineering

The model bake-off

What AI specifically helped with

What's next

Related posts

How I built a brain tumor detector (with a lot of AI help)

Credit risk analysis as a BCA student: turning a Kaggle dataset into something a business person could read

Kyro Downloader: one engine, four UIs, and a lot of learning about contracts

Lung cancer prediction from survey data: what a small, imbalanced dataset taught me

Lung cancer prediction from survey data: what a small, imbalanced dataset taught me

The dataset

Why accuracy lied to me

SMOTE: useful, but not a wand

Feature engineering

The model bake-off

What AI specifically helped with

What's next

Related posts

How I built a brain tumor detector (with a lot of AI help)

Credit risk analysis as a BCA student: turning a Kaggle dataset into something a business person could read

Kyro Downloader: one engine, four UIs, and a lot of learning about contracts