© 2026 Naveen Kumar Pendyam. All rights reserved.

contact@nkpendyam.me
GitHubLinkedInBlogContact
    All posts
    machine-learning
    python
    scikit-learn
    healthcare
    student-journey

    Lung cancer prediction from survey data: what a small, imbalanced dataset taught me

    Naveen Kumar Pendyam
    Thursday, May 29, 2025
    4 min read

    Lung cancer prediction from survey data: what a small, imbalanced dataset taught me

    This was one of my first "real" ML projects — a Jupyter notebook that takes a survey dataset (lifestyle factors, symptoms, age, etc.) and predicts whether a respondent is likely to have lung cancer. Same caveats as my other healthcare project: I'm a student, this is educational, do not use this to diagnose anyone.

    I'm writing this up because the dataset taught me three things I hadn't really understood before:

    1. Accuracy is a liar on imbalanced data.
    2. SMOTE is a tool, not a magic wand.
    3. Simple models often win.

    The dataset

    A few hundred rows of survey responses with categorical features (yes/no for things like smoking, fatigue, wheezing) and a binary target (lung cancer: yes/no). The class distribution was heavily skewed — way more positives than negatives, actually, which was the opposite of what I expected. Real-world skews can go either direction.

    Why accuracy lied to me

    My first model was Logistic Regression, untouched dataset. Accuracy: 87%. I almost wrote "done!" in a commit message and moved on.

    Then I looked at the confusion matrix. The model was correctly identifying almost every positive case and missing roughly half the negatives. On an imbalanced dataset, accuracy ≈ "guess the majority class." If 87% of your data is one class, predicting that class always gives you 87%. It's not a model, it's a lookup table.

    That moment is the entire reason I now report precision, recall, F1, and a confusion matrix on every classification project, no exceptions. Accuracy on its own is a vibe, not a metric.

    SMOTE: useful, but not a wand

    SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic samples of the minority class by interpolating between real ones. I applied it. The model got better on minority recall — by a meaningful amount.

    What I didn't realize until I read more carefully: SMOTE can make your model overconfident if you're not careful. Synthetic samples sit between real samples in feature space, so the model learns the boundary as if those interpolated points are real observations. They aren't. On a tiny dataset like this one, that's a real risk.

    The fix is to only SMOTE the training set (never the test set) and to be careful about cross-validation — apply SMOTE inside each fold, not before splitting. I got this wrong on my first attempt. AI caught it when I described my pipeline.

    Feature engineering

    I added polynomial features and used mutual information for selection. Mutual information is great for survey data because most features are categorical and the relationships aren't linear. AI was useful here for explaining why mutual information makes sense for categorical-target tasks — I'd been defaulting to chi-squared without thinking.

    Polynomial features helped a little. Not as much as I hoped. The signal in survey data tops out fast.

    The model bake-off

    I compared seven approaches:

    • Logistic Regression (baseline)
    • KNN
    • Decision Tree
    • SVM
    • Naive Bayes
    • Random Forest
    • A stacking ensemble of the above

    Random Forest and the stacking ensemble were the best on F1 and balanced accuracy. KNN was the worst (small dataset + many categorical features = KNN suffers). Naive Bayes did surprisingly well, which I expected — Naive Bayes loves independent categorical features.

    The takeaway: the bake-off itself is the value. Don't pick a model and stick with it. Run five. Compare on a metric that's appropriate for your imbalance.

    What AI specifically helped with

    • Catching the SMOTE-before-split mistake. That bug is silent — your reported metrics look great because the test set is contaminated. AI caught it from a description of my pipeline. I'm grateful.
    • Explaining mutual information vs chi-squared. I'd been using chi-squared because every tutorial uses it. AI walked me through when mutual information is a better fit.
    • Writing the disclaimer. Healthcare datasets get used in scary ways. I asked Claude to help me phrase the README disclaimer in a way that was honest without being preachy. I think the result is okay.

    What's next

    The notebook is fine as a learning artifact, but it's not reproducible. The next step is extracting the pipeline into a Python script with a Makefile, so anyone can make train and get my numbers. I also want a tiny static dashboard that shows the model tradeoffs visually — confusion matrices, ROC curves, the band table.

    If you're a student starting on classification: respect the imbalance. Accuracy is the easiest metric to compute and the easiest one to be wrong about.

    Related posts

    How I built a brain tumor detector (with a lot of AI help)

    A BCA student's honest write-up of building a multi-stage MRI analysis pipeline — YOLO, an ensemble of vision models, Grad-CAM, and a healthy fear of saying it's a medical device.

    Credit risk analysis as a BCA student: turning a Kaggle dataset into something a business person could read

    What I learned building a credit card default risk pipeline — why ROC-AUC alone isn't enough, why risk bands matter, and how AI helped me think like an analyst instead of just a coder.

    Kyro Downloader: one engine, four UIs, and a lot of learning about contracts

    I tried to build a media downloader that has CLI, TUI, GUI, and a web UI all backed by the same core. Here's what broke, what worked, and why shared contracts saved me.

    Back to all posts