2022 · Machine Learning
Adult Income Classification
Machine Learning Classification Python
Adult Income Classification
BITS Pilani | Jul 2022 – Aug 2022
Python Scikit-Learn XGBoost Feature Engineering UCI Dataset
Project Overview
Developed machine learning pipeline to predict whether an individual’s income exceeds $50K based on census data, exploring various classification algorithms and feature engineering techniques.
Key Contributions
Data Preprocessing: Performed comprehensive EDA on UCI Adult Census dataset (48,842 instances), handling missing values (7% MCAR) and encoding categorical features
Model Comparison: Benchmarked Logistic Regression, Random Forest, and Gradient Boosting classifiers achieving 87% accuracy and 0.82 AUC-ROC
Feature Analysis: Identified education level, capital gain, and occupation as top predictors through SHAP analysis and permutation importance
Technologies Used
- Languages: Python
- Libraries: Scikit-Learn, XGBoost, pandas
- Methods: Ensemble Learning, Feature Engineering
- Dataset: UCI Adult Census (48,842 instances)
Key Results
| Metric | Value |
|---|---|
| Accuracy | 87% |
| AUC-ROC | 0.82 |
| Dataset Size | 48,842 instances |
| Missing Data | 7% (MCAR) |
| Features | 14 |
Feature Importance (Top 5)
- Education level
- Capital gain
- Occupation
- Hours per week
- Age
Models Compared
| Model | Accuracy | AUC-ROC |
|---|---|---|
| Logistic Regression | 84% | 0.78 |
| Random Forest | 86% | 0.80 |
| Gradient Boosting | 87% | 0.82 |
