2022 · Machine Learning

Adult Income Classification

Machine Learning Classification Python

Adult Income Classification

BITS Pilani | Jul 2022 – Aug 2022

Python Scikit-Learn XGBoost Feature Engineering UCI Dataset

Project Overview

Developed machine learning pipeline to predict whether an individual’s income exceeds $50K based on census data, exploring various classification algorithms and feature engineering techniques.

Key Contributions

Data Preprocessing: Performed comprehensive EDA on UCI Adult Census dataset (48,842 instances), handling missing values (7% MCAR) and encoding categorical features
Model Comparison: Benchmarked Logistic Regression, Random Forest, and Gradient Boosting classifiers achieving 87% accuracy and 0.82 AUC-ROC
Feature Analysis: Identified education level, capital gain, and occupation as top predictors through SHAP analysis and permutation importance

Technologies Used

  • Languages: Python
  • Libraries: Scikit-Learn, XGBoost, pandas
  • Methods: Ensemble Learning, Feature Engineering
  • Dataset: UCI Adult Census (48,842 instances)

Key Results

MetricValue
Accuracy87%
AUC-ROC0.82
Dataset Size48,842 instances
Missing Data7% (MCAR)
Features14

Feature Importance (Top 5)

  1. Education level
  2. Capital gain
  3. Occupation
  4. Hours per week
  5. Age

Models Compared

ModelAccuracyAUC-ROC
Logistic Regression84%0.78
Random Forest86%0.80
Gradient Boosting87%0.82