US Household Income - Data Cleaning & Exploratory Analysis
OBJECTIVE: Prepare, clean, and analyze U.S. household income data to uncover geographic income patterns across states and counties. This project focuses on resolving data quality issues and conducting exploratory analysis to support insights into income distribution, land and water area, and regional disparities.
BACKGROUND: Accurate household income data is essential for economic and social policy planning. However, raw datasets often contain issues such as misspellings, duplicate records, and missing values that hinder meaningful analysis. This project addresses those challenges through a structured data cleaning process, followed by SQL-based exploration of trends in income levels and geographic attributes.
By addressing data quality and performing exploratory queries, this project enables analysis of important questions such as:
Which five U.S. states, districts, or territories have the highest average household income?
Which five U.S. states, districts, or territories have the lowest average household income?
What are the top 10 largest states by land area?
What are the top 10 largest states by water area?
TECH STACK:
SQL: Used for all cleaning, transformation, and analysis
Window Functions: Used
ROW_NUMBER()
to identify and remove duplicatesUpdate & Filtering Queries: Combined datasets and computed averages
Joins & Aggregations: Addressed blank cells and data categorization
Microsoft Excel: Used to review raw data and verify results post-cleaning
PROCESS:
Data Cleaning in SQL:
◇ Duplicates
- Identified repeatedid
values usingROW_NUMBER()
- Removed non-unique rows, preserving one entry perid
◇ Misspellings & Standardization
- Corrected state name issues like'georia'
→'Georgia'
- Normalized inconsistentType
values such as'Boroughs'
→'Borough'
◇ Missing Values
- Located blankPlace
fields and updated based on matching county/city
- Ensured every location entry was complete for joining datasetsExploratory Analysis in SQL:
◇ Geographic Size Analysis
- Summed land (ALand
) and water (AWater
) areas by state
- Ranked top 10 states by land and water area separately
◇ State Income Ranking
- Joined household and statistics tables onid
- Calculated averageMean
andMedian
incomes per state
- Identified top 5 and bottom 5 states by average income
KEY INSIGHTS:
Cleaned Dataset: All entries now have valid IDs, correct state names, and filled place fields for reliable analysis
Top Income States: States like Maryland and D.C. rank highest in average income, while Mississippi and West Virginia rank lowest
Income Gaps: Clear regional disparities exist between average mean and median incomes
Land vs. Water Comparison: Alaska dominates in both land and water area, while states like Michigan show large water area proportions
CHALLENGES & SOLUTIONS:
⚠️ Challenge 1: Duplicate id
entries in raw data
✅ Solution: Used window functions to isolate and delete duplicates
⚠️ Challenge 2: Misspelled and inconsistent categorical fields
✅ Solution: Manually reviewed and updated known issues using UPDATE
⚠️ Challenge 3: Missing location values preventing accurate joins
✅ Solution: Filled blank Place
fields by inferring from context (County, City)
⚠️ Challenge 4: Outliers and zero income values skewing results
✅ Solution: Excluded zero income values in average calculations
DATA SOURCES:
Dataset: us_household_income_raw.csv
Source: Public dataset on U.S. household income across ZIP codes, counties, and states
Timeframe: Covers all 50 states, Washington D.C., and Puerto Rico grouped by household type, geography, and demographic classification
DATA DICTIONARY:
id: Unique identifier for each household record
State_Name: Full name of the state (e.g., California)
State_ab: Two-letter abbreviation (e.g., CA)
Place: Specific sub-region within a city/county
Type: Category like City, Town, or Borough
ALand: Area of land in square meters
AWater: Area of water in square meters
Mean: Average household income
Median: Middle value of household income
Primary: Indicator of whether this record is the primary income entry
Type (Statistics): Classification of income data (e.g., All Households)
VIEW THE SQL CODE:
Want to see how the cleaning process and analysis were built in SQL?
👉 Click here to view the full SQL cleaning code — includes all cleaning operations and data fixes
👉 Click here to view the full SQL analysis code — covers exploratory queries, joins, and income analysis
The codes showcase a clear and reproducible workflow for transforming raw, messy household income data into a structured and analysis-ready format.
The GitHub repository is fully transparent, organized, and well-commented to guide you through each step of the project.
VISUAL WALKTHROUGH: The SQL queries shown below address only the key questions explored in this project. For additional queries including full joins, data validation steps, and state-level income breakdowns. Please refer to the full SQL code linked above in the View the SQL Code section.
Click on any image to enlarge it to full screen.
QUESTION 1:
Which five U.S. states, districts, or territories have the highest average household income?
ANSWER:
See SQL query output in the image above.
QUESTION 3:
What are the top 10 largest states by land area?
ANSWER:
See SQL query output in the image above.
QUESTION 2:
Which five U.S. states, districts, or territories have the lowest average household income?
ANSWER:
See SQL query output in the image above.
QUESTION 4:
What are the top 10 largest states by water area?
ANSWER:
See SQL query output in the image above.