US Household Income - Data Cleaning & Exploratory Analysis

ExcelSQL

Mar 11

Written By Midnight Investor | Trader & Investor

OBJECTIVE: Prepare, clean, and analyze U.S. household income data to uncover geographic income patterns across states and counties. This project focuses on resolving data quality issues and conducting exploratory analysis to support insights into income distribution, land and water area, and regional disparities.

BACKGROUND: Accurate household income data is essential for economic and social policy planning. However, raw datasets often contain issues such as misspellings, duplicate records, and missing values that hinder meaningful analysis. This project addresses those challenges through a structured data cleaning process, followed by SQL-based exploration of trends in income levels and geographic attributes.

By addressing data quality and performing exploratory queries, this project enables analysis of important questions such as:

Which five U.S. states, districts, or territories have the highest average household income?
Which five U.S. states, districts, or territories have the lowest average household income?
What are the top 10 largest states by land area?
What are the top 10 largest states by water area?

TECH STACK:

SQL: Used for all cleaning, transformation, and analysis
Window Functions: Used ROW_NUMBER() to identify and remove duplicates
Update & Filtering Queries: Combined datasets and computed averages
Joins & Aggregations: Addressed blank cells and data categorization
Microsoft Excel: Used to review raw data and verify results post-cleaning

PROCESS:

Data Cleaning in SQL:
◇ Duplicates
- Identified repeated id values using ROW_NUMBER()
- Removed non-unique rows, preserving one entry per id
◇ Misspellings & Standardization
- Corrected state name issues like 'georia' → 'Georgia'
- Normalized inconsistent Type values such as 'Boroughs' → 'Borough'◇ Missing Values
- Located blank Place fields and updated based on matching county/city
- Ensured every location entry was complete for joining datasets
Exploratory Analysis in SQL:
◇ Geographic Size Analysis
- Summed land (ALand) and water (AWater) areas by state
- Ranked top 10 states by land and water area separately
◇ State Income Ranking
- Joined household and statistics tables on id
- Calculated average Mean and Median incomes per state
- Identified top 5 and bottom 5 states by average income

KEY INSIGHTS:

Cleaned Dataset: All entries now have valid IDs, correct state names, and filled place fields for reliable analysis
Top Income States: States like Maryland and D.C. rank highest in average income, while Mississippi and West Virginia rank lowest
Income Gaps: Clear regional disparities exist between average mean and median incomes
Land vs. Water Comparison: Alaska dominates in both land and water area, while states like Michigan show large water area proportions

CHALLENGES & SOLUTIONS:

⚠️ Challenge 1: Duplicate id entries in raw data
✅ Solution: Used window functions to isolate and delete duplicates

⚠️ Challenge 2: Misspelled and inconsistent categorical fields
✅ Solution: Manually reviewed and updated known issues using UPDATE

⚠️ Challenge 3: Missing location values preventing accurate joins
✅ Solution: Filled blank Place fields by inferring from context (County, City)

⚠️ Challenge 4: Outliers and zero income values skewing results
✅ Solution: Excluded zero income values in average calculations

DATA SOURCES:

Dataset: us_household_income_raw.csv
Dataset: us_household_income_statistics_raw.csv
Source: Public dataset on U.S. household income across ZIP codes, counties, and states
Timeframe: Covers all 50 states, Washington D.C., and Puerto Rico grouped by household type, geography, and demographic classification

DATA DICTIONARY:

id: Unique identifier for each household record
State_Name: Full name of the state (e.g., California)
State_ab: Two-letter abbreviation (e.g., CA)
Place: Specific sub-region within a city/county
Type: Category like City, Town, or Borough
ALand: Area of land in square meters
AWater: Area of water in square meters
Mean: Average household income
Median: Middle value of household income
Primary: Indicator of whether this record is the primary income entry
Type (Statistics): Classification of income data (e.g., All Households)

VIEW THE SQL CODE:

Want to see how the cleaning process and analysis were built in SQL?

👉 Click here to view the full SQL cleaning code — includes all cleaning operations and data fixes

👉 Click here to view the full SQL analysis code — covers exploratory queries, joins, and income analysis

The codes showcase a clear and reproducible workflow for transforming raw, messy household income data into a structured and analysis-ready format.

The GitHub repository is fully transparent, organized, and well-commented to guide you through each step of the project.

VISUAL WALKTHROUGH: The SQL queries shown below address only the key questions explored in this project. For additional queries including full joins, data validation steps, and state-level income breakdowns. Please refer to the full SQL code linked above in the View the SQL Code section.

Click on any image to enlarge it to full screen.

**QUESTION 1:**
*Which five U.S. states, districts, or territories have the highest average household income?*
**ANSWER:**
See SQL query output in the image above.

**QUESTION 3:**
*What are the top 10 largest states by land area?*
**ANSWER:**
See SQL query output in the image above.

**QUESTION 2:**
*Which five U.S. states, districts, or territories have the lowest average household income?*
**ANSWER:**
See SQL query output in the image above.

**QUESTION 4:**
*What are the top 10 largest states by water area?*
**ANSWER:**
See SQL query output in the image above.

Midnight Investor | Trader & Investor

US Household Income - Data Cleaning & Exploratory Analysis

Product Segmentation & Profitability Analysis