US Household Income - Data Cleaning & Exploratory Analysis

OBJECTIVE: Prepare, clean, and analyze U.S. household income data to uncover geographic income patterns across states and counties. This project focuses on resolving data quality issues and conducting exploratory analysis to support insights into income distribution, land and water area, and regional disparities.

BACKGROUND: Accurate household income data is essential for economic and social policy planning. However, raw datasets often contain issues such as misspellings, duplicate records, and missing values that hinder meaningful analysis. This project addresses those challenges through a structured data cleaning process, followed by SQL-based exploration of trends in income levels and geographic attributes.

By addressing data quality and performing exploratory queries, this project enables analysis of important questions such as:

  1. Which five U.S. states, districts, or territories have the highest average household income?

  2. Which five U.S. states, districts, or territories have the lowest average household income?

  3. What are the top 10 largest states by land area?

  4. What are the top 10 largest states by water area?

TECH STACK:

  1. SQL: Used for all cleaning, transformation, and analysis

  2. Window Functions: Used ROW_NUMBER() to identify and remove duplicates

  3. Update & Filtering Queries: Combined datasets and computed averages

  4. Joins & Aggregations: Addressed blank cells and data categorization

  5. Microsoft Excel: Used to review raw data and verify results post-cleaning

PROCESS:

  1. Data Cleaning in SQL:
    ◇ Duplicates
    - Identified repeated id values using ROW_NUMBER()
    - Removed non-unique rows, preserving one entry per id

    ◇ Misspellings & Standardization
    - Corrected state name issues like 'georia''Georgia'
    - Normalized inconsistent Type values such as 'Boroughs''Borough'
    ◇ Missing Values
    - Located blank Place fields and updated based on matching county/city
    - Ensured every location entry was complete for joining datasets

  2. Exploratory Analysis in SQL:
    ◇ Geographic Size Analysis
    - Summed land (ALand) and water (AWater) areas by state
    - Ranked top 10 states by land and water area separately
    ◇ State Income Ranking
    - Joined household and statistics tables on id
    - Calculated average Mean and Median incomes per state
    - Identified top 5 and bottom 5 states by average income

KEY INSIGHTS:

  • Cleaned Dataset: All entries now have valid IDs, correct state names, and filled place fields for reliable analysis

  • Top Income States: States like Maryland and D.C. rank highest in average income, while Mississippi and West Virginia rank lowest

  • Income Gaps: Clear regional disparities exist between average mean and median incomes

  • Land vs. Water Comparison: Alaska dominates in both land and water area, while states like Michigan show large water area proportions

CHALLENGES & SOLUTIONS:

⚠️ Challenge 1: Duplicate id entries in raw data
Solution: Used window functions to isolate and delete duplicates

⚠️ Challenge 2: Misspelled and inconsistent categorical fields
Solution: Manually reviewed and updated known issues using UPDATE

⚠️ Challenge 3: Missing location values preventing accurate joins
Solution: Filled blank Place fields by inferring from context (County, City)

⚠️ Challenge 4: Outliers and zero income values skewing results
Solution: Excluded zero income values in average calculations

DATA SOURCES:

DATA DICTIONARY:

  • id: Unique identifier for each household record

  • State_Name: Full name of the state (e.g., California)

  • State_ab: Two-letter abbreviation (e.g., CA)

  • Place: Specific sub-region within a city/county

  • Type: Category like City, Town, or Borough

  • ALand: Area of land in square meters

  • AWater: Area of water in square meters

  • Mean: Average household income

  • Median: Middle value of household income

  • Primary: Indicator of whether this record is the primary income entry

  • Type (Statistics): Classification of income data (e.g., All Households)

VIEW THE SQL CODE:

Want to see how the cleaning process and analysis were built in SQL?

👉 Click here to view the full SQL cleaning code — includes all cleaning operations and data fixes

👉 Click here to view the full SQL analysis code — covers exploratory queries, joins, and income analysis

The codes showcase a clear and reproducible workflow for transforming raw, messy household income data into a structured and analysis-ready format.

The GitHub repository is fully transparent, organized, and well-commented to guide you through each step of the project.

VISUAL WALKTHROUGH: The SQL queries shown below address only the key questions explored in this project. For additional queries including full joins, data validation steps, and state-level income breakdowns. Please refer to the full SQL code linked above in the View the SQL Code section.

Click on any image to enlarge it to full screen.

QUESTION 1:
Which five U.S. states, districts, or territories have the highest average household income?

ANSWER:
See SQL query output in the image above.

QUESTION 3:
What are the top 10 largest states by land area?

ANSWER:
See SQL query output in the image above.

QUESTION 2:
Which five U.S. states, districts, or territories have the lowest average household income?

ANSWER:
See SQL query output in the image above.

QUESTION 4:
What are the top 10 largest states by water area?

ANSWER:
See SQL query output in the image above.

Previous
Previous

Product Segmentation & Profitability Analysis