World Life Expectancy - Part 1: Data Cleaning & Preparation

OBJECTIVE: Prepare and clean a global health dataset to enable accurate life expectancy analysis across countries and years. This project focused on removing duplicates, filling in missing values, and standardizing classifications to support analysis.

BACKGROUND: Reliable life expectancy data is crucial for assessing the progress of global health initiatives and identifying disparities between countries. However, raw datasets often contain inconsistencies, blanks, and duplicates that can distort findings. This project tackles those issues to ensure high data integrity and usability for future exploration.

The cleaning process in this project was designed to remove structural obstacles that could otherwise interfere with analysis. By resolving issues such as duplication, blank values, and inconsistent classifications, the dataset is now well-positioned for deeper investigation in follow-up stages.

For example, with a clean dataset in place, future analysis could potentially explore questions such as:

  1. Are there duplicate entries for country-year combinations?

  2. How are missing or inconsistent values treated?

  3. Can we infer missing life expectancy values based on nearby years?

TECH STACK:

  1. SQL: Core data cleaning, deduplication, updates

  2. Microsoft Excel: Pre- and post-cleaning exploration, formatting

  3. Window Functions: Used ROW_NUMBER() to identify and remove duplicates

  4. Self-Joins: Used to identify and fill in missing values based on matching records from the same table

  5. Update & Filtering Queries: Addressed blank cells and data categorization

PROCESS:

  1. Data Cleaning in SQL:
    ◇ Duplicates
    - Identified duplicate country-year entries by concatenating values with CONCAT(Country, Year) and grouping to count occurrences
    - Applied ROW_NUMBER() in a Common Table Expression (CTE) to isolate duplicate rows
    - Deleted rows where ROW_NUMBER() > 1 to retain only the first occurrence

    ◇ Missing & Inconsistent Values
    - Checked for blanks in the Status column
    - Used a self-join to fill missing Status values based on other entries of the same country
    - Specifically imputed missing 2018 life expectancy values for Afghanistan and Albania by averaging 2017 and 2019 values

  2. Excel Review:
    ◇ Verified that cleaned data aligns with expected patterns
    ◇ Ensured each country-year combination is unique
    ◇ Rechecked column formats and resolved edge cases

KEY INSIGHTS:

  • Duplicates Removed: Country-year pairs were successfully deduplicated, ensuring unique yearly records per country

  • Status Cleaned: Previously missing values for “Developing” or “Developed” were inferred using existing data from the same country

  • Imputation Strategy: Missing life expectancy values were averaged using the years before and after to maintain continuity

  • Consistency Achieved: Dataset is now ready for statistical and visual analysis without data quality concerns

CHALLENGES & SOLUTIONS:

⚠️ Challenge 1: Multiple entries for the same country and year
Solution: Used ROW_NUMBER() to identify and remove all duplicates

⚠️ Challenge 2: Missing values in the Status column
Solution: Applied a self-join to assign known classifications from matching countries

⚠️ Challenge 3: Missing life expectancy values for 2018 in Afghanistan and Albania
Solution: Used a 3-way join to average 2017 and 2019 values and fill in 2018

⚠️ Challenge 4: Inconsistent use of blank strings instead of nulls
Solution: Filtered and standardized blanks in SQL using WHERE column = ''

DATA SOURCES:

  • Dataset: world_life_expectancy_data_raw.csv

  • Source: Provided as part of a data cleaning exercise for global life expectancy

  • Timeframe: Covers multiple years by country — data includes life expectancy, status (developed/developing), and other fields

DATA DICTIONARY:

  • Country: Nation or territory name

  • Year: Calendar year of the observation

  • Status: Classification as "Developed" or "Developing"

  • Life expectancy: Average number of years a newborn is expected to live, based on current mortality rates

VIEW THE SQL CLEANING CODE:

Want to see how the cleaning process was built in SQL?

👉 Click here to view the full SQL cleaning code — includes duplicate detection, blank value updates, status imputation using self-joins, and estimation of missing life expectancy data.

This code showcases a clear and reproducible workflow for transforming raw, messy global life expectancy data into a structured and analysis-ready format.

The GitHub repository is fully transparent, organized, and well-commented to support each step of the cleaning process.

Previous
Previous

US Debt Analysis & Forecasting

Next
Next

World Life Expectancy - Part 2: Exploratory Data Analysis