World Life Expectancy - Part 1: Data Cleaning & Preparation
OBJECTIVE: Prepare and clean a global health dataset to enable accurate life expectancy analysis across countries and years. This project focused on removing duplicates, filling in missing values, and standardizing classifications to support analysis.
BACKGROUND: Reliable life expectancy data is crucial for assessing the progress of global health initiatives and identifying disparities between countries. However, raw datasets often contain inconsistencies, blanks, and duplicates that can distort findings. This project tackles those issues to ensure high data integrity and usability for future exploration.
The cleaning process in this project was designed to remove structural obstacles that could otherwise interfere with analysis. By resolving issues such as duplication, blank values, and inconsistent classifications, the dataset is now well-positioned for deeper investigation in follow-up stages.
For example, with a clean dataset in place, future analysis could potentially explore questions such as:
Are there duplicate entries for country-year combinations?
How are missing or inconsistent values treated?
Can we infer missing life expectancy values based on nearby years?
TECH STACK:
SQL: Core data cleaning, deduplication, updates
Microsoft Excel: Pre- and post-cleaning exploration, formatting
Window Functions: Used
ROW_NUMBER()
to identify and remove duplicatesSelf-Joins: Used to identify and fill in missing values based on matching records from the same table
Update & Filtering Queries: Addressed blank cells and data categorization
PROCESS:
Data Cleaning in SQL:
◇ Duplicates
- Identified duplicate country-year entries by concatenating values withCONCAT(Country, Year)
and grouping to count occurrences
- AppliedROW_NUMBER()
in a Common Table Expression (CTE) to isolate duplicate rows
- Deleted rows whereROW_NUMBER()
> 1 to retain only the first occurrence◇ Missing & Inconsistent Values
- Checked for blanks in theStatus
column
- Used a self-join to fill missingStatus
values based on other entries of the same country
- Specifically imputed missing 2018 life expectancy values for Afghanistan and Albania by averaging 2017 and 2019 valuesExcel Review:
◇ Verified that cleaned data aligns with expected patterns
◇ Ensured each country-year combination is unique
◇ Rechecked column formats and resolved edge cases
KEY INSIGHTS:
Duplicates Removed: Country-year pairs were successfully deduplicated, ensuring unique yearly records per country
Status Cleaned: Previously missing values for “Developing” or “Developed” were inferred using existing data from the same country
Imputation Strategy: Missing life expectancy values were averaged using the years before and after to maintain continuity
Consistency Achieved: Dataset is now ready for statistical and visual analysis without data quality concerns
CHALLENGES & SOLUTIONS:
⚠️ Challenge 1: Multiple entries for the same country and year
✅ Solution: Used ROW_NUMBER()
to identify and remove all duplicates
⚠️ Challenge 2: Missing values in the Status
column
✅ Solution: Applied a self-join to assign known classifications from matching countries
⚠️ Challenge 3: Missing life expectancy values for 2018 in Afghanistan and Albania
✅ Solution: Used a 3-way join to average 2017 and 2019 values and fill in 2018
⚠️ Challenge 4: Inconsistent use of blank strings instead of nulls
✅ Solution: Filtered and standardized blanks in SQL using WHERE column = ''
DATA SOURCES:
Dataset: world_life_expectancy_data_raw.csv
Source: Provided as part of a data cleaning exercise for global life expectancy
Timeframe: Covers multiple years by country — data includes life expectancy, status (developed/developing), and other fields
DATA DICTIONARY:
Country: Nation or territory name
Year: Calendar year of the observation
Status: Classification as "Developed" or "Developing"
Life expectancy: Average number of years a newborn is expected to live, based on current mortality rates
VIEW THE SQL CLEANING CODE:
Want to see how the cleaning process was built in SQL?
👉 Click here to view the full SQL cleaning code — includes duplicate detection, blank value updates, status imputation using self-joins, and estimation of missing life expectancy data.
This code showcases a clear and reproducible workflow for transforming raw, messy global life expectancy data into a structured and analysis-ready format.
The GitHub repository is fully transparent, organized, and well-commented to support each step of the cleaning process.