From Chaos to Clarity: Cleaning Real-World Data in Excel
OBJECTIVE: Transform a messy real-world dataset of U.S. presidents into a clean, analysis-ready format using Excel. This project focuses on standardizing inconsistent text entries, removing middle initials, fixing formatting issues, and organizing structured data for analysis or visualization.
TECH STACK:
Microsoft Excel: Primary tool for cleaning, transforming, and organizing data
Built-in Excel Functions:
◇PROPER()
– Capitalize names consistently (e.g., "john adams" → "John Adams") * Could also useUPPER()
orLOWER()
depending on an organization’s data formatting standards
◇TRIM()
– Remove extra spaces from names and text fields
◇MID()
,LEFT()
,RIGHT()
– Extract structured data like street, city, state, and ZIP from full addresses
◇FIND()
– Used withMID()
to parse out parts of strings
◇LEN()
– Evaluate length of fields to verify data structure (e.g., character count of age values)Formatting Tools:
◇ Applied Short Date format for consistency in date fields
◇ Used a color-coded legend (color_key
sheet) to distinguish functions, helper columns, formatting decisions, and data anomalies needing further research
PROCESS BREAKDOWN:
To ensure clarity and verifiability, the Excel workbook is organized into four structured sheets:
color_key:
◇ Provides a visual legend for functions, helper columns, and formatting choices
◇ Helps to understand the logic behind each transformationraw:
◇ Original dataset as imported
◇ Demonstrates real-world inconsistencies and formatting issuescleaning_process:
◇ Step-by-step data cleaning with labeled helper columns
◇ Includes formula-driven transformations such as:
- Standardizing names
- Removing middle initials
- Parsing address components
- Cleaning party affiliations
- Measuring string lengths for validation
◇ Intermediate steps preserved to show logic and reproducibility
◇ Columns containing numerical data were formatted as numbers to ensure consistency and prevent issues during import into visualization tools like Tableau or Power BI.cleaned:
◇ Final, polished dataset with all formulas removed (values only).
◇ Ready for analysis via PivotTables, charts, or import into tools like Tableau, Power BI, or R
DATA QUALITY CHECKS & INSIGHTS:
To demonstrate analytical thinking beyond cleaning, the cleaning_process
sheet includes a summary section that highlights:
Minimum & Maximum Inauguration Age: Identifies extremes in inauguration age for further review
Minimum & Maximum Salary: Flags compensation range across administrations
Blank Value Count: Uses
COUNTBLANK()
to detect missing valuesCount Over Age 50: Uses
COUNTIF()
to profile age demographics at inaugurationHistorical Party Annotation: Includes a note that the Democratic-Republican party existed from the 1790s to the 1820s for context on early presidential affiliations
This section shows how simple functions can be used for effective exploratory data analysis (EDA) and data validation.
DATA SOURCES:
Dataset: Presidential_Dataset.xlsx
Source: Kaggle (user-contributed dataset containing historical U.S. presidential information)
Timeframe: Covers all U.S. presidents from George Washington to present
KEY LEARNINGS & HIGHLIGHTS:
Showcased mastery of Excel cleaning functions and formula chaining
Demonstrated real-world string handling: name standardization, text splitting, and formatting validation
Built a fully auditable workflow with clean, labeled sheets and transformation logic
Used best practices to prepare raw data for real-world analytical tools and reporting