From Chaos to Clarity: Cleaning Real-World Data in Excel

Excel

Jun 18

Written By Midnight Investor | Trader & Investor

OBJECTIVE: Transform a messy real-world dataset of U.S. presidents into a clean, analysis-ready format using Excel. This project focuses on standardizing inconsistent text entries, removing middle initials, fixing formatting issues, and organizing structured data for analysis or visualization.

TECH STACK:

Microsoft Excel: Primary tool for cleaning, transforming, and organizing data
Built-in Excel Functions:
◇ PROPER() – Capitalize names consistently (e.g., "john adams" → "John Adams") * Could also use UPPER() or LOWER() depending on an organization’s data formatting standards
◇ TRIM() – Remove extra spaces from names and text fields
◇ MID(), LEFT(), RIGHT() – Extract structured data like street, city, state, and ZIP from full addresses
◇ FIND() – Used with MID() to parse out parts of strings
◇ LEN() – Evaluate length of fields to verify data structure (e.g., character count of age values)
Formatting Tools:
◇ Applied Short Date format for consistency in date fields
◇ Used a color-coded legend (color_key sheet) to distinguish functions, helper columns, formatting decisions, and data anomalies needing further research

PROCESS BREAKDOWN:
To ensure clarity and verifiability, the Excel workbook is organized into four structured sheets:

color_key:
◇ Provides a visual legend for functions, helper columns, and formatting choices
◇ Helps to understand the logic behind each transformation
raw:
◇ Original dataset as imported
◇ Demonstrates real-world inconsistencies and formatting issues
cleaning_process:
◇ Step-by-step data cleaning with labeled helper columns
◇ Includes formula-driven transformations such as:
- Standardizing names
- Removing middle initials
- Parsing address components
- Cleaning party affiliations
- Measuring string lengths for validation
◇ Intermediate steps preserved to show logic and reproducibility
◇ Columns containing numerical data were formatted as numbers to ensure consistency and prevent issues during import into visualization tools like Tableau or Power BI.
cleaned:
◇ Final, polished dataset with all formulas removed (values only).
◇ Ready for analysis via PivotTables, charts, or import into tools like Tableau, Power BI, or R

DATA QUALITY CHECKS & INSIGHTS:

To demonstrate analytical thinking beyond cleaning, the cleaning_process sheet includes a summary section that highlights:

Minimum & Maximum Inauguration Age: Identifies extremes in inauguration age for further review
Minimum & Maximum Salary: Flags compensation range across administrations
Blank Value Count: Uses COUNTBLANK() to detect missing values
Count Over Age 50: Uses COUNTIF() to profile age demographics at inauguration
Historical Party Annotation: Includes a note that the Democratic-Republican party existed from the 1790s to the 1820s for context on early presidential affiliations

This section shows how simple functions can be used for effective exploratory data analysis (EDA) and data validation.

DATA SOURCES:

Dataset: Presidential_Dataset.xlsx
Source: Kaggle (user-contributed dataset containing historical U.S. presidential information)
Timeframe: Covers all U.S. presidents from George Washington to present

KEY LEARNINGS & HIGHLIGHTS:

Showcased mastery of Excel cleaning functions and formula chaining
Demonstrated real-world string handling: name standardization, text splitting, and formatting validation
Built a fully auditable workflow with clean, labeled sheets and transformation logic
Used best practices to prepare raw data for real-world analytical tools and reporting

Midnight Investor | Trader & Investor

From Chaos to Clarity: Cleaning Real-World Data in Excel

Part 1: Tesla’s Market Share & EV Trends in Washington