From Chaos to Clarity: Cleaning Real-World Data in Excel

OBJECTIVE: Transform a messy real-world dataset of U.S. presidents into a clean, analysis-ready format using Excel. This project focuses on standardizing inconsistent text entries, removing middle initials, fixing formatting issues, and organizing structured data for analysis or visualization.

TECH STACK:

  1. Microsoft Excel: Primary tool for cleaning, transforming, and organizing data

  2. Built-in Excel Functions:
    PROPER() – Capitalize names consistently (e.g., "john adams" → "John Adams") * Could also use UPPER() or LOWER() depending on an organization’s data formatting standards
    TRIM() – Remove extra spaces from names and text fields
    MID(), LEFT(), RIGHT() – Extract structured data like street, city, state, and ZIP from full addresses
    FIND() – Used with MID() to parse out parts of strings
    LEN() – Evaluate length of fields to verify data structure (e.g., character count of age values)

  3. Formatting Tools:
    ◇ Applied Short Date format for consistency in date fields
    ◇ Used a color-coded legend (color_key sheet) to distinguish functions, helper columns, formatting decisions, and data anomalies needing further research

PROCESS BREAKDOWN:
To ensure clarity and verifiability, the Excel workbook is organized into four structured sheets:

  1. color_key:
    ◇ Provides a visual legend for functions, helper columns, and formatting choices
    ◇ Helps to understand the logic behind each transformation

  2. raw:
    ◇ Original dataset as imported
    ◇ Demonstrates real-world inconsistencies and formatting issues

  3. cleaning_process:
    ◇ Step-by-step data cleaning with labeled helper columns
    ◇ Includes formula-driven transformations such as:
    - Standardizing names
    - Removing middle initials
    - Parsing address components
    - Cleaning party affiliations
    - Measuring string lengths for validation
    ◇ Intermediate steps preserved to show logic and reproducibility
    ◇ Columns containing numerical data were formatted as numbers to ensure consistency and prevent issues during import into visualization tools like Tableau or Power BI.

  4. cleaned:
    ◇ Final, polished dataset with all formulas removed (values only).
    ◇ Ready for analysis via PivotTables, charts, or import into tools like Tableau, Power BI, or R

DATA QUALITY CHECKS & INSIGHTS:

To demonstrate analytical thinking beyond cleaning, the cleaning_process sheet includes a summary section that highlights:

  • Minimum & Maximum Inauguration Age: Identifies extremes in inauguration age for further review

  • Minimum & Maximum Salary: Flags compensation range across administrations

  • Blank Value Count: Uses COUNTBLANK() to detect missing values

  • Count Over Age 50: Uses COUNTIF() to profile age demographics at inauguration

  • Historical Party Annotation: Includes a note that the Democratic-Republican party existed from the 1790s to the 1820s for context on early presidential affiliations

This section shows how simple functions can be used for effective exploratory data analysis (EDA) and data validation.

DATA SOURCES:

  • Dataset: Presidential_Dataset.xlsx

  • Source: Kaggle (user-contributed dataset containing historical U.S. presidential information)

  • Timeframe: Covers all U.S. presidents from George Washington to present

KEY LEARNINGS & HIGHLIGHTS:

  1. Showcased mastery of Excel cleaning functions and formula chaining

  2. Demonstrated real-world string handling: name standardization, text splitting, and formatting validation

  3. Built a fully auditable workflow with clean, labeled sheets and transformation logic

  4. Used best practices to prepare raw data for real-world analytical tools and reporting

Next
Next

Part 1: Tesla’s Market Share & EV Trends in Washington