Data Cleaning Prompt Templates
AI prompt templates for data cleaning. Prepare, transform, and normalize messy datasets.
Overview
Messy data is the norm, not the exception. These prompts help you handle the grunt work of data cleaning: fixing formats, handling missing values, removing duplicates, and standardizing inconsistent entries. Whether you're working in Python, SQL, or spreadsheets, you'll find templates for the most common cleaning tasks.
Best Practices
Describe the mess. What does the bad data actually look like?
Specify your tool (Python/Pandas, SQL, Excel, R) for usable code
Share sample rows showing both clean and dirty data
Mention how you want to handle edge cases, like missing values
State if you need to preserve the original data or can modify in place
Prompt Templates
1. Data Quality Assessment
Analyze this dataset for data quality issues and create a cleaning plan. Dataset description: [DESCRIPTION] Columns: [COLUMN_LIST] Sample rows (including problematic ones): [SAMPLE_DATA] Identify: missing values, duplicates, inconsistent formats, outliers, and invalid entries. Prioritize issues by impact.
Dataset: Customer records from CRM export. Columns: customer_id, name, email, phone, signup_date, country. Sample rows: 1) C001, John Smith, john@email.com, 555-1234, 2024-01-15, USA 2) C002, jane doe, JANE@EMAIL.COM, (555) 5678, 01/15/2024, United States 3) C001, John Smith, john@email.com, 555-1234, 2024-01-15, USA 4) C003, Bob, 555.9999, 2024-13-45, US
Data Quality Issues Found: 1. DUPLICATES (High Impact): Row 1 and 3 are exact duplicates. Check for more by grouping on customer_id or email. 2. INCONSISTENT FORMATS (High Impact): - Names: Mixed case (John Smith vs jane doe) - Phone: Three different formats (555-1234, (555) 5678, 555.9999) - Dates: Two formats (2024-01-15 vs 01/15/2024) - Country: Three variations (USA, United States, US) 3. MISSING VALUES (Medium Impact): Row 4 has no email. Count total missing per column. 4. INVALID DATA (High Impact): Row 4 has date 2024-13-45 which is impossible (no month 13). Cleaning Plan: 1. Remove exact duplicates 2. Standardize country to ISO codes 3. Parse and reformat all phone numbers 4. Convert all dates to YYYY-MM-DD, flag invalid dates for review 5. Normalize name capitalization 6. Decide on missing email handling (flag, exclude, or lookup)
Before starting any analysis, to understand what needs fixing
- •Include your worst rows in the sample, not just clean examples
- •Ask for impact prioritization so you fix the biggest problems first
2. Missing Value Handler
Help me handle missing values in [COLUMN_NAME]. Column type: [NUMERIC/CATEGORICAL/DATE/TEXT] Percent missing: [PERCENTAGE] Context: [WHAT_THIS_COLUMN_REPRESENTS] Downstream use: [HOW_YOU'LL_USE_THIS_DATA] Tool: [PYTHON/SQL/EXCEL] Options I'm considering: [OPTIONS]
Help me handle missing values in the 'income' column. Column type: Numeric. Percent missing: 12%. Context: Annual household income in a customer survey. Downstream use: Building a customer segmentation model. Tool: Python/Pandas. Options I'm considering: Drop rows, fill with mean, fill with median, or predict from other columns.
For income data going into a segmentation model, here's my recommendation:
Don't use mean: Income is usually right-skewed (few high earners pull up the average). Mean would inflate missing values.
Median is safer but still loses information.
Best approach: Impute based on correlated columns if available (age, education, zip code are often good predictors).
Python code for median imputation:
```python
df['income'] = df['income'].fillna(df['income'].median())
```
Python code for group-based imputation (better):
```python
df['income'] = df.groupby('education_level')['income'].transform(
lambda x: x.fillna(x.median())
)
```
Python code for model-based imputation (best but more complex):
```python
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df[['age', 'income', 'education_years']] = imputer.fit_transform(
df[['age', 'income', 'education_years']]
)
```
At 12% missing, you have enough data to impute well. I'd avoid dropping rows since that's a lot of lost records.When you have missing data and need to decide how to handle it
- •Always check if data is missing at random or if there's a pattern
- •For categorical data, consider adding a 'Missing' category instead of imputing
3. Text Standardization
Standardize text values in [COLUMN_NAME] that have inconsistent formatting. Current variations found: [LIST_OF_VARIATIONS] Target format: [DESIRED_FORMAT] Tool: [PYTHON/SQL/EXCEL] Additional rules: [ANY_SPECIAL_CASES]
Standardize company names that have inconsistent formatting. Current variations: 'ACME Inc.', 'Acme, Inc', 'acme incorporated', 'ACME', 'Acme Inc', 'The Acme Company'. Target format: Title case, no punctuation, no legal suffixes (Inc, LLC, Corp). Tool: Python. Additional rules: Remove 'The' prefix if present.
```python import re def standardize_company_name(name): if pd.isna(name): return name # Convert to lowercase for processing name = str(name).lower().strip() # Remove common legal suffixes suffixes = r'\b(inc\.?|incorporated|llc|corp\.?|corporation|ltd\.?|limited|co\.?)\b' name = re.sub(suffixes, '', name, flags=re.IGNORECASE) # Remove 'the' prefix name = re.sub(r'^the\s+', '', name) # Remove punctuation name = re.sub(r'[^\w\s]', '', name) # Remove extra whitespace name = ' '.join(name.split()) # Convert to title case name = name.title() return name df['company_clean'] = df['company_name'].apply(standardize_company_name) ``` Results: - 'ACME Inc.' → 'Acme' - 'Acme, Inc' → 'Acme' - 'The Acme Company' → 'Acme Company' Tip: After standardizing, check unique values to catch edge cases. Run `df['company_clean'].value_counts()` and look for near-duplicates that might need fuzzy matching.
When the same thing is written multiple ways and you need consistency
- •Build the function incrementally, checking output after each rule
- •Keep the original column so you can debug if standardization goes wrong
4. Duplicate Detection and Removal
Find and handle duplicates in my dataset. Duplicate type: [EXACT/FUZZY/BASED_ON_KEY] Columns to check: [COLUMNS] Keep rule: [FIRST/LAST/MOST_COMPLETE/CUSTOM] Tool: [PYTHON/SQL/EXCEL] Sample data showing potential duplicates: [SAMPLE]
Find and handle duplicates in my customer database. Duplicate type: Fuzzy (same person, slightly different records). Columns to check: name, email, phone. Keep rule: Most complete record (fewest nulls). Tool: Python. Sample: Row 1: John Smith, john@email.com, 555-1234. Row 2: John A. Smith, john@email.com, NULL. Row 3: Jon Smith, NULL, 555-1234.
This needs fuzzy matching since the records aren't exact duplicates. ```python import pandas as pd from fuzzywuzzy import fuzz from itertools import combinations def find_fuzzy_duplicates(df, name_col='name', email_col='email', phone_col='phone', threshold=80): duplicates = [] for i, j in combinations(df.index, 2): score = 0 matches = 0 # Compare names (fuzzy) if pd.notna(df.loc[i, name_col]) and pd.notna(df.loc[j, name_col]): name_score = fuzz.ratio(df.loc[i, name_col].lower(), df.loc[j, name_col].lower()) if name_score >= threshold: matches += 1 # Compare email (exact) if pd.notna(df.loc[i, email_col]) and pd.notna(df.loc[j, email_col]): if df.loc[i, email_col].lower() == df.loc[j, email_col].lower(): matches += 1 # Compare phone (exact, after cleaning) if pd.notna(df.loc[i, phone_col]) and pd.notna(df.loc[j, phone_col]): phone_i = re.sub(r'\D', '', str(df.loc[i, phone_col])) phone_j = re.sub(r'\D', '', str(df.loc[j, phone_col])) if phone_i == phone_j: matches += 1 if matches >= 2: # At least 2 fields match duplicates.append((i, j)) return duplicates # Keep most complete record def completeness_score(row): return row.notna().sum() # For each duplicate pair, keep the one with more data dupe_pairs = find_fuzzy_duplicates(df) to_drop = [] for i, j in dupe_pairs: if completeness_score(df.loc[i]) >= completeness_score(df.loc[j]): to_drop.append(j) else: to_drop.append(i) df_clean = df.drop(index=set(to_drop)) ``` For large datasets, consider using recordlinkage or dedupe libraries instead of this pairwise approach.
When you suspect duplicates but they're not exact matches
- •Start with exact duplicate check before fuzzy matching
- •Review a sample of flagged duplicates manually before bulk deletion
Common Mistakes to Avoid
Cleaning data before understanding it. Always explore first, then clean
Overwriting original data without keeping a backup or audit trail
Applying global rules without checking for edge cases that need special handling
Frequently Asked Questions
Messy data is the norm, not the exception. These prompts help you handle the grunt work of data cleaning: fixing formats, handling missing values, removing duplicates, and standardizing inconsistent entries. Whether you're working in Python, SQL, or spreadsheets, you'll find templates for the most common cleaning tasks.
Related Templates
SQL Query Prompt Templates
AI prompt templates for writing SQL queries. Create SELECT, JOIN, aggregate, and complex queries.
Data Analysis Prompt Templates
AI prompt templates for data analysis. Extract insights, identify patterns, and interpret results.
Data Visualization Prompt Templates
AI prompt templates for data visualization. Create effective charts, dashboards, and visual reports.
Have your own prompt to optimize?