How do I use these prompt templates?

Copy the prompt template, customize it with your specific details, and paste it into ChatGPT, Claude, or any other AI assistant.

Can I modify these prompts?

Yes! These templates are starting points. Feel free to modify them to better suit your specific needs and context.

Data Cleaning Prompt Templates

AI prompt templates for data cleaning. Prepare, transform, and normalize messy datasets.

Overview

Messy data is the norm, not the exception. These prompts help you handle the grunt work of data cleaning: fixing formats, handling missing values, removing duplicates, and standardizing inconsistent entries. Whether you're working in Python, SQL, or spreadsheets, you'll find templates for the most common cleaning tasks.

Best Practices

Describe the mess. What does the bad data actually look like?

Specify your tool (Python/Pandas, SQL, Excel, R) for usable code

Share sample rows showing both clean and dirty data

Mention how you want to handle edge cases, like missing values

State if you need to preserve the original data or can modify in place

Prompt Templates

1. Data Quality Assessment

Prompt Template

Analyze this dataset for data quality issues and create a cleaning plan.

Dataset description: [DESCRIPTION]
Columns: [COLUMN_LIST]
Sample rows (including problematic ones):
[SAMPLE_DATA]

Identify: missing values, duplicates, inconsistent formats, outliers, and invalid entries. Prioritize issues by impact.

Example Input

Dataset: Customer records from CRM export. Columns: customer_id, name, email, phone, signup_date, country. Sample rows: 1) C001, John Smith, john@email.com, 555-1234, 2024-01-15, USA 2) C002, jane doe, JANE@EMAIL.COM, (555) 5678, 01/15/2024, United States 3) C001, John Smith, john@email.com, 555-1234, 2024-01-15, USA 4) C003, Bob, 555.9999, 2024-13-45, US

Example Output

Data Quality Issues Found:

1. DUPLICATES (High Impact): Row 1 and 3 are exact duplicates. Check for more by grouping on customer_id or email.

2. INCONSISTENT FORMATS (High Impact):
 - Names: Mixed case (John Smith vs jane doe)
 - Phone: Three different formats (555-1234, (555) 5678, 555.9999)
 - Dates: Two formats (2024-01-15 vs 01/15/2024)
 - Country: Three variations (USA, United States, US)

3. MISSING VALUES (Medium Impact): Row 4 has no email. Count total missing per column.

4. INVALID DATA (High Impact): Row 4 has date 2024-13-45 which is impossible (no month 13).

Cleaning Plan:
1. Remove exact duplicates
2. Standardize country to ISO codes
3. Parse and reformat all phone numbers
4. Convert all dates to YYYY-MM-DD, flag invalid dates for review
5. Normalize name capitalization
6. Decide on missing email handling (flag, exclude, or lookup)

When to Use

Before starting any analysis, to understand what needs fixing

Pro Tips

•Include your worst rows in the sample, not just clean examples
•Ask for impact prioritization so you fix the biggest problems first

2. Missing Value Handler

Prompt Template

Help me handle missing values in [COLUMN_NAME].

Column type: [NUMERIC/CATEGORICAL/DATE/TEXT]
Percent missing: [PERCENTAGE]
Context: [WHAT_THIS_COLUMN_REPRESENTS]
Downstream use: [HOW_YOU'LL_USE_THIS_DATA]

Tool: [PYTHON/SQL/EXCEL]
Options I'm considering: [OPTIONS]

Example Input

Help me handle missing values in the 'income' column. Column type: Numeric. Percent missing: 12%. Context: Annual household income in a customer survey. Downstream use: Building a customer segmentation model. Tool: Python/Pandas. Options I'm considering: Drop rows, fill with mean, fill with median, or predict from other columns.

Example Output

For income data going into a segmentation model, here's my recommendation:

Don't use mean: Income is usually right-skewed (few high earners pull up the average). Mean would inflate missing values.

Median is safer but still loses information.

Best approach: Impute based on correlated columns if available (age, education, zip code are often good predictors).

Python code for median imputation:
```python
df['income'] = df['income'].fillna(df['income'].median())
```

Python code for group-based imputation (better):
```python
df['income'] = df.groupby('education_level')['income'].transform(
 lambda x: x.fillna(x.median())
)
```

Python code for model-based imputation (best but more complex):
```python
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df[['age', 'income', 'education_years']] = imputer.fit_transform(
 df[['age', 'income', 'education_years']]
)
```

At 12% missing, you have enough data to impute well. I'd avoid dropping rows since that's a lot of lost records.

When to Use

When you have missing data and need to decide how to handle it

Pro Tips

•Always check if data is missing at random or if there's a pattern
•For categorical data, consider adding a 'Missing' category instead of imputing

3. Text Standardization

Prompt Template

Standardize text values in [COLUMN_NAME] that have inconsistent formatting.

Current variations found:
[LIST_OF_VARIATIONS]

Target format: [DESIRED_FORMAT]
Tool: [PYTHON/SQL/EXCEL]
Additional rules: [ANY_SPECIAL_CASES]

Example Input

Standardize company names that have inconsistent formatting. Current variations: 'ACME Inc.', 'Acme, Inc', 'acme incorporated', 'ACME', 'Acme Inc', 'The Acme Company'. Target format: Title case, no punctuation, no legal suffixes (Inc, LLC, Corp). Tool: Python. Additional rules: Remove 'The' prefix if present.

Example Output

```python
import re

def standardize_company_name(name):
 if pd.isna(name):
 return name
 
 # Convert to lowercase for processing
 name = str(name).lower().strip()
 
 # Remove common legal suffixes
 suffixes = r'\b(inc\.?|incorporated|llc|corp\.?|corporation|ltd\.?|limited|co\.?)\b'
 name = re.sub(suffixes, '', name, flags=re.IGNORECASE)
 
 # Remove 'the' prefix
 name = re.sub(r'^the\s+', '', name)
 
 # Remove punctuation
 name = re.sub(r'[^\w\s]', '', name)
 
 # Remove extra whitespace
 name = ' '.join(name.split())
 
 # Convert to title case
 name = name.title()
 
 return name

df['company_clean'] = df['company_name'].apply(standardize_company_name)
```

Results:
- 'ACME Inc.' → 'Acme'
- 'Acme, Inc' → 'Acme'
- 'The Acme Company' → 'Acme Company'

Tip: After standardizing, check unique values to catch edge cases. Run `df['company_clean'].value_counts()` and look for near-duplicates that might need fuzzy matching.

When to Use

When the same thing is written multiple ways and you need consistency

Pro Tips

•Build the function incrementally, checking output after each rule
•Keep the original column so you can debug if standardization goes wrong

4. Duplicate Detection and Removal

Prompt Template

Find and handle duplicates in my dataset.

Duplicate type: [EXACT/FUZZY/BASED_ON_KEY]
Columns to check: [COLUMNS]
Keep rule: [FIRST/LAST/MOST_COMPLETE/CUSTOM]
Tool: [PYTHON/SQL/EXCEL]

Sample data showing potential duplicates:
[SAMPLE]

Example Input

Find and handle duplicates in my customer database. Duplicate type: Fuzzy (same person, slightly different records). Columns to check: name, email, phone. Keep rule: Most complete record (fewest nulls). Tool: Python. Sample: Row 1: John Smith, john@email.com, 555-1234. Row 2: John A. Smith, john@email.com, NULL. Row 3: Jon Smith, NULL, 555-1234.

Example Output

This needs fuzzy matching since the records aren't exact duplicates.

```python
import pandas as pd
from fuzzywuzzy import fuzz
from itertools import combinations

def find_fuzzy_duplicates(df, name_col='name', email_col='email', phone_col='phone', threshold=80):
 duplicates = []
 
 for i, j in combinations(df.index, 2):
 score = 0
 matches = 0
 
 # Compare names (fuzzy)
 if pd.notna(df.loc[i, name_col]) and pd.notna(df.loc[j, name_col]):
 name_score = fuzz.ratio(df.loc[i, name_col].lower(), df.loc[j, name_col].lower())
 if name_score >= threshold:
 matches += 1
 
 # Compare email (exact)
 if pd.notna(df.loc[i, email_col]) and pd.notna(df.loc[j, email_col]):
 if df.loc[i, email_col].lower() == df.loc[j, email_col].lower():
 matches += 1
 
 # Compare phone (exact, after cleaning)
 if pd.notna(df.loc[i, phone_col]) and pd.notna(df.loc[j, phone_col]):
 phone_i = re.sub(r'\D', '', str(df.loc[i, phone_col]))
 phone_j = re.sub(r'\D', '', str(df.loc[j, phone_col]))
 if phone_i == phone_j:
 matches += 1
 
 if matches >= 2: # At least 2 fields match
 duplicates.append((i, j))
 
 return duplicates

# Keep most complete record
def completeness_score(row):
 return row.notna().sum()

# For each duplicate pair, keep the one with more data
dupe_pairs = find_fuzzy_duplicates(df)
to_drop = []
for i, j in dupe_pairs:
 if completeness_score(df.loc[i]) >= completeness_score(df.loc[j]):
 to_drop.append(j)
 else:
 to_drop.append(i)

df_clean = df.drop(index=set(to_drop))
```

For large datasets, consider using recordlinkage or dedupe libraries instead of this pairwise approach.

When to Use

When you suspect duplicates but they're not exact matches

Pro Tips

•Start with exact duplicate check before fuzzy matching
•Review a sample of flagged duplicates manually before bulk deletion

Common Mistakes to Avoid

•

Cleaning data before understanding it. Always explore first, then clean

•

Overwriting original data without keeping a backup or audit trail

•

Applying global rules without checking for edge cases that need special handling

Frequently Asked Questions

Try Prompt Optimizer More Data & Analytics Prompts

Data Cleaning Prompt Templates

Overview

Best Practices

Prompt Templates

1. Data Quality Assessment

2. Missing Value Handler

3. Text Standardization

4. Duplicate Detection and Removal

Common Mistakes to Avoid

Frequently Asked Questions

Related Templates

SQL Query Prompt Templates

Data Analysis Prompt Templates

Data Visualization Prompt Templates