How do I use these prompt templates?

Copy the prompt template, customize it with your specific details, and paste it into ChatGPT, Claude, or any other AI assistant.

Can I modify these prompts?

Yes! These templates are starting points. Feel free to modify them to better suit your specific needs and context.

Python Data Analysis Prompt Templates

AI prompt templates for Python data analysis. Use Pandas, NumPy, and other libraries effectively.

Overview

Python is the go-to language for data analysis, and Pandas is the library that makes it practical. These prompts help you write efficient data manipulation code, avoid common Pandas pitfalls, and get working solutions faster. Whether you're loading messy CSVs or building complex transformations, these templates have you covered.

Best Practices

Include a sample of your data or describe the DataFrame structure (columns, types)

Mention the size of your dataset if performance matters

Specify which libraries you have available (Pandas, NumPy, Polars)

Describe the output format you need (DataFrame, Series, dict, CSV)

Ask for explanations if you want to learn, not just get working code

Prompt Templates

1. DataFrame Manipulation

Prompt Template

Write Pandas code to [TASK].

DataFrame structure:
[COLUMNS_AND_TYPES]

Sample data:
[SAMPLE_ROWS]

Expected output: [OUTPUT_DESCRIPTION]
Performance note: [DATASET_SIZE]

Example Input

Write Pandas code to calculate the 7-day rolling average of sales for each product. DataFrame structure: date (datetime), product_id (str), sales (float). Sample data: 2024-01-01, SKU001, 150.00 | 2024-01-02, SKU001, 200.00 | 2024-01-01, SKU002, 75.00. Expected output: Same DataFrame with new column 'sales_7d_avg'. Performance note: 1M rows, need it to be efficient.

Example Output

```python
import pandas as pd

# Sort by product and date for correct rolling calculation
df = df.sort_values(['product_id', 'date'])

# Calculate rolling average per product
df['sales_7d_avg'] = df.groupby('product_id')['sales'].transform(
 lambda x: x.rolling(window=7, min_periods=1).mean()
)
```

For 1M rows, this should complete in a few seconds. If you need faster performance:

```python
# Option 2: Use numba for speed on very large datasets
from pandas.api.types import is_numeric_dtype

df['sales_7d_avg'] = df.groupby('product_id')['sales'].transform(
 lambda x: x.rolling(7, min_periods=1).mean().values
)
```

Note: `min_periods=1` means the first 6 days will have averages based on fewer than 7 days. Set `min_periods=7` if you want NaN for those rows instead.

When to Use

Transforming, filtering, or aggregating data in Pandas

Pro Tips

•Always show sample data. 'I have a DataFrame with sales' is too vague
•Mention if you need the operation to be in-place or return a new DataFrame

2. Data Loading and Export

Prompt Template

Help me load data from [SOURCE] into Pandas and handle these issues:

File/source details: [DETAILS]
Problems I'm seeing: [ISSUES]
Desired result: [WHAT_CLEAN_DATA_LOOKS_LIKE]

Also show how to export to [OUTPUT_FORMAT] when done.

Example Input

Load data from a CSV with messy formatting. File details: sales_data.csv, has mixed date formats (MM/DD/YYYY and YYYY-MM-DD), some numeric columns have commas in numbers ($1,234.56), first row is a title (not headers). Problems: Dates parse wrong, numbers load as strings. Desired result: Clean DataFrame with proper datetime and float types. Export to: Parquet for faster future loading.

Example Output

```python
import pandas as pd

# Load CSV, skip title row, handle encoding
df = pd.read_csv(
 'sales_data.csv',
 skiprows=1, # Skip the title row
 encoding='utf-8', # Try 'latin-1' if this fails
 thousands=',', # Handle comma separators in numbers
 dtype=str # Load everything as string first for cleaning
)

# Clean currency columns (remove $ and convert to float)
currency_cols = ['revenue', 'cost'] # Adjust to your column names
for col in currency_cols:
 df[col] = df[col].str.replace('[$,]', '', regex=True).astype(float)

# Parse mixed-format dates
df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=False)

# Verify dtypes
print(df.dtypes)
print(df.head())

# Export to Parquet (much faster to read later)
df.to_parquet('sales_data.parquet', index=False)

# Next time, load with:
# df = pd.read_parquet('sales_data.parquet')
```

Tips:
- `format='mixed'` handles multiple date formats (Pandas 2.0+)
- Parquet files are typically 2-10x smaller than CSV and load much faster
- If you have really large files, consider `chunksize` parameter to load in batches

When to Use

Getting data into and out of Pandas, especially messy files

Pro Tips

•Load as strings first (dtype=str) when you have mixed or messy data, then clean and convert
•Parquet is almost always better than CSV for intermediate storage

3. GroupBy and Aggregation

Prompt Template

Write Pandas code to group by [GROUP_COLUMNS] and calculate [AGGREGATIONS].

DataFrame: [DESCRIBE_DATA]
Group by: [COLUMNS]
For each group, calculate:
[LIST_OF_AGGREGATIONS]

Output format: [FLAT_DF/MULTI_INDEX/DICT]

Example Input

Group by region and product_category, then calculate: total sales, average order value, order count, percentage of total sales. DataFrame: orders with columns order_id, region, product_category, amount. Output format: Flat DataFrame with readable column names.

Example Output

```python
import pandas as pd

# Calculate total for percentage later
total_sales = df['amount'].sum()

# Group and aggregate
result = df.groupby(['region', 'product_category']).agg(
 total_sales=('amount', 'sum'),
 avg_order_value=('amount', 'mean'),
 order_count=('order_id', 'count')
).reset_index()

# Add percentage of total
result['pct_of_total'] = (result['total_sales'] / total_sales * 100).round(2)

# Sort by total sales descending
result = result.sort_values('total_sales', ascending=False)

print(result)
```

Output columns: region, product_category, total_sales, avg_order_value, order_count, pct_of_total

Alternative using named aggregation (cleaner syntax):
```python
result = df.groupby(['region', 'product_category'], as_index=False).agg(
 total_sales=pd.NamedAgg(column='amount', aggfunc='sum'),
 avg_order_value=pd.NamedAgg(column='amount', aggfunc='mean'),
 order_count=pd.NamedAgg(column='order_id', aggfunc='count')
)
```

Note: `reset_index()` or `as_index=False` gives you a flat DataFrame instead of a MultiIndex.

When to Use

Summarizing data by categories

Pro Tips

•Use named aggregation (pd.NamedAgg) for readable code when doing multiple aggregations
•Add as_index=False to groupby() if you want a flat DataFrame without extra reset_index() call

4. Merge and Join Operations

Prompt Template

Help me combine these DataFrames:

DataFrame 1: [DESCRIPTION]
DataFrame 2: [DESCRIPTION]
Join type needed: [LEFT/RIGHT/INNER/OUTER]
Join keys: [COLUMNS]
Potential issues: [DUPLICATES/MISSING_KEYS/TYPE_MISMATCH]

Example Input

DataFrame 1: orders (order_id, customer_id, amount, date) - 50K rows. DataFrame 2: customers (customer_id, name, email, segment) - 10K rows. Join type: Left join (keep all orders even if customer info missing). Join keys: customer_id. Potential issues: customer_id is int in orders but string in customers, some orders have customer_id = 0 which doesn't exist in customers.

Example Output

```python
import pandas as pd

# First, fix the type mismatch
# Convert both to same type (string is safer for IDs)
orders['customer_id'] = orders['customer_id'].astype(str)
customers['customer_id'] = customers['customer_id'].astype(str)

# Handle the invalid customer_id = '0' case
# Option 1: Replace with NaN before merge
orders.loc[orders['customer_id'] == '0', 'customer_id'] = pd.NA

# Perform the left join
result = orders.merge(
 customers,
 on='customer_id',
 how='left',
 indicator=True # Adds column showing match status
)

# Check the merge results
print(result['_merge'].value_counts())
# both = matched
# left_only = order with no customer match
# right_only = (won't appear in left join)

# Optional: drop the indicator column if not needed
result = result.drop('_merge', axis=1)
```

Diagnostics to run:
```python
# Check for duplicate keys (can cause row explosion)
print(f"Orders with duplicate customer_id: {orders['customer_id'].duplicated().sum()}")
print(f"Customers with duplicate customer_id: {customers['customer_id'].duplicated().sum()}")

# If customers has duplicates, the merge will create multiple rows per order!
# Fix by deduplicating first:
# customers = customers.drop_duplicates(subset='customer_id', keep='first')
```

Expected result: 50K rows (same as orders, since left join preserves all left rows). If you get more rows, you have duplicate keys in the customers table.

When to Use

Combining data from multiple sources

Pro Tips

•Always check row counts before and after merge. Unexpected increases mean duplicate keys
•Use indicator=True to see which rows matched and which didn't

5. Performance Optimization

Prompt Template

My Pandas code is too slow. Help me optimize it.

Current code:
[YOUR_CODE]

Dataset size: [ROWS_AND_COLUMNS]
Current runtime: [TIME]
Target runtime: [GOAL]

What's causing the slowdown and how can I fix it?

Example Input

Current code: for index, row in df.iterrows(): df.loc[index, 'new_col'] = row['price'] * row['quantity'] * (1 - row['discount']). Dataset size: 2 million rows. Current runtime: 15 minutes. Target: Under 1 second.

Example Output

**Problem**: `iterrows()` is almost always the wrong approach in Pandas. It processes one row at a time in Python, which is extremely slow.

**Solution**: Use vectorized operations.

```python
# Replace the loop with a single vectorized operation
df['new_col'] = df['price'] * df['quantity'] * (1 - df['discount'])
```

This will run in milliseconds, not minutes.

**Why it's faster**:
- Vectorized operations use optimized C code under the hood
- They process the entire column at once instead of row by row
- No Python interpreter overhead for each iteration

**General rules to avoid slow Pandas:**

1. **Never use iterrows() for calculations** - Use vectorized operations instead
2. **Avoid apply() when vectorized alternatives exist** - apply() is faster than iterrows but still slow
3. **Use appropriate dtypes** - category for low-cardinality strings, int32 instead of int64 when possible
4. **Filter early** - If you only need 10% of rows, filter before doing expensive operations

**If you must loop** (rare cases like complex state-dependent logic):
```python
# Use itertuples() instead of iterrows() - 10x faster
results = []
for row in df.itertuples():
 results.append(row.price * row.quantity * (1 - row.discount))
df['new_col'] = results
```

**For truly large datasets (10M+ rows)**, consider:
- Polars instead of Pandas (often 5-10x faster)
- Dask for parallel processing
- Loading only needed columns: `pd.read_csv('file.csv', usecols=['price', 'quantity', 'discount'])`

When to Use

When your Pandas code takes too long to run

Pro Tips

•Profile before improving. Use %timeit in Jupyter to measure what's actually slow
•Sometimes the best optimization is using less data (filter, sample, or aggregate earlier)

Common Mistakes to Avoid

•

Using iterrows() for calculations instead of vectorized operations. This can make code 1000x slower than necessary

•

Not checking DataFrame shape after merge operations. Duplicate keys cause silent row multiplication

•

Forgetting that many Pandas operations return copies, not modify in-place. Chain operations or use inplace=True carefully

Frequently Asked Questions

Try Prompt Optimizer More Data & Analytics Prompts

Python Data Analysis Prompt Templates

Overview

Best Practices

Prompt Templates

1. DataFrame Manipulation

2. Data Loading and Export

3. GroupBy and Aggregation

4. Merge and Join Operations

5. Performance Optimization

Common Mistakes to Avoid

Frequently Asked Questions

Related Templates

SQL Query Prompt Templates

Data Analysis Prompt Templates

Data Visualization Prompt Templates