Python Data Analysis Prompt Templates
AI prompt templates for Python data analysis. Use Pandas, NumPy, and other libraries effectively.
Overview
Python is the go-to language for data analysis, and Pandas is the library that makes it practical. These prompts help you write efficient data manipulation code, avoid common Pandas pitfalls, and get working solutions faster. Whether you're loading messy CSVs or building complex transformations, these templates have you covered.
Best Practices
Include a sample of your data or describe the DataFrame structure (columns, types)
Mention the size of your dataset if performance matters
Specify which libraries you have available (Pandas, NumPy, Polars)
Describe the output format you need (DataFrame, Series, dict, CSV)
Ask for explanations if you want to learn, not just get working code
Prompt Templates
1. DataFrame Manipulation
Write Pandas code to [TASK]. DataFrame structure: [COLUMNS_AND_TYPES] Sample data: [SAMPLE_ROWS] Expected output: [OUTPUT_DESCRIPTION] Performance note: [DATASET_SIZE]
Write Pandas code to calculate the 7-day rolling average of sales for each product. DataFrame structure: date (datetime), product_id (str), sales (float). Sample data: 2024-01-01, SKU001, 150.00 | 2024-01-02, SKU001, 200.00 | 2024-01-01, SKU002, 75.00. Expected output: Same DataFrame with new column 'sales_7d_avg'. Performance note: 1M rows, need it to be efficient.
```python
import pandas as pd
# Sort by product and date for correct rolling calculation
df = df.sort_values(['product_id', 'date'])
# Calculate rolling average per product
df['sales_7d_avg'] = df.groupby('product_id')['sales'].transform(
lambda x: x.rolling(window=7, min_periods=1).mean()
)
```
For 1M rows, this should complete in a few seconds. If you need faster performance:
```python
# Option 2: Use numba for speed on very large datasets
from pandas.api.types import is_numeric_dtype
df['sales_7d_avg'] = df.groupby('product_id')['sales'].transform(
lambda x: x.rolling(7, min_periods=1).mean().values
)
```
Note: `min_periods=1` means the first 6 days will have averages based on fewer than 7 days. Set `min_periods=7` if you want NaN for those rows instead.Transforming, filtering, or aggregating data in Pandas
- •Always show sample data. 'I have a DataFrame with sales' is too vague
- •Mention if you need the operation to be in-place or return a new DataFrame
2. Data Loading and Export
Help me load data from [SOURCE] into Pandas and handle these issues: File/source details: [DETAILS] Problems I'm seeing: [ISSUES] Desired result: [WHAT_CLEAN_DATA_LOOKS_LIKE] Also show how to export to [OUTPUT_FORMAT] when done.
Load data from a CSV with messy formatting. File details: sales_data.csv, has mixed date formats (MM/DD/YYYY and YYYY-MM-DD), some numeric columns have commas in numbers ($1,234.56), first row is a title (not headers). Problems: Dates parse wrong, numbers load as strings. Desired result: Clean DataFrame with proper datetime and float types. Export to: Parquet for faster future loading.
```python
import pandas as pd
# Load CSV, skip title row, handle encoding
df = pd.read_csv(
'sales_data.csv',
skiprows=1, # Skip the title row
encoding='utf-8', # Try 'latin-1' if this fails
thousands=',', # Handle comma separators in numbers
dtype=str # Load everything as string first for cleaning
)
# Clean currency columns (remove $ and convert to float)
currency_cols = ['revenue', 'cost'] # Adjust to your column names
for col in currency_cols:
df[col] = df[col].str.replace('[$,]', '', regex=True).astype(float)
# Parse mixed-format dates
df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=False)
# Verify dtypes
print(df.dtypes)
print(df.head())
# Export to Parquet (much faster to read later)
df.to_parquet('sales_data.parquet', index=False)
# Next time, load with:
# df = pd.read_parquet('sales_data.parquet')
```
Tips:
- `format='mixed'` handles multiple date formats (Pandas 2.0+)
- Parquet files are typically 2-10x smaller than CSV and load much faster
- If you have really large files, consider `chunksize` parameter to load in batchesGetting data into and out of Pandas, especially messy files
- •Load as strings first (dtype=str) when you have mixed or messy data, then clean and convert
- •Parquet is almost always better than CSV for intermediate storage
3. GroupBy and Aggregation
Write Pandas code to group by [GROUP_COLUMNS] and calculate [AGGREGATIONS]. DataFrame: [DESCRIBE_DATA] Group by: [COLUMNS] For each group, calculate: [LIST_OF_AGGREGATIONS] Output format: [FLAT_DF/MULTI_INDEX/DICT]
Group by region and product_category, then calculate: total sales, average order value, order count, percentage of total sales. DataFrame: orders with columns order_id, region, product_category, amount. Output format: Flat DataFrame with readable column names.
```python
import pandas as pd
# Calculate total for percentage later
total_sales = df['amount'].sum()
# Group and aggregate
result = df.groupby(['region', 'product_category']).agg(
total_sales=('amount', 'sum'),
avg_order_value=('amount', 'mean'),
order_count=('order_id', 'count')
).reset_index()
# Add percentage of total
result['pct_of_total'] = (result['total_sales'] / total_sales * 100).round(2)
# Sort by total sales descending
result = result.sort_values('total_sales', ascending=False)
print(result)
```
Output columns: region, product_category, total_sales, avg_order_value, order_count, pct_of_total
Alternative using named aggregation (cleaner syntax):
```python
result = df.groupby(['region', 'product_category'], as_index=False).agg(
total_sales=pd.NamedAgg(column='amount', aggfunc='sum'),
avg_order_value=pd.NamedAgg(column='amount', aggfunc='mean'),
order_count=pd.NamedAgg(column='order_id', aggfunc='count')
)
```
Note: `reset_index()` or `as_index=False` gives you a flat DataFrame instead of a MultiIndex.Summarizing data by categories
- •Use named aggregation (pd.NamedAgg) for readable code when doing multiple aggregations
- •Add as_index=False to groupby() if you want a flat DataFrame without extra reset_index() call
4. Merge and Join Operations
Help me combine these DataFrames: DataFrame 1: [DESCRIPTION] DataFrame 2: [DESCRIPTION] Join type needed: [LEFT/RIGHT/INNER/OUTER] Join keys: [COLUMNS] Potential issues: [DUPLICATES/MISSING_KEYS/TYPE_MISMATCH]
DataFrame 1: orders (order_id, customer_id, amount, date) - 50K rows. DataFrame 2: customers (customer_id, name, email, segment) - 10K rows. Join type: Left join (keep all orders even if customer info missing). Join keys: customer_id. Potential issues: customer_id is int in orders but string in customers, some orders have customer_id = 0 which doesn't exist in customers.
```python
import pandas as pd
# First, fix the type mismatch
# Convert both to same type (string is safer for IDs)
orders['customer_id'] = orders['customer_id'].astype(str)
customers['customer_id'] = customers['customer_id'].astype(str)
# Handle the invalid customer_id = '0' case
# Option 1: Replace with NaN before merge
orders.loc[orders['customer_id'] == '0', 'customer_id'] = pd.NA
# Perform the left join
result = orders.merge(
customers,
on='customer_id',
how='left',
indicator=True # Adds column showing match status
)
# Check the merge results
print(result['_merge'].value_counts())
# both = matched
# left_only = order with no customer match
# right_only = (won't appear in left join)
# Optional: drop the indicator column if not needed
result = result.drop('_merge', axis=1)
```
Diagnostics to run:
```python
# Check for duplicate keys (can cause row explosion)
print(f"Orders with duplicate customer_id: {orders['customer_id'].duplicated().sum()}")
print(f"Customers with duplicate customer_id: {customers['customer_id'].duplicated().sum()}")
# If customers has duplicates, the merge will create multiple rows per order!
# Fix by deduplicating first:
# customers = customers.drop_duplicates(subset='customer_id', keep='first')
```
Expected result: 50K rows (same as orders, since left join preserves all left rows). If you get more rows, you have duplicate keys in the customers table.Combining data from multiple sources
- •Always check row counts before and after merge. Unexpected increases mean duplicate keys
- •Use indicator=True to see which rows matched and which didn't
5. Performance Optimization
My Pandas code is too slow. Help me optimize it. Current code: [YOUR_CODE] Dataset size: [ROWS_AND_COLUMNS] Current runtime: [TIME] Target runtime: [GOAL] What's causing the slowdown and how can I fix it?
Current code: for index, row in df.iterrows(): df.loc[index, 'new_col'] = row['price'] * row['quantity'] * (1 - row['discount']). Dataset size: 2 million rows. Current runtime: 15 minutes. Target: Under 1 second.
**Problem**: `iterrows()` is almost always the wrong approach in Pandas. It processes one row at a time in Python, which is extremely slow.
**Solution**: Use vectorized operations.
```python
# Replace the loop with a single vectorized operation
df['new_col'] = df['price'] * df['quantity'] * (1 - df['discount'])
```
This will run in milliseconds, not minutes.
**Why it's faster**:
- Vectorized operations use optimized C code under the hood
- They process the entire column at once instead of row by row
- No Python interpreter overhead for each iteration
**General rules to avoid slow Pandas:**
1. **Never use iterrows() for calculations** - Use vectorized operations instead
2. **Avoid apply() when vectorized alternatives exist** - apply() is faster than iterrows but still slow
3. **Use appropriate dtypes** - category for low-cardinality strings, int32 instead of int64 when possible
4. **Filter early** - If you only need 10% of rows, filter before doing expensive operations
**If you must loop** (rare cases like complex state-dependent logic):
```python
# Use itertuples() instead of iterrows() - 10x faster
results = []
for row in df.itertuples():
results.append(row.price * row.quantity * (1 - row.discount))
df['new_col'] = results
```
**For truly large datasets (10M+ rows)**, consider:
- Polars instead of Pandas (often 5-10x faster)
- Dask for parallel processing
- Loading only needed columns: `pd.read_csv('file.csv', usecols=['price', 'quantity', 'discount'])`When your Pandas code takes too long to run
- •Profile before improving. Use %timeit in Jupyter to measure what's actually slow
- •Sometimes the best optimization is using less data (filter, sample, or aggregate earlier)
Common Mistakes to Avoid
Using iterrows() for calculations instead of vectorized operations. This can make code 1000x slower than necessary
Not checking DataFrame shape after merge operations. Duplicate keys cause silent row multiplication
Forgetting that many Pandas operations return copies, not modify in-place. Chain operations or use inplace=True carefully
Frequently Asked Questions
Python is the go-to language for data analysis, and Pandas is the library that makes it practical. These prompts help you write efficient data manipulation code, avoid common Pandas pitfalls, and get working solutions faster. Whether you're loading messy CSVs or building complex transformations, these templates have you covered.
Related Templates
SQL Query Prompt Templates
AI prompt templates for writing SQL queries. Create SELECT, JOIN, aggregate, and complex queries.
Data Analysis Prompt Templates
AI prompt templates for data analysis. Extract insights, identify patterns, and interpret results.
Data Visualization Prompt Templates
AI prompt templates for data visualization. Create effective charts, dashboards, and visual reports.
Have your own prompt to optimize?