TL;DR - Best AI Prompts for Data Scientists & Analysts
Looking for ready-to-use AI prompts for data science? This guide contains 40+ copy-paste prompts that work with ChatGPT, Claude, and Gemini. Each prompt includes placeholders you can customize for your specific analysis needs. For foundational prompting skills, see the Prompt Engineering Fundamentals guide.
What’s included:
- SQL Prompts — Generate complex queries, optimize performance, and debug errors
- EDA Prompts — Explore datasets, generate summary statistics, and find patterns
- Data Cleaning Prompts — Handle missing values, outliers, and data quality issues
- Statistical Analysis Prompts — Hypothesis testing, A/B tests, regression analysis
- Machine Learning Prompts — Model selection, feature engineering, evaluation
- Visualization Prompts — Create charts, dashboards, and presentation-ready graphics
- Communication Prompts — Write reports, explain findings, and tell data stories
💡 Pro tip: Include your data schema, sample data, and business context. The more specific you are about your data and objectives, the better the output. For advanced prompting techniques, see the Advanced Prompt Engineering guide.
How to Use These Data Science Prompts
Each prompt below is ready to copy and paste. Here’s how they work:
- Copy the entire prompt from the code block
- Replace the placeholder comments (lines starting with
# REPLACE:) with your actual content - Paste into ChatGPT, Claude, or your preferred AI
- Get your result and iterate if needed
Adding Context for Better Results
Data science prompts work best with rich context:
=== DATA CONTEXT (include in any prompt) ===
Dataset: [Name and description]
Size: [Rows x columns, file size]
Key Columns: [Important columns with types]
Business Context: [What business problem are you solving]
Tools: [Python/R, specific libraries]
Output: [Code only / Code + explanation / Explanation only]
SQL Query Prompts
Generate SQL Query from Natural Language
Use this to convert plain English descriptions into SQL.
Write a SQL query for the following request.
=== DATABASE ===
Database type:
# REPLACE: PostgreSQL / MySQL / SQL Server / BigQuery / Snowflake
=== SCHEMA ===
# REPLACE: Describe or paste your table schemas
# Example:
# users (user_id INT PK, email VARCHAR, created_at TIMESTAMP, plan VARCHAR)
# orders (order_id INT PK, user_id INT FK, amount DECIMAL, status VARCHAR, created_at TIMESTAMP)
# products (product_id INT PK, name VARCHAR, category VARCHAR, price DECIMAL)
=== REQUEST ===
# REPLACE: Describe what you need in plain English
# Example: Get monthly revenue by product category for the last 12 months,
# showing month-over-month growth percentage
=== REQUIREMENTS ===
# REPLACE: Any specific requirements
# - Must handle NULLs
# - Needs pagination
# - Performance is critical (table has 10M+ rows)
=== OUTPUT ===
1. The SQL query with comments explaining each section
2. Sample output showing expected columns
3. Suggested indexes for performance (if applicable)
4. Alternative approaches (if any)
Optimize Slow SQL Query
Use this to optimize query performance.
Optimize this SQL query for better performance.
=== CURRENT QUERY ===
# REPLACE: Paste your slow query
=== CONTEXT ===
Database:
# REPLACE: PostgreSQL / MySQL / etc.
Table sizes:
# REPLACE: Approximate row counts
# - users: 1M rows
# - orders: 50M rows
Current execution time:
# REPLACE: e.g., 45 seconds
Target execution time:
# REPLACE: e.g., under 5 seconds
Existing indexes:
# REPLACE: List current indexes or "Unknown"
=== ANALYZE AND OPTIMIZE ===
Provide:
1. Analysis of why it's slow
2. Optimized query with explanation
3. Recommended indexes (CREATE INDEX statements)
4. Query execution plan considerations
5. Alternative approaches if applicable
Debug SQL Error
Use this when you have a SQL error to fix.
Help me debug this SQL error.
=== ERROR MESSAGE ===
# REPLACE: Paste the full error message
=== QUERY ===
# REPLACE: Paste your query
=== SCHEMA (relevant tables) ===
# REPLACE: Table definitions
=== DATABASE ===
# REPLACE: Database type and version
Please:
1. Explain what the error means
2. Identify the exact cause
3. Provide the corrected query
4. Explain how to avoid this in the future
Complex SQL with CTEs and Window Functions
Use this for advanced analytical queries.
Write an advanced SQL query using CTEs and window functions.
=== ANALYSIS NEEDED ===
# REPLACE: Describe the analysis
# Example: Calculate running total of revenue per customer,
# rank customers by lifetime value, and identify customers
# whose spending dropped more than 50% month-over-month
=== SCHEMA ===
# REPLACE: Relevant tables
=== REQUIREMENTS ===
# REPLACE: Specific requirements
# - Use CTEs for readability
# - Include window functions where appropriate
# - Handle edge cases (new customers, null values)
=== OUTPUT FORMAT ===
| Column | Description |
|--------|-------------|
# REPLACE: Expected output columns
Generate query with:
1. CTEs for each logical step
2. Appropriate window functions
3. Comments explaining the logic
4. Sample output
Exploratory Data Analysis Prompts
Initial Dataset Exploration
Use this when you first receive a new dataset. For more on how AI handles data, see the Tokens, Context Windows & Parameters guide.
Generate Python code for initial exploratory data analysis.
=== DATASET ===
Name:
# REPLACE: Dataset name
Description:
# REPLACE: What does this data represent?
Format:
# REPLACE: CSV / Parquet / JSON / Database table
=== SAMPLE DATA (optional) ===
# REPLACE: Paste first few rows if available
=== COLUMNS (if known) ===
# REPLACE: List columns with expected types
# - customer_id: int (unique identifier)
# - purchase_amount: float (in USD)
# - purchase_date: datetime
=== GENERATE EDA CODE ===
Provide Python code (using pandas, numpy, matplotlib, seaborn) for:
1. **Data Loading & First Look**
- Load data
- .head(), .info(), .describe()
- Column types and memory usage
2. **Missing Values Analysis**
- Count and percentage by column
- Visualize missing patterns
3. **Univariate Analysis**
- Distributions for numerical columns
- Value counts for categorical columns
- Statistical summaries
4. **Bivariate Analysis**
- Correlation matrix
- Key relationships
- Scatter plots for important pairs
5. **Data Quality Checks**
- Duplicates
- Outliers
- Inconsistent values
6. **Initial Insights**
- Key observations
- Potential issues
- Recommended next steps
Analyze Specific Variables
Use this to deep-dive into specific columns.
Analyze these specific variables in my dataset.
=== VARIABLES TO ANALYZE ===
# REPLACE: List variables with their types
# - revenue (continuous, numeric)
# - customer_segment (categorical)
# - signup_date (datetime)
=== ANALYSIS GOALS ===
# REPLACE: What questions are you trying to answer?
# - Is revenue normally distributed?
# - How do segments differ in behavior?
# - Are there seasonal patterns?
=== SAMPLE DATA ===
# REPLACE: Paste sample or describe data characteristics
=== GENERATE ANALYSIS CODE ===
Python code for:
1. **Numerical Variables**
- Distribution (histogram, KDE)
- Summary statistics (mean, median, std, quartiles)
- Skewness and kurtosis
- Outlier detection (IQR, Z-score)
- Transformation suggestions
2. **Categorical Variables**
- Value counts and proportions
- Bar charts
- Chi-square test for independence (if applicable)
- Cardinality assessment
3. **Datetime Variables**
- Time range and gaps
- Patterns (daily, weekly, monthly, yearly)
- Trend decomposition
4. **Variable Relationships**
- Cross-tabulations
- Group comparisons
- Visualizations by segment
5. **Summary of Findings**
- Key insights
- Anomalies detected
- Recommendations
Correlation and Relationship Analysis
Use this to understand variable relationships.
Analyze correlations and relationships in my dataset.
=== DATASET DESCRIPTION ===
# REPLACE: Describe your dataset and key columns
=== TARGET VARIABLE (if applicable) ===
# REPLACE: What are you trying to predict/explain?
=== SAMPLE DATA ===
# REPLACE: Paste sample data or schema
=== GENERATE ANALYSIS ===
Python code for:
1. **Correlation Matrix**
- Pearson correlation for numeric variables
- Heatmap visualization
- Highlight strong correlations (|r| > 0.7)
2. **Multicollinearity Check**
- VIF (Variance Inflation Factor)
- Recommendations for feature selection
3. **Target Relationships**
- Correlation with target variable
- Top predictive features
- Scatter plots with trend lines
4. **Non-Linear Relationships**
- Spearman correlation
- Visual inspection for curves
- Polynomial relationship detection
5. **Categorical Associations**
- Cramér's V for categorical variables
- Point-biserial for mixed types
6. **Insights Summary**
- Key relationships found
- Potential issues (multicollinearity)
- Feature recommendations
Data Cleaning Prompts
Handle Missing Values
Use this to address missing data.
Generate code to handle missing values in my dataset.
=== DATASET ===
# REPLACE: Dataset description
=== MISSING VALUE PATTERN ===
# REPLACE: Describe what you know about missing values
# - Column X: 15% missing, likely MCAR
# - Column Y: 30% missing, possibly related to Column Z
# - Column W: Completely missing for certain dates
=== CONSTRAINTS ===
# REPLACE: Any constraints on imputation
# - Cannot remove rows (need all records)
# - Must handle categorical and numeric differently
# - Need to preserve distributions
=== GENERATE SOLUTION ===
Python code for:
1. **Missing Value Assessment**
- Missing counts and percentages
- Missing patterns (MCAR, MAR, MNAR analysis)
- Visualization of missingness
2. **Handling Strategy by Column**
For each column with missing values:
- **Numeric columns:**
- Mean/median imputation
- KNN imputation
- Regression imputation
- **Categorical columns:**
- Mode imputation
- Create "Unknown" category
- Predictive imputation
- **Time-based:**
- Forward/backward fill
- Interpolation
3. **Imputation Code**
- Reusable functions
- Before/after comparison
- Validation
4. **Documentation**
- What was done to each column
- Justification for approach
- Potential impact on analysis
Detect and Handle Outliers
Use this for outlier analysis.
Analyze and handle outliers in my dataset.
=== DATASET ===
# REPLACE: Describe dataset
=== COLUMNS TO CHECK ===
# REPLACE: List numerical columns to check for outliers
# - purchase_amount
# - session_duration
# - age
=== BUSINESS CONTEXT ===
# REPLACE: What constitutes a valid extreme value?
# - Purchase amounts over $10K are possible but rare
# - Negative ages are errors
# - Session durations over 24 hours are logging errors
=== GENERATE ANALYSIS ===
Python code for:
1. **Outlier Detection**
- IQR method (1.5 * IQR rule)
- Z-score method (|z| > 3)
- Isolation Forest (for multivariate)
- Visual detection (box plots, scatter)
2. **Outlier Report**
| Column | Method | # Outliers | % of Data | Min/Max Outlier |
3. **Outlier Investigation**
- Are outliers errors or valid extremes?
- Pattern analysis (when do they occur?)
- Source investigation
4. **Handling Strategies**
- Remove (with justification)
- Cap/Winsorize (to percentile)
- Transform (log, sqrt)
- Keep (if valid business cases)
- Flag (create indicator variable)
5. **Implementation Code**
- Functions for each strategy
- Before/after visualizations
- Impact assessment
Data Type and Format Cleaning
Use this for data type issues.
Clean data types and formats in my dataset.
=== DATA ISSUES ===
# REPLACE: Describe your data type issues
# - date_column is string format "MM/DD/YYYY"
# - price has $ symbol and commas
# - boolean stored as "Yes"/"No"
# - phone numbers inconsistent formats
=== SAMPLE DATA ===
# REPLACE: Paste sample showing the issues
=== TARGET FORMAT ===
# REPLACE: What formats do you need?
# - Dates: datetime64
# - Prices: float
# - Booleans: True/False
=== GENERATE CLEANING CODE ===
Python code for:
1. **Data Type Assessment**
- Current types
- Suggested types
- Conversion risks
2. **Cleaning Functions**
For each issue:
```python
def clean_[column](value):
# Handle the specific issue
# Return cleaned value
-
Validation
- Check for conversion errors
- Handle edge cases
- Log problematic values
-
Pipeline
- Complete cleaning function
- Error handling
- Before/after summary
---
## Statistical Analysis Prompts
### Hypothesis Testing
Use this to conduct statistical tests. For understanding AI limitations in analysis, see the [Understanding AI Safety, Ethics, and Limitations guide](/tech-articles/understanding-ai-safety/).
```text
Conduct a hypothesis test for my analysis.
=== RESEARCH QUESTION ===
# REPLACE: What are you trying to determine?
# Example: Is there a significant difference in conversion rates
# between the new and old landing page designs?
=== HYPOTHESIS ===
H0 (null):
# REPLACE: e.g., There is no difference in conversion rates
H1 (alternative):
# REPLACE: e.g., The new design has a higher conversion rate
=== DATA ===
# REPLACE: Describe your data
# - Group A (old design): 5000 visitors, 150 conversions
# - Group B (new design): 5000 visitors, 185 conversions
=== GENERATE ANALYSIS ===
Python code for:
1. **Test Selection**
- Appropriate test based on:
- Data type (continuous, categorical)
- Number of groups
- Assumptions (normality, independence)
- Justification for chosen test
2. **Assumption Checks**
- Normality (Shapiro-Wilk, Q-Q plots)
- Homogeneity of variance (Levene's)
- Independence
- Sample size adequacy
3. **Conduct Test**
- Calculate test statistic
- P-value
- Effect size (Cohen's d, odds ratio, etc.)
- Confidence interval
4. **Results Interpretation**
- Statistical conclusion (reject/fail to reject H0)
- Practical significance
- Limitations
5. **Report Format**
- APA-style result statement
- Visualization
- Business recommendation
A/B Test Analysis
Use this to analyze experiment results.
Analyze this A/B test and determine if results are significant.
=== EXPERIMENT DETAILS ===
Test name:
# REPLACE: Name of the test
Test duration:
# REPLACE: Start and end dates
Hypothesis:
# REPLACE: What change will improve what metric
=== METRICS ===
Primary metric:
# REPLACE: e.g., Conversion rate
Secondary metrics:
# REPLACE: e.g., Average order value, Bounce rate
=== DATA ===
# REPLACE: Data for each variant
# Control (A):
# - Sample size: 10,000
# - Conversions: 250
# - Revenue: $12,500
#
# Treatment (B):
# - Sample size: 10,200
# - Conversions: 290
# - Revenue: $15,300
=== GENERATE A/B TEST ANALYSIS ===
Python code for:
1. **Test Summary**
| Metric | Control | Treatment | Difference | % Lift |
2. **Statistical Significance**
- Chi-square test (for proportions)
- T-test (for continuous)
- P-value
- Required sample size vs actual
3. **Confidence Intervals**
- 95% CI for each metric
- CI for the difference
4. **Effect Size**
- Practical significance
- Minimum detectable effect achieved?
5. **Power Analysis**
- Post-hoc power
- Was test adequately powered?
6. **Segmentation Check**
- Results by key segments
- Any surprising segment behavior
7. **Recommendation**
- Ship / Don't ship / Keep testing
- Justification
- Confidence level
Regression Analysis
Use this for regression modeling.
Perform regression analysis on my data.
=== OBJECTIVE ===
# REPLACE: What are you trying to predict/explain?
# Example: Predict customer lifetime value based on demographic
# and behavioral features
=== DATA ===
Target variable:
# REPLACE: Column name and description
Feature variables:
# REPLACE: List features to consider
# - age (continuous)
# - gender (categorical)
# - tenure_days (continuous)
# - purchase_count (discrete)
Sample data:
# REPLACE: Paste sample or describe dataset
=== REGRESSION TYPE ===
# REPLACE: Linear / Logistic / Other
=== GENERATE ANALYSIS ===
Python code for:
1. **Data Preparation**
- Handle categoricals (encoding)
- Check for multicollinearity
- Train/test split
2. **Model Building**
- Baseline model
- Feature selection (if needed)
- Final model
3. **Assumption Checking** (for linear regression)
- Linearity
- Homoscedasticity
- Normality of residuals
- Independence
4. **Model Evaluation**
- R² / Adjusted R²
- RMSE, MAE (for regression)
- AUC, Accuracy, F1 (for classification)
- Cross-validation results
5. **Coefficient Interpretation**
| Feature | Coefficient | Std Error | P-value | Interpretation |
6. **Visualizations**
- Actual vs Predicted
- Residual plots
- Feature importance
7. **Business Insights**
- Key drivers
- Actionable recommendations
- Limitations
Machine Learning Prompts
Model Selection Helper
Use this to choose the right model for your problem.
Help me select the right machine learning model.
=== PROBLEM TYPE ===
# REPLACE: What are you trying to do?
# - Predict a continuous value (regression)
# - Classify into categories (classification)
# - Find natural groupings (clustering)
# - Detect anomalies
# - Recommend items
=== DATA CHARACTERISTICS ===
Size:
# REPLACE: Number of rows and features
Target distribution:
# REPLACE: Balanced/imbalanced, continuous range
Feature types:
# REPLACE: Mostly numeric / categorical / mixed / text / images
Missing values:
# REPLACE: Percentage
=== CONSTRAINTS ===
# REPLACE: Any constraints
# - Need interpretability
# - Limited compute resources
# - Must run in real-time
# - Need probability outputs
=== GENERATE RECOMMENDATIONS ===
Provide:
1. **Recommended Models** (ranked)
| Rank | Model | Why | Pros | Cons |
|------|-------|-----|------|------|
| 1 | | | | |
2. **For Each Recommended Model:**
- When to use it
- Python implementation code
- Key hyperparameters to tune
- Expected performance range
3. **Model Comparison Strategy**
- Cross-validation approach
- Metrics to compare
- Baseline model
4. **Quick Start Code**
```python
# Complete working code for top model
---
### Feature Engineering
Use this to create better features.
```text
Generate feature engineering ideas and code.
=== PROBLEM ===
# REPLACE: What are you trying to predict?
=== CURRENT FEATURES ===
# REPLACE: List your current features
# - user_id
# - purchase_date
# - purchase_amount
# - product_category
# - customer_tenure_days
=== DOMAIN KNOWLEDGE ===
# REPLACE: Any domain knowledge to leverage
# - Seasonality is important
# - Recent behavior is more predictive than old
# - Certain product combinations are meaningful
=== GENERATE FEATURE IDEAS ===
Provide:
1. **Feature Engineering Ideas**
| Category | Feature | Description | Expected Value |
|----------|---------|-------------|----------------|
| Aggregations | | | |
| Time-based | | | |
| Interactions | | | |
| Transformations | | | |
2. **Implementation Code**
For each feature category:
```python
def create_[category]_features(df):
# Feature engineering code
return df
-
Time-Based Features
- Recency, frequency, monetary (RFM)
- Rolling windows
- Lag features
- Seasonality encoding
-
Categorical Engineering
- Target encoding
- Frequency encoding
- Combinations
-
Feature Selection
- Correlation with target
- Feature importance
- Recommendations
-
Complete Pipeline
# End-to-end feature engineering pipeline
---
### Model Evaluation and Interpretation
Use this to evaluate and explain your model.
```text
Evaluate and interpret my machine learning model.
=== MODEL ===
Model type:
# REPLACE: e.g., Random Forest Classifier
Task:
# REPLACE: Classification / Regression
=== DATA ===
# REPLACE: Train/test split info, class distribution
=== CURRENT RESULTS ===
# REPLACE: Paste your current metrics or predictions
=== GENERATE EVALUATION ===
Python code for:
1. **Performance Metrics**
For Classification:
- Accuracy, Precision, Recall, F1
- AUC-ROC, AUC-PR
- Confusion Matrix
- Classification Report
For Regression:
- RMSE, MAE, MAPE
- R², Adjusted R²
- Residual analysis
2. **Model Diagnostics**
- Learning curves
- Overfitting check
- Cross-validation scores
3. **Feature Importance**
- Built-in importance (if available)
- Permutation importance
- SHAP values
4. **Error Analysis**
- Where does the model fail?
- Patterns in errors
- Segment performance
5. **Model Comparison** (if applicable)
- Benchmark against baseline
- Comparison visualization
6. **Interpretation Summary**
- Key predictive features
- How features influence predictions
- Business recommendations
Data Visualization Prompts
Create Visualization for Dataset
Use this to generate appropriate charts.
Create visualizations for my dataset.
=== DATASET ===
# REPLACE: Describe your dataset
=== KEY VARIABLES ===
# REPLACE: Variables you want to visualize
# - revenue (continuous)
# - region (categorical, 5 categories)
# - date (datetime, 2 years of data)
# - customer_segment (categorical)
=== PURPOSE ===
# REPLACE: What story are you trying to tell?
# - Show revenue trends over time
# - Compare performance across regions
# - Identify patterns and outliers
=== AUDIENCE ===
# REPLACE: Who will see these? (Executives / Technical team / Clients)
=== GENERATE VISUALIZATIONS ===
Python code (matplotlib/seaborn/plotly) for:
1. **Recommended Charts**
| Variable(s) | Chart Type | Purpose |
|-------------|------------|---------|
| | | |
2. **Single Variable Visualizations**
- Distributions
- Time series
- Category counts
3. **Multi-Variable Visualizations**
- Relationships
- Comparisons
- Compositions
4. **Dashboard Layout**
- Suggested arrangement
- Key metrics to highlight
5. **Presentation-Ready Code**
- Proper titles and labels
- Color palette
- Annotations
- Export settings
6. **Interactive Version** (if applicable)
- Plotly/Bokeh code
- Drill-down capabilities
Build Dashboard Metrics
Use this to define dashboard KPIs.
Help me design dashboard metrics and visualizations.
=== BUSINESS CONTEXT ===
# REPLACE: What does your team/business do?
=== AUDIENCE ===
# REPLACE: Who will use this dashboard?
=== AVAILABLE DATA ===
# REPLACE: What data sources do you have?
=== KEY QUESTIONS ===
# REPLACE: What questions should the dashboard answer?
# - How are we performing vs target?
# - What's trending up or down?
# - Where should we focus?
=== GENERATE DASHBOARD DESIGN ===
1. **KPI Definitions**
| KPI | Definition | Formula | Target | Cadence |
|-----|------------|---------|--------|---------|
| | | | | |
2. **Dashboard Layout**
[Sketch of dashboard layout]
- Top: Summary KPIs
- Middle: Trend charts
- Bottom: Detail tables
3. **Visualization Specifications**
For each chart:
- Chart type
- Data source
- Filters
- Interactivity
4. **SQL Queries for Metrics**
(One query per metric)
5. **Python/BI Tool Code**
- Code to generate visualizations
- Update/refresh logic
6. **Recommendations**
- Refresh frequency
- Alert thresholds
- Future enhancements
Communication Prompts
Explain Analysis to Non-Technical Audience
Use this to translate findings.
Explain this analysis for a non-technical audience.
=== TECHNICAL FINDINGS ===
# REPLACE: Paste your technical results
# Example:
# - Logistic regression model with 0.85 AUC
# - Top features: tenure (0.45), purchase_frequency (0.32), support_tickets (0.28)
# - Customers with tenure < 60 days and >2 support tickets have 65% churn probability
=== AUDIENCE ===
# REPLACE: Who are you presenting to?
# - VP of Marketing
# - Non-technical stakeholders
# - C-suite
=== BUSINESS CONTEXT ===
# REPLACE: What decision does this inform?
=== GENERATE EXPLANATION ===
Provide:
1. **One-Sentence Summary**
(The key insight in plain English)
2. **What We Did**
(Non-technical description of methodology)
3. **What We Found**
- Key finding 1 (with business implication)
- Key finding 2 (with business implication)
- Key finding 3 (with business implication)
4. **What This Means**
(Business recommendations)
5. **Visualizations to Include**
(Describe simple, intuitive charts)
6. **Confidence and Limitations**
(Honest assessment in plain language)
7. **Recommended Actions**
| Action | Expected Impact | Effort |
|--------|-----------------|--------|
8. **Appendix for Questions**
(Talking points for common questions)
Write Analysis Report
Use this to structure a complete report.
Write a data analysis report.
=== ANALYSIS TOPIC ===
# REPLACE: What did you analyze?
=== KEY FINDINGS ===
# REPLACE: Summarize main findings
=== DATA USED ===
# REPLACE: Data sources and timeframes
=== METHODOLOGY ===
# REPLACE: Brief description of methods
=== AUDIENCE ===
# REPLACE: Who will read this?
=== GENERATE REPORT ===
## [Report Title]
### Executive Summary
(1 paragraph: key findings and recommendations)
### Background
- Business context
- Why this analysis was conducted
- Questions we sought to answer
### Data & Methodology
- Data sources
- Time period
- Analytical approach
- Key assumptions
### Key Findings
**Finding 1: [Title]**
- What we found
- Supporting evidence (metrics, charts)
- Business implication
**Finding 2: [Title]**
...
### Detailed Analysis
(Technical details for those who want more depth)
### Recommendations
| Priority | Recommendation | Expected Impact | Owner | Timeline |
|----------|----------------|-----------------|-------|----------|
### Limitations & Caveats
- Data limitations
- Assumptions made
- Areas for future analysis
### Appendix
- Detailed tables
- Methodology notes
- Data dictionary
Data Storytelling Narrative
Use this to craft compelling data stories.
Create a data storytelling narrative.
=== KEY INSIGHT ===
# REPLACE: The main insight you want to communicate
=== SUPPORTING DATA ===
# REPLACE: Data points that support the insight
=== AUDIENCE ===
# REPLACE: Who needs to act on this insight?
=== DESIRED ACTION ===
# REPLACE: What do you want them to do?
=== GENERATE NARRATIVE ===
Create a compelling story arc:
1. **Hook**
(Opening that grabs attention)
2. **Context**
(Background needed to understand)
3. **Tension**
(The problem or challenge revealed by data)
4. **Data Evidence**
- Key metric 1: [number] - what it means
- Key metric 2: [number] - what it means
- Visualization description
5. **Resolution**
(The recommendation or solution)
6. **Call to Action**
(Specific next steps)
7. **Presentation Slide Outline**
| Slide | Title | Content | Visual |
|-------|-------|---------|--------|
8. **Speaking Notes**
(Talking points for each section)
Python Code Prompts
Debug Python Data Code
Use this to fix Python errors in data work.
Help me debug this Python data analysis code.
=== ERROR ===
# REPLACE: Paste the full error message and traceback
=== CODE ===
# REPLACE: Paste your code
=== DATA (sample) ===
# REPLACE: Sample of your data or df.head() output
=== WHAT I'M TRYING TO DO ===
# REPLACE: Explain your goal
Please:
1. Explain what the error means
2. Identify the root cause
3. Provide corrected code
4. Explain the fix
5. Suggest how to prevent this error
Convert Analysis to Production Code
Use this to refactor analysis code.
Convert this analysis code to production-quality code.
=== CURRENT CODE ===
# REPLACE: Paste your notebook/analysis code
=== REQUIREMENTS ===
# REPLACE: What needs to be productionized?
# - Run as scheduled job
# - Handle new data
# - Error handling
# - Logging
=== GENERATE PRODUCTION CODE ===
Provide:
1. **Refactored Code**
- Functions with docstrings
- Type hints
- Error handling
- Logging
2. **Configuration**
- Config file structure
- Environment variables
3. **Testing**
- Unit tests
- Data validation
4. **Documentation**
- README
- Function documentation
5. **Deployment Notes**
- Dependencies (requirements.txt)
- Run instructions
Quick Reference
| Need | Prompt to Use |
|---|---|
| SQL | |
| Write SQL from description | Generate SQL Query |
| Make query faster | Optimize Slow SQL Query |
| Fix SQL error | Debug SQL Error |
| Advanced analytics query | Complex SQL with CTEs |
| EDA | |
| Start exploring new data | Initial Dataset Exploration |
| Analyze specific columns | Analyze Specific Variables |
| Find variable relationships | Correlation and Relationship Analysis |
| Data Cleaning | |
| Handle missing data | Handle Missing Values |
| Find and fix outliers | Detect and Handle Outliers |
| Fix data types | Data Type and Format Cleaning |
| Statistics | |
| Run significance test | Hypothesis Testing |
| Analyze A/B test | A/B Test Analysis |
| Build regression model | Regression Analysis |
| Machine Learning | |
| Pick the right model | Model Selection Helper |
| Create better features | Feature Engineering |
| Evaluate model | Model Evaluation and Interpretation |
| Visualization | |
| Create charts | Create Visualization for Dataset |
| Design dashboard | Build Dashboard Metrics |
| Communication | |
| Explain to non-technical | Explain Analysis to Non-Technical |
| Write formal report | Write Analysis Report |
| Tell data story | Data Storytelling Narrative |
| Code | |
| Fix Python errors | Debug Python Data Code |
| Make code production-ready | Convert Analysis to Production |
Tips for Better Data Science Prompts
1. Include Data Schema
❌ "Write a query to find top customers"
✅ "Write a SQL query for PostgreSQL. Tables:
- customers (customer_id, name, signup_date)
- orders (order_id, customer_id, amount, created_at)
Find top 10 customers by total order value in the last 90 days."
2. Specify Your Tools
❌ "Show me how to plot this"
✅ "Using Python with pandas and seaborn, create a visualization showing..."
3. Include Sample Data
"Here's the first 5 rows:
customer_id,revenue,segment
1,150.00,Premium
2,45.50,Standard
..."
4. State Your Objective
❌ "Analyze this data"
✅ "Analyze this data to identify which customer segments
have the highest churn risk, so we can prioritize retention efforts."
5. Specify Output Format
"Output as Python code with comments"
"Provide both SQL and equivalent pandas code"
"Include visualizations with Plotly"
6. Ask for Interpretation
"What do these results mean for the business?"
"Explain how to present this to a non-technical audience"
"What are the limitations of this analysis?"
7. Iterate with Context
"The query runs but returns 0 rows. Here's the data sample..."
"Good, but also add handling for NULL values in column X"
"Can you make this more efficient for 100M rows?"
What’s Next
- 📚 AI Prompts for Software Developers — Work with your engineering team
- 📚 AI Prompts for Business Analysts — Partner with your BA team
- 📚 Prompt Engineering Fundamentals — Master prompt techniques
- 🛠️ SQL Formatter Tool — Format your SQL queries
Found these prompts helpful? Share them with your data team!