Prompts Library updated 25 min read

AI Prompts for Data Scientists & Analysts: 40+ Ready-to-Use Templates

Copy-paste AI prompts for EDA, SQL queries, statistical analysis, machine learning, data visualization, and reporting. Works with ChatGPT, Claude, and Gemini.

RP

Rajesh Praharaj

Oct 7, 2025 · Updated Dec 29, 2025

AI Prompts for Data Scientists & Analysts: 40+ Ready-to-Use Templates

TL;DR - Best AI Prompts for Data Scientists & Analysts

Looking for ready-to-use AI prompts for data science? This guide contains 40+ copy-paste prompts that work with ChatGPT, Claude, and Gemini. Each prompt includes placeholders you can customize for your specific analysis needs. For foundational prompting skills, see the Prompt Engineering Fundamentals guide.

What’s included:

  • SQL Prompts — Generate complex queries, optimize performance, and debug errors
  • EDA Prompts — Explore datasets, generate summary statistics, and find patterns
  • Data Cleaning Prompts — Handle missing values, outliers, and data quality issues
  • Statistical Analysis Prompts — Hypothesis testing, A/B tests, regression analysis
  • Machine Learning Prompts — Model selection, feature engineering, evaluation
  • Visualization Prompts — Create charts, dashboards, and presentation-ready graphics
  • Communication Prompts — Write reports, explain findings, and tell data stories

💡 Pro tip: Include your data schema, sample data, and business context. The more specific you are about your data and objectives, the better the output. For advanced prompting techniques, see the Advanced Prompt Engineering guide.


How to Use These Data Science Prompts

Each prompt below is ready to copy and paste. Here’s how they work:

  1. Copy the entire prompt from the code block
  2. Replace the placeholder comments (lines starting with # REPLACE:) with your actual content
  3. Paste into ChatGPT, Claude, or your preferred AI
  4. Get your result and iterate if needed

Adding Context for Better Results

Data science prompts work best with rich context:

=== DATA CONTEXT (include in any prompt) ===
Dataset: [Name and description]
Size: [Rows x columns, file size]
Key Columns: [Important columns with types]
Business Context: [What business problem are you solving]
Tools: [Python/R, specific libraries]
Output: [Code only / Code + explanation / Explanation only]

SQL Query Prompts

Generate SQL Query from Natural Language

Use this to convert plain English descriptions into SQL.

Write a SQL query for the following request.

=== DATABASE ===
Database type: 
# REPLACE: PostgreSQL / MySQL / SQL Server / BigQuery / Snowflake

=== SCHEMA ===
# REPLACE: Describe or paste your table schemas
# Example:
# users (user_id INT PK, email VARCHAR, created_at TIMESTAMP, plan VARCHAR)
# orders (order_id INT PK, user_id INT FK, amount DECIMAL, status VARCHAR, created_at TIMESTAMP)
# products (product_id INT PK, name VARCHAR, category VARCHAR, price DECIMAL)

=== REQUEST ===
# REPLACE: Describe what you need in plain English
# Example: Get monthly revenue by product category for the last 12 months, 
# showing month-over-month growth percentage

=== REQUIREMENTS ===
# REPLACE: Any specific requirements
# - Must handle NULLs
# - Needs pagination
# - Performance is critical (table has 10M+ rows)

=== OUTPUT ===
1. The SQL query with comments explaining each section
2. Sample output showing expected columns
3. Suggested indexes for performance (if applicable)
4. Alternative approaches (if any)

Optimize Slow SQL Query

Use this to optimize query performance.

Optimize this SQL query for better performance.

=== CURRENT QUERY ===
# REPLACE: Paste your slow query

=== CONTEXT ===
Database: 
# REPLACE: PostgreSQL / MySQL / etc.

Table sizes:
# REPLACE: Approximate row counts
# - users: 1M rows
# - orders: 50M rows

Current execution time: 
# REPLACE: e.g., 45 seconds

Target execution time: 
# REPLACE: e.g., under 5 seconds

Existing indexes:
# REPLACE: List current indexes or "Unknown"

=== ANALYZE AND OPTIMIZE ===

Provide:
1. Analysis of why it's slow
2. Optimized query with explanation
3. Recommended indexes (CREATE INDEX statements)
4. Query execution plan considerations
5. Alternative approaches if applicable

Debug SQL Error

Use this when you have a SQL error to fix.

Help me debug this SQL error.

=== ERROR MESSAGE ===
# REPLACE: Paste the full error message

=== QUERY ===
# REPLACE: Paste your query

=== SCHEMA (relevant tables) ===
# REPLACE: Table definitions

=== DATABASE ===
# REPLACE: Database type and version

Please:
1. Explain what the error means
2. Identify the exact cause
3. Provide the corrected query
4. Explain how to avoid this in the future

Complex SQL with CTEs and Window Functions

Use this for advanced analytical queries.

Write an advanced SQL query using CTEs and window functions.

=== ANALYSIS NEEDED ===
# REPLACE: Describe the analysis
# Example: Calculate running total of revenue per customer, 
# rank customers by lifetime value, and identify customers 
# whose spending dropped more than 50% month-over-month

=== SCHEMA ===
# REPLACE: Relevant tables

=== REQUIREMENTS ===
# REPLACE: Specific requirements
# - Use CTEs for readability
# - Include window functions where appropriate
# - Handle edge cases (new customers, null values)

=== OUTPUT FORMAT ===
| Column | Description |
|--------|-------------|
# REPLACE: Expected output columns

Generate query with:
1. CTEs for each logical step
2. Appropriate window functions
3. Comments explaining the logic
4. Sample output

Exploratory Data Analysis Prompts

Initial Dataset Exploration

Use this when you first receive a new dataset. For more on how AI handles data, see the Tokens, Context Windows & Parameters guide.

Generate Python code for initial exploratory data analysis.

=== DATASET ===
Name: 
# REPLACE: Dataset name

Description: 
# REPLACE: What does this data represent?

Format: 
# REPLACE: CSV / Parquet / JSON / Database table

=== SAMPLE DATA (optional) ===
# REPLACE: Paste first few rows if available

=== COLUMNS (if known) ===
# REPLACE: List columns with expected types
# - customer_id: int (unique identifier)
# - purchase_amount: float (in USD)
# - purchase_date: datetime

=== GENERATE EDA CODE ===

Provide Python code (using pandas, numpy, matplotlib, seaborn) for:

1. **Data Loading & First Look**
   - Load data
   - .head(), .info(), .describe()
   - Column types and memory usage

2. **Missing Values Analysis**
   - Count and percentage by column
   - Visualize missing patterns

3. **Univariate Analysis**
   - Distributions for numerical columns
   - Value counts for categorical columns
   - Statistical summaries

4. **Bivariate Analysis**
   - Correlation matrix
   - Key relationships
   - Scatter plots for important pairs

5. **Data Quality Checks**
   - Duplicates
   - Outliers
   - Inconsistent values

6. **Initial Insights**
   - Key observations
   - Potential issues
   - Recommended next steps

Analyze Specific Variables

Use this to deep-dive into specific columns.

Analyze these specific variables in my dataset.

=== VARIABLES TO ANALYZE ===
# REPLACE: List variables with their types
# - revenue (continuous, numeric)
# - customer_segment (categorical)
# - signup_date (datetime)

=== ANALYSIS GOALS ===
# REPLACE: What questions are you trying to answer?
# - Is revenue normally distributed?
# - How do segments differ in behavior?
# - Are there seasonal patterns?

=== SAMPLE DATA ===
# REPLACE: Paste sample or describe data characteristics

=== GENERATE ANALYSIS CODE ===

Python code for:

1. **Numerical Variables**
   - Distribution (histogram, KDE)
   - Summary statistics (mean, median, std, quartiles)
   - Skewness and kurtosis
   - Outlier detection (IQR, Z-score)
   - Transformation suggestions

2. **Categorical Variables**
   - Value counts and proportions
   - Bar charts
   - Chi-square test for independence (if applicable)
   - Cardinality assessment

3. **Datetime Variables**
   - Time range and gaps
   - Patterns (daily, weekly, monthly, yearly)
   - Trend decomposition

4. **Variable Relationships**
   - Cross-tabulations
   - Group comparisons
   - Visualizations by segment

5. **Summary of Findings**
   - Key insights
   - Anomalies detected
   - Recommendations

Correlation and Relationship Analysis

Use this to understand variable relationships.

Analyze correlations and relationships in my dataset.

=== DATASET DESCRIPTION ===
# REPLACE: Describe your dataset and key columns

=== TARGET VARIABLE (if applicable) ===
# REPLACE: What are you trying to predict/explain?

=== SAMPLE DATA ===
# REPLACE: Paste sample data or schema

=== GENERATE ANALYSIS ===

Python code for:

1. **Correlation Matrix**
   - Pearson correlation for numeric variables
   - Heatmap visualization
   - Highlight strong correlations (|r| > 0.7)

2. **Multicollinearity Check**
   - VIF (Variance Inflation Factor)
   - Recommendations for feature selection

3. **Target Relationships**
   - Correlation with target variable
   - Top predictive features
   - Scatter plots with trend lines

4. **Non-Linear Relationships**
   - Spearman correlation
   - Visual inspection for curves
   - Polynomial relationship detection

5. **Categorical Associations**
   - Cramér's V for categorical variables
   - Point-biserial for mixed types

6. **Insights Summary**
   - Key relationships found
   - Potential issues (multicollinearity)
   - Feature recommendations

Data Cleaning Prompts

Handle Missing Values

Use this to address missing data.

Generate code to handle missing values in my dataset.

=== DATASET ===
# REPLACE: Dataset description

=== MISSING VALUE PATTERN ===
# REPLACE: Describe what you know about missing values
# - Column X: 15% missing, likely MCAR
# - Column Y: 30% missing, possibly related to Column Z
# - Column W: Completely missing for certain dates

=== CONSTRAINTS ===
# REPLACE: Any constraints on imputation
# - Cannot remove rows (need all records)
# - Must handle categorical and numeric differently
# - Need to preserve distributions

=== GENERATE SOLUTION ===

Python code for:

1. **Missing Value Assessment**
   - Missing counts and percentages
   - Missing patterns (MCAR, MAR, MNAR analysis)
   - Visualization of missingness

2. **Handling Strategy by Column**
   
   For each column with missing values:
   - **Numeric columns:**
     - Mean/median imputation
     - KNN imputation
     - Regression imputation
   
   - **Categorical columns:**
     - Mode imputation
     - Create "Unknown" category
     - Predictive imputation
   
   - **Time-based:**
     - Forward/backward fill
     - Interpolation

3. **Imputation Code**
   - Reusable functions
   - Before/after comparison
   - Validation

4. **Documentation**
   - What was done to each column
   - Justification for approach
   - Potential impact on analysis

Detect and Handle Outliers

Use this for outlier analysis.

Analyze and handle outliers in my dataset.

=== DATASET ===
# REPLACE: Describe dataset

=== COLUMNS TO CHECK ===
# REPLACE: List numerical columns to check for outliers
# - purchase_amount
# - session_duration
# - age

=== BUSINESS CONTEXT ===
# REPLACE: What constitutes a valid extreme value?
# - Purchase amounts over $10K are possible but rare
# - Negative ages are errors
# - Session durations over 24 hours are logging errors

=== GENERATE ANALYSIS ===

Python code for:

1. **Outlier Detection**
   - IQR method (1.5 * IQR rule)
   - Z-score method (|z| > 3)
   - Isolation Forest (for multivariate)
   - Visual detection (box plots, scatter)

2. **Outlier Report**
   | Column | Method | # Outliers | % of Data | Min/Max Outlier |
   
3. **Outlier Investigation**
   - Are outliers errors or valid extremes?
   - Pattern analysis (when do they occur?)
   - Source investigation

4. **Handling Strategies**
   - Remove (with justification)
   - Cap/Winsorize (to percentile)
   - Transform (log, sqrt)
   - Keep (if valid business cases)
   - Flag (create indicator variable)

5. **Implementation Code**
   - Functions for each strategy
   - Before/after visualizations
   - Impact assessment

Data Type and Format Cleaning

Use this for data type issues.

Clean data types and formats in my dataset.

=== DATA ISSUES ===
# REPLACE: Describe your data type issues
# - date_column is string format "MM/DD/YYYY"
# - price has $ symbol and commas
# - boolean stored as "Yes"/"No"
# - phone numbers inconsistent formats

=== SAMPLE DATA ===
# REPLACE: Paste sample showing the issues

=== TARGET FORMAT ===
# REPLACE: What formats do you need?
# - Dates: datetime64
# - Prices: float
# - Booleans: True/False

=== GENERATE CLEANING CODE ===

Python code for:

1. **Data Type Assessment**
   - Current types
   - Suggested types
   - Conversion risks

2. **Cleaning Functions**
   
   For each issue:
   ```python
   def clean_[column](value):
       # Handle the specific issue
       # Return cleaned value
  1. Validation

    • Check for conversion errors
    • Handle edge cases
    • Log problematic values
  2. Pipeline

    • Complete cleaning function
    • Error handling
    • Before/after summary

---

## Statistical Analysis Prompts

### Hypothesis Testing

Use this to conduct statistical tests. For understanding AI limitations in analysis, see the [Understanding AI Safety, Ethics, and Limitations guide](/tech-articles/understanding-ai-safety/).

```text
Conduct a hypothesis test for my analysis.

=== RESEARCH QUESTION ===
# REPLACE: What are you trying to determine?
# Example: Is there a significant difference in conversion rates 
# between the new and old landing page designs?

=== HYPOTHESIS ===
H0 (null): 
# REPLACE: e.g., There is no difference in conversion rates

H1 (alternative): 
# REPLACE: e.g., The new design has a higher conversion rate

=== DATA ===
# REPLACE: Describe your data
# - Group A (old design): 5000 visitors, 150 conversions
# - Group B (new design): 5000 visitors, 185 conversions

=== GENERATE ANALYSIS ===

Python code for:

1. **Test Selection**
   - Appropriate test based on:
     - Data type (continuous, categorical)
     - Number of groups
     - Assumptions (normality, independence)
   - Justification for chosen test

2. **Assumption Checks**
   - Normality (Shapiro-Wilk, Q-Q plots)
   - Homogeneity of variance (Levene's)
   - Independence
   - Sample size adequacy

3. **Conduct Test**
   - Calculate test statistic
   - P-value
   - Effect size (Cohen's d, odds ratio, etc.)
   - Confidence interval

4. **Results Interpretation**
   - Statistical conclusion (reject/fail to reject H0)
   - Practical significance
   - Limitations

5. **Report Format**
   - APA-style result statement
   - Visualization
   - Business recommendation

A/B Test Analysis

Use this to analyze experiment results.

Analyze this A/B test and determine if results are significant.

=== EXPERIMENT DETAILS ===
Test name: 
# REPLACE: Name of the test

Test duration: 
# REPLACE: Start and end dates

Hypothesis: 
# REPLACE: What change will improve what metric

=== METRICS ===
Primary metric: 
# REPLACE: e.g., Conversion rate

Secondary metrics: 
# REPLACE: e.g., Average order value, Bounce rate

=== DATA ===
# REPLACE: Data for each variant
# Control (A):
#   - Sample size: 10,000
#   - Conversions: 250
#   - Revenue: $12,500
#
# Treatment (B):
#   - Sample size: 10,200
#   - Conversions: 290
#   - Revenue: $15,300

=== GENERATE A/B TEST ANALYSIS ===

Python code for:

1. **Test Summary**
   | Metric | Control | Treatment | Difference | % Lift |
   
2. **Statistical Significance**
   - Chi-square test (for proportions)
   - T-test (for continuous)
   - P-value
   - Required sample size vs actual

3. **Confidence Intervals**
   - 95% CI for each metric
   - CI for the difference

4. **Effect Size**
   - Practical significance
   - Minimum detectable effect achieved?

5. **Power Analysis**
   - Post-hoc power
   - Was test adequately powered?

6. **Segmentation Check**
   - Results by key segments
   - Any surprising segment behavior

7. **Recommendation**
   - Ship / Don't ship / Keep testing
   - Justification
   - Confidence level

Regression Analysis

Use this for regression modeling.

Perform regression analysis on my data.

=== OBJECTIVE ===
# REPLACE: What are you trying to predict/explain?
# Example: Predict customer lifetime value based on demographic 
# and behavioral features

=== DATA ===
Target variable: 
# REPLACE: Column name and description

Feature variables:
# REPLACE: List features to consider
# - age (continuous)
# - gender (categorical)
# - tenure_days (continuous)
# - purchase_count (discrete)

Sample data:
# REPLACE: Paste sample or describe dataset

=== REGRESSION TYPE ===
# REPLACE: Linear / Logistic / Other

=== GENERATE ANALYSIS ===

Python code for:

1. **Data Preparation**
   - Handle categoricals (encoding)
   - Check for multicollinearity
   - Train/test split

2. **Model Building**
   - Baseline model
   - Feature selection (if needed)
   - Final model

3. **Assumption Checking** (for linear regression)
   - Linearity
   - Homoscedasticity
   - Normality of residuals
   - Independence

4. **Model Evaluation**
   - R² / Adjusted R²
   - RMSE, MAE (for regression)
   - AUC, Accuracy, F1 (for classification)
   - Cross-validation results

5. **Coefficient Interpretation**
   
   | Feature | Coefficient | Std Error | P-value | Interpretation |
   
6. **Visualizations**
   - Actual vs Predicted
   - Residual plots
   - Feature importance

7. **Business Insights**
   - Key drivers
   - Actionable recommendations
   - Limitations

Machine Learning Prompts

Model Selection Helper

Use this to choose the right model for your problem.

Help me select the right machine learning model.

=== PROBLEM TYPE ===
# REPLACE: What are you trying to do?
# - Predict a continuous value (regression)
# - Classify into categories (classification)
# - Find natural groupings (clustering)
# - Detect anomalies
# - Recommend items

=== DATA CHARACTERISTICS ===
Size: 
# REPLACE: Number of rows and features

Target distribution: 
# REPLACE: Balanced/imbalanced, continuous range

Feature types: 
# REPLACE: Mostly numeric / categorical / mixed / text / images

Missing values: 
# REPLACE: Percentage

=== CONSTRAINTS ===
# REPLACE: Any constraints
# - Need interpretability
# - Limited compute resources
# - Must run in real-time
# - Need probability outputs

=== GENERATE RECOMMENDATIONS ===

Provide:

1. **Recommended Models** (ranked)
   
   | Rank | Model | Why | Pros | Cons |
   |------|-------|-----|------|------|
   | 1 | | | | |

2. **For Each Recommended Model:**
   - When to use it
   - Python implementation code
   - Key hyperparameters to tune
   - Expected performance range

3. **Model Comparison Strategy**
   - Cross-validation approach
   - Metrics to compare
   - Baseline model

4. **Quick Start Code**
   ```python
   # Complete working code for top model

---

### Feature Engineering

Use this to create better features.

```text
Generate feature engineering ideas and code.

=== PROBLEM ===
# REPLACE: What are you trying to predict?

=== CURRENT FEATURES ===
# REPLACE: List your current features
# - user_id
# - purchase_date
# - purchase_amount
# - product_category
# - customer_tenure_days

=== DOMAIN KNOWLEDGE ===
# REPLACE: Any domain knowledge to leverage
# - Seasonality is important
# - Recent behavior is more predictive than old
# - Certain product combinations are meaningful

=== GENERATE FEATURE IDEAS ===

Provide:

1. **Feature Engineering Ideas**

   | Category | Feature | Description | Expected Value |
   |----------|---------|-------------|----------------|
   | Aggregations | | | |
   | Time-based | | | |
   | Interactions | | | |
   | Transformations | | | |

2. **Implementation Code**

   For each feature category:
   ```python
   def create_[category]_features(df):
       # Feature engineering code
       return df
  1. Time-Based Features

    • Recency, frequency, monetary (RFM)
    • Rolling windows
    • Lag features
    • Seasonality encoding
  2. Categorical Engineering

    • Target encoding
    • Frequency encoding
    • Combinations
  3. Feature Selection

    • Correlation with target
    • Feature importance
    • Recommendations
  4. Complete Pipeline

    # End-to-end feature engineering pipeline

---

### Model Evaluation and Interpretation

Use this to evaluate and explain your model.

```text
Evaluate and interpret my machine learning model.

=== MODEL ===
Model type: 
# REPLACE: e.g., Random Forest Classifier

Task: 
# REPLACE: Classification / Regression

=== DATA ===
# REPLACE: Train/test split info, class distribution

=== CURRENT RESULTS ===
# REPLACE: Paste your current metrics or predictions

=== GENERATE EVALUATION ===

Python code for:

1. **Performance Metrics**
   
   For Classification:
   - Accuracy, Precision, Recall, F1
   - AUC-ROC, AUC-PR
   - Confusion Matrix
   - Classification Report
   
   For Regression:
   - RMSE, MAE, MAPE
   - R², Adjusted R²
   - Residual analysis

2. **Model Diagnostics**
   - Learning curves
   - Overfitting check
   - Cross-validation scores

3. **Feature Importance**
   - Built-in importance (if available)
   - Permutation importance
   - SHAP values

4. **Error Analysis**
   - Where does the model fail?
   - Patterns in errors
   - Segment performance

5. **Model Comparison** (if applicable)
   - Benchmark against baseline
   - Comparison visualization

6. **Interpretation Summary**
   - Key predictive features
   - How features influence predictions
   - Business recommendations

Data Visualization Prompts

Create Visualization for Dataset

Use this to generate appropriate charts.

Create visualizations for my dataset.

=== DATASET ===
# REPLACE: Describe your dataset

=== KEY VARIABLES ===
# REPLACE: Variables you want to visualize
# - revenue (continuous)
# - region (categorical, 5 categories)
# - date (datetime, 2 years of data)
# - customer_segment (categorical)

=== PURPOSE ===
# REPLACE: What story are you trying to tell?
# - Show revenue trends over time
# - Compare performance across regions
# - Identify patterns and outliers

=== AUDIENCE ===
# REPLACE: Who will see these? (Executives / Technical team / Clients)

=== GENERATE VISUALIZATIONS ===

Python code (matplotlib/seaborn/plotly) for:

1. **Recommended Charts**
   
   | Variable(s) | Chart Type | Purpose |
   |-------------|------------|---------|
   | | | |

2. **Single Variable Visualizations**
   - Distributions
   - Time series
   - Category counts

3. **Multi-Variable Visualizations**
   - Relationships
   - Comparisons
   - Compositions

4. **Dashboard Layout**
   - Suggested arrangement
   - Key metrics to highlight

5. **Presentation-Ready Code**
   - Proper titles and labels
   - Color palette
   - Annotations
   - Export settings

6. **Interactive Version** (if applicable)
   - Plotly/Bokeh code
   - Drill-down capabilities

Build Dashboard Metrics

Use this to define dashboard KPIs.

Help me design dashboard metrics and visualizations.

=== BUSINESS CONTEXT ===
# REPLACE: What does your team/business do?

=== AUDIENCE ===
# REPLACE: Who will use this dashboard?

=== AVAILABLE DATA ===
# REPLACE: What data sources do you have?

=== KEY QUESTIONS ===
# REPLACE: What questions should the dashboard answer?
# - How are we performing vs target?
# - What's trending up or down?
# - Where should we focus?

=== GENERATE DASHBOARD DESIGN ===

1. **KPI Definitions**
   
   | KPI | Definition | Formula | Target | Cadence |
   |-----|------------|---------|--------|---------|
   | | | | | |

2. **Dashboard Layout**
   

[Sketch of dashboard layout]

  • Top: Summary KPIs
  • Middle: Trend charts
  • Bottom: Detail tables

3. **Visualization Specifications**

For each chart:
- Chart type
- Data source
- Filters
- Interactivity

4. **SQL Queries for Metrics**
(One query per metric)

5. **Python/BI Tool Code**
- Code to generate visualizations
- Update/refresh logic

6. **Recommendations**
- Refresh frequency
- Alert thresholds
- Future enhancements

Communication Prompts

Explain Analysis to Non-Technical Audience

Use this to translate findings.

Explain this analysis for a non-technical audience.

=== TECHNICAL FINDINGS ===
# REPLACE: Paste your technical results
# Example:
# - Logistic regression model with 0.85 AUC
# - Top features: tenure (0.45), purchase_frequency (0.32), support_tickets (0.28)
# - Customers with tenure < 60 days and >2 support tickets have 65% churn probability

=== AUDIENCE ===
# REPLACE: Who are you presenting to?
# - VP of Marketing
# - Non-technical stakeholders
# - C-suite

=== BUSINESS CONTEXT ===
# REPLACE: What decision does this inform?

=== GENERATE EXPLANATION ===

Provide:

1. **One-Sentence Summary**
   (The key insight in plain English)

2. **What We Did**
   (Non-technical description of methodology)

3. **What We Found**
   - Key finding 1 (with business implication)
   - Key finding 2 (with business implication)
   - Key finding 3 (with business implication)

4. **What This Means**
   (Business recommendations)

5. **Visualizations to Include**
   (Describe simple, intuitive charts)

6. **Confidence and Limitations**
   (Honest assessment in plain language)

7. **Recommended Actions**
   | Action | Expected Impact | Effort |
   |--------|-----------------|--------|

8. **Appendix for Questions**
   (Talking points for common questions)

Write Analysis Report

Use this to structure a complete report.

Write a data analysis report.

=== ANALYSIS TOPIC ===
# REPLACE: What did you analyze?

=== KEY FINDINGS ===
# REPLACE: Summarize main findings

=== DATA USED ===
# REPLACE: Data sources and timeframes

=== METHODOLOGY ===
# REPLACE: Brief description of methods

=== AUDIENCE ===
# REPLACE: Who will read this?

=== GENERATE REPORT ===

## [Report Title]

### Executive Summary
(1 paragraph: key findings and recommendations)

### Background
- Business context
- Why this analysis was conducted
- Questions we sought to answer

### Data & Methodology
- Data sources
- Time period
- Analytical approach
- Key assumptions

### Key Findings

**Finding 1: [Title]**
- What we found
- Supporting evidence (metrics, charts)
- Business implication

**Finding 2: [Title]**
...

### Detailed Analysis
(Technical details for those who want more depth)

### Recommendations

| Priority | Recommendation | Expected Impact | Owner | Timeline |
|----------|----------------|-----------------|-------|----------|

### Limitations & Caveats
- Data limitations
- Assumptions made
- Areas for future analysis

### Appendix
- Detailed tables
- Methodology notes
- Data dictionary

Data Storytelling Narrative

Use this to craft compelling data stories.

Create a data storytelling narrative.

=== KEY INSIGHT ===
# REPLACE: The main insight you want to communicate

=== SUPPORTING DATA ===
# REPLACE: Data points that support the insight

=== AUDIENCE ===
# REPLACE: Who needs to act on this insight?

=== DESIRED ACTION ===
# REPLACE: What do you want them to do?

=== GENERATE NARRATIVE ===

Create a compelling story arc:

1. **Hook**
   (Opening that grabs attention)

2. **Context**
   (Background needed to understand)

3. **Tension**
   (The problem or challenge revealed by data)

4. **Data Evidence**
   - Key metric 1: [number] - what it means
   - Key metric 2: [number] - what it means
   - Visualization description

5. **Resolution**
   (The recommendation or solution)

6. **Call to Action**
   (Specific next steps)

7. **Presentation Slide Outline**
   | Slide | Title | Content | Visual |
   |-------|-------|---------|--------|

8. **Speaking Notes**
   (Talking points for each section)

Python Code Prompts

Debug Python Data Code

Use this to fix Python errors in data work.

Help me debug this Python data analysis code.

=== ERROR ===
# REPLACE: Paste the full error message and traceback

=== CODE ===
# REPLACE: Paste your code

=== DATA (sample) ===
# REPLACE: Sample of your data or df.head() output

=== WHAT I'M TRYING TO DO ===
# REPLACE: Explain your goal

Please:
1. Explain what the error means
2. Identify the root cause
3. Provide corrected code
4. Explain the fix
5. Suggest how to prevent this error

Convert Analysis to Production Code

Use this to refactor analysis code.

Convert this analysis code to production-quality code.

=== CURRENT CODE ===
# REPLACE: Paste your notebook/analysis code

=== REQUIREMENTS ===
# REPLACE: What needs to be productionized?
# - Run as scheduled job
# - Handle new data
# - Error handling
# - Logging

=== GENERATE PRODUCTION CODE ===

Provide:

1. **Refactored Code**
   - Functions with docstrings
   - Type hints
   - Error handling
   - Logging

2. **Configuration**
   - Config file structure
   - Environment variables

3. **Testing**
   - Unit tests
   - Data validation

4. **Documentation**
   - README
   - Function documentation

5. **Deployment Notes**
   - Dependencies (requirements.txt)
   - Run instructions

Quick Reference

NeedPrompt to Use
SQL
Write SQL from descriptionGenerate SQL Query
Make query fasterOptimize Slow SQL Query
Fix SQL errorDebug SQL Error
Advanced analytics queryComplex SQL with CTEs
EDA
Start exploring new dataInitial Dataset Exploration
Analyze specific columnsAnalyze Specific Variables
Find variable relationshipsCorrelation and Relationship Analysis
Data Cleaning
Handle missing dataHandle Missing Values
Find and fix outliersDetect and Handle Outliers
Fix data typesData Type and Format Cleaning
Statistics
Run significance testHypothesis Testing
Analyze A/B testA/B Test Analysis
Build regression modelRegression Analysis
Machine Learning
Pick the right modelModel Selection Helper
Create better featuresFeature Engineering
Evaluate modelModel Evaluation and Interpretation
Visualization
Create chartsCreate Visualization for Dataset
Design dashboardBuild Dashboard Metrics
Communication
Explain to non-technicalExplain Analysis to Non-Technical
Write formal reportWrite Analysis Report
Tell data storyData Storytelling Narrative
Code
Fix Python errorsDebug Python Data Code
Make code production-readyConvert Analysis to Production

Tips for Better Data Science Prompts

1. Include Data Schema

❌ "Write a query to find top customers"
✅ "Write a SQL query for PostgreSQL. Tables:
    - customers (customer_id, name, signup_date)
    - orders (order_id, customer_id, amount, created_at)
    Find top 10 customers by total order value in the last 90 days."

2. Specify Your Tools

❌ "Show me how to plot this"
✅ "Using Python with pandas and seaborn, create a visualization showing..."

3. Include Sample Data

"Here's the first 5 rows:
customer_id,revenue,segment
1,150.00,Premium
2,45.50,Standard
..."

4. State Your Objective

❌ "Analyze this data"
✅ "Analyze this data to identify which customer segments 
    have the highest churn risk, so we can prioritize retention efforts."

5. Specify Output Format

"Output as Python code with comments"
"Provide both SQL and equivalent pandas code"
"Include visualizations with Plotly"

6. Ask for Interpretation

"What do these results mean for the business?"
"Explain how to present this to a non-technical audience"
"What are the limitations of this analysis?"

7. Iterate with Context

"The query runs but returns 0 rows. Here's the data sample..."
"Good, but also add handling for NULL values in column X"
"Can you make this more efficient for 100M rows?"

What’s Next


Found these prompts helpful? Share them with your data team!

Was this page helpful?

Let us know if you found what you were looking for.