Asked by ABAINEH

As a data scientist at a multinational corporation, you are tasked with using ChatGPT to predict customer churn for the upcoming quarter. The model should take into account variables such as tenure, monthly charges, total charges, contract type, payment method, and whether the customer has opted for various services like online security, device protection, tech support, and streaming services. However, the model has never been exposed to this type of problem before.

How would you craft the best prompt to guide ChatGPT in analyzing this data and making a predictive analysis on customer churn? What type of prompt would you use, and why? Consider the need for clarity and specificity in your instructions, as well as any examples you might provide to guide the model's response.

5 hours ago

There are no human answers yet.

Answered by GPT-5 mini

Short answer: use a structured, specific instruction-style prompt (system + user) that (1) tells the model its role, (2) gives the dataset schema and sample rows, (3) lists exact deliverables (EDA, preprocessing, modelling, metrics, code, explainability and per-customer predictions), (4) constrains tools and formats (Python, pandas/sklearn/XGBoost, output JSON/table), and (5) provides a few labeled examples (few-shot) so the model knows the expected output format. This reduces ambiguity and yields reproducible, actionable results.

Below I give (A) reasons why this prompt style works, (B) a ready-to-use prompt template you can paste into ChatGPT, (C) a short version for single-customer scoring, and (D) tips and best practices.

A. Why this approach
- Role + context sets expectations (expertise and business goal).
- Schema + sample rows removes uncertainty about column types and target.
- Explicit deliverables force the model to produce the outputs you care about (e.g., code, metrics, SHAP explanations, retention suggestions).
- Few-shot examples teach the model the exact output format for predictions/explanations.
- Constraints (language, libraries, evaluation metrics) produce usable, reproducible artifacts you can run and validate.

B. Full prompt template (copy/paste and replace placeholders)
System message:
You are an experienced machine learning engineer and data scientist specializing in customer churn prediction. Produce reproducible Python code and clear explanations. Use pandas, scikit-learn, XGBoost, and SHAP where appropriate. Provide concise, actionable business recommendations for reducing churn.

User message:
I will provide a customer dataset with the following schema (replace with actual column names if different):
- customerID: string
- tenure: integer (months)
- monthly_charges: float
- total_charges: float (may contain missing/empty)
- contract: categorical (Month-to-month, One year, Two year)
- payment_method: categorical (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
- online_security: categorical (Yes, No, No internet service)
- device_protection: categorical (Yes, No, No internet service)
- tech_support: categorical (Yes, No, No internet service)
- streaming_tv: categorical (Yes, No, No internet service)
- streaming_movies: categorical (Yes, No, No internet service)
- other relevant columns...
- churn: target (Yes/No) -- if you do not have labels for training, tell me how to proceed with unsupervised or risk-scoring approaches.

My objectives:
1. Build a predictive model to estimate probability of churn next quarter for each customer.
2. Provide a reproducible Jupyter notebook (or Python script) with data cleaning, EDA, feature engineering, model training, validation, hyperparameter tuning, and final model export.
3. Provide per-customer outputs: probability_of_churn (0-1), predicted_class (Yes/No) using a business-chosen threshold, top 3 reasons/features contributing to the prediction (feature importance / SHAP), recommended retention action (e.g., offer 1-month discount, upgrade to annual contract).
4. Provide model evaluation using appropriate metrics (AUC-ROC, Precision@k, recall, confusion matrix, calibration plot) and recommended threshold tuned to minimize false negatives (business may prefer recall) — explain trade-offs.
5. Address class imbalance, missing data, and categorical encoding. Explain assumptions and limitations.

Please produce the following deliverables in order:
1) Data inspection and EDA summary (summary stats, churn rate, feature distributions, correlations, top missing values).
2) Preprocessing plan and exact code: imputations, encoding, scaling, handling "No internet service" categories, creation of sensible derived features (e.g., avg_monthly_charge = total_charges/tenure), and how to treat tenure=0 or missing total_charges.
3) Baseline model (logistic regression) and at least one stronger model (XGBoost or RandomForest). Use stratified 5-fold cross-validation and grid/hyperparameter search. Show code.
4) Model evaluation results (CV AUC with CI, test set performance, confusion matrix at chosen threshold). Show feature importances and SHAP summary plot code.
5) Final model export (pickle/ONNX) and a small function predict_customer(row) that returns JSON: {customerID, probability_of_churn, predicted_class, top_3_features_with_contributions, recommended_action}.
6) Short business-ready summary and prioritized retention actions with estimated expected uplift rationale (if possible).
7) A short section on deployment considerations and monitoring (drift detection, periodic retraining frequency).
8) A clear list of assumptions and limitations.

Important implementation choices:
- Use stratified splitting (train/test 80/20). If class imbalance > 10:1, use class_weight or balanced objective or SMOTE; show both approaches and compare.
- Use random seed 42 for reproducibility.
- Use probability threshold tuning with business metric (e.g., maximize F1 or target recall >= X).
- Use SHAP for per-customer explanations and show code to extract the top 3 contributing features.

Example rows and expected output (few-shot):
Input rows (CSV rows):
customerID,tenure,monthly_charges,total_charges,contract,payment_method,online_security,device_protection,tech_support,streaming_tv,streaming_movies
C001,1,70.35,70.35,Month-to-month,Electronic check,No,No,No,No,No
C002,45,89.15,4000.25,Two year,Bank transfer (automatic),Yes,Yes,Yes,Yes,Yes
C003,10,52.00,520.00,Month-to-month,Credit card (automatic),No,No,No,Yes,No

Expected per-customer prediction format:
{
"customerID": "C001",
"probability_of_churn": 0.82,
"predicted_class": "Yes",
"top_3_features": [
{"feature": "contract", "value": "Month-to-month", "contribution": +0.25},
{"feature": "payment_method", "value": "Electronic check", "contribution": +0.12},
{"feature": "tenure", "value": 1, "contribution": +0.10}
],
"recommended_action": "Offer 20% discount for 3 months + 1-year contract incentive"
}

Please produce code cells in the notebook style (pandas, numpy, sklearn, xgboost, shap, matplotlib/seaborn). If the dataset is large, include sample code to stream rows. If you do not have the actual CSV yet, outline the full notebook with placeholder for loading data and flag where I should replace path names. After you produce the plan and code, ask clarifying questions about business priorities (e.g., whether minimizing false negatives or false positives is more important, budget per retention offer).

C. Short prompt for ad-hoc single-customer scoring (paste this when you just want one customer's prediction)
System message:
You are a data scientist. Use a trained model (assume features and preprocessing described below) to score a single customer row and explain top reasons.

User message:
Model info: assume a trained XGBoost model with preprocessing: categorical one-hot, numeric scaling. Use SHAP values to explain contributions. Business threshold = 0.5 unless otherwise specified.

Customer row:
{customerID: "C901", tenure: 3, monthly_charges: 85.99, total_charges: 255.97, contract: "Month-to-month", payment_method: "Electronic check", online_security:"No", device_protection:"No", tech_support:"No", streaming_tv:"Yes", streaming_movies:"Yes"}

Output format (JSON):
{customerID, probability_of_churn, predicted_class, top_3_features_with_contributions, recommended_action}

D. Additional tips and best practices
- Provide the actual CSV or a way to load it. The more real samples, the better the model's concrete outputs.
- Include the business objective: reduce churn by X% or maximize retention ROI. That guides threshold and recommended actions.
- Ask the model to produce runnable code and to mark places where you must insert secrets or file paths.
- Ask for a small train/test run with a subset of data first to validate pipeline speed.
- Iterate: after the model returns a notebook, run it, find issues, then ask the model to fix specific errors by pasting stack traces.

If you want, I can generate a ready-to-run Jupyter notebook scaffold using the exact columns you have — paste a small sample of real rows (10–50) or indicate file location and I'll produce the notebook and the sample predictions.

5 hours ago