Browse Categories
Use the sidebar to navigate 5 workflow categories and their themes
Explore Projects
Click project cards to see detailed problem statements and proposed capabilities
Read the Evidence
Each theme includes developer quotes with PID attribution
Methodology: 6-Phase Analysis Pipeline
- Unit of analysis: Individual survey response
- Prevalence: Unique PIDs with 2+ model agreement (majority vote)
- Multi-coding: One response can be assigned multiple themes
- Coverage: Percentage of responses assigned at least one theme
- IRR: All responses coded by all 3 models; Krippendorff's α computed per theme
- Convergence: How many of the 3 models independently discovered a theme during Phase 1
Multi-LLM Qualitative Analysis Methodology
Analyzing 860 Microsoft developer survey responses on AI automation preferences using triangulated thematic analysis
Source survey: "AI Where It Matters: Where, Why, and How Developers Want AI Support in Daily Work" (Choudhuri et al., 2025)
This document describes the full methodology for a multi-model qualitative analysis pipeline that identifies research opportunities (what developers want AI to do) and design constraints (what developers do not want AI to handle) from open-ended survey responses. Three frontier LLMs serve as independent coders, with inter-rater reliability calculated via Krippendorff's alpha and consensus reached through majority vote. A human review gate validates codebooks before systematic coding begins.
Research Questions & Data
Survey Questions (per category)
Open-ended responses capturing desired capabilities and unmet needs.
Open-ended responses capturing guardrails, no-go zones, and boundary conditions.
Unit of Analysis
Each unit is a single respondent's open-ended answer to one of the two questions within a category. Respondents answered about 2–3 categories each, and a single response may be assigned multiple theme codes.
Task Categories & Response Counts
| Category | Respondents | Tasks Covered |
|---|---|---|
| Development | 816 | Coding, Bug Fixing, Perf Optimization, Refactoring, AI Development |
| Design & Planning | 548 | System Architecture, Requirements Gathering, Project Planning |
| Meta-Work | 532 | Documentation, Communication, Mentoring, Learning, Research |
| Quality & Risk | 401 | Testing & QA, Code Review / PRs, Security & Compliance |
| Infrastructure & Ops | 283 | DevOps / CI-CD, Environment Setup, Monitoring, Customer Support |
Data Quality Note
Approximately 11% of responses contain data quality issues detected during coding:
misplaced answers (respondent wrote a "want" answer in the "not want" field or vice-versa),
back-references to prior answers that are unintelligible on their own, or terse non-responses.
These are flagged with ISSUE_* codes rather than discarded,
avoiding pre-filter bias (see Methodological Controls).
Pipeline Overview
The analysis runs in two stages with a human review gate between them. Both the opportunity and constraint tracks follow identical process steps but with track-specific prompts and codebooks.
Stage 1: Theme Discovery & Reconciliation
Phase 1: Independent 3-Model Discovery
Each model receives all responses for a given category and independently proposes 4–15 themes with supporting evidence (PIDs). The prompt instructs models to create specific, actionable, problem-focused themes and to allow multi-coding.
| Model | Provider | Thinking Mode | Role |
|---|---|---|---|
| GPT-5.2 | OpenAI | reasoning_effort="high" | Independent coder & reconciler |
| Gemini 3.1 Pro | thinking_level="HIGH" | Independent coder | |
| Claude Opus 4.6 | Anthropic | thinking: adaptive, effort: high | Independent coder |
Inputs
- Open-ended survey responses with PIDs (e.g., 816 Development responses or 548 Design & Planning responses)
- Category name and context description
Outputs (per category, per model)
- Theme codebook: code, name, description, supporting PIDs
- Per-response codings: PID → [theme_code_1, theme_code_2, ...]
- Files:
{category}_themes_{model}.json(15 opportunity files + 15 constraint files)
Phase 2: GPT-5.2 Reconciliation
A single reconciliation model (GPT-5.2) receives all three models' theme sets and produces a unified codebook per category by:
- Identifying overlapping themes across models (same concept, different names)
- Merging overlapping themes into single unified entries
- Retaining single-model themes only if substantive (≥3 PIDs)
- Dropping themes that are too vague or have very few supporting responses
- Targeting 5–10 unified themes per category
Each unified theme records its source_models (which of the three models independently proposed it) and source_codes (original model-specific code names), providing full provenance.
Outputs
consolidated_codebook.json— all 5 category codebooks (opportunity track)constraint_codebook.json— all 5 category codebooks (constraint track)
Human Review Gate Required
The pipeline pauses for researcher review before systematic coding begins. The researcher:
- Reviews each proposed theme and reads sample supporting responses
- Checks themes for specificity, granularity, and completeness
- Can keep, rename, merge, split, or remove any theme
- Can add themes the models missed
- Documents rationale for all changes
Systematic coding (Stage 2) does not proceed until the codebook is explicitly approved.
Stage 2: Systematic Coding & Analysis
Coding Protocol
All three models independently re-code every response against the finalized codebook. Key protocol elements:
| Parameter | Value | Rationale |
|---|---|---|
| Batch size | 20 responses per API call | Balances context window usage against API call count |
| Rationale-first | Model writes rationale before assigning codes | Improves accuracy via chain-of-thought; enables auditability |
| Cross-response context | Each response shown alongside opposite-question answer | Enables misresponse detection (ISSUE codes) |
| Multi-coding | 0, 1, or many themes per response | Captures full semantic content |
| Codebook-only | Only codebook codes or ISSUE_* codes allowed | Prevents code drift across batches |
ISSUE Code System
During systematic coding, models flag data quality problems rather than silently discarding responses:
| Code | Meaning | Example |
|---|---|---|
ISSUE_WRONG_FIELD | Respondent answered the opposite question | Describing constraints in the "want" field |
ISSUE_BACK_REFERENCE | References a prior answer; unintelligible alone | "Same as before", "see above" |
ISSUE_NON_RESPONSE | Terse non-answer with no analyzable content | "N/A", "none", "no" |
Models may create additional ISSUE_* codes if they encounter other data quality problems. The ISSUE prefix ensures these are never confused with substantive themes.
Inter-Rater Reliability (IRR)
Agreement between the three LLM coders is measured per theme using Krippendorff's alpha (α), the standard multi-rater reliability coefficient for qualitative research. For each theme, a binary (present/absent) coding matrix is built across all responses, and α is calculated at the nominal level.
| Range | Interpretation |
|---|---|
| α ≥ 0.80 | Excellent agreement — publishable |
| α ≥ 0.67 | Acceptable agreement — tentative conclusions |
| α ≥ 0.50 | Moderate agreement — use with caution |
| α < 0.50 | Poor agreement — unreliable for this theme |
Additionally, pairwise Cohen's kappa (κ) is calculated for each model pair (GPT–Gemini, GPT–Opus, Gemini–Opus) and 3-rater percent agreement (all three models assign the same code) is reported per theme.
Consensus Voting
Final theme assignments use a majority vote: 2 of 3 models must agree for a theme to be assigned to a response. This is applied independently per response and per theme code.
ISSUE code handling
If 2+ models flag any ISSUE code for a response (regardless of which specific ISSUE code), the response receives a generic ISSUE marker and is excluded from substantive analysis. This prevents a single aggressive model from filtering out too many responses.
Rich Opportunity Cards
For the top 5 themes per category (by prevalence), all three models independently generate detailed opportunity cards including:
- Problem statement and proposed capability description
- Required context sources and capability steps
- Impact description with supporting evidence quotes
- Success criteria (qualitative and quantitative measures)
- Constraints and guardrails drawn from the constraint track
- Prevalence data and quantitative signals (AI preference, usage gap)
Cards from the three models are merged using a union-and-deduplicate strategy: longest title wins, context sources are combined (max 7), capability steps use the longest sequence (max 6), and constraints are deduplicated (max 4).
Constraint Maps & Design Principles
Constraint-track prevalence is calculated identically to the opportunity track. The top no-go zones per category are documented with:
- Zone name, description, and prevalence count
- Up to 10 supporting respondent quotes
- 3–6 synthesized design principles per category (generated by GPT-5.2)
- Each principle includes implementation guidance and derivation provenance
Methodological Controls
The pipeline incorporates several controls designed to increase rigor beyond what a single-model analysis can provide.
| Control | Mechanism | What It Mitigates |
|---|---|---|
| Multi-LLM triangulation | 3 frontier models from different families code independently | Single-model bias, training-data artifacts, idiosyncratic interpretations |
| Rationale-first coding | Models write reasoning before assigning codes | Snap-judgment errors; enables post-hoc audit of coding decisions |
| Cross-response context | Both "want" and "not want" answers shown to coder | Misresponse blindness; enables ISSUE_WRONG_FIELD detection |
| ISSUE code system | Flag quality problems in-band rather than pre-filtering | Pre-filter bias from silently dropping ambiguous responses |
| Idempotent checkpointing | Staleness detection skips phases whose inputs haven't changed | Wasted computation; ensures reproducible reruns |
| Consensus merging | Majority vote (2/3) for codes; union-and-deduplicate for synthesis | Noise from single-model outlier codes; incomplete synthesis from any single model |
Design Decisions & Trade-offs
| Decision | Rationale | Trade-off |
|---|---|---|
| 3 models, not 2 or 5 | Minimum for meaningful IRR (Krippendorff's α); covers 3 major LLM families | Higher API cost (∼3×); manageable with batch parallelism |
| HIGH thinking for all models | Qualitative coding benefits from extended reasoning; reduces surface-level pattern matching | Slower inference, higher token cost (thinking tokens billed at output rate) |
| Batch size of 20 | Enough responses for cross-response pattern recognition; fits comfortably in context windows | More API calls than larger batches; but avoids context truncation risks |
| Majority vote (2/3) | Balances sensitivity and specificity; equivalent to >50% agreement threshold | May miss themes where only one model sees a valid pattern |
| Human gate before coding | Prevents systematic errors from propagating through the entire coding phase | Introduces a manual pause in an otherwise automated pipeline |
| No pre-filtering of responses | ISSUE codes capture quality problems without discarding data points | Models must process noisy responses; ISSUE detection is itself imperfect |
| GPT-5.2 as sole reconciler | Reconciliation requires structured comparison rather than independent generation; one model suffices | Reconciliation may inherit GPT-specific biases in theme naming |
| Streaming for Claude Opus | Avoids 10-minute HTTP timeout on long-running inference | More complex error handling; no retry on partial stream failures |
Limitations & Mitigations
| Limitation | Impact | Mitigation |
|---|---|---|
| LLM nondeterminism | Exact codings may vary across runs even with identical inputs | 3-model triangulation smooths out individual variance; IRR quantifies remaining disagreement; idempotent checkpointing ensures reproducible runs when inputs are stable |
| LLM rationalization | Models may construct plausible but incorrect rationales | Multi-model disagreement surfaces cases where rationalization diverges; majority vote filters single-model confabulations |
| Prompt sensitivity | Different prompt wording could yield different themes | Codebook-anchored coding constrains coder freedom; prompts are documented and versioned for replication |
| Not replacing human qualitative research | LLM coders lack lived experience; may miss cultural nuances | Human review gate validates codebook; methodology is positioned as accelerating qualitative work, not replacing it; all outputs include supporting quotes for human verification |
| Survey sample | 860 Microsoft developers may not represent the broader industry | Out of scope for the analysis methodology itself; noted as a limitation of the source data |
| LLM knowledge contamination | Models may have been trained on similar survey analyses | Codebook-first design constrains output to researcher-approved themes; verbatim quotes provide verifiable evidence independent of model knowledge |
Artifacts & Replication
Artifact Inventory
| Phase | File Pattern | Count | Description |
|---|---|---|---|
| Data | {category}_responses.json | 5 | Extracted open-ended responses with PIDs |
| Data | {category}_quantitative.json | 5 | Aggregated Likert scale metrics per task |
| Data | {category}_do_not_want_responses.json | 5 | Extracted constraint responses with PIDs |
| Stage 1 | {category}_themes_{model}.json | 15 | Independent opportunity theme discoveries |
| Stage 1 | {category}_constraint_themes_{model}.json | 15 | Independent constraint theme discoveries |
| Stage 1 | consolidated_codebook.json | 1 | Unified opportunity codebook (all categories) |
| Stage 1 | constraint_codebook.json | 1 | Unified constraint codebook (all categories) |
| Stage 2 | {category}_phase4_codings.json | 5 | 3-model systematic codings with rationales |
| Stage 2 | phase5_irr_results.json | 1 | Krippendorff's α, Cohen's κ, agreement % |
| Stage 2 | phase6_prevalence_results.json | 1 | Majority-vote consensus and theme prevalence |
| Stage 2 | phase6_rich_opportunities.json | 1 | Top-5 opportunity cards per category (3-model synthesis) |
| Stage 2 | constraint_maps.json | 1 | No-go zones and design principles |
Dependency Chain
Staleness Detection
Every pipeline phase checks whether its output is stale relative to its inputs by comparing file modification times. If all inputs are older than the output, the phase is skipped. If any input is newer, the output is regenerated. This enables:
- Incremental reruns: updating one category's theme discovery only regenerates downstream outputs for that category
- Safe restarts: if the pipeline crashes mid-phase, only the incomplete phase reruns
- Force override:
--forceflag bypasses staleness checks for full regeneration
How to Rerun
- Ensure API keys are set in
.envfor OpenAI, Google, and Anthropic - Install dependencies:
uv sync - Run full pipeline:
bash run_full_pipeline.sh - Pipeline pauses after Stage 1 for human codebook review
- After approval, Stage 2 runs automatically
- To force regeneration:
bash run_full_pipeline.sh --force - To rerun a single category:
uv run phase4_systematic_coding.py design_planning
Appendix
Opportunity Codebook (All 5 Categories, 48 Themes)
Unified codebook produced by GPT-5.2 reconciliation of themes independently discovered by all three models. Each theme lists which models independently identified it.
Development 10 themes
| Code | Theme | Models |
|---|---|---|
refactoring_modernization | Automated Refactoring, Modernization & Tech-Debt Reduction | GPT, Gemini, Opus |
boilerplate_scaffolding_feature_codegen | Boilerplate, Scaffolding & Routine Feature Code Generation | GPT, Gemini, Opus |
automated_testing_validation | Automated Test Generation, Coverage & Change Validation | GPT, Gemini, Opus |
debugging_root_cause_fixing | Debugging, Root Cause Analysis & Bug Fix Assistance | GPT, Gemini, Opus |
repo_wide_context_dependency_awareness | Repo-Wide Context, Dependency Awareness & Safe Multi-File Changes | GPT, Gemini, Opus |
code_quality_review_security_compliance | Code Quality, Review Automation, Standards & Security/Compliance Guidance | GPT, Gemini, Opus |
performance_profiling_optimization | Performance Profiling & Optimization Suggestions | GPT, Gemini, Opus |
architecture_design_planning_support | Architecture, Design Brainstorming & Planning Support | GPT, Gemini, Opus |
devops_ci_cd_iac_workflow_automation | DevOps, CI/CD, IaC & Engineering Workflow Automation | GPT, Gemini, Opus |
documentation_knowledge_retrieval_onboarding | Documentation Generation, Knowledge Retrieval & Onboarding/Learning Support | GPT, Gemini, Opus |
Design & Planning 10 themes
| Code | Theme | Models |
|---|---|---|
requirements_gathering_synthesis | Requirements Gathering, Synthesis & Clarification | GPT, Gemini, Opus |
architecture_design_generation | Architecture & System Design Generation/Iteration | GPT, Gemini, Opus |
interactive_brainstorming_design_partner | Interactive Brainstorming & Design Copilot | GPT, Gemini, Opus |
tradeoff_decision_support_simulation | Trade-off Analysis, What-if Simulation & Decision Support | GPT, Gemini, Opus |
design_validation_risk_edge_cases | Design Validation, Risk Assessment & Edge-Case Discovery | GPT, Gemini, Opus |
project_planning_tasking_status_automation | Project Planning, Ticket/Task Breakdown & Status Automation | GPT, Gemini, Opus |
documentation_spec_diagram_generation | Documentation, Specs & Diagram/Artifact Generation | GPT, Gemini, Opus |
context_retrieval_codebase_and_institutional_memory | Context Retrieval: Codebase Understanding & Institutional Memory | GPT, Gemini, Opus |
research_and_information_synthesis | Research, Information Gathering & Knowledge Synthesis | GPT, Gemini, Opus |
trustworthy_outputs_with_citations | Trustworthy Outputs: Higher Accuracy & Verifiable Citations | GPT, Gemini, Opus |
Quality & Risk 9 themes
| Code | Theme | Models |
|---|---|---|
automated_test_generation_and_quality_gates | Automated Test Generation, Maintenance & Quality Gates | GPT, Gemini, Opus |
intelligent_pr_code_review | Intelligent PR/Code Review Assistant | GPT, Gemini, Opus |
security_vulnerability_detection_and_fix_guidance | Security Vulnerability Detection & Fix Guidance | GPT, Gemini, Opus |
compliance_and_audit_automation | Compliance, Standards & Audit Process Automation | GPT, Gemini, Opus |
proactive_risk_monitoring_and_prediction | Proactive Risk Monitoring, Prediction & Anomaly Detection | GPT, Gemini, Opus |
debugging_root_cause_and_failure_triage | Debugging, Root Cause Analysis & Failure Triage | GPT, Gemini, Opus |
knowledge_retrieval_and_standards_guidance | Knowledge Retrieval, Summarization & Standards Guidance | GPT, Gemini, Opus |
agentic_workflow_automation_and_remediation | Agentic Workflow Automation & Automated Remediation | GPT, Gemini, Opus |
ai_driven_exploratory_chaos_and_fuzz_testing | AI-Driven Exploratory, Chaos & Fuzz Testing | Opus only |
Infrastructure & Ops 10 themes
| Code | Theme | Models |
|---|---|---|
intelligent_monitoring_alerting_anomaly_detection | Intelligent Monitoring, Alerting & Anomaly Detection | GPT, Gemini, Opus |
incident_response_rca_mitigation_self_heal | Incident Response Automation (Triage, RCA, Mitigation, Self-Heal) | GPT, Gemini, Opus |
cicd_pipeline_and_deployment_automation | CI/CD Pipeline & Deployment Automation | GPT, Gemini, Opus |
infrastructure_provisioning_and_iac_generation | Automated Environment Setup & IaC Generation | GPT, Gemini, Opus |
infrastructure_maintenance_upgrades_security_cost_optimization | Proactive Maintenance, Upgrades, Security/Compliance & Cost Optimization | GPT, Gemini, Opus |
customer_support_triage_and_autoresponse | Customer Support Triage & Auto-Response | GPT, Gemini, Opus |
knowledge_management_doc_search_and_system_context | Knowledge Management, Documentation Search & System Context | GPT, Gemini, Opus |
ops_toil_automation_and_script_generation | Ops Toil Automation & Script Writing/Debugging | GPT, Gemini, Opus |
testing_quality_validation_and_safe_deploy | Testing, Quality Validation & Safer Releases | GPT, Gemini, Opus |
ai_tooling_ux_accuracy_and_cohesive_workflows | Better AI Tooling UX (Accuracy, Control & Cohesive Workflows) | GPT, Gemini, Opus |
Meta-Work 9 themes
| Code | Theme | Models |
|---|---|---|
automated_documentation | Automated Documentation Generation & Maintenance | GPT, Gemini, Opus |
knowledge_search_and_discovery | Project Knowledge Search & Discovery (with Traceable Sources) | GPT, Gemini, Opus |
brainstorming_and_solution_exploration | Brainstorming, Option Generation & Rapid Exploration | GPT, Gemini, Opus |
personalized_learning_and_upskilling | Personalized Learning for New Technologies | GPT, Gemini, Opus |
team_onboarding_and_mentoring | Team Onboarding, Mentoring & Institutional Knowledge Transfer | GPT, Gemini, Opus |
stakeholder_communication_support | Stakeholder/Client Communication Drafting & Translation | GPT, Gemini, Opus |
meeting_assistance | Meeting Scheduling, Notes, Summaries & Action Items | GPT, Gemini, Opus |
planning_prioritization_and_status_tracking | Planning, Prioritization, Blocker Detection & Status Reporting | GPT, Gemini, Opus |
proactive_personal_agent_and_admin_automation | Proactive Personal Agent & Routine Admin Automation | GPT, Gemini, Opus |
Constraint Codebook (All 5 Categories, 50 Themes)
Unified constraint codebook produced by GPT-5.2 reconciliation. Captures what developers do not want AI to handle.
Development 10 themes
| Code | Theme | Models |
|---|---|---|
no_autonomous_architecture_system_design | No Autonomous Architecture or System Design Decisions | GPT, Gemini, Opus |
no_large_unscoped_refactors | No Large, Unscoped, or Sweeping Codebase Changes | GPT, Gemini, Opus |
no_autonomous_execution_merge_deploy_or_agentic_control | No Autonomous Execution, Merging/Deploying, or Agentic Control | GPT, Gemini, Opus |
no_complex_debugging_or_critical_bug_fixes | No AI Ownership of Complex Debugging or Critical Bug Fixes | GPT, Gemini, Opus |
no_security_privacy_secrets_handling | No Security/Privacy-Sensitive Work or Secrets Handling | GPT, Gemini, Opus |
no_autonomous_performance_optimization | No Autonomous Performance Optimization | GPT, Gemini, Opus |
no_ai_deciding_requirements_business_logic_or_api_ux | No AI-Led Requirements, Core Business Logic, or API/UX Decisions | GPT, Gemini |
preserve_developer_agency_learning_and_job_ownership | Preserve Developer Agency, Learning, and Ownership | GPT, Gemini, Opus |
avoid_ai_when_unreliable_contextless_hard_to_verify_or_intrusive | Avoid AI Output That Is Unreliable, Contextless, Hard to Verify, or Intrusive | GPT, Gemini, Opus |
no_constraints_open_to_ai_help | No Specific No-Go Zones (Open to AI Help) | GPT, Gemini |
Design & Planning 10 themes
| Code | Theme | Models |
|---|---|---|
human_accountability_final_decisions | No AI Final Decision-Making (Human Accountability Required) | GPT, Gemini, Opus |
human_led_architecture_design | No AI as Primary System Architect / High-Level Designer | GPT, Gemini, Opus |
no_ai_project_management_task_assignment | No AI Running Project Management | GPT, Gemini, Opus |
no_ai_requirements_stakeholder_elicitation | No AI-Led Requirements Gathering or Stakeholder Alignment | GPT, Gemini, Opus |
no_ai_empathy_team_dynamics | No Replacement of Human Empathy, Collaboration, or Interpersonal Dynamics | GPT, Gemini, Opus |
ai_assistant_human_in_loop | No Autopilot: AI Should Assist with Human-in-the-Loop Oversight | GPT, Gemini, Opus |
trust_accuracy_and_context_limitations | Avoid AI for High-Stakes Work Due to Reliability & Missing Context | GPT, Gemini, Opus |
privacy_confidentiality_ip_and_message_control | No AI Handling Sensitive/Confidential Data or Uncontrolled Messaging | GPT, Gemini, Opus |
no_ai_vision_strategy_creativity_taste | No AI Owning Product Vision, Strategy, or Creative Judgments | GPT, Gemini |
no_constraints_or_unsure | No Constraints Stated / Welcome Full AI Involvement | GPT, Gemini, Opus |
Quality & Risk 10 themes
| Code | Theme | Models |
|---|---|---|
human_final_decision_and_accountability | Humans Must Make Final High-Stakes Decisions | GPT, Gemini, Opus |
no_autonomous_code_or_production_actions | No Autonomous Code/Repo/Production Actions Without Approval | GPT, Gemini, Opus |
human_code_review_gate_required | Human Code Review / PR Approval Must Remain the Gate | GPT, Gemini, Opus |
security_and_compliance_must_be_human_led | Security, Compliance, and Threat Modeling Must Be Human-Led | GPT, Gemini, Opus |
no_sensitive_data_or_credentials_access | Do Not Give AI Access to Sensitive/Customer Data or Credentials | GPT, Gemini, Opus |
ai_outputs_must_be_verifiable_and_not_self_validated | AI Must Be Reliable, Verifiable, and Not Self-Validated | GPT, Gemini, Opus |
humans_own_requirements_architecture_and_tradeoffs | Humans Must Own Requirements, Architecture, and Trade-Offs | GPT, Gemini, Opus |
human_led_test_strategy_intent_and_signoff | Test Strategy and Sign-Off Must Be Human-Led | GPT only |
preserve_human_ethics_empathy_and_human_centric_work | Preserve Human Ethics, Empathy, and Human-Centric Work | GPT, Gemini |
no_constraints_stated | No Specific No-Go Areas Stated | GPT, Opus |
Infrastructure & Ops 10 themes
| Code | Theme | Models |
|---|---|---|
no_direct_customer_interaction | No Direct AI-to-Customer Interaction | GPT, Gemini, Opus |
no_autonomous_production_changes | No Autonomous Production Deployments or Changes | GPT, Gemini, Opus |
human_approval_before_consequential_actions | Human Approval Required Before Consequential Actions | GPT, Opus |
no_security_permissions_secrets_management | No AI Management of Security, Access, Permissions, or Secrets | GPT, Gemini, Opus |
no_autonomous_incident_response_or_overrides | No Autonomous Incident Response or Critical Overrides | GPT, Gemini, Opus |
avoid_ai_for_high_precision_deterministic_work | Avoid AI for High-Precision/Deterministic Work | GPT, Gemini, Opus |
no_full_autonomy_for_environment_setup_maintenance | No Full Autonomy for Environment Setup and Maintenance | GPT, Gemini |
preserve_human_learning_and_accountability | Preserve Human Learning, System Understanding, and Accountability | GPT, Gemini, Opus |
no_ai_initiated_irreversible_or_destructive_data_actions | No AI-Initiated Irreversible/Destructive Data Operations | GPT, Gemini, Opus |
no_constraints_expressed_or_pro_automation | No Constraints Expressed / Comfortable with Broad Automation | GPT, Gemini, Opus |
Meta-Work 10 themes
| Code | Theme | Models |
|---|---|---|
human_led_mentoring_onboarding | Keep mentoring and onboarding human-led | GPT, Gemini, Opus |
human_authored_communication | Keep interpersonal communications human-authored | GPT, Gemini, Opus |
human_review_required_before_sending_or_publishing | No autonomous sending/publishing without human review | GPT, Gemini, Opus |
no_confidential_or_sensitive_data | Keep AI away from confidential or sensitive information | GPT, Gemini, Opus |
preserve_hands_on_learning | Don't outsource learning and skills development to AI | GPT, Gemini, Opus |
preserve_human_research_and_ideation | Keep research/brainstorming primarily human | GPT, Gemini, Opus |
human_accountability_for_high_stakes_decisions | High-stakes decisions must remain human-led | GPT, Gemini, Opus |
avoid_unvetted_documentation | AI-generated documentation must be vetted | GPT, Gemini, Opus |
ai_outputs_not_trustworthy_as_primary_source | Don't treat AI output as trustworthy/authoritative | Opus only |
no_constraints_or_unsure | No constraints stated / unsure | GPT, Gemini, Opus |
Coding Prompt: Opportunity Track (Phase 4)
You are a qualitative research coder. Your task is to systematically code
each "WANT" response using ONLY the themes from the provided codebook.
Each response is shown alongside the same respondent's answer to a related
question about what they do NOT want AI to handle, for additional context.
CODEBOOK:
{codebook themes listed here}
ISSUE CODES (assign when a response has data quality issues):
- ISSUE_WRONG_FIELD: The respondent appears to have answered the other question
- ISSUE_BACK_REFERENCE: Response references a prior answer and is unintelligible
on its own
- ISSUE_NON_RESPONSE: Terse non-answer with no analyzable content
- You may create other ISSUE_* codes if you encounter a different type of data
quality problem
INSTRUCTIONS:
1. Read each response carefully
2. For each response, write a brief rationale
3. Then assign ALL applicable theme codes from the codebook
4. A response can have 0, 1, or multiple themes
5. Only use codes from the codebook or ISSUE codes
6. If no themes apply, return an empty array
RESPONSES TO CODE:
[Batches of 20 responses, each with context from opposite question]
OUTPUT FORMAT:
[
{"pid": 8, "rationale": "...", "themes": ["theme_code_1", "theme_code_2"]},
{"pid": 11, "rationale": "...", "themes": ["ISSUE_BACK_REFERENCE"]}
]
Return ONLY the JSON array, no other text.
Coding Prompt: Constraint Track
Identical structure to the opportunity track prompt, but with:
- Constraint codebook themes replacing opportunity themes
- "NOT WANT" responses as the primary coding target
- "WANT" responses shown as cross-response context
- Same ISSUE code system applies
Theme Discovery Prompt Template
You are analyzing open-ended survey responses from software developers about
where they want AI assistance in their work. Your task is to identify themes
in these responses.
Guidelines for theme creation:
- Themes should be SPECIFIC and ACTIONABLE
(e.g., "Automated test generation for edge cases" not just "Testing")
- Themes should be PROBLEM-FOCUSED (describe the pain point, not a solution)
- A response can belong to MULTIPLE themes
- Aim for 4-15 themes that capture the major patterns
For each response, return:
{
"pid": <participant ID>,
"themes": ["theme_code_1", "theme_code_2", ...]
}
Also provide a theme codebook:
{
"themes": [
{
"code": "snake_case_theme_code",
"name": "Human-Readable Theme Name",
"description": "What this capability means and why developers want it",
"pids": [list of PIDs expressing this]
}
],
"codings": [
{"pid": 123, "themes": ["theme_code_1", "theme_code_2"]}
]
}
RESPONSES TO ANALYZE:
[All responses for the category]
Theme Reconciliation Prompt Template
You are a qualitative research analyst performing theme reconciliation.
CONTEXT: Three independent AI models analyzed survey responses from the
"[category]" category. Each identified opportunity themes. Your job is to
reconcile these into a unified codebook.
--- GPT-5.2 THEMES ---
[All GPT themes with names, descriptions, PIDs]
--- GEMINI THEMES ---
[All Gemini themes]
--- OPUS THEMES ---
[All Opus themes]
TASK: Create a unified codebook by:
1. Identifying themes that overlap across models (same concept, different names)
2. Merging overlapping themes into single unified themes
3. Keeping single-model themes IF substantive (≥3 PIDs)
4. Dropping themes that are too vague or have very few supporting responses
5. Aim for 5-10 unified themes per category
For each unified theme, provide:
- code: snake_case identifier
- name: Human-readable name
- description: Clear description of the desired capability
- source_models: which models identified it (["gpt", "gemini", "opus"])
- source_codes: the original codes from each model
Return ONLY valid JSON, no other text.
ISSUE Code Taxonomy
| Code | Definition | Detection Signal | Consensus Rule |
|---|---|---|---|
ISSUE_WRONG_FIELD |
Respondent answered the opposite question (e.g., wrote constraints in the "want" field) | Cross-response context reveals contradictory intent | 2+ models flag any ISSUE_* → generic ISSUE marker applied |
ISSUE_BACK_REFERENCE |
Response references a prior answer ("same as before", "see above") and is unintelligible alone | Short response with deictic language | |
ISSUE_NON_RESPONSE |
Terse reply with no analyzable content | "N/A", "none", "no", single punctuation | |
ISSUE_* (custom) |
Models may create additional issue codes for novel quality problems | Varies | Same 2/3 majority rule; prefix matching ensures grouping |
Output JSON Schemas
Theme Discovery Output
{
"model": "gpt-5.2",
"category": "design_planning",
"category_name": "Design & Planning",
"response_count": 223,
"timestamp": "ISO-8601",
"themes": [
{
"code": "string",
"name": "string",
"description": "string",
"pids": [integer]
}
],
"codings": [
{ "pid": integer, "themes": ["string"] }
]
}
Consolidated Codebook
{
"metadata": {
"phase": "Opportunity Theme Reconciliation",
"timestamp": "ISO-8601",
"reconciliation_model": "gpt-5.2",
"discovery_models": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"]
},
"categories": {
"category_key": {
"category": "string",
"category_name": "string",
"theme_count": integer,
"models_reconciled": ["gpt", "gemini", "opus"],
"themes": [
{
"code": "string",
"name": "string",
"description": "string",
"source_models": ["gpt", "gemini", "opus"],
"source_codes": {
"gpt": ["string"],
"gemini": ["string"],
"opus": ["string"]
}
}
]
}
}
}
Systematic Codings (Phase 4)
{
"category": "string",
"phase": "Phase 4 - Systematic Coding",
"timestamp": "ISO-8601",
"models": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"],
"codebook": [ { "code": "string", "name": "string", "description": "string" } ],
"response_count": integer,
"codings": {
"gpt": [ { "pid": integer, "rationale": "string", "themes": ["string"] } ],
"gemini": [ ... ],
"opus": [ ... ]
},
"cost": { ... }
}
IRR Results (Phase 5)
{
"phase": "Phase 5 - Inter-Rater Reliability",
"methodology": {
"raters": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"],
"metrics": ["Krippendorff's Alpha", "Cohen's Kappa (pairwise)", "Percent Agreement"]
},
"overall_statistics": {
"mean_krippendorff_alpha": float,
"mean_percent_agreement": float,
"interpretation": "string"
},
"category_results": {
"category_key": {
"krippendorff_alpha": { "theme_code": float },
"percent_agreement": { "theme_code": float },
"pairwise_kappa": {
"gpt_vs_gemini": { "theme_code": float },
"gpt_vs_opus": { "theme_code": float },
"gemini_vs_opus": { "theme_code": float }
},
"code_frequencies": { "gpt": {}, "gemini": {}, "opus": {} }
}
}
}
Prevalence Results (Phase 6)
{
"methodology": {
"consensus_method": "majority_vote",
"threshold": "2+ of 3 models must agree"
},
"category_results": {
"category_key": {
"theme_prevalence": [
{
"code": "string",
"count": integer,
"percentage": float,
"pids": [integer]
}
],
"consensus_codings": { "pid": ["theme1", "theme2"] }
}
}
}
Rich Opportunity Card
{
"rank": integer,
"theme_code": "string",
"category": "string",
"title": "string",
"problem_statement": "string",
"proposed_capability": {
"summary": "string",
"context_sources_needed": ["string"],
"capability_steps": ["string"]
},
"impact": {
"description": "string",
"evidence_quotes": [ { "pid": integer, "quote": "string" } ]
},
"success_definition": {
"qualitative_measures": ["string"],
"quantitative_measures": ["string"]
},
"constraints_and_guardrails": [
{
"constraint": "string",
"supporting_quote": { "pid": integer, "quote": "string" }
}
],
"who_it_affects": {
"prevalence_count": integer,
"prevalence_percentage": float,
"description": "string",
"signals": ["string"]
},
"models_consulted": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"]
}
IRR Interpretation Guide
Why Krippendorff's Alpha?
- Designed for multi-rater reliability (3+ raters)
- Handles missing data gracefully (if one model fails on a batch)
- Supports nominal-level measurement (categorical theme codes)
- Does not assume a fixed rater set
- More conservative than simple percent agreement, adjusting for chance
How It's Calculated
For each theme code, a binary matrix is constructed:
# Row per model, column per response
# PID_1 PID_2 PID_3 PID_4 ...
# GPT: [ 1, 0, 1, 0, ...]
# Gemini: [ 1, 0, 1, 0, ...]
# Opus: [ 1, 0, 0, 0, ...]
alpha = krippendorff.alpha(
reliability_data=matrix,
level_of_measurement="nominal"
)
Reporting
- Per-theme α values identify which themes models agree/disagree on
- Themes with α < 0.67 are flagged for potential human adjudication
- Overall mean α provides a summary reliability score
- Pairwise κ identifies whether specific model pairs diverge
- Code frequency counts reveal systematic over/under-coding by individual models
What IRR Tells Us (and Doesn't)
High α means the three models consistently apply the same theme to the same responses—the codebook is operationally clear and the models "understand" it similarly. Low α on a specific theme may indicate the theme definition is ambiguous, the theme requires human judgment the models handle differently, or the theme captures a rare pattern where base-rate effects inflate disagreement.
IRR does not tell us whether the codes are correct—only that the coders agree. This is why the human review gate exists: to ensure the codebook itself captures meaningful, well-defined themes before reliability is measured.
Model Configuration & Cost Tracking
Model Parameters
| Model | Thinking Config | Temperature | Streaming |
|---|---|---|---|
| GPT-5.2 | reasoning_effort="high" | 1 | No |
| Gemini 3.1 Pro | ThinkingConfig(thinking_level="HIGH") | Default | No |
| Claude Opus 4.6 | thinking: adaptive, effort: high | Default | Yes (timeout avoidance) |
Token Pricing (per 1M tokens)
| Model | Input | Output | Notes |
|---|---|---|---|
| GPT-5.2 | $1.75 | $14.00 | Thinking tokens billed at output rate |
| Gemini 3.1 Pro | $2.00 | $12.00 | Thinking tokens billed at output rate |
| Claude Opus 4.6 | $5.00 | $25.00 | Thinking tokens billed at output rate |
The CostTracker class in llm.py tracks input, output, and thinking tokens separately per API call, with phase-level summaries printed to console.
Design & Planning
223 responses | 7 themes
View CodebookDesign & Planning Codebook
Themes identified from "What do you NOT want AI to handle?" responses.
Project Planning, Ticket/Task Breakdown & Status Automation
Architecture Ideation & Interactive Design Copilot
Design Review, Risk Assessment & Trade-off Decision Support
Context Retrieval & Knowledge Synthesis (Internal + External)
Documentation, Specs & Diagram/Artifact Generation
Requirements Gathering, Synthesis & Clarification
Trustworthy Outputs: Higher Accuracy & Verifiable Citations
Project Description
Turn an approved brief into a first-pass execution package: hierarchical work items, a dependency map, estimate ranges calibrated to the team’s own delivery history, and a draft sprint or milestone sequence. Once the team edits and accepts that draft, the assistant can sync the approved deltas into the tracker and assemble recurring status digests directly from tracker activity, CI signals, and open blockers.
- Accepted feature briefs, one-pagers, and design/spec documents
- Current backlog, sprint history, and cross-team links from the work tracking system
- Historical cycle time, throughput, and estimate-to-actual data for the team
- Repository/service ownership boundaries and dependency metadata
- CI/CD and incident signals that indicate progress, regressions, and release readiness
- Start from user-selected accepted scope rather than open-ended prompting, so the plan is anchored to approved work and can explicitly mark assumptions or out-of-scope items.
- Parse the design or spec and generate a work breakdown with task titles, descriptions, and acceptance checks tied to the team’s planning template.
- Infer dependencies, likely blockers, and estimate ranges using linked work items, service boundaries, and comparable historical tasks from the same team.
- Draft milestone or sprint sequencing from estimates, capacity, and calendar constraints, surfacing trade-offs among scope, date, and staffing instead of silently choosing priorities.
- Show tracker diffs before any write: proposed backlog items, hierarchy, links, and estimate fields, while preserving existing manual edits as the source of truth.
- From approved tracker activity, delivery signals, and unresolved blockers, compose periodic status summaries, stale-item alerts, and re-planning suggestions for the lead to edit and send.
Who It Affects
68 of 223 respondents (30.5%) were coded to this theme with strong inter-rater reliability (α = 0.94). These are developers, tech leads, and engineering managers who regularly translate accepted designs and specs into executable plans, maintain backlogs across sprint cycles, and produce recurring status updates; 67.7% want High or Very High AI support for this work.
- 67.7% of respondents in this theme want High or Very High AI support for project planning tasks
- Average AI Preference: 3.94/5
- Average AI Usage: 2.57/5
- Preference-Usage Gap: 1.38
- IRR agreement α = 0.94, confirming strong coder consensus on theme boundaries
Impact
If this assistant works, an approved design or feature brief becomes a usable draft execution plan instead of a manual sequence of backlog grooming, estimation, and tracker maintenance. Routine status communication is generated from work and delivery signals rather than copied by hand into emails or meeting artifacts, while blockers and cross-team dependencies are surfaced earlier. The main outcome is reclaiming engineering time for strategy and implementation, not replacing human project leadership.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Teams report that the generated work-item breakdowns are 'close enough' to use as starting points (requiring only minor edits, not rewrites)
- Developers report that sprint planning meetings shifted from data-entry exercises to strategic discussions about priority and scope
- Tech leads say they no longer manually compile status reports or copy-paste from work trackers into emails and slide decks
- Teams report improved coordination because cross-team dependencies are surfaced early with actionable next steps, without the tool autonomously contacting people
- Developers express that they retain full control and agency — the tool proposes, they decide
Quantitative Measures
- 50% reduction in time spent manually creating work items from specs (measured via before/after time-tracking studies)
- 80% of generated work-item trees accepted with fewer than 25% of items requiring substantive edits (title, scope, or acceptance criteria changes)
- Effort estimate accuracy within 30% of actuals for 70%+ of tasks after 3 sprints of team-specific calibration
- Reduce weekly time spent on status reporting/admin by 30%+ (self-reported + calendar/usage telemetry opt-in)
- 30% reduction in the number of recurring status meetings per cross-team project, replaced by automated digest consumption
Theme Evidence
Turn requirements or designs into executable plans: break down work into epics, stories, or tasks; generate Jira/ADO items; estimate effort; plan milestones and sprints; track dependencies; and automate recurring project admin such as status updates, progress summaries, and coordination. This theme...
Project Description
An interactive architecture studio that keeps an explicit design state—goals, constraints, assumptions, open questions, and option history—while pulling in local precedents from ADRs, service topologies, and analogous code. It generates several materially different architectures, lets the team perturb constraints and compare diffs, and can export the chosen path as a draft decision record once the humans settle on a direction.
- Requirement docs, one-pagers, and user stories for the feature under discussion
- Prior ADRs, design docs, and architecture diagrams for related systems
- Service and repository catalogs with dependency and integration maps
- Deployment manifests and infrastructure topology for the existing system
- Organization-specific standards, approved components, and platform guardrails
- Analogous implementations in the team’s own repositories
- Seed a structured design state from the initial brief, explicitly separating supplied requirements from assumptions and unknowns.
- Ask targeted questions only where the missing information affects architectural shape—scale, data sensitivity, failure model, ownership, or integration boundaries—and record the answers in the design state.
- Retrieve local precedent from ADRs, service catalogs, repository structure, and infrastructure manifests so the option space reflects what the organization already runs, not just public pattern libraries.
- Generate 3–5 distinct architecture candidates with components, data flow, deployment shape, assumptions, pros and cons, and named precedents or patterns for each.
- Let the team change a constraint, pin a decision driver, or branch the conversation, then regenerate only the affected parts and show what changed in each option.
- Preserve session history so the team can compare forks over several turns instead of losing context to one-shot prompting.
- When a direction is chosen, export an ADR-style draft that captures context, alternatives considered, rationale, consequences, and unresolved questions.
Who It Affects
62 of 223 developers (27.8%) in the design-planning category explicitly asked for AI assistance with architecture ideation and interactive design collaboration, making it the second-most prevalent theme. Demand is strong but unmet: 67.7% want High or Very High support, average preference is 3.94/5, current usage is 2.57/5, and the preference-usage gap is 1.38.
- 62 of 223 design-planning respondents (27.8%) explicitly mentioned architecture ideation and interactive design collaboration.
- 67.7% of respondents in this theme want High or Very High AI support.
- Average AI preference is 3.94/5 while average current usage is 2.57/5.
- Preference-usage gap is 1.38, indicating substantial unmet demand in this workflow.
Impact
Instead of starting from a blank page or repeatedly re-prompting a general assistant, developers could quickly compare several context-grounded architectures before formal design review. This should reduce premature commitment to one familiar pattern, surface trade-offs and edge cases earlier, and shorten the path from requirements to a viable design direction.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Developers report that the tool surfaces architecture options they had not previously considered, rather than echoing back their own ideas.
- Developers describe the interaction as a productive back-and-forth conversation rather than a one-shot prompt-response pattern.
- Users report options are specific to their constraints and existing ecosystem, not generic templates.
- Developers trust recommendations because the tool cites internal precedent or named patterns and explicitly flags uncertainty.
- Developers feel they remain accountable and in control of design direction.
Quantitative Measures
- Reduce average time from requirements-to-first-viable-architecture-proposal by 40% (measured via session timestamps).
- Median time-to-3-viable-architecture-candidates (from session start) reduced by 50% versus baseline manual process (measured via user study).
- Increase the average number of distinct architecture alternatives considered per design decision from 1-2 to 3-4 (measured via session logs).
- Decrease average number of user turns needed to reach a stable candidate architecture by 30% (conversation analytics).
- Achieve >80% of generated recommendations traceable to a cited source (internal precedent or named industry pattern) rather than unsourced assertions.
Theme Evidence
Generate and iteratively refine system architectures and technical designs from stated requirements, constraints, and existing ecosystem context, recommending patterns, components, infrastructure choices, and integration approaches. Operate as a conversational partner that asks clarifying...
Project Description
Analyze a proposed design the way a strong reviewer would: reconstruct the component model from docs and diagrams, check requirement and NFR coverage, search for hidden dependencies, run what-if scenarios, and turn the result into a trade-off matrix plus a draft decision record. Every critique should point back to its source—diagram elements, requirements text, catalog entries, incident history, or explicit assumptions.
- Design docs, architecture diagrams, and competing options under review
- Requirements, acceptance criteria, and non-functional checklists for the project
- Internal service catalogs, interface definitions, and ownership metadata
- Historical incidents, outages, and prior review artifacts for similar components
- Cost, capacity, and load assumptions relevant to the design
- Ingest the review package and parse it into components, data flows, integrations, trust boundaries, explicit decisions, and open questions.
- Map the parsed design to stated functional and non-functional requirements, marking each item as covered, partially covered, unaddressed, or unclear with supporting snippets.
- Cross-check the design against internal catalogs and interface definitions to expose hidden dependencies, undocumented integrations, ownership gaps, and likely cross-team blockers.
- Generate design-specific failure modes and parameterized what-if scenarios so reviewers can see how a change in assumptions affects complexity, performance, cost, recovery targets, or delivery risk.
- When multiple options are supplied, compare them across user-chosen criteria and explain each cell in the matrix instead of collapsing the output to a single winner.
- Draft a decision record that captures options considered, assumptions, risk concentrations, downstream consequences, and unresolved questions for the review meeting.
Who It Affects
46 of 223 respondents (20.6%) were coded to this theme with very high inter-rater reliability (α = 0.944). These are engineers involved in preparing or reviewing designs, comparing options, and reasoning about risks, assumptions, dependencies, and downstream impacts. Demand is strong but underserved: 67.7% want high or very high AI support for this activity, while current usage remains well below preference.
- 46 of 223 respondents (20.6%) were coded to this theme
- Inter-rater reliability α = 0.944 across 3 independent coders
- 67.7% of developers want High/Very High AI support for this area
- Average AI preference score of 3.94/5 vs. average usage of 2.57/5 — a 1.38-point gap indicating significant unmet need
Impact
Teams would enter design reviews with a structured analysis rather than starting from a blank debate: a requirement and non-functional coverage check, a prioritized risk register, a trade-off matrix for competing options, and a decision impact trail showing how one choice propagates through the architecture. This should surface blockers and assumption failures earlier, shorten circular review cycles, and leave reusable decision records that make later reviews and design changes easier to justify.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Developers report that the tool catches gaps, edge cases, or dependency risks they had not previously considered in their designs.
- Design review meetings become shorter and more focused because participants arrive with pre-analyzed trade-off matrices rather than debating from scratch.
- Users trust outputs because assumptions, evidence, and confidence are explicit, and the tool is seen as an assistant rather than a decision-maker.
- Teams attach generated trade-off analyses and draft decision records to formal design review workflows.
- Teams report reduced frequency of late-stage architecture pivots caused by risks that should have been caught during planning.
Quantitative Measures
- Reduce median time from first design draft to design approval by 20–30% (measured via timestamps on review artifacts)
- Increase the number of risks and gaps identified during the planning phase by at least 40% compared to manual-only reviews
- Reduce post-implementation architectural rework incidents (tracked via work items tagged as 'design change' or 'architecture pivot') by 25% within the first year
- Close the preference-usage gap from 1.38 to below 0.5 within 12 months of launch, indicating that developers who want this capability are actually using it
- At least 80% of generated trade-off analyses rated as 'useful' or 'very useful' by the reviewing engineer in post-review feedback
Theme Evidence
Critically evaluate proposed or competing designs to validate feasibility and alignment with requirements and non-functional needs (security, compliance, reliability, scalability, operability). Surface gaps, hidden dependencies, and edge cases early, and assess their impact on cost, complexity,...
Project Description
Create a project memory layer that ingests code history, docs, tickets, notes, and opted-in conversations as dated facts and decisions rather than as isolated text chunks. The interface should answer questions like 'why was this done?', 'what changed after the RPO target moved?', or 'which later components depend on that decision?' by showing a timeline, the decision lineage, and the exact source passages behind each claim.
- Source repositories, commit history, and pull-request discussions
- Design docs, specs, ADRs, and architecture diagrams
- Work items, comments, and linked approval records
- Meeting notes plus opted-in chat or email threads
- API/interface definitions and service or component catalogs
- Incident records and postmortems tied to architectural components
- Let teams define project boundaries and access policies across repositories, documents, work items, and conversation sources before ingestion.
- Normalize those artifacts into time-stamped records with stable deep links, document authority markers, and freshness metadata.
- Extract components, requirements, decisions, alternatives, owners, and rationale, while building a glossary of project-specific terminology and aliases.
- Link parent and child decisions over time so users can trace how an upstream call—such as a reliability target or integration choice—propagated into later implementation and operational decisions.
- Answer questions with structured sections such as current facts, decisions made, reasons cited, open conflicts, and missing evidence, rather than a single flattened paragraph.
- Expose disagreements, stale context, and inaccessible sources instead of collapsing them into one confident answer.
- Generate reusable briefings for onboarding, design reviews, or incident retrospectives that remain drillable to exact snippets, timestamps, code locations, and prior discussions.
Who It Affects
44 of 223 respondents (19.7%) were coded to this theme with high inter-rater reliability (α = 0.897). Those affected range from engineers joining unfamiliar codebases to senior engineers trying to preserve institutional memory as ownership changes. The theme cuts across design planning, architecture, requirements, and onboarding because context retrieval is an upstream bottleneck for all of them.
- 67.7% of respondents in this theme want High or Very High AI support for context retrieval tasks
- Average AI Preference: 3.94/5
- Average AI Usage: 2.57/5, creating a 1.38-point preference–usage gap
Impact
If this capability exists, developers can reconstruct system context and design history from one place instead of manually searching across fragmented artifacts. The main gains are faster onboarding to unfamiliar codebases, better continuity when owners leave, and fewer repeated investigations or re-debates because prior rationale is retrievable and cited. The result is not autonomous planning, but a trustworthy project memory that helps humans make better-informed design decisions.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Developers report that onboarding onto unfamiliar codebases takes significantly less time because the tool provides accurate, cited explanations of system structure and past decisions
- Teams report fewer instances of re-debating previously settled design decisions because the rationale and context are readily accessible
- New team members and engineers returning from leave report they can reconstruct project context without scheduling 'brain dump' meetings with colleagues
- Developers trust the tool's outputs because every summary includes verifiable citations and clearly flags information staleness or conflicts
- Design reviews more consistently reference prior decisions and their rationale instead of re-investigating from scratch
Quantitative Measures
- Decrease average time for a developer to answer a 'why was this decision made?' question from >30 minutes of manual searching to <3 minutes via the query interface
- At least 90% of generated summaries include citations for all key claims
- Detect and surface 'stale context' for at least 70% of docs or decision records that reference changed APIs or components, with a false-positive rate under 20%
- Reduce onboarding time for engineers joining a new codebase by at least 30% as measured by time-to-first-meaningful-commit on the new project
- Increase the proportion of design documents that reference prior decisions or decision records by at least 40%
Theme Evidence
Retrieve and synthesize relevant context from large codebases and organizational knowledge (docs, APIs, prior decisions, chats, meetings) alongside external sources (standards, OSS options, best practices, comparable solutions). Consolidate fragmented references into clear, up-to-date summaries...
Project Description
Use an interactive authoring workspace to turn notes, transcripts, rough outlines, and selected code diffs into sectioned design docs, specs, and editable diagrams. Each paragraph, table, and box should carry its provenance so the author can see which meeting note, requirement, or code change it came from and refine only the part that needs work.
- Team document templates and required review sections
- User-selected notes, transcripts, outlines, and dictated explanations
- Linked requirements and prior design documents relevant to the artifact
- Selected code diffs or proof-of-concept branches
- Organization naming, security, and privacy checklists used during design review
- Let the author choose the artifact type and explicitly select which inputs are allowed to shape the draft.
- Map the source material to required sections such as goals, non-goals, requirements, risks, rollout, and open questions, and highlight gaps before drafting.
- Draft the document section by section, labeling each statement as directly supported by a source, inferred from several sources, or waiting on author confirmation.
- Generate diagrams in both rendered form and structured text notation so edits can round-trip instead of forcing redraws.
- Support targeted regeneration: the author can rewrite only the rollout section, only the dependency table, or only one diagram lane without losing manual edits elsewhere.
- When linked requirements or code change later, detect which sections and diagram elements are now stale and suggest patch-style updates.
Who It Affects
34 of 223 respondents (15.2%) were coded to this theme with near-perfect inter-rater reliability (α = 0.965), spanning developers who draft design docs, create architecture diagrams, write specs from meeting context, build proposals, and maintain as-built documentation. They consistently describe artifact production as repetitive work that pulls time and attention away from actual design thinking.
- 67.7% of respondents coded to this theme want High or Very High AI support for documentation and artifact generation
- Average AI Preference of 3.94/5 vs. Average AI Usage of 2.57/5 reveals a 1.38-point gap — the largest unmet demand signal
- Theme achieved highest inter-rater reliability in the category (α = 0.965), confirming clear, unambiguous developer need
Impact
If this capability exists, developers start from a grounded first draft instead of a blank page: rough notes, meeting context, and code changes become structured documents and diagrams aligned to team conventions. This shifts effort away from formatting and manual diagram drawing toward design reasoning and review, while also making as-built documentation easier to keep useful for onboarding and cross-team understanding.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Developers report that generated first drafts are shareable with minor edits rather than major rewrites.
- Developers describe the workflow as conversational and steerable, with targeted updates instead of full regeneration.
- Developers confirm that generated diagrams accurately reflect their described architecture and are usable in reviews without manual redrawing.
- Developers report spending more time on design thinking and less time on formatting, rewriting, and manual diagramming.
- Junior engineers report that refreshed as-built documentation reduces the effort required to understand a system.
Quantitative Measures
- Reduce median time-to-first-draft of a design doc or proposal by 50%+ compared with baseline practice.
- Achieve >80% of generated document sections requiring no more than minor edits, measured by edit distance between AI draft and final published version.
- >= 70% of generated diagrams are exported without being fully redrawn, measured by edit distance or replacement events.
- >= 25% of artifacts opt into change-tracked update suggestions, with >= 60% acceptance rate of suggested patches after review.
- Reduce the number of stale-documentation incidents reported per quarter by 40%, measured through doc freshness audits or retrospective tracking.
Theme Evidence
Draft and maintain design documents, specifications, and proposals, and generate supporting artifacts such as templates, slides, UML diagrams, data-flow diagrams, flowcharts, and Visio-style diagrams. Includes converting meeting notes or rough outlines into structured, presentable documentation....
Development
353 responses | 7 themes
View CodebookDevelopment Codebook
Themes identified from "What do you NOT want AI to handle?" responses.
Code Generation, Refactoring & Modernization Automation
Code Quality, Review Automation, Automated Testing & Security/Compliance Guidance
Debugging, Root Cause Analysis & Bug Fix Assistance
Codebase Context, Knowledge Capture & Safe Cross-File Changes
Architecture, Design Brainstorming & Planning Support
Performance Profiling & Optimization Suggestions
DevOps, CI/CD, IaC & Engineering Workflow Automation
Project Description
Turn a change request—API rename, dependency upgrade, framework migration, or boilerplate expansion—into a sequence of small diffs that match the repository’s naming, file layout, error-handling, and test patterns. The system should stop and surface the plan as soon as the change grows beyond the intended modules, public APIs, or review size budget.
- Source tree structure and analogous implementations already in the codebase
- Dependency manifests, lockfiles, migration guides, and release notes
- Build, lint, format, and test configuration from the repository
- API schemas, interface definitions, and changed call sites
- Recent version-control history for similar migrations or refactors
- Accept the transformation intent together with explicit file, directory, or module boundaries and whether behavior must remain unchanged.
- Inspect analogous implementations and impacted symbols to understand how the repository currently handles naming, layering, error paths, and tests.
- Produce a change plan that estimates files touched, likely API impact, and where the request may spill outside the approved area.
- Generate atomic diffs under a configurable size threshold so a rename across 40 files becomes several digestible patches instead of one opaque rewrite.
- Run formatter, linter, build, static analysis, and tests after each patch and halt with diagnostics when a step fails or confidence drops.
- Assemble the accepted patches into a branch and PR summary that explains each transformation step in plain language.
Who It Affects
177 of 353 respondents (50.1%) were coded to this theme with high inter-rater reliability (α=0.894), making it the single most prevalent developer need in the survey. The 1.14-point gap between preference (4.21/5) and current usage (3.07/5) shows substantial unmet demand for help with repetitive code generation and transformation tasks.
- 177 of 353 respondents (50.1%) mentioned concrete code generation or transformation tasks, the highest prevalence of any theme
- 77.6% of respondents in this theme want High or Very High AI support for development tasks
- Average AI Preference: 4.21/5 versus actual Usage: 3.07/5
- Preference–Usage Gap: 1.14
Impact
If this capability exists, developers can delegate repetitive multi-file code changes and receive a staged sequence of small, reviewable diffs that follow local patterns and pass existing checks before review. That reduces maintenance toil, makes modernization work less disruptive, and frees developers to spend more time on business logic, design, and other higher-cognitive-load work.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Developers describe generated changesets as small enough to review confidently and easy to stop, correct, or approve step by step.
- Developers report that generated code follows existing repository patterns instead of generic AI style.
- Refactors and upgrades feel boring but easy to delegate rather than risky and time-consuming.
- Developers report spending more time on business logic and design and less on repetitive maintenance work.
- Teams trust the capability for production modernization tasks, not only for disposable or throwaway code.
Quantitative Measures
- ≥50% reduction in manual file edits for common mechanical migrations such as renames, signature changes, and deprecation replacements
- User-configured scope violations: <2% of tasks attempt to touch out-of-scope files, and those attempts are blocked and logged
- 100% of presented changesets pass existing linter and test suites before developer review
- Average changeset size stays below 200 lines changed, with 90th percentile below 400 lines
- Reduce median time-to-draft-PR for refactor and upgrade tasks by ≥30%
- Reduce average PR cycle time for refactoring-tagged work items by 40%
Theme Evidence
AI that generates boilerplate/scaffolding, ports or converts code, writes small scripts, and implements well-scoped features directly from requirements or specs to reduce repetitive implementation work. It also performs behavior-preserving refactors, migrations, and framework/library/dependency...
Project Description
Embed a high-precision quality pass in the editor and pre-commit flow. It should look only at the current diff plus the nearby tests, rules, and interfaces that give that diff meaning, combine deterministic analyzers with targeted model reasoning, and suggest only the smallest missing tests or fixes worth a developer’s attention.
- Changed functions, their direct callers, and touched interfaces
- Nearby unit and integration tests plus local test helpers and assertion styles
- Repository lint, format, compiler, type-check, and secure-coding rules
- Coverage deltas for changed lines and branches
- Dependency manifests and vulnerability registries
- Historical PR comments that encode recurring team-specific feedback
- On save or stage, compute the minimal impacted area from changed functions, adjacent call sites, touched interfaces, and affected modules.
- Run compiler, type, lint, basic static, and dependency/security checks first and normalize those findings into a common model.
- Compare the change against nearby implementation and testing patterns to flag likely correctness, readability, performance, or security issues with line-anchored explanations.
- When behavior is ambiguous, ask a short question tied to the affected branch or interface instead of guessing what the code is supposed to do.
- Generate small, change-focused tests that match existing repository style and cover happy path, boundary, error, and regression cases for the altered behavior.
- Before the PR is opened, publish a compact quality summary of unresolved findings, added tests, coverage movement, and security-sensitive surfaces.
Who It Affects
98 of 353 developers (27.8%) explicitly described wanting AI assistance with code review, test generation, security scanning, or standards enforcement. This theme had near-perfect inter-rater reliability (α=0.94) and the highest preference-usage gap (1.14) in this category, indicating clear unmet demand for higher-signal quality support.
- Preference-Usage Gap: 1.14
- Average AI Preference: 4.21/5
- Average AI Usage: 3.07/5
- 77.6% of respondents in this theme want High or Very High AI support
- Inter-rater reliability α=0.94 showing extremely clear and consistent expression of this need across respondents
Impact
If this exists, developers receive small, evidence-backed review findings and targeted test suggestions while they code, rather than discovering routine correctness, standards, or security issues only during code review or later testing. That shifts defect detection earlier, reduces back-and-forth on basic review comments, and improves confidence that code changes will not break expected behavior.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Developers report that inline findings match the type and quality of feedback they receive from experienced human reviewers.
- Generated tests are accepted and kept by developers rather than deleted or heavily rewritten, indicating that they are meaningful and repository-conformant.
- Developers say suggestions are repository-aware and evidence-backed rather than generic automated comments.
- Developers report catching bugs and insecure patterns during editing that would previously have surfaced during code review or later testing.
- Developers report that the tool is non-blocking and easy to dismiss when its suggestions are not relevant.
Quantitative Measures
- Reduce average pull request review turnaround time by 30%+ as fewer issues are surfaced for the first time during review
- Increase repository test coverage by 15-25% within 6 months of adoption through generated test suggestions
- Reduce regression-related bug reports by 20%+ through pre-commit change validation and targeted regression test generation
- Reduce security-related findings in dedicated security review or penetration testing by 40%+ due to shift-left detection
- Achieve >70% acceptance rate on generated test suggestions (tests kept without major modification)
- Achieve >60% acceptance rate on inline correctness and standards suggestions
Theme Evidence
AI that acts as an intelligent quality gate by providing real-time code review feedback, enforcing style/standards, and flagging correctness issues and bad practices. It generates meaningful unit/integration/E2E tests, identifies edge cases and coverage gaps, supports TDD workflows, and validates...
Project Description
Start from a bug report, stack trace, failing test, or incident alert and assemble a debug case file: correlated logs and traces, the most likely regression window, similar historical failures, competing root-cause hypotheses, and—when verification succeeds—a candidate patch plus regression test. The emphasis is investigative workflow, not generic code completion.
- Logs, traces, and metrics tied to the failing request or time window
- Stack traces, crash dumps, compiler errors, and CI failure artifacts
- Recent commits, deploys, config flips, and feature-flag changes
- Historical bug reports, incidents, and linked remediation commits
- Source code plus dependency relationships for implicated modules
- Sandboxed build and reproduction environments matching the failing configuration
- Normalize the trigger into a failure signature and derive the minimum data collection plan needed for that signature.
- Pull correlated runtime signals by time, correlation ID, service boundary, and deployment version to build a failure timeline instead of a raw log dump.
- Rank recent commits, configuration changes, and rollout events by how well they explain the observed failure path.
- Retrieve similar past bugs and incidents, including their fixes, validation steps, and modules touched, to seed plausible hypotheses.
- Reproduce the issue in an isolated environment using the closest matching build and config snapshot, then capture enriched diagnostics from the repro attempt.
- Generate several root-cause hypotheses, each with a causal story, supporting artifacts, confidence estimate, and a short disproof checklist.
- If one hypothesis survives verification, generate a narrow patch and regression test and present them alongside the full investigation record.
Who It Affects
69 of 353 respondents (19.5%) explicitly described wanting AI assistance with reactive debugging, root cause analysis, log correlation, bug triage, or fix suggestion — making this the second most prominent theme. 77.6% of these respondents want High or Very High AI support, yet current usage averages only 3.07/5, indicating tools are not meeting the need.
- 69 responses coded to this theme across 353 total, representing 19.5% prevalence
- 77.6% want High/Very High AI support for debugging and root cause analysis
- Average AI Preference of 4.21/5 vs. Usage of 3.07/5 — a 1.14 gap indicating strong unmet demand
- Inter-rater reliability α = 0.958, indicating near-perfect agreement among coders that these responses describe reactive debugging
Impact
If this workbench exists, the first phase of debugging shifts from manual hunting to reviewing an investigation brief. Instead of bouncing across logs, telemetry, version history, and bug databases, engineers receive a structured packet with the failure timeline, implicated code, likely regression window, relevant prior incidents, and—when verification succeeds—a small patch plus regression test. This reduces cognitive load during incidents and makes senior engineers reviewers of investigations rather than collectors of artifacts.
Constraints & Guardrails
Success Definition
Qualitative Measures
- On-call engineers report reduced cognitive load during incident response, citing the structured brief as a useful starting point rather than a distraction
- Developers report that the investigation brief accurately identifies the root cause on first attempt for the majority of bugs they investigate
- Developers working in unfamiliar codebases report the tool makes them effective at diagnosing bugs they otherwise could not have tackled independently
- Developers trust the confidence calibration — when the tool says it is uncertain, they find that accurate, and when it says high confidence, the diagnosis is usually correct
- Developers report less time spent gathering/logging-in to multiple tools to find relevant evidence ("it collected what I would have hunted for")
Quantitative Measures
- Mean time from bug report filed to root cause identified reduced by 40% or more for bugs where the tool is invoked
- Reduce MTTR for eligible incidents/bugs by 25% (end-to-end from detection to merged fix) compared with baseline
- Regression rate of AI-proposed fixes (fixes that introduce new test failures) below 5%, measured against CI pipeline results
- At least 50% of suggested patches that pass verification are accepted (merged) after human review
- Historical bug match accuracy: at least 70% of surfaced similar-bug links rated as relevant by the investigating developer
Theme Evidence
AI that accelerates debugging by analyzing stack traces, logs, telemetry, and failing tests; reproduces issues; identifies root causes; suggests fixes; and helps triage incidents/regressions. This theme applies when the respondent describes diagnosing or fixing specific bugs or incidents that have...
Project Description
Maintain a continuously updated map of symbols, calls, types, modules, interfaces, tests, and historical discussions so developers can ask 'where should this change go?' and 'what breaks if I touch this?' before editing. The emphasis is system-level context retention across sessions, not one-off code generation.
- AST- and symbol-level index of the full repository across supported languages
- Build definitions, module boundaries, dependency graphs, and test relationships
- Version-control history, blame, and pull-request discussions
- Linked ADRs, design notes, bug reports, and incident postmortems
- Current lint/style rules and naming conventions from the repository
- On repository onboarding, parse source and build metadata into a persistent graph of symbols, types, calls, modules, APIs, and tests, then update that graph incrementally as code changes.
- Attach rationale and historical context to code regions using blame, PR review discussions, ADRs, and linked bugs or incidents.
- Infer a convention profile from the existing repository so file placement, naming, abstractions, and testing style reflect local norms rather than generic habits.
- Answer repo-specific how, why, and where questions with citations to code, docs, review threads, and historical changes.
- For a proposed change, compute a ripple report that lists affected files, interfaces, modules, tests, and likely risk categories before any patch is drafted.
- Generate a multi-file patch plan grouped into small steps that match repository conventions and can be validated against the build and test suite.
Who It Affects
65 of 353 developers (18.4%) explicitly described needing AI to understand, retain, and safely act on repository-scale context across files and modules. This theme had strong inter-rater reliability (α=0.884) and the highest preference-usage gap (1.14), indicating a coherent, high-priority need that current AI support does not meet.
- Inter-rater reliability α=0.884 indicating high agreement that these responses describe a coherent need
- Preference-Usage Gap: 1.14
- Average AI Preference: 4.21/5
- Average AI Usage: 3.07/5
- 77.6% of respondents want High/Very High AI support for this area
Impact
With a persistent, traceable model of the repository, developers could ask where a change belongs, see the ripple effects before editing, and receive small reviewable multi-file diffs that fit existing patterns and stay within scope. This would reduce time spent re-establishing context, hunting for prior fixes and design rationale, and manually checking downstream breakage, especially in unfamiliar parts of the codebase.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Developers report they no longer need to re-explain repository context at the start of each session.
- Developers describe the tool's how/why/where answers as accurate and traceable to source artifacts.
- Developers voluntarily use ripple-effect analysis before making cross-cutting changes.
- Developers trust cross-file edit proposals enough to review and accept them instead of rewriting from scratch.
- Developers stop reporting that AI duplicates code or breaks existing functionality when extending unfamiliar areas.
Quantitative Measures
- Reduce the preference-usage gap from 1.14 to below 0.5 within 12 months of deployment.
- Increase acceptance rate of cross-file edit proposals to 70%+ (vs. baseline of single-file suggestions).
- >= 80% of tool-generated cross-file patch sets pass build + unit tests on the first execution in CI (within repo's existing test suite).
- Reduce average number of back-and-forth iterations to complete a cross-file refactor/rename/migration by >= 30%.
- Achieve 90%+ citation accuracy on how/why answers (verified by developer feedback on whether cited sources were relevant and correct).
Theme Evidence
AI that builds a reliable, repo-wide understanding of large/legacy systems -- including dependencies, conventions, and cross-module/service relationships -- to navigate code and make coordinated multi-file edits with awareness of ripple effects. It retrieves and synthesizes codebase- and...
Quality & Risk
155 responses | 8 themes
View CodebookQuality & Risk Codebook
Themes identified from "What do you NOT want AI to handle?" responses.
Automated Test Generation, Maintenance & Quality Gates
Intelligent PR/Code Review Assistant
Security Vulnerability Detection & Fix Guidance
Compliance, Standards & Audit Process Automation
Proactive Risk Monitoring, Prediction & Anomaly Detection
Agentic Workflow Automation & Automated Remediation
Knowledge Retrieval, Summarization & Standards Guidance
Debugging, Root Cause Analysis & Failure Triage
Project Description
On each pull request, derive a behavior-to-test map from the diff, linked requirements, existing suites, and coverage deltas; generate missing unit, integration, or end-to-end tests; run them; and publish a compact gate report showing exactly what changed behavior remains untested. For workflow-heavy changes, the same system can switch into a release-rehearsal mode and simulate the affected pipeline steps in a sandbox so config and CI logic are validated alongside the code.
- PR diff, touched interfaces, and dependency impact map
- Existing tests, helpers, and coverage deltas for changed code
- Linked requirements, acceptance criteria, and bug/regression history for touched areas
- Build and test configuration plus flaky-test history
- Pipeline or workflow definitions when CI/CD files are part of the change
- Approved synthetic fixtures and test-data templates
- Ingest the diff and compute an impact map of changed functions, interfaces, workflows, and downstream dependents instead of treating the PR as an undifferentiated blob.
- Recover expected behavior from linked requirements, existing tests, and recent bugs; when intent is missing, ask only for the behavior that cannot be inferred from those sources.
- Create an intent-to-test matrix that ties each proposed check to a changed branch, interface, workflow, or stated requirement.
- Generate or update repository-style unit, integration, and end-to-end tests, reusing existing helpers and suppressing redundant cases.
- Run the new tests together with impacted existing suites; if workflow files changed, simulate the affected pipeline stages in a sandbox with stubbed secrets and non-production targets.
- Publish a gate report that shows what changed behavior was exercised, where changed-line or changed-branch coverage is still thin, and which gaps matter for critical paths.
- Apply configurable gates to changed code and high-risk workflows while preserving auditable human overrides.
Who It Affects
69 of 155 respondents (44.5%) explicitly asked for AI help with test generation, coverage analysis, quality gates, or test automation—the largest theme in the quality-risk category. Coding agreement was near-perfect (α = 0.974), and the gap between high preference and lower current usage indicates a broad unmet need rather than a niche request.
- 81% of respondents in this theme want High or Very High AI support for testing tasks
- Average AI Preference: 4.32/5
- Average AI Usage: 2.75/5
- Preference-Usage Gap: 1.57
Impact
If successful, developers would move from writing most tests from scratch to reviewing a change-specific test plan and candidate tests that already cover happy paths, edge cases, and regressions. Pull requests would include concrete evidence of what was exercised, what changed-code coverage is still missing, and whether minimum gates are met, giving teams more confidence in new features and legacy-code changes without handing final judgment to the tool.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Developers report that writing tests has shifted from a write-from-scratch task to a review-and-refine task on typical changes.
- Generated tests are described as readable, human-editable, and aligned with repository conventions rather than generic boilerplate.
- Teams say the system reliably surfaces edge cases and changed-code gaps they would otherwise miss.
- Engineers trust CI gate outcomes because they include concrete evidence of executed tests, coverage deltas, and artifacts for audited workflows.
- Teams working in legacy code report higher confidence making changes because untested areas are made explicit and easier to address.
Quantitative Measures
- Reduce median time-to-add-tests per change by 30% (measured from first pipeline run to tests accepted in the repository).
- Increase coverage on changed lines or branches by +15% within 3 months of adoption for onboarded repositories (coverage delta focused on touched code, not overall).
- Reduce escaped regression defects linked to insufficient test coverage by 20% (based on bug or incident tagging and change linkage).
- Decrease occurrences of behavior-changing pull requests with no new tests added by 25% (measured via gate outcomes).
- Keep false-positive gate failures under 5% of runs (developer-labeled unhelpful or incorrect gate outcomes).
Theme Evidence
Generate and maintain meaningful unit, integration, and E2E tests from requirements, code changes, UI/workflow context, and repo history; propose edge cases and test data; identify coverage gaps and regressions; and enforce quality checks in CI/CD to prevent low-quality changes from shipping. This...
Project Description
Add a bounded first-pass analyst to the PR workflow. It should explain what changed, trace which modules and interfaces are most affected, and attach line-specific findings on likely defects, readability problems, performance issues, missing tests, or policy violations—together with what it checked and what it did not check.
- Pull request diff, description, commit messages, and linked work items
- Semantic code index for the target branch, including symbols and call relationships
- Team review rules, architectural constraints, and contribution guidelines from the repository
- Historical PR discussions and resolved review comments from the same codebase
- CI outputs such as static analysis, type checks, tests, and build results
- Ingest the PR diff and metadata, then compute the reachable impact set across changed symbols, public interfaces, and dependency-relevant context.
- Generate a layered summary of the change: intent, modules touched, interface movement, risky hotspots, and likely downstream impact.
- Analyze the changed code for likely correctness, performance, readability, maintainability, and policy issues using the repository’s own review norms and historical comment patterns.
- Attach each finding to exact files or lines with severity, confidence, and a short rationale; when intent is ambiguous, turn the finding into a clarifying question instead of a false assertion.
- Publish both inline annotations and a top-level summary that explicitly lists what categories were checked and where the analysis boundary stopped.
- Use accepted and dismissed findings to adapt to repo-specific norms and reduce repeat noise over time.
Who It Affects
This affects teams that use pull requests as a core quality checkpoint and struggle with reviewer load, large diffs, and shallow automated feedback. In the survey, 36 of 155 developers (23.2%) who answered the 'want AI help' question explicitly asked for AI help with PR/code review tasks, and demand is strong despite low current usage.
- 36 of 155 respondents (23.2%) were coded to the Intelligent PR/Code Review Assistant theme
- 81.0% of respondents want High or Very High AI support in this area
- Average AI Preference: 4.32/5
- Average AI Usage: 2.75/5
- Preference-Usage Gap: 1.57
Impact
If successful, the assistant gives reviewers a reliable first pass on each pull request: a concise explanation of what changed, which parts of the codebase are most affected, and evidence-backed findings on likely defects, anti-patterns, readability problems, and performance concerns. This should reduce the time humans spend orienting to large diffs and repetitive low-level checks, while helping them focus their attention on design intent, special-case business logic, and other judgments the team does not want automated away.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Reviewers report that summaries help them understand large or complex pull requests faster.
- Developers say the assistant's comments are specific to their repository and team norms rather than generic best-practice boilerplate.
- Human reviewers report spending less time on mechanical checking and more time on design intent and change-specific judgment.
- Users say they trust the assistant more because each finding includes evidence, confidence, and explicit coverage limits.
Quantitative Measures
- At least 80% of reviewed pull requests have the assistant's summary rated as useful or very useful in a post-merge survey.
- Reduce median time-to-first-human-review on pull requests using the assistant by 20-30%.
- Reduce review iterations by 10-20% for pull requests larger than 500 changed lines.
- Increase acceptance rate of assistant findings to more than 40% while keeping false-positive dismissals below 30%.
- Reduce post-merge defects attributable to covered categories by 10-15%.
Theme Evidence
Act as a context-aware reviewer that understands the codebase and team norms to summarize large diffs, flag likely bugs and anti-patterns, improve readability and maintainability, suggest refactors, and surface performance concerns -- reducing reviewer load and catching issues earlier. The...
Project Description
Run a pre-merge security pass that ties findings to concrete code paths, dependency uses, and configuration choices in the current change. For high-confidence cases, draft the smallest plausible remediation patch together with the specific checks—build, unit, route-level tests, or security scans—that show whether the fix actually works.
- PR diff and surrounding source context for touched handlers, routes, and middleware
- Call graph and dataflow through changed auth, input, crypto, or secret-handling paths
- Dependency manifests, lockfiles, SBOM data, and advisory feeds
- Security-relevant configuration such as route policy, middleware, and access-control files
- Past security findings, suppressions, and remediation commits in the same repo
- Build, test, and security scan outputs for the candidate patch
- Localize the analysis to changed code and nearby security-sensitive surfaces such as auth checks, input handling, secrets, crypto, filesystem access, and network-facing paths.
- Run static, pattern, taint-style, and dependency advisory checks, then correlate the results into findings tied to exact files, lines, source-to-sink paths, or dependency chains.
- Explain each issue in developer terms: how it manifests here, what makes it risky, and which repository patterns or policies suggest the safer form.
- Generate minimal remediation options, including a conservative patch and, when relevant, a slightly broader but more future-proof alternative.
- Validate the chosen candidate against build, unit, and relevant security checks before surfacing it in the PR.
- Record dismissals, deferrals, and false positives with rationale so the assistant can reduce repeat noise per repository.
Who It Affects
34 of 155 developers (21.9%) explicitly requested AI assistance with security vulnerability detection and fix guidance. Demand is both strong and clear: this theme has near-perfect coding agreement (α = 0.938), 81.0% of respondents wanting high AI support, and a large preference-usage gap of 1.57 (4.32/5 desired vs. 2.75/5 current usage), indicating that existing tools are not meeting need.
- 34 of 155 responses coded to this theme (21.9%)
- 81.0% of respondents in this theme want high or very high AI support for security vulnerability detection
- Average AI preference: 4.32/5
- Average AI usage: 2.75/5
- Preference-Usage Gap: 1.57
- α = 0.938
Impact
Developers get security feedback at the point of code change, with evidence of exploitability and a repository-specific fix they can review, instead of late or non-actionable scanner output. This shortens the path from finding to remediation for both code and dependency issues, reduces back-and-forth with security reviewers, and helps teams catch auth/authz gaps and vulnerable libraries before merge.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Developers report that security findings are actionable because explanations tie the issue to local code and dependency usage.
- Developers report that remediation suggestions are written in plain language rather than security jargon.
- Developers trust the tool because it shows evidence, confidence, and validation results and does not overclaim.
- Security reviewers report fewer back-and-forth cycles because pull requests arrive with clearer fixes and rationale.
Quantitative Measures
- Increase pre-merge security issue detection rate (share of security issues found before merge) by 20+ percentage points.
- Reduce high/critical security findings detected after merge by 30-50% within 2 quarters of rollout.
- False positive rate below 10% of findings marked "not applicable" after 4 weeks of tuning per repository.
- Median time-to-remediate for dependency vulnerabilities reduced by 25-40%.
- Preference-usage gap reduction: increase average usage from 2.75/5 toward preference (4.32/5) over 6 months.
Theme Evidence
Proactively scan code, PRs, and dependencies to identify security vulnerabilities (e.g., auth/authz gaps, insecure patterns, risky libraries, weak cryptography) and provide actionable remediation guidance or suggested patches before merge or deployment. This theme applies when the response...
Project Description
Translate a selected control set into an applicability matrix, fetch the required artifacts from engineering systems, and draft questionnaire answers with citations, timestamps, and explicit gaps. This is a workflow engine for evidence collection and policy-to-engineering translation, not a system that pronounces a service compliant.
- Control catalogs, policy text, and questionnaire templates with stable IDs and versions
- Service and repository inventory, including ownership and data classification
- CI results, scanner outputs, and cloud policy state from read-only connectors
- Work items, waivers, approvals, and remediation history
- Prior submissions and stored evidence packets for the same control families
- Resolve the target system and parse the selected control set into required evidence types, applicability rules, and allowed answer formats.
- Collect a minimal set of scoping facts that determine which controls apply and record the reasoning for each applicable or non-applicable judgment.
- Generate an evidence plan per control that specifies where the artifact should come from, how fresh it must be, and whether human input is still required.
- Harvest available artifacts into an evidence ledger with immutable references, timestamps, and provenance instead of copying raw screenshots and fragments into free text.
- Draft questionnaire answers control by control, citing the evidence used, marking low-confidence fields, and creating remediation tasks where the evidence is missing or points to a gap.
- Assemble an audit packet containing the applicability matrix, evidence ledger, drafted responses, unresolved issues, and the final edits made by human reviewers.
Who It Affects
22 out of 155 "want help" responses explicitly referenced compliance processes, audit readiness, or standards-enforcement workflows. This theme had very high inter-rater reliability (α=0.949), indicating a clear and consistently expressed practitioner pain point rather than isolated complaints.
- Average AI Preference: 4.32/5
- Average AI Usage: 2.75/5
- Preference-Usage Gap: 1.57
- 81.0% of respondents in this theme want High or Very High AI support for compliance tasks
Impact
Instead of spending hours hunting through repositories, pipeline outputs, scanners, and documents to answer long compliance questionnaires, developers would receive a reviewable draft with cited evidence, explicit gaps, and plain-language action items. This shifts effort from repetitive retrieval and form-filling to checking edge cases, reduces incomplete submissions and back-and-forth with reviewers, and moves teams closer to continuous audit readiness.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Developers report that compliance requirements are translated into clear, actionable steps rather than dense jargon.
- Developers trust drafted answers because every control status includes evidence links, timestamps, and an explanation of what was checked.
- Reviewers report more complete submissions with fewer back-and-forth revision cycles.
- Teams feel audit-ready earlier in the process rather than scrambling only at formal review time.
Quantitative Measures
- Reduce median time to complete a compliance/security review questionnaire by 50% (from first open to submitted draft).
- Reduce number of manual evidence lookups per review by 40% (tracked via workflow events and connector usage).
- Auto-collect and attach evidence for at least 60% of applicable controls.
- Increase first-pass acceptance rate of compliance submissions by 30% (fewer returns for missing evidence or unclear answers).
- Maintain low error rate: fewer than 5% of drafted questionnaire answers require major correction due to factual inaccuracies.
Theme Evidence
Reduce compliance toil by interpreting internal/external standards and policies, translating them into actionable developer steps, checking whether compliance bars are met, automating evidence collection and form/questionnaire filling, and improving audit readiness (e.g., SFI, S360, security review...
Project Description
Continuously score recent deployments, config flips, and feature-flag changes against service-specific baselines and historical failure patterns, then issue short risk briefs that explain why a change looks dangerous, which signals are drifting, and what blast radius is most plausible. The output should read like an early warning note, not an opaque verdict.
- Deployment, configuration, and feature-flag events with service and environment identifiers
- Service metrics, traces, and representative anomalous log signatures
- Historical incidents and postmortems for similar services or change types
- Dependency or service-topology relationships that shape blast radius
- CI quality signals and change metadata for the rollout under evaluation
- Assign a canonical ChangeID to each deployment or config event so telemetry can be tied back to a specific rollout or flag flip.
- Learn service-specific healthy baselines and detect low-severity deviations in latency, error rate, crash patterns, saturation, or log templates.
- Correlate deviations to recent changes using timing, topology, rollout strategy, and historical incident similarity.
- Compute a risk score with explicit drivers such as blast radius, prior instability, limited test coverage, risky dependency edges, or unusual config movement.
- Generate a short risk brief for the owning team that includes the top drivers, before/after signal comparisons, relevant past incidents, and suggested next checks.
- Use engineer feedback and post-incident outcomes to recalibrate thresholds and highlight recurring risk motifs across services.
Who It Affects
18 of 155 respondents (11.6%) explicitly asked for AI support with proactive risk monitoring, prediction, or anomaly detection using operational signals beyond code review. This theme has the highest preference-usage gap among all themes, indicating strong unmet demand for systems that correlate production signals and change history to surface risk early.
- Average AI Preference: 4.32/5
- Average AI Usage: 2.75/5
- Preference-Usage Gap: 1.57
- 81% of respondents in this theme want High or Very High AI support
- High inter-rater reliability (α = 0.894)
Impact
If this exists, teams get early, evidence-backed warnings about risky changes while problems are still low severity instead of discovering them during firefighting. Developers spend less time stitching together dashboards, logs, and rollout history during triage, and more time acting on a prioritized view of likely causes and impact. Over time, repeated alerts can also reveal recurring risk drivers so teams can address root causes rather than repeatedly patching symptoms.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Developers report fewer surprise regressions after rollout because alerts arrive early with enough evidence to investigate.
- On-call engineers say alerts are understandable and clearly linked to likely related changes.
- Developers describe the risk brief and drill-down view as reducing manual log-hunting and cross-signal correlation during triage.
- Teams report that repeated alerts help them identify recurring risk drivers and prioritize root-cause fixes instead of patching symptoms.
Quantitative Measures
- Reduce mean time to detect regressions after a change by at least 30% within 2 quarters for onboarded services.
- Increase the percentage of incidents where a pre-incident risk signal was surfaced from less than 10% to more than 50% within 12 months.
- Achieve a false-positive rate below 15% for high-severity risk alerts after an initial calibration period.
- Attach a risk score and risk brief to more than 90% of production changes for onboarded services.
- Reduce average time spent on manual post-incident root cause correlation by at least 30%.
Theme Evidence
Use telemetry, logs, configurations, and historical change data to predict high-risk changes, detect regressions and anomalies early, track risk trends across services, assess likely impact, and generate prioritized risk reports or alerts so teams can mitigate issues before incidents escalate. The...
Infrastructure & Ops
101 responses | 8 themes
View CodebookInfrastructure & Ops Codebook
Themes identified from "What do you NOT want AI to handle?" responses.
Observability & Incident Response Automation (Monitoring, Triage, RCA, Mitigation, Self-Heal)
CI/CD, Deployment & Infrastructure Provisioning Automation (Pipelines + IaC)
Proactive Maintenance, Upgrades, Security/Compliance & Cost Optimization
Customer Support Triage & Auto-Response
Testing, Quality Validation & Safer Releases
Knowledge Management, Documentation Search & System Context
Ops Toil Automation & Script Writing/Debugging
Better AI Tooling UX (Accuracy, Control & Cohesive Workflows)
Project Description
Operate as a read-mostly observability assistant that tunes monitors from historical alert quality and, when an alert fires, assembles an incident brief from logs, traces, recent deploys, dependency topology, similar incidents, and runbook fragments. The assistant should shorten the orientation phase of incident response, not impersonate an incident commander.
- Metrics, logs, and traces from production telemetry systems
- Alert rules, page history, acknowledgement data, and routing policies
- Service dependency graph, ownership metadata, and criticality tags
- Deployment, configuration, and feature-flag events
- Incident records, postmortems, and mitigation runbooks
- Audit logs for assistant reads and proposed actions
- Normalize telemetry, alert, and change events into a common schema with stable service IDs and aligned timestamps.
- Refresh the service dependency graph from metadata and observed traces so the assistant can separate upstream causes from downstream symptoms.
- Analyze historical alert quality to propose threshold, deduplication, and missing-monitor changes with explicit trade-offs between noise and missed detection.
- When an alert fires, gather a compact evidence bundle from correlated logs, metrics, traces, neighboring services, and recent deploy or config events.
- Generate a structured incident brief with suspected user impact, event timeline, top candidate causes, and clearly stated unknowns.
- Retrieve similar prior incidents and related runbook steps, annotating each mitigation option with expected effect, risk, and rollback guidance.
- Auto-enrich the incident record with the evidence bundle, chosen mitigation, and outcome so later alerts can benefit from the resolved case.
Who It Affects
41 of 101 respondents (40.6%) were coded to this theme — the largest theme in infrastructure operations, with near-perfect inter-rater reliability (α = 0.973). These are on-call engineers and service owners responsible for monitoring production health, triaging alerts, and coordinating incident response across distributed systems.
- Theme prevalence: 41/101 (40.6%) requests
- 74.5% of respondents in this category want High or Very High AI support for infrastructure operations tasks
- Average AI preference: 4.11/5 vs current usage: 2.42/5 (gap: 1.69)
Impact
Instead of starting from raw alerts, scattered dashboards, and manual log queries, on-call engineers would receive fewer low-signal alerts and an incident brief that already correlates likely impact, recent changes, and candidate causes. Repeated incidents would become faster to diagnose because the assistant surfaces similar prior incidents and relevant runbook steps, letting engineers focus on deciding and executing the response rather than assembling context.
Constraints & Guardrails
Success Definition
Qualitative Measures
- On-call engineers report that incident summaries are accurate and useful, reducing the need to manually query logs during triage.
- On-call engineers report fewer low-signal pages and faster understanding of what is broken and why.
- Engineers trust the system's root cause hypotheses enough to use them as a starting point rather than investigating from scratch.
- Teams report that monitoring coverage gaps are caught proactively, with fewer incidents caused by missing or misconfigured monitors.
- Past-incident matching and runbook suggestions are perceived as relevant and actionable.
- Incident commanders say the generated timelines and impact summaries reduce coordination overhead and improve handoffs between responders.
Quantitative Measures
- Reduce mean time-to-triage (alert fire to root cause hypothesis) by 50% within 6 months of adoption.
- Reduce alert/page volume per service by 20–40% while maintaining or improving incident detection coverage (no increase in escaped incidents).
- Increase monitoring coverage (% of services with health monitors) from baseline by 30% through gap detection and auto-suggested monitors.
- Achieve 80%+ accuracy for top-3 root cause hypotheses as validated against post-mortem confirmed root causes.
- Improve mean time to mitigate/restore (MTTR) by 10–25% through faster correlation, timeline generation, and mitigation guidance.
- Reduce manual incident record enrichment time by 70% through auto-generated summaries and evidence bundles.
Theme Evidence
AI that continuously analyzes telemetry (metrics, logs, traces) to set up and tune monitoring, detect anomalies, predict failures, and generate higher-signal alerts to reduce noise and missed conditions. When incidents occur, it correlates signals across systems, summarizes impact and timeline,...
Project Description
Author, migrate, explain, and debug delivery definitions by reading the repository’s build targets, workflow files, IaC modules, and platform templates, then producing file-by-file patches and diagnostic notes rather than opaque tips. The same assistant should be able to render the current pipeline topology in plain language so engineers can see how jobs, artifacts, environments, and gates actually fit together.
- Existing workflow files, build scripts, deployment manifests, and IaC modules
- Pipeline execution logs, failed run metadata, and deployment events
- Organization pipeline templates, policy rules, and approved base images
- Read-only environment inventory and desired-state data
- Repository structure, test targets, and dependency manifests
- Version-control history for delivery definition files
- Parse the repository into a delivery profile that identifies runtime, build, test, packaging, deployment targets, and the files that currently implement them.
- Construct a topology map of stages, jobs, gates, artifacts, environments, and dependencies, and explain that map in plain language for the team.
- Generate baseline CI/CD or environment skeletons for new repos or services by combining the project profile with organization templates and policy constraints.
- For legacy definitions, produce minimal migration diffs with staged rollout notes and rollback guidance instead of full rewrites.
- When a build or deployment fails, trace the error back to the pipeline step, configuration stanza, script line, or resource definition most likely responsible.
- Validate generated changes against platform schemas and organization policies, then emit a safety packet summarizing changed files, risky operations, compliance results, and recommended smoke tests.
Who It Affects
34 of 101 respondents (33.7%) were coded to this theme, making it the second-largest demand cluster after observability/incident response. These respondents described work spanning environment provisioning, pipeline authoring, migration, and deployment troubleshooting; coding agreement was strong across three independent coders (α = 0.91).
- 34 responses (33.7%) coded to this theme, making it the second most prevalent 'want' category
- 74.5% of respondents coded to this theme want High or Very High AI support
- Average AI Preference of 4.11/5 vs. Average AI Usage of 2.42/5 - a 1.69-point gap
- Inter-rater reliability α = 0.91 across 3 independent coders
Impact
If this capability existed, teams could bootstrap new pipelines and environments from repository context instead of hand-assembling them from poor documentation, while still receiving reviewable diffs rather than autonomous changes. The same system would shorten build and deployment debugging by mapping failures back to specific pipeline or infrastructure lines, and it would make existing delivery topologies understandable enough for onboarding, migration, and change planning. The result is less prerequisite toil and fewer hours lost to trial-and-error in delivery infrastructure.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Developers report that setting up a new project's CI/CD pipeline and infrastructure templates feels like filling in the blanks on a well-structured draft rather than starting from scratch.
- Developers report they can understand how their pipeline and environments are wired end-to-end via the generated explanations and topology maps.
- On failures, developers report the triage output is actionable and points to specific files, lines, or steps instead of forcing trial-and-error log reading.
- Developers trust the system because every output is a reviewable change-set with rationale and validation evidence rather than an autonomous action.
Quantitative Measures
- Reduce median time-to-first-successful CI pipeline for a new repo by 50%.
- Reduce median time-to-provision a new non-production environment (from infrastructure-as-code) by 30-60%.
- Reduce mean time-to-diagnosis for build and deployment failures by 30%.
- 75% of generated pipeline definitions and infrastructure templates pass organizational compliance validation on first generation (before human edits).
- Decrease re-run count per failed pipeline run by 20%.
- Safety: 0 autonomous production changes; 100% of production-impacting suggestions accompanied by a blast-radius report and policy evaluation artifact.
Theme Evidence
AI that creates, explains, migrates, reviews, and maintains CI/CD pipelines and deployment workflows, including automating releases and troubleshooting build/deploy failures. It also reduces toil in provisioning environments by generating or updating infrastructure-as-code (e.g., Bicep/ARM/EV2) and...
Project Description
Consolidate deprecations, security findings, runtime drift, platform notices, and cost anomalies into a maintenance agenda that already says what to change, why it matters now, how to verify it, and who owns the missing context. The assistant’s job is backlog synthesis and task preparation, not silent remediation.
- Service catalog with ownership, criticality, and environment mappings
- Runtime, package, image, and configuration inventory per environment
- Vulnerability, compliance, and platform lifecycle findings
- API deprecation notices and upgrade requirements from internal platforms
- Cloud utilization and billing data for optimization candidates
- Existing runbooks, escalation contacts, and backlog history
- Map each service to its environments, repositories, owners, runtimes, and deployed artifacts so findings land in the right operational context.
- Normalize upkeep signals—security/compliance findings, unsupported versions, deprecations, upgrade requirements, and cost anomalies—into a common remediation queue.
- Deduplicate repeated alerts across tools and environments by collapsing them into root remediation actions rather than one ticket per scanner result.
- Prioritize the queue using severity, criticality, blast radius, estimated effort, due dates, and potential cost savings.
- Draft backlog items that answer the practitioner’s preferred rubric: what to do, why now, detailed steps, validation checks, rollback notes, and who to contact if assumptions do not line up.
- After teams accept or complete items, rescan the relevant signals to attach closure evidence and suppress already-resolved or duplicate work.
Who It Affects
17 of 101 "want" responses (16.8%) described developers responsible for ongoing service ownership after launch: maintaining environments, upgrading systems, closing security/compliance findings, and managing resource efficiency. This theme had very high inter-rater reliability (α = 0.954), indicating a clear and repeated practitioner need.
- 17/101 want responses (16.8%) were coded to this theme with α = 0.954
- Average AI preference: 4.11/5
- Average AI usage: 2.42/5
- Preference-usage gap: 1.69
- 74.5% of respondents want High or Very High AI support for infrastructure ops tasks
Impact
Developers stop manually chasing scattered upkeep signals across scanners, advisories, deprecation notices, and billing reports. Instead, service owners get a prioritized daily or weekly agenda of proposed work items—each with what to do, why it matters, detailed execution steps, validation evidence, and escalation guidance—so upkeep becomes a predictable routine rather than ad hoc fire drills. This should reduce the engineering burden of service ownership, especially around security/compliance remediation, while also surfacing upgrade and cost-saving work earlier.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Service owners report that each proposed item clearly answers what to do next, why it matters, and how to verify completion.
- Engineers trust recommendations because each item cites the underlying finding, affected resources, and supporting policy or advisory text.
- Teams report fewer missed upkeep tasks when developers shift attention back to feature work.
- Security/compliance stakeholders report less back-and-forth because maintenance items include acceptance criteria and audit-ready evidence.
Quantitative Measures
- Reduction in median time-to-remediate for security/compliance findings
- Decrease in overdue security/compliance findings per service
- Increase in patch and upgrade compliance rate across environments
- Reduction in manual effort to create maintenance tickets, measured by accepted drafted items versus tickets created from scratch
- Measured cost savings from accepted optimization recommendations
- Lower rate of generated upkeep items closed as duplicate, irrelevant, or not applicable
Theme Evidence
AI that plans and drives routine operational upkeep of already-running services -- upgrades/patching, dependency and API/workflow migrations, security/compliance posture management (e.g., SFI, S360 remediation), and resource/cost optimization -- by generating actionable work items and...
Project Description
Screen incoming tickets, match them to known issue patterns, run only pre-approved telemetry lookups keyed by safe identifiers, and draft a responder-facing triage card plus a customer-safe reply. The assistant should act like a fast intake specialist sitting next to the support engineer, not like a wall between the customer and a human.
- Incoming support tickets, attachments, and verified account metadata
- Approved knowledge-base articles, troubleshooting guides, and known-issue records
- Anonymized resolved tickets and resolution codes for similar cases
- Service ownership maps and escalation policies
- Read-only telemetry query templates keyed by safe identifiers
- Communication, privacy, and confidentiality rules for support responses
- Normalize the request into a common schema, redact sensitive fields from free text, and attach the relevant privacy and communication policies before retrieval begins.
- Classify the ticket by likely product area, issue type, severity, and owner queue, and show rationale snippets so the assigned responder can see why it was routed that way.
- Retrieve similar resolved cases and approved knowledge sources to ground the triage in prior outcomes rather than in the current ticket alone.
- When a verified identifier is present, run only allowed telemetry or log queries and summarize recent errors, anomalous states, or health signals relevant to the report.
- Draft a triage card and a customer-facing reply that state what was observed, what steps are recommended, what information is still missing, and where confidence is low.
- Route the case to the appropriate human queue with the context pack attached and learn from overrides, edits, and escalations.
Who It Affects
12 of 101 respondents (11.9%) explicitly asked for AI help with customer- or user-facing support interactions. This theme showed strong unmet demand—average AI preference 4.11/5 versus usage 2.42/5, a 1.69-point gap, with 74.5% wanting High or Very High support—while also drawing at least 15 'do not want' mentions, indicating that affected teams face real support volume but want assistance designed as human-in-the-loop rather than full automation.
- Average AI Preference: 4.11/5
- Average AI Usage: 2.42/5
- Preference-Usage Gap: 1.69
- 74.5% of respondents in this theme want High or Very High AI support
- At least 15 'do not want' respondents explicitly mentioned customer support
Impact
For support engineers and developers, common tickets no longer begin with manual screening, queue selection, log hunting, and drafting a first reply from scratch. The assistant surfaces the likely category, relevant diagnostic signals, similar resolved cases, and a cited draft response so humans can review and respond quickly; novel cases arrive to specialists with context already assembled. This shifts time away from repetitive support toil and toward mitigation and product work, while giving customers faster and more specific first responses.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Support responders report less time spent screening tickets, deciding routing, and manually digging through logs for common issues.
- Responders trust the triage cards and drafts because they include citations, show what evidence was checked, and clearly flag uncertainty and missing information.
- Engineers receiving escalations report that tickets arrive with clearer reproduction details and more relevant diagnostic context.
- Customers do not perceive a degradation in support quality, and escalation to a human remains straightforward when needed.
Quantitative Measures
- Reduce median time-to-first-response by 25–40% for supported ticket categories.
- Reduce average handle time for common issues such as permissions, configuration problems, and known errors by 20–35%.
- Increase correct initial routing rate to the right queue or owner by 15–30%.
- Achieve ≥80% classification accuracy on the top-10 most common ticket categories, validated by agent override rate <20%.
- Agent draft-acceptance rate for knowledge-base-matched tickets reaches ≥60% within 6 months.
- Maintain a low customer-facing error rate: <1% of AI-assisted responses later marked incorrect or misleading by human audit.
- Privacy/security compliance: 0 incidents of secret or PII leakage in drafts as measured by automated scanners and audits.
Theme Evidence
AI that screens and buckets customer or user support requests, correlates user-reported issues with telemetry/logs, drafts responses from known solutions/knowledge bases, and escalates appropriately -- reducing repetitive support workload. This theme applies when the respondent explicitly mentions...
Meta-Work
157 responses | 8 themes
View CodebookMeta-Work Codebook
Themes identified from "What do you NOT want AI to handle?" responses.
Automated Documentation Generation & Maintenance
Onboarding, Mentoring & Personalized Upskilling
Project Knowledge Search & Discovery (with Traceable Sources)
Stakeholder/Client Communication Drafting & Translation
Brainstorming, Option Generation & Rapid Exploration
Meeting Scheduling, Notes, Summaries & Action Items
Proactive Personal Agent & Routine Admin Automation
Planning, Prioritization, Blocker Detection & Status Reporting
Project Description
Watch code changes, infer which docs they invalidate, and propose small patches—README paragraphs, API notes, runbook steps, diagrams, docstrings—linked directly to the code symbols, tests, and schemas that justify them. The unit of output should be a precise doc patch, not a page of generic prose.
- PR diffs, changed symbols, and repository structure
- Existing documentation corpus, including READMEs, docs folders, runbooks, and docstrings
- Tests and coverage changes that clarify current behavior
- API, schema, and configuration definitions for changed interfaces
- Architecture diagrams or dependency graphs for structural changes
- CI validation outputs for examples, snippets, and references
- Inventory the repository’s documentation surfaces and conventions so the assistant knows which files, doc styles, and diagram formats actually matter here.
- For each code change, compute a documentation impact map that links changed interfaces, commands, configs, and behaviors to the docs most likely to drift.
- Pull facts only from authoritative sources such as symbols, tests, schemas, and existing docs, storing line-level citations for every generated claim.
- Generate minimal patches for the affected sections instead of rewriting whole documents, and regenerate only diagram elements that the structural change actually touched.
- Validate links, code examples, schema snippets, and referenced symbols in CI wherever possible.
- Surface stale-document risk for modules outside the current PR when code churn or validation failures suggest hidden drift.
Who It Affects
72 of 157 respondents (45.9%) were coded to the automated documentation theme, making it the most prevalent meta-work request in the survey; inter-rater reliability was near-perfect (α = 0.966). The need appears across repositories with dense code, missing documentation, public interfaces, and onboarding materials that quickly drift out of date.
- 72 of 157 respondents (45.9%) were coded to this theme, the most prevalent meta-work request
- 72.6% of respondents in this theme want High or Very High AI support for documentation tasks
- Average AI Preference: 4.07/5
- Average AI Usage: 2.58/5, creating a 1.49-point preference-usage gap
Impact
If this exists, documentation work shifts from a separate, often-skipped task to a lightweight review activity attached to normal code change flow. Teams get targeted updates when interfaces or behavior change, stale docs are detected before drift accumulates, and onboarding improves because README-level and architecture-level documentation stays aligned with the live system instead of depending on heroic manual maintenance.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Reviewers trust the system because each generated statement is traceable to code, tests, or schemas.
- Developers report that documentation for new and changed code 'just exists' without separate documentation sprints.
- Teams report that stale documentation is identified within days of code changes rather than months later.
- Developers describe generated outputs as concise, targeted patches rather than low-value bulk text.
- New team members report clearer README-level and architecture-level documentation during onboarding.
Quantitative Measures
- Increase in the percentage of code-changing pull/merge requests with accepted documentation updates when public interfaces or behavior change.
- Increase in documentation coverage for public interfaces from baseline (target: >80% coverage within 6 months of adoption).
- Reduction in documentation drift, measured as days between a code change and an update to the impacted documentation (target: <14 days drift).
- Reduction in documentation validation failures and stale-document findings per week.
- Higher reviewer acceptance rate for generated documentation patches and lower average number of manual edits per accepted patch.
- Preference-usage gap for documentation assistance narrows from 1.49 to below 0.5 within 12 months of deployment.
Theme Evidence
Generate, update, and validate documentation artifacts directly from code, PRs, specs, and tests (e.g., READMEs, inline comments, API docs, architecture overviews/diagrams). This includes producing new documentation, maintaining accuracy as code evolves, and identifying gaps or staleness. Assign...
Project Description
Compose a role-specific ramp-up plan from the actual repos, docs, ADRs, setup scripts, and examples the learner will touch, then teach through sequenced exercises, answerable questions, and source-linked explanations tailored to that engineer’s background. It should feel less like a chatbot and more like a personalized, traceable onboarding syllabus for this team.
- Learner profile, ramp-up goal, and time budget
- Target repositories, READMEs, examples, and setup scripts
- Internal docs, ADRs, and service catalog entries for the target system
- Resolved internal Q&A threads or support tickets for common setup and domain questions
- Approved external docs and version-specific SDK or framework references
- Senior engineer notes or narrated knowledge dumps approved for onboarding use
- Capture the learner’s goal, starting skill level, time constraints, preferred learning style, and accessible source set.
- Generate an onboarding playbook with prerequisites, setup checks, success criteria, and explicit human touchpoints such as mentor meetings or pairing sessions.
- Produce a codebase map and curated reading path that identifies entry points, core modules, dependencies, and architecture concepts in the order they matter for the learner’s goal.
- Create hands-on exercises tied to the actual target repo or technology and adapt the next step based on completed work and self-reported confidence.
- Answer questions with exact citations to files, docs, or examples, while surfacing permission gaps, conflicting sources, and uncertainty instead of bluffing.
- Turn senior engineers’ notes into draft FAQs, walkthroughs, and onboarding modules, and suggest updates when key setup or architecture files change.
Who It Affects
44 of 157 developers (28.0%) explicitly requested AI help with onboarding, mentoring, or learning new technologies. Demand is both strong and clear: 72.6% of respondents in this theme want high or very high AI support, average preference is 4.07/5 versus current usage of 2.58/5 (a 1.49-point gap), and inter-rater reliability was very high (α = 0.948).
- Average AI Preference: 4.07/5
- Average AI Usage: 2.58/5
- Preference-Usage Gap: 1.49
- 72.6% want High or Very High AI support
- Inter-Rater Reliability: α = 0.948
Impact
If this exists, developers get a repository- and role-specific ramp-up path instead of a pile of disconnected resources: a traceable onboarding checklist, a map of the system, guided readings, and hands-on exercises that lead toward a first meaningful contribution. Newcomers gain a private place to ask basic or context-specific questions without social friction, while senior engineers can turn repeated explanations into reviewed onboarding modules. The net effect is faster ramp-up on both new technologies and existing systems, with human mentors freed to focus on judgment, relationships, and team culture rather than repetitive setup and factual questions.
Constraints & Guardrails
Success Definition
Qualitative Measures
- New joiners report they can ask 'embarrassing' questions safely and get understandable explanations with examples tailored to their context.
- Mentors report fewer repetitive onboarding questions and more time for relationship-building and nuanced guidance.
- Senior engineers report that knowledge-capture workflows reduce time spent repeatedly packaging the same onboarding knowledge.
- Teams report that generated onboarding modules feel accurate, concise, and owned rather than generic or spammy.
- Developers say they trust the tool because each answer shows sources and clearly signals uncertainty.
Quantitative Measures
- Reduce median time-to-first-successful-local-build/run for new joiners.
- Reduce median time-to-first-merged change for new team members.
- Increase completion rate of onboarding checklists and hands-on exercises.
- Decrease repeated onboarding questions in internal Q&A or support channels for the same repository or system.
- Maintain content accuracy ratings of 90%+ in developer spot-checks of source attribution and technical correctness.
- Reduce the theme's preference-usage gap from the 1.49 baseline by increasing sustained use of the ramp-up tool.
Theme Evidence
Act as an adaptive tutor or mentor that tailors explanations, examples, and learning plans to help developers learn new languages, frameworks, and APIs, or ramp up on unfamiliar systems and domains. Includes generating onboarding guides/checklists, packaging institutional knowledge into training...
Project Description
Assemble a context pack from user-selected tickets, commits, incidents, prior messages, and meeting notes, then draft audience-specific updates, replies, rewrites, or proofreading suggestions that preserve the engineer’s intent and voice. The useful contribution here is careful audience adaptation with traceable claims, not generic business prose.
- User-selected work items, status changes, blockers, and owners
- Selected commits, PR notes, incident records, and technical docs
- Prior thread messages and stakeholder questions for the same topic
- Meeting notes or transcripts explicitly chosen by the user
- Stakeholder profile fields such as role, technical fluency, language, and formality expectations
- Approved terminology, translation memory, and confidentiality rules
- Start from a drafting or review action where the user specifies the audience, channel, intent, language, and source pack.
- Extract the facts, deltas since the last message, open risks, dependencies, and asks from the chosen artifacts, and identify what information is still missing for this audience.
- Generate one or more drafts or edit suggestions at the right technical depth, simplifying jargon where appropriate without flattening important nuance.
- For multilingual communication, offer either full translation or targeted language coaching on register and phrasing while preserving the author’s meaning.
- Show sentence-level source links and warnings for unsupported claims, likely overcommitments, or wording that leaks internal detail.
- Run confidentiality and external-communication checks before exporting an editable draft back to the user.
Who It Affects
20 of 157 respondents (12.7%) were coded to this theme with high inter-rater reliability (α = 0.93). The requests span communication with internal stakeholders, customers, and external partners; 72.6% wanted high support, yet average usage is only 2.58/5 versus 4.07/5 preference, indicating a substantial unmet need in current communication-support tools.
- Average AI Preference: 4.07/5
- Average AI Usage: 2.58/5
- Preference-Usage Gap: 1.49
- 72.6% wanting high support
- Inter-rater reliability α = 0.93
Impact
If this capability exists, developers no longer start stakeholder communication from a blank page or manually restate the same technical work for each audience. Instead, they assemble a trusted context pack and receive editable drafts or review suggestions that explain status, risks, and answers at the right level of detail, while preserving intended meaning across languages and reducing repeated explanations in ongoing threads.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Users report that stakeholder messages start from a trustworthy draft or review pass rather than a blank page
- Users report that outputs preserve their authentic voice instead of sounding generic or synthetic
- Users trust the tool because every important claim can be traced to selected source artifacts and unsupported claims are clearly flagged
- Cross-language users report increased confidence in formality and register choices without loss of intended meaning
- Stakeholders report no degradation in clarity, accuracy, or personal touch compared to prior communication practices
Quantitative Measures
- Reduce median time-to-first-draft for stakeholder updates by 50%
- Decrease average number of manual revisions per message by 25% before export
- At least 30% reduction in post-send corrective follow-up messages for the same topic
- <= 5% of exported drafts contain unsupported claims without explicit user override acknowledgment
- Zero autonomous sends: 100% of outbound communications pass through a human approval gate
- >= 60% weekly active usage among users who try the feature at least once
Theme Evidence
Draft, rewrite, proofread, and tailor communications (emails, updates, status messages, explanations) for stakeholders, clients, or other audiences, including simplifying technical details for non-technical readers and adjusting tone or language. Assign when the respondent wants AI to help compose...
Project Description
Use an interactive exploration board for early technical discovery: widen the option set, score competing approaches against user-chosen criteria, and spin up disposable prototype spikes or diagrams for the options worth testing. This is the fuzzy front end of technical research, not a substitute for a formal architecture review.
- Problem statement, constraints, non-goals, and risk tolerance supplied by the user
- Selected repository modules, interfaces, and dependency manifests relevant to the question
- Existing design docs, ADRs, and past decisions for similar problems
- Operational artifacts such as incidents, runbooks, and service objectives when they bear on the trade-off
- Approved technology lists and engineering standards
- Start a session from a design or research question and capture the constraints, assumptions, and evaluation criteria the user wants to hold fixed.
- Generate 3–6 option cards, including at least one deliberately contrasting alternative so the first idea does not dominate the session.
- Attach local precedent, prerequisites, and explicit assumptions to each option using the selected codebase and design artifacts.
- Let the user reweight criteria, eliminate options, add their own candidate, or ask the system to challenge the current set with more extreme or conservative alternatives.
- For selected options, create lightweight spike artifacts such as skeleton code, interface stubs, sample config, or quick diagrams that test the risky part of the idea rather than the whole system.
- Export a shareable comparison log that records the options considered, trade-offs, user edits, prototype results, and open questions.
Who It Affects
18 of 157 developers (11.5%) explicitly asked for AI help with research, brainstorming, option generation, or rapid prototypes, indicating a clear need among developers doing early design and technical discovery work.
- 72.6% want High or Very High AI support for this theme
- Average AI Preference: 4.07/5
- Average AI Usage: 2.58/5
- Preference-Usage Gap: 1.49
Impact
If this capability exists, developers can move from a vague design question to a reviewable set of distinct options, explicit assumptions, and lightweight prototype spikes in hours instead of days. The main benefit is reduced tunnel vision: teams consider more than one plausible approach, make tradeoffs visible, and validate risky ideas earlier without handing final decisions to the tool.
Constraints & Guardrails
Success Definition
Qualitative Measures
- Developers describe the system as a technical sounding board or thought partner rather than an answer generator.
- Users report that the tool surfaced at least one plausible option they had not previously considered.
- Teams report that trade-off matrices, assumptions, and unknowns make early design discussions easier to review and align around.
- Developers say prototype spikes helped validate risky assumptions before committing to full implementation.
Quantitative Measures
- Reduce time from initial problem statement to a shareable option set by 50%
- Increase average number of distinct options considered per design discussion from 1-2 to 4-6
- Reduce time to first runnable spike or prototype for selected options by 30%
- Achieve >70% of exploration sessions where the developer uses steer, expand, challenge, or add-option actions beyond the initial generation
- Reduce the preference-usage gap for this theme from 1.49 to below 0.5 in a follow-up deployment study
Theme Evidence
Serve as a technical sounding board to expand solution options, propose architectures or approaches, compare tradeoffs, and rapidly explore directions (including lightweight prototypes/mockups). Assign when the respondent wants AI to help generate, evaluate, or iterate on ideas and design...