Developer AI Survey: Thematic Analysis
989 Microsoft developer responses across 5 engineering workflow categories
989
Responses
5
Categories
38
Themes
93.3%
Coverage
0.911
Krippendorff's α

Browse Categories

Use the sidebar to navigate 5 workflow categories and their themes

Explore Projects

Click project cards to see detailed problem statements and proposed capabilities

Read the Evidence

Each theme includes developer quotes with PID attribution

Methodology: 6-Phase Analysis Pipeline

1
Initial Coding
3 models independently
2
Theme Consolidation
Cross-model merge
3
Human Review
Expert reconciliation
4
Systematic Coding
All responses, all models
5
Inter-Rater Reliability
Krippendorff's α
6
Prevalence & Synthesis
Majority vote
GPT-5.2 Gemini 3.1 Pro Claude Opus 4.6
Methods & Validity
  • Unit of analysis: Individual survey response
  • Prevalence: Unique PIDs with 2+ model agreement (majority vote)
  • Multi-coding: One response can be assigned multiple themes
  • Coverage: Percentage of responses assigned at least one theme
  • IRR: All responses coded by all 3 models; Krippendorff's α computed per theme
  • Convergence: How many of the 3 models independently discovered a theme during Phase 1
2
Analysis Tracks
5
Task Categories
3
LLM Coders
20
Batch Size
α
IRR Metric

This document describes the full methodology for a multi-model qualitative analysis pipeline that identifies research opportunities (what developers want AI to do) and design constraints (what developers do not want AI to handle) from open-ended survey responses. Three frontier LLMs serve as independent coders, with inter-rater reliability calculated via Krippendorff's alpha and consensus reached through majority vote. A human review gate validates codebooks before systematic coding begins.

Research Questions & Data

Survey Questions (per category)

Opportunity track: "Where do you want AI to play the biggest role in [category] activities?"
Open-ended responses capturing desired capabilities and unmet needs.
Constraint track: "What aspects do you NOT want AI to handle and why?"
Open-ended responses capturing guardrails, no-go zones, and boundary conditions.

Unit of Analysis

Each unit is a single respondent's open-ended answer to one of the two questions within a category. Respondents answered about 2–3 categories each, and a single response may be assigned multiple theme codes.

Task Categories & Response Counts

CategoryRespondentsTasks Covered
Development816Coding, Bug Fixing, Perf Optimization, Refactoring, AI Development
Design & Planning548System Architecture, Requirements Gathering, Project Planning
Meta-Work532Documentation, Communication, Mentoring, Learning, Research
Quality & Risk401Testing & QA, Code Review / PRs, Security & Compliance
Infrastructure & Ops283DevOps / CI-CD, Environment Setup, Monitoring, Customer Support

Data Quality Note

Approximately 11% of responses contain data quality issues detected during coding: misplaced answers (respondent wrote a "want" answer in the "not want" field or vice-versa), back-references to prior answers that are unintelligible on their own, or terse non-responses. These are flagged with ISSUE_* codes rather than discarded, avoiding pre-filter bias (see Methodological Controls).

Pipeline Overview

The analysis runs in two stages with a human review gate between them. Both the opportunity and constraint tracks follow identical process steps but with track-specific prompts and codebooks.

Opportunity Track
 
Constraint Track
Theme Discovery
3 models discover "want" themes independently
Theme Discovery
3 models discover "not want" themes independently
Reconciliation
GPT-5.2 merges themes into unified codebook
Reconciliation
GPT-5.2 merges constraint themes into codebook
Human Gate  Author Review & Codebook Approval
Researcher validates themes, merges/splits/renames as needed, approves before coding proceeds
Systematic Coding
3 models code every response (batch=20)
Systematic Coding
3 models code every constraint response
Triangulation  IRR Calculation & Majority-Vote Consensus
Krippendorff's α per theme • 2-of-3 majority vote • ISSUE code aggregation
Rich Opportunity Cards
Top-5 per category, 3-model synthesis
Constraint Maps & Design Principles
No-go zones with guardrail guidance

Stage 1: Theme Discovery & Reconciliation

Phase 1: Independent 3-Model Discovery

Each model receives all responses for a given category and independently proposes 4–15 themes with supporting evidence (PIDs). The prompt instructs models to create specific, actionable, problem-focused themes and to allow multi-coding.

ModelProviderThinking ModeRole
GPT-5.2OpenAIreasoning_effort="high"Independent coder & reconciler
Gemini 3.1 ProGooglethinking_level="HIGH"Independent coder
Claude Opus 4.6Anthropicthinking: adaptive, effort: highIndependent coder

Inputs

  • Open-ended survey responses with PIDs (e.g., 816 Development responses or 548 Design & Planning responses)
  • Category name and context description

Outputs (per category, per model)

  • Theme codebook: code, name, description, supporting PIDs
  • Per-response codings: PID → [theme_code_1, theme_code_2, ...]
  • Files: {category}_themes_{model}.json (15 opportunity files + 15 constraint files)

Phase 2: GPT-5.2 Reconciliation

A single reconciliation model (GPT-5.2) receives all three models' theme sets and produces a unified codebook per category by:

  1. Identifying overlapping themes across models (same concept, different names)
  2. Merging overlapping themes into single unified entries
  3. Retaining single-model themes only if substantive (≥3 PIDs)
  4. Dropping themes that are too vague or have very few supporting responses
  5. Targeting 5–10 unified themes per category

Each unified theme records its source_models (which of the three models independently proposed it) and source_codes (original model-specific code names), providing full provenance.

Outputs

  • consolidated_codebook.json — all 5 category codebooks (opportunity track)
  • constraint_codebook.json — all 5 category codebooks (constraint track)

Human Review Gate Required

The pipeline pauses for researcher review before systematic coding begins. The researcher:

  • Reviews each proposed theme and reads sample supporting responses
  • Checks themes for specificity, granularity, and completeness
  • Can keep, rename, merge, split, or remove any theme
  • Can add themes the models missed
  • Documents rationale for all changes

Systematic coding (Stage 2) does not proceed until the codebook is explicitly approved.

Stage 2: Systematic Coding & Analysis

Coding Protocol

All three models independently re-code every response against the finalized codebook. Key protocol elements:

ParameterValueRationale
Batch size20 responses per API callBalances context window usage against API call count
Rationale-firstModel writes rationale before assigning codesImproves accuracy via chain-of-thought; enables auditability
Cross-response contextEach response shown alongside opposite-question answerEnables misresponse detection (ISSUE codes)
Multi-coding0, 1, or many themes per responseCaptures full semantic content
Codebook-onlyOnly codebook codes or ISSUE_* codes allowedPrevents code drift across batches

ISSUE Code System

During systematic coding, models flag data quality problems rather than silently discarding responses:

CodeMeaningExample
ISSUE_WRONG_FIELDRespondent answered the opposite questionDescribing constraints in the "want" field
ISSUE_BACK_REFERENCEReferences a prior answer; unintelligible alone"Same as before", "see above"
ISSUE_NON_RESPONSETerse non-answer with no analyzable content"N/A", "none", "no"

Models may create additional ISSUE_* codes if they encounter other data quality problems. The ISSUE prefix ensures these are never confused with substantive themes.

Inter-Rater Reliability (IRR)

Agreement between the three LLM coders is measured per theme using Krippendorff's alpha (α), the standard multi-rater reliability coefficient for qualitative research. For each theme, a binary (present/absent) coding matrix is built across all responses, and α is calculated at the nominal level.

RangeInterpretation
α ≥ 0.80Excellent agreement — publishable
α ≥ 0.67Acceptable agreement — tentative conclusions
α ≥ 0.50Moderate agreement — use with caution
α < 0.50Poor agreement — unreliable for this theme

Additionally, pairwise Cohen's kappa (κ) is calculated for each model pair (GPT–Gemini, GPT–Opus, Gemini–Opus) and 3-rater percent agreement (all three models assign the same code) is reported per theme.

Consensus Voting

Final theme assignments use a majority vote: 2 of 3 models must agree for a theme to be assigned to a response. This is applied independently per response and per theme code.

ISSUE code handling

If 2+ models flag any ISSUE code for a response (regardless of which specific ISSUE code), the response receives a generic ISSUE marker and is excluded from substantive analysis. This prevents a single aggressive model from filtering out too many responses.

Rich Opportunity Cards

For the top 5 themes per category (by prevalence), all three models independently generate detailed opportunity cards including:

  • Problem statement and proposed capability description
  • Required context sources and capability steps
  • Impact description with supporting evidence quotes
  • Success criteria (qualitative and quantitative measures)
  • Constraints and guardrails drawn from the constraint track
  • Prevalence data and quantitative signals (AI preference, usage gap)

Cards from the three models are merged using a union-and-deduplicate strategy: longest title wins, context sources are combined (max 7), capability steps use the longest sequence (max 6), and constraints are deduplicated (max 4).

Constraint Maps & Design Principles

Constraint-track prevalence is calculated identically to the opportunity track. The top no-go zones per category are documented with:

  • Zone name, description, and prevalence count
  • Up to 10 supporting respondent quotes
  • 3–6 synthesized design principles per category (generated by GPT-5.2)
  • Each principle includes implementation guidance and derivation provenance

Methodological Controls

The pipeline incorporates several controls designed to increase rigor beyond what a single-model analysis can provide.

ControlMechanismWhat It Mitigates
Multi-LLM triangulation 3 frontier models from different families code independently Single-model bias, training-data artifacts, idiosyncratic interpretations
Rationale-first coding Models write reasoning before assigning codes Snap-judgment errors; enables post-hoc audit of coding decisions
Cross-response context Both "want" and "not want" answers shown to coder Misresponse blindness; enables ISSUE_WRONG_FIELD detection
ISSUE code system Flag quality problems in-band rather than pre-filtering Pre-filter bias from silently dropping ambiguous responses
Idempotent checkpointing Staleness detection skips phases whose inputs haven't changed Wasted computation; ensures reproducible reruns
Consensus merging Majority vote (2/3) for codes; union-and-deduplicate for synthesis Noise from single-model outlier codes; incomplete synthesis from any single model

Design Decisions & Trade-offs

DecisionRationaleTrade-off
3 models, not 2 or 5 Minimum for meaningful IRR (Krippendorff's α); covers 3 major LLM families Higher API cost (∼3×); manageable with batch parallelism
HIGH thinking for all models Qualitative coding benefits from extended reasoning; reduces surface-level pattern matching Slower inference, higher token cost (thinking tokens billed at output rate)
Batch size of 20 Enough responses for cross-response pattern recognition; fits comfortably in context windows More API calls than larger batches; but avoids context truncation risks
Majority vote (2/3) Balances sensitivity and specificity; equivalent to >50% agreement threshold May miss themes where only one model sees a valid pattern
Human gate before coding Prevents systematic errors from propagating through the entire coding phase Introduces a manual pause in an otherwise automated pipeline
No pre-filtering of responses ISSUE codes capture quality problems without discarding data points Models must process noisy responses; ISSUE detection is itself imperfect
GPT-5.2 as sole reconciler Reconciliation requires structured comparison rather than independent generation; one model suffices Reconciliation may inherit GPT-specific biases in theme naming
Streaming for Claude Opus Avoids 10-minute HTTP timeout on long-running inference More complex error handling; no retry on partial stream failures

Limitations & Mitigations

LimitationImpactMitigation
LLM nondeterminism Exact codings may vary across runs even with identical inputs 3-model triangulation smooths out individual variance; IRR quantifies remaining disagreement; idempotent checkpointing ensures reproducible runs when inputs are stable
LLM rationalization Models may construct plausible but incorrect rationales Multi-model disagreement surfaces cases where rationalization diverges; majority vote filters single-model confabulations
Prompt sensitivity Different prompt wording could yield different themes Codebook-anchored coding constrains coder freedom; prompts are documented and versioned for replication
Not replacing human qualitative research LLM coders lack lived experience; may miss cultural nuances Human review gate validates codebook; methodology is positioned as accelerating qualitative work, not replacing it; all outputs include supporting quotes for human verification
Survey sample 860 Microsoft developers may not represent the broader industry Out of scope for the analysis methodology itself; noted as a limitation of the source data
LLM knowledge contamination Models may have been trained on similar survey analyses Codebook-first design constrains output to researcher-approved themes; verbatim quotes provide verifiable evidence independent of model knowledge

Artifacts & Replication

Artifact Inventory

PhaseFile PatternCountDescription
Data{category}_responses.json5Extracted open-ended responses with PIDs
Data{category}_quantitative.json5Aggregated Likert scale metrics per task
Data{category}_do_not_want_responses.json5Extracted constraint responses with PIDs
Stage 1{category}_themes_{model}.json15Independent opportunity theme discoveries
Stage 1{category}_constraint_themes_{model}.json15Independent constraint theme discoveries
Stage 1consolidated_codebook.json1Unified opportunity codebook (all categories)
Stage 1constraint_codebook.json1Unified constraint codebook (all categories)
Stage 2{category}_phase4_codings.json53-model systematic codings with rationales
Stage 2phase5_irr_results.json1Krippendorff's α, Cohen's κ, agreement %
Stage 2phase6_prevalence_results.json1Majority-vote consensus and theme prevalence
Stage 2phase6_rich_opportunities.json1Top-5 opportunity cards per category (3-model synthesis)
Stage 2constraint_maps.json1No-go zones and design principles

Dependency Chain

data.xlsx → {cat}_responses.json {cat}_do_not_want_responses.json {cat}_quantitative.json ↓ {cat}_themes_{model}.json (3×5 = 15 files) {cat}_constraint_themes_{model}.json (15 files) ↓ consolidated_codebook.json constraint_codebook.json ↓ ■ HUMAN REVIEW GATE ↓ {cat}_phase4_codings.json (5 files, 3 models each) ↓ phase5_irr_results.json phase6_prevalence_results.json ↓ ↓ phase6_rich_opportunities.json constraint_maps.json

Staleness Detection

Every pipeline phase checks whether its output is stale relative to its inputs by comparing file modification times. If all inputs are older than the output, the phase is skipped. If any input is newer, the output is regenerated. This enables:

  • Incremental reruns: updating one category's theme discovery only regenerates downstream outputs for that category
  • Safe restarts: if the pipeline crashes mid-phase, only the incomplete phase reruns
  • Force override: --force flag bypasses staleness checks for full regeneration

How to Rerun

  1. Ensure API keys are set in .env for OpenAI, Google, and Anthropic
  2. Install dependencies: uv sync
  3. Run full pipeline: bash run_full_pipeline.sh
  4. Pipeline pauses after Stage 1 for human codebook review
  5. After approval, Stage 2 runs automatically
  6. To force regeneration: bash run_full_pipeline.sh --force
  7. To rerun a single category: uv run phase4_systematic_coding.py design_planning

Appendix

Opportunity Codebook (All 5 Categories, 48 Themes)

Unified codebook produced by GPT-5.2 reconciliation of themes independently discovered by all three models. Each theme lists which models independently identified it.

Development 10 themes

CodeThemeModels
refactoring_modernizationAutomated Refactoring, Modernization & Tech-Debt ReductionGPT, Gemini, Opus
boilerplate_scaffolding_feature_codegenBoilerplate, Scaffolding & Routine Feature Code GenerationGPT, Gemini, Opus
automated_testing_validationAutomated Test Generation, Coverage & Change ValidationGPT, Gemini, Opus
debugging_root_cause_fixingDebugging, Root Cause Analysis & Bug Fix AssistanceGPT, Gemini, Opus
repo_wide_context_dependency_awarenessRepo-Wide Context, Dependency Awareness & Safe Multi-File ChangesGPT, Gemini, Opus
code_quality_review_security_complianceCode Quality, Review Automation, Standards & Security/Compliance GuidanceGPT, Gemini, Opus
performance_profiling_optimizationPerformance Profiling & Optimization SuggestionsGPT, Gemini, Opus
architecture_design_planning_supportArchitecture, Design Brainstorming & Planning SupportGPT, Gemini, Opus
devops_ci_cd_iac_workflow_automationDevOps, CI/CD, IaC & Engineering Workflow AutomationGPT, Gemini, Opus
documentation_knowledge_retrieval_onboardingDocumentation Generation, Knowledge Retrieval & Onboarding/Learning SupportGPT, Gemini, Opus

Design & Planning 10 themes

CodeThemeModels
requirements_gathering_synthesisRequirements Gathering, Synthesis & ClarificationGPT, Gemini, Opus
architecture_design_generationArchitecture & System Design Generation/IterationGPT, Gemini, Opus
interactive_brainstorming_design_partnerInteractive Brainstorming & Design CopilotGPT, Gemini, Opus
tradeoff_decision_support_simulationTrade-off Analysis, What-if Simulation & Decision SupportGPT, Gemini, Opus
design_validation_risk_edge_casesDesign Validation, Risk Assessment & Edge-Case DiscoveryGPT, Gemini, Opus
project_planning_tasking_status_automationProject Planning, Ticket/Task Breakdown & Status AutomationGPT, Gemini, Opus
documentation_spec_diagram_generationDocumentation, Specs & Diagram/Artifact GenerationGPT, Gemini, Opus
context_retrieval_codebase_and_institutional_memoryContext Retrieval: Codebase Understanding & Institutional MemoryGPT, Gemini, Opus
research_and_information_synthesisResearch, Information Gathering & Knowledge SynthesisGPT, Gemini, Opus
trustworthy_outputs_with_citationsTrustworthy Outputs: Higher Accuracy & Verifiable CitationsGPT, Gemini, Opus

Quality & Risk 9 themes

CodeThemeModels
automated_test_generation_and_quality_gatesAutomated Test Generation, Maintenance & Quality GatesGPT, Gemini, Opus
intelligent_pr_code_reviewIntelligent PR/Code Review AssistantGPT, Gemini, Opus
security_vulnerability_detection_and_fix_guidanceSecurity Vulnerability Detection & Fix GuidanceGPT, Gemini, Opus
compliance_and_audit_automationCompliance, Standards & Audit Process AutomationGPT, Gemini, Opus
proactive_risk_monitoring_and_predictionProactive Risk Monitoring, Prediction & Anomaly DetectionGPT, Gemini, Opus
debugging_root_cause_and_failure_triageDebugging, Root Cause Analysis & Failure TriageGPT, Gemini, Opus
knowledge_retrieval_and_standards_guidanceKnowledge Retrieval, Summarization & Standards GuidanceGPT, Gemini, Opus
agentic_workflow_automation_and_remediationAgentic Workflow Automation & Automated RemediationGPT, Gemini, Opus
ai_driven_exploratory_chaos_and_fuzz_testingAI-Driven Exploratory, Chaos & Fuzz TestingOpus only

Infrastructure & Ops 10 themes

CodeThemeModels
intelligent_monitoring_alerting_anomaly_detectionIntelligent Monitoring, Alerting & Anomaly DetectionGPT, Gemini, Opus
incident_response_rca_mitigation_self_healIncident Response Automation (Triage, RCA, Mitigation, Self-Heal)GPT, Gemini, Opus
cicd_pipeline_and_deployment_automationCI/CD Pipeline & Deployment AutomationGPT, Gemini, Opus
infrastructure_provisioning_and_iac_generationAutomated Environment Setup & IaC GenerationGPT, Gemini, Opus
infrastructure_maintenance_upgrades_security_cost_optimizationProactive Maintenance, Upgrades, Security/Compliance & Cost OptimizationGPT, Gemini, Opus
customer_support_triage_and_autoresponseCustomer Support Triage & Auto-ResponseGPT, Gemini, Opus
knowledge_management_doc_search_and_system_contextKnowledge Management, Documentation Search & System ContextGPT, Gemini, Opus
ops_toil_automation_and_script_generationOps Toil Automation & Script Writing/DebuggingGPT, Gemini, Opus
testing_quality_validation_and_safe_deployTesting, Quality Validation & Safer ReleasesGPT, Gemini, Opus
ai_tooling_ux_accuracy_and_cohesive_workflowsBetter AI Tooling UX (Accuracy, Control & Cohesive Workflows)GPT, Gemini, Opus

Meta-Work 9 themes

CodeThemeModels
automated_documentationAutomated Documentation Generation & MaintenanceGPT, Gemini, Opus
knowledge_search_and_discoveryProject Knowledge Search & Discovery (with Traceable Sources)GPT, Gemini, Opus
brainstorming_and_solution_explorationBrainstorming, Option Generation & Rapid ExplorationGPT, Gemini, Opus
personalized_learning_and_upskillingPersonalized Learning for New TechnologiesGPT, Gemini, Opus
team_onboarding_and_mentoringTeam Onboarding, Mentoring & Institutional Knowledge TransferGPT, Gemini, Opus
stakeholder_communication_supportStakeholder/Client Communication Drafting & TranslationGPT, Gemini, Opus
meeting_assistanceMeeting Scheduling, Notes, Summaries & Action ItemsGPT, Gemini, Opus
planning_prioritization_and_status_trackingPlanning, Prioritization, Blocker Detection & Status ReportingGPT, Gemini, Opus
proactive_personal_agent_and_admin_automationProactive Personal Agent & Routine Admin AutomationGPT, Gemini, Opus
Constraint Codebook (All 5 Categories, 50 Themes)

Unified constraint codebook produced by GPT-5.2 reconciliation. Captures what developers do not want AI to handle.

Development 10 themes

CodeThemeModels
no_autonomous_architecture_system_designNo Autonomous Architecture or System Design DecisionsGPT, Gemini, Opus
no_large_unscoped_refactorsNo Large, Unscoped, or Sweeping Codebase ChangesGPT, Gemini, Opus
no_autonomous_execution_merge_deploy_or_agentic_controlNo Autonomous Execution, Merging/Deploying, or Agentic ControlGPT, Gemini, Opus
no_complex_debugging_or_critical_bug_fixesNo AI Ownership of Complex Debugging or Critical Bug FixesGPT, Gemini, Opus
no_security_privacy_secrets_handlingNo Security/Privacy-Sensitive Work or Secrets HandlingGPT, Gemini, Opus
no_autonomous_performance_optimizationNo Autonomous Performance OptimizationGPT, Gemini, Opus
no_ai_deciding_requirements_business_logic_or_api_uxNo AI-Led Requirements, Core Business Logic, or API/UX DecisionsGPT, Gemini
preserve_developer_agency_learning_and_job_ownershipPreserve Developer Agency, Learning, and OwnershipGPT, Gemini, Opus
avoid_ai_when_unreliable_contextless_hard_to_verify_or_intrusiveAvoid AI Output That Is Unreliable, Contextless, Hard to Verify, or IntrusiveGPT, Gemini, Opus
no_constraints_open_to_ai_helpNo Specific No-Go Zones (Open to AI Help)GPT, Gemini

Design & Planning 10 themes

CodeThemeModels
human_accountability_final_decisionsNo AI Final Decision-Making (Human Accountability Required)GPT, Gemini, Opus
human_led_architecture_designNo AI as Primary System Architect / High-Level DesignerGPT, Gemini, Opus
no_ai_project_management_task_assignmentNo AI Running Project ManagementGPT, Gemini, Opus
no_ai_requirements_stakeholder_elicitationNo AI-Led Requirements Gathering or Stakeholder AlignmentGPT, Gemini, Opus
no_ai_empathy_team_dynamicsNo Replacement of Human Empathy, Collaboration, or Interpersonal DynamicsGPT, Gemini, Opus
ai_assistant_human_in_loopNo Autopilot: AI Should Assist with Human-in-the-Loop OversightGPT, Gemini, Opus
trust_accuracy_and_context_limitationsAvoid AI for High-Stakes Work Due to Reliability & Missing ContextGPT, Gemini, Opus
privacy_confidentiality_ip_and_message_controlNo AI Handling Sensitive/Confidential Data or Uncontrolled MessagingGPT, Gemini, Opus
no_ai_vision_strategy_creativity_tasteNo AI Owning Product Vision, Strategy, or Creative JudgmentsGPT, Gemini
no_constraints_or_unsureNo Constraints Stated / Welcome Full AI InvolvementGPT, Gemini, Opus

Quality & Risk 10 themes

CodeThemeModels
human_final_decision_and_accountabilityHumans Must Make Final High-Stakes DecisionsGPT, Gemini, Opus
no_autonomous_code_or_production_actionsNo Autonomous Code/Repo/Production Actions Without ApprovalGPT, Gemini, Opus
human_code_review_gate_requiredHuman Code Review / PR Approval Must Remain the GateGPT, Gemini, Opus
security_and_compliance_must_be_human_ledSecurity, Compliance, and Threat Modeling Must Be Human-LedGPT, Gemini, Opus
no_sensitive_data_or_credentials_accessDo Not Give AI Access to Sensitive/Customer Data or CredentialsGPT, Gemini, Opus
ai_outputs_must_be_verifiable_and_not_self_validatedAI Must Be Reliable, Verifiable, and Not Self-ValidatedGPT, Gemini, Opus
humans_own_requirements_architecture_and_tradeoffsHumans Must Own Requirements, Architecture, and Trade-OffsGPT, Gemini, Opus
human_led_test_strategy_intent_and_signoffTest Strategy and Sign-Off Must Be Human-LedGPT only
preserve_human_ethics_empathy_and_human_centric_workPreserve Human Ethics, Empathy, and Human-Centric WorkGPT, Gemini
no_constraints_statedNo Specific No-Go Areas StatedGPT, Opus

Infrastructure & Ops 10 themes

CodeThemeModels
no_direct_customer_interactionNo Direct AI-to-Customer InteractionGPT, Gemini, Opus
no_autonomous_production_changesNo Autonomous Production Deployments or ChangesGPT, Gemini, Opus
human_approval_before_consequential_actionsHuman Approval Required Before Consequential ActionsGPT, Opus
no_security_permissions_secrets_managementNo AI Management of Security, Access, Permissions, or SecretsGPT, Gemini, Opus
no_autonomous_incident_response_or_overridesNo Autonomous Incident Response or Critical OverridesGPT, Gemini, Opus
avoid_ai_for_high_precision_deterministic_workAvoid AI for High-Precision/Deterministic WorkGPT, Gemini, Opus
no_full_autonomy_for_environment_setup_maintenanceNo Full Autonomy for Environment Setup and MaintenanceGPT, Gemini
preserve_human_learning_and_accountabilityPreserve Human Learning, System Understanding, and AccountabilityGPT, Gemini, Opus
no_ai_initiated_irreversible_or_destructive_data_actionsNo AI-Initiated Irreversible/Destructive Data OperationsGPT, Gemini, Opus
no_constraints_expressed_or_pro_automationNo Constraints Expressed / Comfortable with Broad AutomationGPT, Gemini, Opus

Meta-Work 10 themes

CodeThemeModels
human_led_mentoring_onboardingKeep mentoring and onboarding human-ledGPT, Gemini, Opus
human_authored_communicationKeep interpersonal communications human-authoredGPT, Gemini, Opus
human_review_required_before_sending_or_publishingNo autonomous sending/publishing without human reviewGPT, Gemini, Opus
no_confidential_or_sensitive_dataKeep AI away from confidential or sensitive informationGPT, Gemini, Opus
preserve_hands_on_learningDon't outsource learning and skills development to AIGPT, Gemini, Opus
preserve_human_research_and_ideationKeep research/brainstorming primarily humanGPT, Gemini, Opus
human_accountability_for_high_stakes_decisionsHigh-stakes decisions must remain human-ledGPT, Gemini, Opus
avoid_unvetted_documentationAI-generated documentation must be vettedGPT, Gemini, Opus
ai_outputs_not_trustworthy_as_primary_sourceDon't treat AI output as trustworthy/authoritativeOpus only
no_constraints_or_unsureNo constraints stated / unsureGPT, Gemini, Opus
Coding Prompt: Opportunity Track (Phase 4)
You are a qualitative research coder. Your task is to systematically code
each "WANT" response using ONLY the themes from the provided codebook.

Each response is shown alongside the same respondent's answer to a related
question about what they do NOT want AI to handle, for additional context.

CODEBOOK:
{codebook themes listed here}

ISSUE CODES (assign when a response has data quality issues):
- ISSUE_WRONG_FIELD: The respondent appears to have answered the other question
- ISSUE_BACK_REFERENCE: Response references a prior answer and is unintelligible
  on its own
- ISSUE_NON_RESPONSE: Terse non-answer with no analyzable content
- You may create other ISSUE_* codes if you encounter a different type of data
  quality problem

INSTRUCTIONS:
1. Read each response carefully
2. For each response, write a brief rationale
3. Then assign ALL applicable theme codes from the codebook
4. A response can have 0, 1, or multiple themes
5. Only use codes from the codebook or ISSUE codes
6. If no themes apply, return an empty array

RESPONSES TO CODE:
[Batches of 20 responses, each with context from opposite question]

OUTPUT FORMAT:
[
  {"pid": 8, "rationale": "...", "themes": ["theme_code_1", "theme_code_2"]},
  {"pid": 11, "rationale": "...", "themes": ["ISSUE_BACK_REFERENCE"]}
]

Return ONLY the JSON array, no other text.
Coding Prompt: Constraint Track

Identical structure to the opportunity track prompt, but with:

  • Constraint codebook themes replacing opportunity themes
  • "NOT WANT" responses as the primary coding target
  • "WANT" responses shown as cross-response context
  • Same ISSUE code system applies
Theme Discovery Prompt Template
You are analyzing open-ended survey responses from software developers about
where they want AI assistance in their work. Your task is to identify themes
in these responses.

Guidelines for theme creation:
- Themes should be SPECIFIC and ACTIONABLE
  (e.g., "Automated test generation for edge cases" not just "Testing")
- Themes should be PROBLEM-FOCUSED (describe the pain point, not a solution)
- A response can belong to MULTIPLE themes
- Aim for 4-15 themes that capture the major patterns

For each response, return:
{
  "pid": <participant ID>,
  "themes": ["theme_code_1", "theme_code_2", ...]
}

Also provide a theme codebook:
{
  "themes": [
    {
      "code": "snake_case_theme_code",
      "name": "Human-Readable Theme Name",
      "description": "What this capability means and why developers want it",
      "pids": [list of PIDs expressing this]
    }
  ],
  "codings": [
    {"pid": 123, "themes": ["theme_code_1", "theme_code_2"]}
  ]
}

RESPONSES TO ANALYZE:
[All responses for the category]
Theme Reconciliation Prompt Template
You are a qualitative research analyst performing theme reconciliation.

CONTEXT: Three independent AI models analyzed survey responses from the
"[category]" category. Each identified opportunity themes. Your job is to
reconcile these into a unified codebook.

--- GPT-5.2 THEMES ---
[All GPT themes with names, descriptions, PIDs]

--- GEMINI THEMES ---
[All Gemini themes]

--- OPUS THEMES ---
[All Opus themes]

TASK: Create a unified codebook by:
1. Identifying themes that overlap across models (same concept, different names)
2. Merging overlapping themes into single unified themes
3. Keeping single-model themes IF substantive (≥3 PIDs)
4. Dropping themes that are too vague or have very few supporting responses
5. Aim for 5-10 unified themes per category

For each unified theme, provide:
- code: snake_case identifier
- name: Human-readable name
- description: Clear description of the desired capability
- source_models: which models identified it (["gpt", "gemini", "opus"])
- source_codes: the original codes from each model

Return ONLY valid JSON, no other text.
ISSUE Code Taxonomy
CodeDefinitionDetection SignalConsensus Rule
ISSUE_WRONG_FIELD Respondent answered the opposite question (e.g., wrote constraints in the "want" field) Cross-response context reveals contradictory intent 2+ models flag any ISSUE_* → generic ISSUE marker applied
ISSUE_BACK_REFERENCE Response references a prior answer ("same as before", "see above") and is unintelligible alone Short response with deictic language
ISSUE_NON_RESPONSE Terse reply with no analyzable content "N/A", "none", "no", single punctuation
ISSUE_* (custom) Models may create additional issue codes for novel quality problems Varies Same 2/3 majority rule; prefix matching ensures grouping
Output JSON Schemas

Theme Discovery Output

{
  "model": "gpt-5.2",
  "category": "design_planning",
  "category_name": "Design & Planning",
  "response_count": 223,
  "timestamp": "ISO-8601",
  "themes": [
    {
      "code": "string",
      "name": "string",
      "description": "string",
      "pids": [integer]
    }
  ],
  "codings": [
    { "pid": integer, "themes": ["string"] }
  ]
}

Consolidated Codebook

{
  "metadata": {
    "phase": "Opportunity Theme Reconciliation",
    "timestamp": "ISO-8601",
    "reconciliation_model": "gpt-5.2",
    "discovery_models": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"]
  },
  "categories": {
    "category_key": {
      "category": "string",
      "category_name": "string",
      "theme_count": integer,
      "models_reconciled": ["gpt", "gemini", "opus"],
      "themes": [
        {
          "code": "string",
          "name": "string",
          "description": "string",
          "source_models": ["gpt", "gemini", "opus"],
          "source_codes": {
            "gpt": ["string"],
            "gemini": ["string"],
            "opus": ["string"]
          }
        }
      ]
    }
  }
}

Systematic Codings (Phase 4)

{
  "category": "string",
  "phase": "Phase 4 - Systematic Coding",
  "timestamp": "ISO-8601",
  "models": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"],
  "codebook": [ { "code": "string", "name": "string", "description": "string" } ],
  "response_count": integer,
  "codings": {
    "gpt": [ { "pid": integer, "rationale": "string", "themes": ["string"] } ],
    "gemini": [ ... ],
    "opus": [ ... ]
  },
  "cost": { ... }
}

IRR Results (Phase 5)

{
  "phase": "Phase 5 - Inter-Rater Reliability",
  "methodology": {
    "raters": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"],
    "metrics": ["Krippendorff's Alpha", "Cohen's Kappa (pairwise)", "Percent Agreement"]
  },
  "overall_statistics": {
    "mean_krippendorff_alpha": float,
    "mean_percent_agreement": float,
    "interpretation": "string"
  },
  "category_results": {
    "category_key": {
      "krippendorff_alpha": { "theme_code": float },
      "percent_agreement": { "theme_code": float },
      "pairwise_kappa": {
        "gpt_vs_gemini": { "theme_code": float },
        "gpt_vs_opus": { "theme_code": float },
        "gemini_vs_opus": { "theme_code": float }
      },
      "code_frequencies": { "gpt": {}, "gemini": {}, "opus": {} }
    }
  }
}

Prevalence Results (Phase 6)

{
  "methodology": {
    "consensus_method": "majority_vote",
    "threshold": "2+ of 3 models must agree"
  },
  "category_results": {
    "category_key": {
      "theme_prevalence": [
        {
          "code": "string",
          "count": integer,
          "percentage": float,
          "pids": [integer]
        }
      ],
      "consensus_codings": { "pid": ["theme1", "theme2"] }
    }
  }
}

Rich Opportunity Card

{
  "rank": integer,
  "theme_code": "string",
  "category": "string",
  "title": "string",
  "problem_statement": "string",
  "proposed_capability": {
    "summary": "string",
    "context_sources_needed": ["string"],
    "capability_steps": ["string"]
  },
  "impact": {
    "description": "string",
    "evidence_quotes": [ { "pid": integer, "quote": "string" } ]
  },
  "success_definition": {
    "qualitative_measures": ["string"],
    "quantitative_measures": ["string"]
  },
  "constraints_and_guardrails": [
    {
      "constraint": "string",
      "supporting_quote": { "pid": integer, "quote": "string" }
    }
  ],
  "who_it_affects": {
    "prevalence_count": integer,
    "prevalence_percentage": float,
    "description": "string",
    "signals": ["string"]
  },
  "models_consulted": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"]
}
IRR Interpretation Guide

Why Krippendorff's Alpha?

  • Designed for multi-rater reliability (3+ raters)
  • Handles missing data gracefully (if one model fails on a batch)
  • Supports nominal-level measurement (categorical theme codes)
  • Does not assume a fixed rater set
  • More conservative than simple percent agreement, adjusting for chance

How It's Calculated

For each theme code, a binary matrix is constructed:

# Row per model, column per response
#                PID_1  PID_2  PID_3  PID_4  ...
# GPT:          [  1,     0,     1,     0,   ...]
# Gemini:       [  1,     0,     1,     0,   ...]
# Opus:         [  1,     0,     0,     0,   ...]

alpha = krippendorff.alpha(
    reliability_data=matrix,
    level_of_measurement="nominal"
)

Reporting

  • Per-theme α values identify which themes models agree/disagree on
  • Themes with α < 0.67 are flagged for potential human adjudication
  • Overall mean α provides a summary reliability score
  • Pairwise κ identifies whether specific model pairs diverge
  • Code frequency counts reveal systematic over/under-coding by individual models

What IRR Tells Us (and Doesn't)

High α means the three models consistently apply the same theme to the same responses—the codebook is operationally clear and the models "understand" it similarly. Low α on a specific theme may indicate the theme definition is ambiguous, the theme requires human judgment the models handle differently, or the theme captures a rare pattern where base-rate effects inflate disagreement.

IRR does not tell us whether the codes are correct—only that the coders agree. This is why the human review gate exists: to ensure the codebook itself captures meaningful, well-defined themes before reliability is measured.

Model Configuration & Cost Tracking

Model Parameters

ModelThinking ConfigTemperatureStreaming
GPT-5.2reasoning_effort="high"1No
Gemini 3.1 ProThinkingConfig(thinking_level="HIGH")DefaultNo
Claude Opus 4.6thinking: adaptive, effort: highDefaultYes (timeout avoidance)

Token Pricing (per 1M tokens)

ModelInputOutputNotes
GPT-5.2$1.75$14.00Thinking tokens billed at output rate
Gemini 3.1 Pro$2.00$12.00Thinking tokens billed at output rate
Claude Opus 4.6$5.00$25.00Thinking tokens billed at output rate

The CostTracker class in llm.py tracks input, output, and thinking tokens separately per API call, with phase-level summaries printed to console.

Design & Planning

223 responses | 7 themes

View Codebook
Research Projects
Ranked by prevalence and multi-model consensus
#1
From Accepted Scope to Sprint Plan
The hard part here is not deciding what to build; it is the administrative grind that starts after the design is already accepted. Teams...
68 responses (30.5%)α=0.94
#2
Architecture Studio for Requirements and Constraints
This project tackles the moment before a design review exists: the requirements are partial, the non-functional constraints are only half...
62 responses (27.8%)α=0.93
#3
Design Review Analyzer for Trade-off and Risk
Design reviews break down less from lack of opinions than from missing a specific consequence: the unstated dependency, the non-functional...
46 responses (20.6%)α=0.94
#4
Project Memory and Decision Lineage Graph
Long-lived projects rarely suffer from a total lack of information. The failure mode is chronology: the key assumption was said in a...
44 responses (19.7%)α=0.90
#5
Rough-to-Refined Design Doc and Diagram Workspace
Design artifacts rarely start as clean prose. They start as bullet lists from a meeting, a verbal walkthrough, a proof-of-concept branch, a...
34 responses (15.2%)α=0.97
Theme Prevalence
(majority vote: assigned when 2+ of 3 models agree)
Project Planning, Ticket/Task Breakdown & Status Automation
68 (30.5%)
α=0.943/3
Architecture Ideation & Interactive Design Copilot
62 (27.8%)
α=0.933/3
Design Review, Risk Assessment & Trade-off Decision Support
46 (20.6%)
α=0.943/3
Context Retrieval & Knowledge Synthesis (Internal + External)
44 (19.7%)
α=0.903/3
Documentation, Specs & Diagram/Artifact Generation
34 (15.2%)
α=0.973/3
Requirements Gathering, Synthesis & Clarification
33 (14.8%)
α=0.953/3
Trustworthy Outputs: Higher Accuracy & Verifiable Citations
13 (5.8%)
α=0.873/3
Key Constraints & Guardrails (30)
Must not own the final decision or plan; accountability stays with the human team.
"Making decisions. I want to use AI to help distill and gather information. Sifting through the noise is the hard part for both, but the end decision and plan should be my responsibility" (PID 11)
From: Project #1
Must never make or present the final architectural decision; it should surface options and trade-offs, but the human remains accountable for the choice.
"I don't want AI to own critical architectural decisions or trade-offs, as these require deep domain knowledge, long-term vision, and accountability that only experienced engineers can provide." (PID 249)
From: Project #3
Must provide transparent reasoning and evidence for every critique so users can judge whether it is correct.
"I want AI to assist. I don't want AI to give me answers without context. If I can't tell why its suggesting I do something one way then I also can't tell if its hallucinating or not." (PID 483)
From: Project #3
Must require a human decision at pivotal points so someone can stand behind the outcome later.
"Any pivotal decision point needs to be made by a human so that someone can stand behind it if it is questioned in the future." (PID 166)
From: Project #3
Must not make final decisions or own the plan; it should distill and retrieve context, while humans remain responsible for judgment and trade-offs.
"Making decisions. I want to use AI to help distill and gather information. Sifting through the noise is the hard part for both, but the end decision and plan should be my responsibility" (PID 11)
From: Project #4

Design & Planning Codebook

Requirements Gathering, Synthesis & Clarification requirements_gathering_synthesis α=0.95 3/3
Help capture requirements from scattered sources (docs, chats, meetings, stakeholder conversations), consolidate and de-duplicate them, surface ambiguities and inconsistencies, identify missing requirements, and translate business goals into actionable user stories or acceptance criteria so teams don't lose intent over time. This theme covers the intake and synthesis of what needs to be built and why. It does NOT cover generating the technical design that satisfies those requirements (that is architecture_ideation_design_copilot), nor retrieving general technical knowledge or codebase context to inform design choices (that is context_retrieval_and_knowledge_synthesis).
Architecture Ideation & Interactive Design Copilot architecture_ideation_design_copilot α=0.93 3/3
Generate and iteratively refine system architectures and technical designs from stated requirements, constraints, and existing ecosystem context, recommending patterns, components, infrastructure choices, and integration approaches. Operate as a conversational partner that asks clarifying questions, bounces ideas, and explores alternative designs in real time, supporting early-stage exploratory thinking rather than one-shot output. This theme is NOT about evaluating or critiquing an already-proposed design for risks, trade-offs, or feasibility (that is design_review_risk_tradeoff_decisions), nor about producing the written document or diagram that communicates the design (that is documentation_spec_diagram_generation).
Design Review, Risk Assessment & Trade-off Decision Support design_review_risk_tradeoff_decisions α=0.94 3/3
Critically evaluate proposed or competing designs to validate feasibility and alignment with requirements and non-functional needs (security, compliance, reliability, scalability, operability). Surface gaps, hidden dependencies, and edge cases early, and assess their impact on cost, complexity, performance, maintainability, and delivery risk. Support evidence-based decisions by running what-if/scenario reasoning and producing clear, defensible trade-off analyses rather than generic recommendations. This theme is not about generating the initial design or brainstorming new approaches (that is architecture_ideation_design_copilot), nor about merely retrieving background information that informs the evaluation (that is context_retrieval_and_knowledge_synthesis). Apply when the emphasis is on judgment, critique, comparison, or risk identification of design options, not on creating them.
Project Planning, Ticket/Task Breakdown & Status Automation project_planning_tasking_status_automation α=0.94 3/3
Turn requirements or designs into executable plans: break down work into epics, stories, or tasks; generate Jira/ADO items; estimate effort; plan milestones and sprints; track dependencies; and automate recurring project admin such as status updates, progress summaries, and coordination. This theme focuses on the mechanics of project execution. It does NOT cover capturing or clarifying what needs to be built (that is requirements_gathering_synthesis), nor producing design documents or diagrams (that is documentation_spec_diagram_generation). The boundary with requirements: translating goals into user stories is requirements gathering; breaking accepted stories into implementation tasks with estimates and schedules is project planning.
Documentation, Specs & Diagram/Artifact Generation documentation_spec_diagram_generation α=0.97 3/3
Draft and maintain design documents, specifications, and proposals, and generate supporting artifacts such as templates, slides, UML diagrams, data-flow diagrams, flowcharts, and Visio-style diagrams. Includes converting meeting notes or rough outlines into structured, presentable documentation. This theme covers the written or visual artifact itself as the primary output. It does NOT cover the intellectual work of deciding what the design should be (that is architecture_ideation_design_copilot), nor synthesizing information from scattered sources to understand the current state (that is context_retrieval_and_knowledge_synthesis).
Context Retrieval & Knowledge Synthesis (Internal + External) context_retrieval_and_knowledge_synthesis α=0.90 3/3
Retrieve and synthesize relevant context from large codebases and organizational knowledge (docs, APIs, prior decisions, chats, meetings) alongside external sources (standards, OSS options, best practices, comparable solutions). Consolidate fragmented references into clear, up-to-date summaries that explain 'what we know,' 'what we decided,' and 'why,' so teams can make informed design choices without re-discovering information. Preserve long-lived project memory and rationale. This theme is about finding, consolidating, and preserving information as a distinct need. It does not cover the downstream activities that consume that information: generating a design is architecture_ideation_design_copilot; evaluating trade-offs is design_review_risk_tradeoff_decisions. Do not infer this theme solely because another task would implicitly benefit from better context.
Trustworthy Outputs: Higher Accuracy & Verifiable Citations trustworthy_outputs_with_citations α=0.87 3/3
Improve reliability for design and planning tasks by reducing hallucinations and providing verifiable grounding (citations, links, traceability to sources, clear statements of uncertainty) so teams can safely act on AI suggestions without introducing costly mistakes. Assign when a respondent names accuracy, hallucination reduction, trustworthiness, or source citation as a distinct concern. This theme is not a catch-all for responses expressing general quality expectations about AI output -- there should be a clear reference to reliability, grounding, or verifiability.

Themes identified from "What do you NOT want AI to handle?" responses.

No AI as Primary System Architect / High-Level Designer human_led_architecture_design 3/3
Developers do not want AI to independently create, select, or drive end-to-end system architecture or high-level design. Concerns include missing context, generic or stale patterns, maintainability/ownership over time, and the need for experienced engineering judgment.
No AI Running Project Management (Planning/Estimation/Task Assignment) no_ai_project_management_task_assignment 3/3
Developers do not want AI to autonomously manage projects or Agile processes (planning, estimation, staffing, task assignment, coordination). These activities require situational awareness of people, shifting priorities, and team-owned allocation decisions that AI is seen as unable to reliably handle.
No AI-Led Requirements Gathering or Stakeholder Alignment no_ai_requirements_stakeholder_elicitation 3/3
Developers do not want AI to directly elicit, define, or assume requirements, or to be the primary agent interacting with customers/stakeholders for alignment. This work is viewed as requiring nuanced human interpretation, negotiation, and shared understanding; AI may help organize/refine after humans gather inputs.
No Replacement of Human Empathy, Collaboration, or Interpersonal Dynamics no_ai_empathy_team_dynamics 3/3
Developers do not want AI to lead or replace work that depends on empathy, trust, interpersonal communication, or navigating team dynamics and politics. This includes sensitive conversations and collaboration where emotional intelligence and human relationships are central.
No Autopilot or Final Calls: Human-in-the-Loop Accountability no_ai_autonomy_human_accountability 3/3
Developers do not want AI to operate autonomously or make final, consequential decisions in design and planning. AI may propose options, summarize trade-offs, and surface assumptions, but it should ask clarifying questions and require explicit human steering, review, and sign-off before outputs are treated as final or acted upon. Humans must retain judgment, ownership, and accountability for defendable decisions and outcomes.
Avoid AI for High-Stakes Work Due to Reliability, Hallucinations, and Missing Context trust_accuracy_and_context_limitations 3/3
Developers want to restrict AI use in design/planning when correctness is critical or the system is complex, citing hallucinations, inaccuracies, outdated knowledge, and insufficient grounding in organization/domain context. AI should not be relied upon as the source of truth for consequential outputs until reliability is demonstrably high.
No AI Handling Sensitive/Confidential Data or Uncontrolled External Messaging privacy_confidentiality_ip_and_message_control 3/3
Developers do not want AI to process proprietary, confidential, or sensitive information (including product ideas/IP) due to privacy and security risks. They also want control over any outbound communications, avoiding AI sending messages or sharing information without explicit human approval.
No AI Owning Product Vision, Strategy, or Creative/Taste Judgments no_ai_vision_strategy_creativity_taste 2/3
Developers do not want AI to set product vision, business strategy, or creative direction, or to make subjective taste-based choices. AI can support ideation, but humans should define purpose, values, differentiation, and creative intent.

Project Planning, Ticket/Task Breakdown & Status Automation

project_planning_tasking_status_automation
Turn requirements or designs into executable plans: break down work into epics, stories, or tasks; generate Jira/ADO items; estimate effort; plan milestones and sprints; track dependencies; and automate recurring project admin such as status updates, progress summaries, and coordination. This theme focuses on the mechanics of project execution. It does NOT cover capturing or clarifying what needs to be built (that is requirements_gathering_synthesis), nor producing design documents or diagrams (that is documentation_spec_diagram_generation). The boundary with requirements: translating goals into user stories is requirements gathering; breaking accepted stories into implementation tasks with estimates and schedules is project planning.
0.943
Krippendorff's α (Excellent)
68
Responses (30.5%)
3/3
Model Convergence
Prevalence
68 of 223 responses (30.5%)
Source Codes (3/3 models converged)
gpt: project_planning_work_management gemini: work_item_generation_and_task_breakdown gemini: project_management_and_status_reporting opus: project_planning_management opus: status_reporting_communication
Developer Quotes

Architecture Ideation & Interactive Design Copilot

architecture_ideation_design_copilot
Generate and iteratively refine system architectures and technical designs from stated requirements, constraints, and existing ecosystem context, recommending patterns, components, infrastructure choices, and integration approaches. Operate as a conversational partner that asks clarifying questions, bounces ideas, and explores alternative designs in real time, supporting early-stage exploratory thinking rather than one-shot output. This theme is NOT about evaluating or critiquing an already-proposed design for risks, trade-offs, or feasibility (that is design_review_risk_tradeoff_decisions), nor about producing the written document or diagram that communicates the design (that is documentation_spec_diagram_generation).
0.926
Krippendorff's α (Excellent)
62
Responses (27.8%)
3/3
Model Convergence
Prevalence
62 of 223 responses (27.8%)
Source Codes (3/3 models converged)
gpt: architecture_design_assistance gemini: architecture_and_design_generation gemini: interactive_design_partner opus: architecture_design_exploration
Developer Quotes

Design Review, Risk Assessment & Trade-off Decision Support

design_review_risk_tradeoff_decisions
Critically evaluate proposed or competing designs to validate feasibility and alignment with requirements and non-functional needs (security, compliance, reliability, scalability, operability). Surface gaps, hidden dependencies, and edge cases early, and assess their impact on cost, complexity, performance, maintainability, and delivery risk. Support evidence-based decisions by running what-if/scenario reasoning and producing clear, defensible trade-off analyses rather than generic recommendations. This theme is not about generating the initial design or brainstorming new approaches (that is architecture_ideation_design_copilot), nor about merely retrieving background information that informs the evaluation (that is context_retrieval_and_knowledge_synthesis). Apply when the emphasis is on judgment, critique, comparison, or risk identification of design options, not on creating them.
0.944
Krippendorff's α (Excellent)
46
Responses (20.6%)
3/3
Model Convergence
Prevalence
46 of 223 responses (20.6%)
Source Codes (3/3 models converged)
gpt: decision_support_simulation_risk gemini: architecture_review_and_risk_assessment gemini: design_tradeoffs_and_alternatives opus: architecture_design_exploration opus: risk_assessment_validation opus: tradeoff_decision_support
Developer Quotes
Creation of tracking design decisions through the architecture easier. eg. If I choose to have a 1-minute Recovery Point Objective (RPO), what subsequent decisions were made because of this, and what child decisions were made because of those subsequent decisions. I would like to reason over the choices that I made, the impact they had, and how small changes to initial design decisions could have significant impact/improvement on the architecture (be it complexity, performance, or cost).
PID 2
I’d like AI to play a key role in scenario simulation, effort estimation, and design validation during the planning phase. These areas often involve uncertainty and assumptions. AI can help model outcomes, surface risks early, and guide more informed, data-driven decisions.
PID 10

Context Retrieval & Knowledge Synthesis (Internal + External)

context_retrieval_and_knowledge_synthesis
Retrieve and synthesize relevant context from large codebases and organizational knowledge (docs, APIs, prior decisions, chats, meetings) alongside external sources (standards, OSS options, best practices, comparable solutions). Consolidate fragmented references into clear, up-to-date summaries that explain 'what we know,' 'what we decided,' and 'why,' so teams can make informed design choices without re-discovering information. Preserve long-lived project memory and rationale. This theme is about finding, consolidating, and preserving information as a distinct need. It does not cover the downstream activities that consume that information: generating a design is architecture_ideation_design_copilot; evaluating trade-offs is design_review_risk_tradeoff_decisions. Do not infer this theme solely because another task would implicitly benefit from better context.
0.897
Krippendorff's α (Excellent)
44
Responses (19.7%)
3/3
Model Convergence
Prevalence
44 of 223 responses (19.7%)
Source Codes (3/3 models converged)
gpt: context_retrieval_system_understanding gemini: internal_context_and_codebase_understanding gemini: meeting_and_decision_context_tracking opus: codebase_context_understanding opus: research_information_gathering
Developer Quotes
I feel it will be most useful in analyzing new codebases that an engineer is jumping into, which is usually a very daunting step since it requires a lot of time and discipline to understand what's going on in a given codebase.
PID 17

Documentation, Specs & Diagram/Artifact Generation

documentation_spec_diagram_generation
Draft and maintain design documents, specifications, and proposals, and generate supporting artifacts such as templates, slides, UML diagrams, data-flow diagrams, flowcharts, and Visio-style diagrams. Includes converting meeting notes or rough outlines into structured, presentable documentation. This theme covers the written or visual artifact itself as the primary output. It does NOT cover the intellectual work of deciding what the design should be (that is architecture_ideation_design_copilot), nor synthesizing information from scattered sources to understand the current state (that is context_retrieval_and_knowledge_synthesis).
0.965
Krippendorff's α (Excellent)
34
Responses (15.2%)
3/3
Model Convergence
Prevalence
34 of 223 responses (15.2%)
Source Codes (3/3 models converged)
gpt: documentation_artifact_generation gemini: documentation_and_diagram_generation opus: documentation_generation
Developer Quotes
Help crafting design documents along with building architecture diagrams from the description of the features
PID 57
Would love for AI to take a proof-of-concept code change and generate a design doc based on it
PID 124

Requirements Gathering, Synthesis & Clarification

requirements_gathering_synthesis
Help capture requirements from scattered sources (docs, chats, meetings, stakeholder conversations), consolidate and de-duplicate them, surface ambiguities and inconsistencies, identify missing requirements, and translate business goals into actionable user stories or acceptance criteria so teams don't lose intent over time. This theme covers the intake and synthesis of what needs to be built and why. It does NOT cover generating the technical design that satisfies those requirements (that is architecture_ideation_design_copilot), nor retrieving general technical knowledge or codebase context to inform design choices (that is context_retrieval_and_knowledge_synthesis).
0.953
Krippendorff's α (Excellent)
33
Responses (14.8%)
3/3
Model Convergence
Prevalence
33 of 223 responses (14.8%)
Source Codes (3/3 models converged)
gpt: requirements_gathering_synthesis gemini: requirements_gathering_and_analysis opus: requirements_gathering_analysis
Developer Quotes

Trustworthy Outputs: Higher Accuracy & Verifiable Citations

trustworthy_outputs_with_citations
Improve reliability for design and planning tasks by reducing hallucinations and providing verifiable grounding (citations, links, traceability to sources, clear statements of uncertainty) so teams can safely act on AI suggestions without introducing costly mistakes. Assign when a respondent names accuracy, hallucination reduction, trustworthiness, or source citation as a distinct concern. This theme is not a catch-all for responses expressing general quality expectations about AI output -- there should be a clear reference to reliability, grounding, or verifiability.
0.873
Krippendorff's α (Excellent)
13
Responses (5.8%)
3/3
Model Convergence
Prevalence
13 of 223 responses (5.8%)
Source Codes (3/3 models converged)
gpt: trust_accuracy_citations opus: reduce_hallucinations_improve_accuracy
Developer Quotes
Can provide more details and the source of the solution (URL)
PID 72
Reducing hallucinations. That's my only complaint with AI today, but it's still a huge problem.
PID 101
#1
From Accepted Scope to Sprint Plan
The hard part here is not deciding what to build; it is the administrative grind that starts after the design is already accepted. Teams still spend hours turning a settled brief into backlog structure, estimate ranges, dependency ordering, tracker updates, and the same weekly status narratives for different audiences. Cross-team work makes that worse, because the real overhead is not one planning meeting but the repeated translation of the same project state into boards, emails, newsletters, and coordination rituals.

Project Description

Turn an approved brief into a first-pass execution package: hierarchical work items, a dependency map, estimate ranges calibrated to the team’s own delivery history, and a draft sprint or milestone sequence. Once the team edits and accepts that draft, the assistant can sync the approved deltas into the tracker and assemble recurring status digests directly from tracker activity, CI signals, and open blockers.

Relevant Context Sources:
  • Accepted feature briefs, one-pagers, and design/spec documents
  • Current backlog, sprint history, and cross-team links from the work tracking system
  • Historical cycle time, throughput, and estimate-to-actual data for the team
  • Repository/service ownership boundaries and dependency metadata
  • CI/CD and incident signals that indicate progress, regressions, and release readiness
Capability Steps:
  1. Start from user-selected accepted scope rather than open-ended prompting, so the plan is anchored to approved work and can explicitly mark assumptions or out-of-scope items.
  2. Parse the design or spec and generate a work breakdown with task titles, descriptions, and acceptance checks tied to the team’s planning template.
  3. Infer dependencies, likely blockers, and estimate ranges using linked work items, service boundaries, and comparable historical tasks from the same team.
  4. Draft milestone or sprint sequencing from estimates, capacity, and calendar constraints, surfacing trade-offs among scope, date, and staffing instead of silently choosing priorities.
  5. Show tracker diffs before any write: proposed backlog items, hierarchy, links, and estimate fields, while preserving existing manual edits as the source of truth.
  6. From approved tracker activity, delivery signals, and unresolved blockers, compose periodic status summaries, stale-item alerts, and re-planning suggestions for the lead to edit and send.

Who It Affects

68 responses (30.5%)α=0.943

68 of 223 respondents (30.5%) were coded to this theme with strong inter-rater reliability (α = 0.94). These are developers, tech leads, and engineering managers who regularly translate accepted designs and specs into executable plans, maintain backlogs across sprint cycles, and produce recurring status updates; 67.7% want High or Very High AI support for this work.

Quantitative Signals:
  • 67.7% of respondents in this theme want High or Very High AI support for project planning tasks
  • Average AI Preference: 3.94/5
  • Average AI Usage: 2.57/5
  • Preference-Usage Gap: 1.38
  • IRR agreement α = 0.94, confirming strong coder consensus on theme boundaries
Being able to accurately turn a design + architecture into a project plan, with realistic timelines, and milestones.
PID 28
It would be helpful if AI could take high level features and break them down into discrete implementation tasks, and then plan those tasks out.
PID 380
I think helping me break down a large scope project into discrete tasks is the most important to me.
PID 704

Impact

If this assistant works, an approved design or feature brief becomes a usable draft execution plan instead of a manual sequence of backlog grooming, estimation, and tracker maintenance. Routine status communication is generated from work and delivery signals rather than copied by hand into emails or meeting artifacts, while blockers and cross-team dependencies are surfaced earlier. The main outcome is reclaiming engineering time for strategy and implementation, not replacing human project leadership.

Evidence
I’d like AI to play a bigger role in automating repetitive planning tasks... so I can focus more on strategy and decision-making.
PID 179
Entirely automating status update meetings. AI agents should be doing this. No more status update meetings and status report work for employees.
PID 5

Constraints & Guardrails

Must not own the final decision or plan; accountability stays with the human team.
"Making decisions. I want to use AI to help distill and gather information. Sifting through the noise is the hard part for both, but the end decision and plan should be my responsibility" (PID 11)

Success Definition

Qualitative Measures

  • Teams report that the generated work-item breakdowns are 'close enough' to use as starting points (requiring only minor edits, not rewrites)
  • Developers report that sprint planning meetings shifted from data-entry exercises to strategic discussions about priority and scope
  • Tech leads say they no longer manually compile status reports or copy-paste from work trackers into emails and slide decks
  • Teams report improved coordination because cross-team dependencies are surfaced early with actionable next steps, without the tool autonomously contacting people
  • Developers express that they retain full control and agency — the tool proposes, they decide

Quantitative Measures

  • 50% reduction in time spent manually creating work items from specs (measured via before/after time-tracking studies)
  • 80% of generated work-item trees accepted with fewer than 25% of items requiring substantive edits (title, scope, or acceptance criteria changes)
  • Effort estimate accuracy within 30% of actuals for 70%+ of tasks after 3 sprints of team-specific calibration
  • Reduce weekly time spent on status reporting/admin by 30%+ (self-reported + calendar/usage telemetry opt-in)
  • 30% reduction in the number of recurring status meetings per cross-team project, replaced by automated digest consumption

Theme Evidence

Project Planning, Ticket/Task Breakdown & Status Automation
project_planning_tasking_status_automation
68 responses (30.5%)α=0.9433/3 convergence

Turn requirements or designs into executable plans: break down work into epics, stories, or tasks; generate Jira/ADO items; estimate effort; plan milestones and sprints; track dependencies; and automate recurring project admin such as status updates, progress summaries, and coordination. This theme...

#2
Architecture Studio for Requirements and Constraints
This project tackles the moment before a design review exists: the requirements are partial, the non-functional constraints are only half stated, and the current system has already narrowed the option space in ways nobody wrote down. Existing assistants usually collapse that ambiguity into one generic architecture answer. What teams actually need is a design studio that can surface real alternatives, force the missing questions into the open, and show where the codebase and prior decisions are already biasing the solution.

Project Description

An interactive architecture studio that keeps an explicit design state—goals, constraints, assumptions, open questions, and option history—while pulling in local precedents from ADRs, service topologies, and analogous code. It generates several materially different architectures, lets the team perturb constraints and compare diffs, and can export the chosen path as a draft decision record once the humans settle on a direction.

Relevant Context Sources:
  • Requirement docs, one-pagers, and user stories for the feature under discussion
  • Prior ADRs, design docs, and architecture diagrams for related systems
  • Service and repository catalogs with dependency and integration maps
  • Deployment manifests and infrastructure topology for the existing system
  • Organization-specific standards, approved components, and platform guardrails
  • Analogous implementations in the team’s own repositories
Capability Steps:
  1. Seed a structured design state from the initial brief, explicitly separating supplied requirements from assumptions and unknowns.
  2. Ask targeted questions only where the missing information affects architectural shape—scale, data sensitivity, failure model, ownership, or integration boundaries—and record the answers in the design state.
  3. Retrieve local precedent from ADRs, service catalogs, repository structure, and infrastructure manifests so the option space reflects what the organization already runs, not just public pattern libraries.
  4. Generate 3–5 distinct architecture candidates with components, data flow, deployment shape, assumptions, pros and cons, and named precedents or patterns for each.
  5. Let the team change a constraint, pin a decision driver, or branch the conversation, then regenerate only the affected parts and show what changed in each option.
  6. Preserve session history so the team can compare forks over several turns instead of losing context to one-shot prompting.
  7. When a direction is chosen, export an ADR-style draft that captures context, alternatives considered, rationale, consequences, and unresolved questions.

Who It Affects

62 responses (27.8%)α=0.926

62 of 223 developers (27.8%) in the design-planning category explicitly asked for AI assistance with architecture ideation and interactive design collaboration, making it the second-most prevalent theme. Demand is strong but unmet: 67.7% want High or Very High support, average preference is 3.94/5, current usage is 2.57/5, and the preference-usage gap is 1.38.

Quantitative Signals:
  • 62 of 223 design-planning respondents (27.8%) explicitly mentioned architecture ideation and interactive design collaboration.
  • 67.7% of respondents in this theme want High or Very High AI support.
  • Average AI preference is 3.94/5 while average current usage is 2.57/5.
  • Preference-usage gap is 1.38, indicating substantial unmet demand in this workflow.
I want AI to assist with exploring design alternatives, identifying edge cases, and helping align architecture with evolving business goals. It should act as a thoughtful collaborator during planning, not just a generator.
PID 249
Act as a thinking partner to survey existing infrastructure/design patterns. Propose some fruit for thoughts and work through design together.
PID 391
Currently we are getting only one completion or main answer with a few alterative suggestions. It would be nice to have multiple alternatives to choose from instead of zeroing on in one solution would greatly improve the design phase. Many times, there are multiple ways to do something and the statistically significant way is not the best and over time create a bias towards suboptimal solutions.
PID 312

Impact

Instead of starting from a blank page or repeatedly re-prompting a general assistant, developers could quickly compare several context-grounded architectures before formal design review. This should reduce premature commitment to one familiar pattern, surface trade-offs and edge cases earlier, and shorten the path from requirements to a viable design direction.

Evidence
I want AI to assist with exploring design alternatives, identifying edge cases, and helping align architecture with evolving business goals. It should act as a thoughtful collaborator during planning, not just a generator.
PID 249
Currently we are getting only one completion or main answer with a few alterative suggestions. It would be nice to have multiple alternatives to choose from instead of zeroing on in one solution would greatly improve the design phase. Many times, there are multiple ways to do something and the statistically significant way is not the best and over time create a bias towards suboptimal solutions.
PID 312
AI should be suggesting optimal architectural patterns and recommending infrastructure choices based on projected load and cost constraints, significantly reducing costly rework down the line.
PID 322

Constraints & Guardrails

Success Definition

Qualitative Measures

  • Developers report that the tool surfaces architecture options they had not previously considered, rather than echoing back their own ideas.
  • Developers describe the interaction as a productive back-and-forth conversation rather than a one-shot prompt-response pattern.
  • Users report options are specific to their constraints and existing ecosystem, not generic templates.
  • Developers trust recommendations because the tool cites internal precedent or named patterns and explicitly flags uncertainty.
  • Developers feel they remain accountable and in control of design direction.

Quantitative Measures

  • Reduce average time from requirements-to-first-viable-architecture-proposal by 40% (measured via session timestamps).
  • Median time-to-3-viable-architecture-candidates (from session start) reduced by 50% versus baseline manual process (measured via user study).
  • Increase the average number of distinct architecture alternatives considered per design decision from 1-2 to 3-4 (measured via session logs).
  • Decrease average number of user turns needed to reach a stable candidate architecture by 30% (conversation analytics).
  • Achieve >80% of generated recommendations traceable to a cited source (internal precedent or named industry pattern) rather than unsourced assertions.

Theme Evidence

Architecture Ideation & Interactive Design Copilot
architecture_ideation_design_copilot
62 responses (27.8%)α=0.9263/3 convergence

Generate and iteratively refine system architectures and technical designs from stated requirements, constraints, and existing ecosystem context, recommending patterns, components, infrastructure choices, and integration approaches. Operate as a conversational partner that asks clarifying...

#3
Design Review Analyzer for Trade-off and Risk
Design reviews break down less from lack of opinions than from missing a specific consequence: the unstated dependency, the non-functional requirement nobody checked, the failure mode nobody simulated, or the cost implication buried two decisions downstream. Teams do this reasoning manually, inconsistently, and often without a reusable trail. The result is circular meetings, shallow comparisons, and architecture choices whose real trade-offs are only discovered after implementation starts.

Project Description

Analyze a proposed design the way a strong reviewer would: reconstruct the component model from docs and diagrams, check requirement and NFR coverage, search for hidden dependencies, run what-if scenarios, and turn the result into a trade-off matrix plus a draft decision record. Every critique should point back to its source—diagram elements, requirements text, catalog entries, incident history, or explicit assumptions.

Relevant Context Sources:
  • Design docs, architecture diagrams, and competing options under review
  • Requirements, acceptance criteria, and non-functional checklists for the project
  • Internal service catalogs, interface definitions, and ownership metadata
  • Historical incidents, outages, and prior review artifacts for similar components
  • Cost, capacity, and load assumptions relevant to the design
Capability Steps:
  1. Ingest the review package and parse it into components, data flows, integrations, trust boundaries, explicit decisions, and open questions.
  2. Map the parsed design to stated functional and non-functional requirements, marking each item as covered, partially covered, unaddressed, or unclear with supporting snippets.
  3. Cross-check the design against internal catalogs and interface definitions to expose hidden dependencies, undocumented integrations, ownership gaps, and likely cross-team blockers.
  4. Generate design-specific failure modes and parameterized what-if scenarios so reviewers can see how a change in assumptions affects complexity, performance, cost, recovery targets, or delivery risk.
  5. When multiple options are supplied, compare them across user-chosen criteria and explain each cell in the matrix instead of collapsing the output to a single winner.
  6. Draft a decision record that captures options considered, assumptions, risk concentrations, downstream consequences, and unresolved questions for the review meeting.

Who It Affects

46 responses (20.6%)α=0.944

46 of 223 respondents (20.6%) were coded to this theme with very high inter-rater reliability (α = 0.944). These are engineers involved in preparing or reviewing designs, comparing options, and reasoning about risks, assumptions, dependencies, and downstream impacts. Demand is strong but underserved: 67.7% want high or very high AI support for this activity, while current usage remains well below preference.

Quantitative Signals:
  • 46 of 223 respondents (20.6%) were coded to this theme
  • Inter-rater reliability α = 0.944 across 3 independent coders
  • 67.7% of developers want High/Very High AI support for this area
  • Average AI preference score of 3.94/5 vs. average usage of 2.57/5 — a 1.38-point gap indicating significant unmet need
I would like to reason over the choices that I made, the impact they had, and how small changes to initial design decisions could have significant impact/improvement on the architecture (be it complexity, performance, or cost).
PID 2
I'd like AI to play a key role in scenario simulation, effort estimation, and design validation during the planning phase. These areas often involve uncertainty and assumptions. AI can help model outcomes, surface risks early, and guide more informed, data-driven decisions.
PID 10
If it can understand complex system architectures, and provides insights into gaps in your design, that would be extremely helpful.
PID 83

Impact

Teams would enter design reviews with a structured analysis rather than starting from a blank debate: a requirement and non-functional coverage check, a prioritized risk register, a trade-off matrix for competing options, and a decision impact trail showing how one choice propagates through the architecture. This should surface blockers and assumption failures earlier, shorten circular review cycles, and leave reusable decision records that make later reviews and design changes easier to justify.

Evidence
Sometimes we go in circles or miss edge cases, and it would be great if AI could track the logic, highlight trade-offs, and even reuse parts of earlier reasoning when we hit similar problems again. That would save time and reduce errors.
PID 313
AI can help model outcomes, surface risks early, and guide more informed, data-driven decisions.
PID 10
If I choose to have a 1-minute Recovery Point Objective (RPO), what subsequent decisions were made because of this, and what child decisions were made because of those subsequent decisions.
PID 2

Constraints & Guardrails

Must never make or present the final architectural decision; it should surface options and trade-offs, but the human remains accountable for the choice.
"I don't want AI to own critical architectural decisions or trade-offs, as these require deep domain knowledge, long-term vision, and accountability that only experienced engineers can provide." (PID 249)
Must provide transparent reasoning and evidence for every critique so users can judge whether it is correct.
"I want AI to assist. I don't want AI to give me answers without context. If I can't tell why its suggesting I do something one way then I also can't tell if its hallucinating or not." (PID 483)
Must require a human decision at pivotal points so someone can stand behind the outcome later.
"Any pivotal decision point needs to be made by a human so that someone can stand behind it if it is questioned in the future." (PID 166)

Success Definition

Qualitative Measures

  • Developers report that the tool catches gaps, edge cases, or dependency risks they had not previously considered in their designs.
  • Design review meetings become shorter and more focused because participants arrive with pre-analyzed trade-off matrices rather than debating from scratch.
  • Users trust outputs because assumptions, evidence, and confidence are explicit, and the tool is seen as an assistant rather than a decision-maker.
  • Teams attach generated trade-off analyses and draft decision records to formal design review workflows.
  • Teams report reduced frequency of late-stage architecture pivots caused by risks that should have been caught during planning.

Quantitative Measures

  • Reduce median time from first design draft to design approval by 20–30% (measured via timestamps on review artifacts)
  • Increase the number of risks and gaps identified during the planning phase by at least 40% compared to manual-only reviews
  • Reduce post-implementation architectural rework incidents (tracked via work items tagged as 'design change' or 'architecture pivot') by 25% within the first year
  • Close the preference-usage gap from 1.38 to below 0.5 within 12 months of launch, indicating that developers who want this capability are actually using it
  • At least 80% of generated trade-off analyses rated as 'useful' or 'very useful' by the reviewing engineer in post-review feedback

Theme Evidence

Design Review, Risk Assessment & Trade-off Decision Support
design_review_risk_tradeoff_decisions
46 responses (20.6%)α=0.9443/3 convergence

Critically evaluate proposed or competing designs to validate feasibility and alignment with requirements and non-functional needs (security, compliance, reliability, scalability, operability). Surface gaps, hidden dependencies, and edge cases early, and assess their impact on cost, complexity,...

#4
Project Memory and Decision Lineage Graph
Long-lived projects rarely suffer from a total lack of information. The failure mode is chronology: the key assumption was said in a meeting, the workaround landed in code months later, the real decision got captured only in a PR comment, and the person who knew why has since moved on. Engineers need more than search hits. They need a usable reconstruction of what was decided, when, on what basis, and what later choices depended on it.

Project Description

Create a project memory layer that ingests code history, docs, tickets, notes, and opted-in conversations as dated facts and decisions rather than as isolated text chunks. The interface should answer questions like 'why was this done?', 'what changed after the RPO target moved?', or 'which later components depend on that decision?' by showing a timeline, the decision lineage, and the exact source passages behind each claim.

Relevant Context Sources:
  • Source repositories, commit history, and pull-request discussions
  • Design docs, specs, ADRs, and architecture diagrams
  • Work items, comments, and linked approval records
  • Meeting notes plus opted-in chat or email threads
  • API/interface definitions and service or component catalogs
  • Incident records and postmortems tied to architectural components
Capability Steps:
  1. Let teams define project boundaries and access policies across repositories, documents, work items, and conversation sources before ingestion.
  2. Normalize those artifacts into time-stamped records with stable deep links, document authority markers, and freshness metadata.
  3. Extract components, requirements, decisions, alternatives, owners, and rationale, while building a glossary of project-specific terminology and aliases.
  4. Link parent and child decisions over time so users can trace how an upstream call—such as a reliability target or integration choice—propagated into later implementation and operational decisions.
  5. Answer questions with structured sections such as current facts, decisions made, reasons cited, open conflicts, and missing evidence, rather than a single flattened paragraph.
  6. Expose disagreements, stale context, and inaccessible sources instead of collapsing them into one confident answer.
  7. Generate reusable briefings for onboarding, design reviews, or incident retrospectives that remain drillable to exact snippets, timestamps, code locations, and prior discussions.

Who It Affects

44 responses (19.7%)α=0.897

44 of 223 respondents (19.7%) were coded to this theme with high inter-rater reliability (α = 0.897). Those affected range from engineers joining unfamiliar codebases to senior engineers trying to preserve institutional memory as ownership changes. The theme cuts across design planning, architecture, requirements, and onboarding because context retrieval is an upstream bottleneck for all of them.

Quantitative Signals:
  • 67.7% of respondents in this theme want High or Very High AI support for context retrieval tasks
  • Average AI Preference: 3.94/5
  • Average AI Usage: 2.57/5, creating a 1.38-point preference–usage gap
I feel it will be most useful in analyzing new codebases that an engineer is jumping into, which is usually a very daunting step since it requires a lot of time and discipline to understand what's going on in a given codebase.
PID 17
Creation of tracking design decisions through the architecture easier. eg. If I choose to have a 1-minute Recovery Point Objective (RPO), what subsequent decisions were made because of this, and what child decisions were made because of those subsequent decisions. I would like to reason over the choices that I made, the impact they had, and how small changes to initial design decisions could have significant impact/improvement on the architecture (be it complexity, performance, or cost).
PID 2
There can be a substantial time gap between when a stakeholder mentions an important detail and the documentation is produced and reviewed, and with the volume of topics discussed and documented, it's possible for things to get lost inadvertently. Not lost conceptually, just not stated... Having AI help keep track of things would be really helpful.
PID 40

Impact

If this capability exists, developers can reconstruct system context and design history from one place instead of manually searching across fragmented artifacts. The main gains are faster onboarding to unfamiliar codebases, better continuity when owners leave, and fewer repeated investigations or re-debates because prior rationale is retrievable and cited. The result is not autonomous planning, but a trustworthy project memory that helps humans make better-informed design decisions.

Evidence
I feel it will be most useful in analyzing new codebases that an engineer is jumping into, which is usually a very daunting step since it requires a lot of time and discipline to understand what's going on in a given codebase.
PID 17
There can be a substantial time gap between when a stakeholder mentions an important detail and the documentation is produced and reviewed, and with the volume of topics discussed and documented, it's possible for things to get lost inadvertently. Not lost conceptually, just not stated... Having AI help keep track of things would be really helpful.
PID 40

Constraints & Guardrails

Must not make final decisions or own the plan; it should distill and retrieve context, while humans remain responsible for judgment and trade-offs.
"Making decisions. I want to use AI to help distill and gather information. Sifting through the noise is the hard part for both, but the end decision and plan should be my responsibility" (PID 11)
Must not provide opaque answers; every summary should expose why a claim was surfaced and where it came from.
"I want AI to assist. I don't want AI to give me answers without context. If I can't tell why its suggesting I do something one way then I also can't tell if its hallucinating or not." (PID 483)

Success Definition

Qualitative Measures

  • Developers report that onboarding onto unfamiliar codebases takes significantly less time because the tool provides accurate, cited explanations of system structure and past decisions
  • Teams report fewer instances of re-debating previously settled design decisions because the rationale and context are readily accessible
  • New team members and engineers returning from leave report they can reconstruct project context without scheduling 'brain dump' meetings with colleagues
  • Developers trust the tool's outputs because every summary includes verifiable citations and clearly flags information staleness or conflicts
  • Design reviews more consistently reference prior decisions and their rationale instead of re-investigating from scratch

Quantitative Measures

  • Decrease average time for a developer to answer a 'why was this decision made?' question from >30 minutes of manual searching to <3 minutes via the query interface
  • At least 90% of generated summaries include citations for all key claims
  • Detect and surface 'stale context' for at least 70% of docs or decision records that reference changed APIs or components, with a false-positive rate under 20%
  • Reduce onboarding time for engineers joining a new codebase by at least 30% as measured by time-to-first-meaningful-commit on the new project
  • Increase the proportion of design documents that reference prior decisions or decision records by at least 40%

Theme Evidence

Context Retrieval & Knowledge Synthesis (Internal + External)
context_retrieval_and_knowledge_synthesis
44 responses (19.7%)α=0.8973/3 convergence

Retrieve and synthesize relevant context from large codebases and organizational knowledge (docs, APIs, prior decisions, chats, meetings) alongside external sources (standards, OSS options, best practices, comparable solutions). Consolidate fragmented references into clear, up-to-date summaries...

#5
Rough-to-Refined Design Doc and Diagram Workspace
Design artifacts rarely start as clean prose. They start as bullet lists from a meeting, a verbal walkthrough, a proof-of-concept branch, a few architecture boxes on a whiteboard, and a template someone promises to fill in later. Teams want help with the conversion work: turning that rough material into an actual design document and keeping the diagram and text aligned as the idea sharpens.

Project Description

Use an interactive authoring workspace to turn notes, transcripts, rough outlines, and selected code diffs into sectioned design docs, specs, and editable diagrams. Each paragraph, table, and box should carry its provenance so the author can see which meeting note, requirement, or code change it came from and refine only the part that needs work.

Relevant Context Sources:
  • Team document templates and required review sections
  • User-selected notes, transcripts, outlines, and dictated explanations
  • Linked requirements and prior design documents relevant to the artifact
  • Selected code diffs or proof-of-concept branches
  • Organization naming, security, and privacy checklists used during design review
Capability Steps:
  1. Let the author choose the artifact type and explicitly select which inputs are allowed to shape the draft.
  2. Map the source material to required sections such as goals, non-goals, requirements, risks, rollout, and open questions, and highlight gaps before drafting.
  3. Draft the document section by section, labeling each statement as directly supported by a source, inferred from several sources, or waiting on author confirmation.
  4. Generate diagrams in both rendered form and structured text notation so edits can round-trip instead of forcing redraws.
  5. Support targeted regeneration: the author can rewrite only the rollout section, only the dependency table, or only one diagram lane without losing manual edits elsewhere.
  6. When linked requirements or code change later, detect which sections and diagram elements are now stale and suggest patch-style updates.

Who It Affects

34 responses (15.2%)α=0.965

34 of 223 respondents (15.2%) were coded to this theme with near-perfect inter-rater reliability (α = 0.965), spanning developers who draft design docs, create architecture diagrams, write specs from meeting context, build proposals, and maintain as-built documentation. They consistently describe artifact production as repetitive work that pulls time and attention away from actual design thinking.

Quantitative Signals:
  • 67.7% of respondents coded to this theme want High or Very High AI support for documentation and artifact generation
  • Average AI Preference of 3.94/5 vs. Average AI Usage of 2.57/5 reveals a 1.38-point gap — the largest unmet demand signal
  • Theme achieved highest inter-rater reliability in the category (α = 0.965), confirming clear, unambiguous developer need
Help crafting design documents along with building architecture diagrams from the description of the features
PID 57
I want AI to cover trivial and repetitive work (creating documents, slides, ....) so that I can focus on what really matters.
PID 75
Would love for AI to take a proof-of-concept code change and generate a design doc based on it
PID 124

Impact

If this capability exists, developers start from a grounded first draft instead of a blank page: rough notes, meeting context, and code changes become structured documents and diagrams aligned to team conventions. This shifts effort away from formatting and manual diagram drawing toward design reasoning and review, while also making as-built documentation easier to keep useful for onboarding and cross-team understanding.

Evidence
Essentially, supplement my ability to get my ideas documented in a consumable way for others. This would really help and reduce a lot of time that feels wasted to me.
PID 483
I want AI to cover trivial and repetitive work (creating documents, slides, ....) so that I can focus on what really matters.
PID 75
templatization and summary from notes and meeting context into a skeleton design with high accuracy would be a major step forward. Intuitive application of security and privacy standards in this phase would be ideal.
PID 387

Constraints & Guardrails

The tool must not make the final design or planning decisions; it should distill and draft, while the developer retains responsibility for the plan.
"Making decisions. I want to use AI to help distill and gather information. Sifting through the noise is the hard part for both, but the end decision and plan should be my responsibility" (PID 11)
The tool must expose source context for its suggestions so users can verify why content appears and whether it is trustworthy.
"I want AI to assist. I don't want AI to give me answers without context. If I can't tell why its suggesting I do something one way then I also can't tell if its hallucinating or not." (PID 483)

Success Definition

Qualitative Measures

  • Developers report that generated first drafts are shareable with minor edits rather than major rewrites.
  • Developers describe the workflow as conversational and steerable, with targeted updates instead of full regeneration.
  • Developers confirm that generated diagrams accurately reflect their described architecture and are usable in reviews without manual redrawing.
  • Developers report spending more time on design thinking and less time on formatting, rewriting, and manual diagramming.
  • Junior engineers report that refreshed as-built documentation reduces the effort required to understand a system.

Quantitative Measures

  • Reduce median time-to-first-draft of a design doc or proposal by 50%+ compared with baseline practice.
  • Achieve >80% of generated document sections requiring no more than minor edits, measured by edit distance between AI draft and final published version.
  • >= 70% of generated diagrams are exported without being fully redrawn, measured by edit distance or replacement events.
  • >= 25% of artifacts opt into change-tracked update suggestions, with >= 60% acceptance rate of suggested patches after review.
  • Reduce the number of stale-documentation incidents reported per quarter by 40%, measured through doc freshness audits or retrospective tracking.

Theme Evidence

Documentation, Specs & Diagram/Artifact Generation
documentation_spec_diagram_generation
34 responses (15.2%)α=0.9653/3 convergence

Draft and maintain design documents, specifications, and proposals, and generate supporting artifacts such as templates, slides, UML diagrams, data-flow diagrams, flowcharts, and Visio-style diagrams. Includes converting meeting notes or rough outlines into structured, presentable documentation....

Development

353 responses | 7 themes

View Codebook
Research Projects
Ranked by prevalence and multi-model consensus
#1
Incremental PR Builder for Mechanical Code Changes
The highest-leverage coding automation is not open-ended feature work. It is the backlog of tedious, well-bounded transformations:...
177 responses (50.1%)α=0.89
#2
Inline Quality Gate for Code Review and Test Generation
The best code review comment is the one the developer sees before the code ever leaves the editor. Teams want help with bug spotting,...
98 responses (27.8%)α=0.94
#3
Trace-to-Patch Root Cause Workbench
Debugging is still dominated by evidence collection. Before an engineer can even test a theory, they have to find the right logs, line up...
69 responses (19.5%)α=0.96
#4
Repository Context Graph for Cross-File Changes
Cross-file changes fail when the engineer cannot see the ripples. Changing one type, interface, or contract in a large codebase means...
65 responses (18.4%)α=0.88
Theme Prevalence
(majority vote: assigned when 2+ of 3 models agree)
Code Generation, Refactoring & Modernization Automation
177 (50.1%)
α=0.893/3
Code Quality, Review Automation, Automated Testing & Security/Compliance Guidance
98 (27.8%)
α=0.943/3
Debugging, Root Cause Analysis & Bug Fix Assistance
69 (19.5%)
α=0.963/3
Codebase Context, Knowledge Capture & Safe Cross-File Changes
65 (18.4%)
α=0.883/3
Architecture, Design Brainstorming & Planning Support
38 (10.8%)
α=0.843/3
Performance Profiling & Optimization Suggestions
33 (9.3%)
α=0.983/3
DevOps, CI/CD, IaC & Engineering Workflow Automation
23 (6.5%)
α=0.893/3
Key Constraints & Guardrails (24)
Must not submit or approve code review artifacts; a human remains accountable for final review and submission.
"I don't want AI to handle submitting or approving pull requests. There should be a human who reviews all changes that AI suggests to prevent the introduction of issues." (PID 125)
From: Project #2

Development Codebook

Code Generation, Refactoring & Modernization Automation codegen_refactor_modernization α=0.89 3/3
AI that generates boilerplate/scaffolding, ports or converts code, writes small scripts, and implements well-scoped features directly from requirements or specs to reduce repetitive implementation work. It also performs behavior-preserving refactors, migrations, and framework/library/dependency upgrades to modernize legacy systems and reduce technical debt. This theme covers the act of producing or transforming code itself. It does not cover reviewing or testing that code for quality/security (see quality_review_testing_security), nor the architectural reasoning or design decisions that precede implementation (see architecture_design_planning_support). This theme requires a concrete generation or transformation task, not just a wish for better AI output quality.
Debugging, Root Cause Analysis & Bug Fix Assistance debugging_root_cause_fixing α=0.96 3/3
AI that accelerates debugging by analyzing stack traces, logs, telemetry, and failing tests; reproduces issues; identifies root causes; suggests fixes; and helps triage incidents/regressions. This theme applies when the respondent describes diagnosing or fixing specific bugs or incidents that have already manifested. The key distinction is reactive (investigating a known failure or incident) vs. proactive (preventing bugs through review and testing). Proactive bug detection belongs under quality_review_testing_security.
Codebase Context, Knowledge Capture & Safe Cross-File Changes codebase_context_knowledge_safe_changes α=0.88 3/3
AI that builds a reliable, repo-wide understanding of large/legacy systems -- including dependencies, conventions, and cross-module/service relationships -- to navigate code and make coordinated multi-file edits with awareness of ripple effects. It retrieves and synthesizes codebase- and API-specific knowledge with traceable sources, answers 'how/why' questions about existing code, maintains long-running context across sessions, and keeps documentation/comments and onboarding materials up to date. Assign when the respondent describes needing the AI to understand, retain knowledge about, or navigate a codebase or system. Do not assign solely because the described task happens to occur in a large codebase. This also does not cover general external knowledge lookup unrelated to a specific codebase.
Code Quality, Review Automation, Automated Testing & Security/Compliance Guidance quality_review_testing_security α=0.94 3/3
AI that acts as an intelligent quality gate by providing real-time code review feedback, enforcing style/standards, and flagging correctness issues and bad practices. It generates meaningful unit/integration/E2E tests, identifies edge cases and coverage gaps, supports TDD workflows, and validates changes to prevent regressions. It also detects security vulnerabilities and provides secure-by-default and compliance-aware guidance. Assign when the response describes a review, testing, or security-checking activity. Wanting AI-generated code to be better quality does not, by itself, warrant this theme -- there should be an identifiable quality assurance action described. This does not cover test orchestration or CI pipeline execution (see devops_ci_cd_iac_workflow_automation).
Performance Profiling & Optimization Suggestions performance_profiling_optimization α=0.98 3/3
AI that identifies performance bottlenecks, assists with profiling and interpretation of performance signals, and suggests (or safely implements) optimizations for runtime efficiency and resource usage. Code this theme when the respondent explicitly requests performance profiling, analysis, or optimization as a distinct activity. It does NOT apply when 'performant' is mentioned only as a desired attribute of AI-generated code without requesting a specific performance analysis or optimization task.
Architecture, Design Brainstorming & Planning Support architecture_design_planning_support α=0.84 3/3
AI that helps with solution/architecture decisions by proposing options with tradeoffs, recommending patterns, translating requirements into designs, and supporting planning/triage/prioritization as a 'thinking partner.' Code this theme when the respondent describes wanting AI to help reason about how to approach a problem, evaluate design alternatives, or make architectural choices. It does NOT apply when the respondent simply wants AI to understand a use case as a prerequisite to writing code (see codegen_refactor_modernization) -- the distinction is whether the respondent is asking for the design reasoning itself vs. asking for implementation with implicit design understanding.
DevOps, CI/CD, IaC & Engineering Workflow Automation devops_ci_cd_iac_workflow_automation α=0.89 3/3
AI that automates non-coding engineering workflows such as CI/CD setup and troubleshooting, deployments, infrastructure-as-code generation/fixes, build failure diagnosis, and toolchain/PR workflow automation to reduce operational toil. Assign when the respondent names DevOps, CI/CD, IaC, deployment, or pipeline tasks. When a response mentions automating 'repetitive tasks' or 'maintenance' without specifying operational or infrastructure context, default to codegen_refactor_modernization. Fixing issues within IaC or pipelines belongs here rather than under debugging_root_cause_fixing.

Themes identified from "What do you NOT want AI to handle?" responses.

No AI-Led Architecture Decisions or Sweeping Refactors no_ai_led_architecture_or_sweeping_refactors 3/3
Developers do not want AI to define or significantly alter system architecture, high-level design, or major technical trade-offs, since these require deep domain context, long-term thinking, and cross-team alignment. They also do not want AI to perform large, unscoped refactors or broad multi-file rewrites that expand scope or reshape structure in one shot. Such changes are difficult to review, validate, and safely integrate without careful human-led planning and incremental scoping.
No Autonomous Execution, Merging/Deploying, or Agentic Control no_autonomous_execution_merge_deploy_or_agentic_control 3/3
Developers want humans to remain in control of actions with real impact: running commands, modifying files without explicit instruction, approving/reviewing, committing, merging, releasing, or deploying. AI should not operate as a fully autonomous agent; any significant action should require explicit human confirmation and final responsibility.
No AI Ownership of Complex Debugging or Critical Bug Fixes no_complex_debugging_or_critical_bug_fixes 3/3
Developers do not want AI to lead complex debugging, root-cause analysis, or high-stakes bug fixes (especially cross-system or production-critical issues). They cite lack of runtime/context, confident-but-wrong fixes, and regression risk that is hard to detect.
No Security/Privacy-Sensitive Work or Secrets Handling no_security_privacy_secrets_handling 3/3
Developers do not want AI to implement or modify security-critical code (authn/authz, crypto, vulnerability fixes), handle credentials/secrets, or work with sensitive/regulated data (e.g., PII). They worry about subtle vulnerabilities, compliance exposure, and the high cost of mistakes.
No Autonomous Performance Optimization no_autonomous_performance_optimization 3/3
Developers do not want AI to independently change code/architecture for performance (latency, throughput, memory, scalability) without careful measurement and context. Optimization is scenario-dependent and mistakes can silently degrade performance or reliability.
No AI-Led Requirements, Core Business Logic, or API/UX Decisions no_ai_deciding_requirements_business_logic_or_api_ux 2/3
Developers do not want AI to interpret ambiguous requirements, decide product behavior, implement core business rules without guidance, or make API/UX trade-offs. These areas depend on stakeholder intent, nuanced domain knowledge, and consistency with existing product decisions.
Preserve Developer Agency, Learning, and Ownership preserve_developer_agency_learning_and_job_ownership 3/3
Developers want to remain the primary driver of development work, preserving hands-on learning, creativity, problem-solving satisfaction, and ownership/accountability. They resist AI taking over the “interesting” parts of engineering or creating dependency/deskilling or job-replacement concerns.
Avoid AI Output That Is Unreliable, Contextless, Hard to Verify, or Intrusive avoid_ai_when_unreliable_contextless_hard_to_verify_or_intrusive 3/3
Developers restrict AI use when it lacks repo/domain context, hallucinates, produces non-compiling/incorrect code, violates conventions, or generates changes that are time-consuming to validate. This also includes disruptive assistance patterns (e.g., aggressive autocompletion, unprompted edits/formatting/imports) that break flow and create cleanup work.

Code Generation, Refactoring & Modernization Automation

codegen_refactor_modernization
AI that generates boilerplate/scaffolding, ports or converts code, writes small scripts, and implements well-scoped features directly from requirements or specs to reduce repetitive implementation work. It also performs behavior-preserving refactors, migrations, and framework/library/dependency upgrades to modernize legacy systems and reduce technical debt. This theme covers the act of producing or transforming code itself. It does not cover reviewing or testing that code for quality/security (see quality_review_testing_security), nor the architectural reasoning or design decisions that precede implementation (see architecture_design_planning_support). This theme requires a concrete generation or transformation task, not just a wish for better AI output quality.
0.894
Krippendorff's α (Excellent)
177
Responses (50.1%)
3/3
Model Convergence
Prevalence
177 of 353 responses (50.1%)
Source Codes (3/3 models converged)
gpt: boilerplate_and_feature_codegen gpt: refactoring_modernization gemini: boilerplate_and_repetitive_tasks gemini: refactoring_and_modernization opus: boilerplate_and_repetitive_automation opus: refactoring_and_maintenance
Developer Quotes
Refactoring for sure. It's often the most boring task for me, and being able to mostly automate it would be amazing. Current progress is already great for a lot of use cases.
PID 1

Code Quality, Review Automation, Automated Testing & Security/Compliance Guidance

quality_review_testing_security
AI that acts as an intelligent quality gate by providing real-time code review feedback, enforcing style/standards, and flagging correctness issues and bad practices. It generates meaningful unit/integration/E2E tests, identifies edge cases and coverage gaps, supports TDD workflows, and validates changes to prevent regressions. It also detects security vulnerabilities and provides secure-by-default and compliance-aware guidance. Assign when the response describes a review, testing, or security-checking activity. Wanting AI-generated code to be better quality does not, by itself, warrant this theme -- there should be an identifiable quality assurance action described. This does not cover test orchestration or CI pipeline execution (see devops_ci_cd_iac_workflow_automation).
0.939
Krippendorff's α (Excellent)
98
Responses (27.8%)
3/3
Model Convergence
Prevalence
98 of 353 responses (27.8%)
Source Codes (3/3 models converged)
gpt: automated_testing_and_validation gpt: quality_review_standards_and_security gemini: automated_test_generation gemini: code_quality_and_reviews opus: automated_test_generation opus: code_review_and_quality
Developer Quotes
Over the next 1-3 years, I’d like AI to play a bigger role in intelligent code reviews, automated test generation, and architectural decision support. These areas often consume significant time and require deep context. AI could help accelerate delivery while maintaining high quality and consistency.
PID 10

Debugging, Root Cause Analysis & Bug Fix Assistance

debugging_root_cause_fixing
AI that accelerates debugging by analyzing stack traces, logs, telemetry, and failing tests; reproduces issues; identifies root causes; suggests fixes; and helps triage incidents/regressions. This theme applies when the respondent describes diagnosing or fixing specific bugs or incidents that have already manifested. The key distinction is reactive (investigating a known failure or incident) vs. proactive (preventing bugs through review and testing). Proactive bug detection belongs under quality_review_testing_security.
0.958
Krippendorff's α (Excellent)
69
Responses (19.5%)
3/3
Model Convergence
Prevalence
69 of 353 responses (19.5%)
Source Codes (3/3 models converged)
gpt: debugging_and_incident_rca gemini: debugging_and_bug_fixing opus: bug_fixing_and_debugging
Developer Quotes

Codebase Context, Knowledge Capture & Safe Cross-File Changes

codebase_context_knowledge_safe_changes
AI that builds a reliable, repo-wide understanding of large/legacy systems -- including dependencies, conventions, and cross-module/service relationships -- to navigate code and make coordinated multi-file edits with awareness of ripple effects. It retrieves and synthesizes codebase- and API-specific knowledge with traceable sources, answers 'how/why' questions about existing code, maintains long-running context across sessions, and keeps documentation/comments and onboarding materials up to date. Assign when the respondent describes needing the AI to understand, retain knowledge about, or navigate a codebase or system. Do not assign solely because the described task happens to occur in a large codebase. This also does not cover general external knowledge lookup unrelated to a specific codebase.
0.884
Krippendorff's α (Excellent)
65
Responses (18.4%)
3/3
Model Convergence
Prevalence
65 of 353 responses (18.4%)
Source Codes (3/3 models converged)
gpt: codebase_context_and_dependency_awareness gemini: documentation_and_knowledge_retrieval gemini: large_codebase_context opus: codebase_understanding_and_context opus: documentation_and_knowledge
Developer Quotes
Generation & dealing with boiler-plate code. More cross-codebase awareness, so I can make a single change in a small part of the codebase (eg changing a datatype in a model, or changing nullability of a field), and AI will help implement those changes across the rest of the codebase. Supporting re-architecture of my codebase.
PID 2

Architecture, Design Brainstorming & Planning Support

architecture_design_planning_support
AI that helps with solution/architecture decisions by proposing options with tradeoffs, recommending patterns, translating requirements into designs, and supporting planning/triage/prioritization as a 'thinking partner.' Code this theme when the respondent describes wanting AI to help reason about how to approach a problem, evaluate design alternatives, or make architectural choices. It does NOT apply when the respondent simply wants AI to understand a use case as a prerequisite to writing code (see codegen_refactor_modernization) -- the distinction is whether the respondent is asking for the design reasoning itself vs. asking for implementation with implicit design understanding.
0.843
Krippendorff's α (Excellent)
38
Responses (10.8%)
3/3
Model Convergence
Prevalence
38 of 353 responses (10.8%)
Source Codes (3/3 models converged)
gpt: architecture_design_brainstorming_and_planning gemini: architecture_and_design_support opus: architecture_and_design
Developer Quotes
Over the next 1-3 years, I’d like AI to play a bigger role in intelligent code reviews, automated test generation, and architectural decision support. These areas often consume significant time and require deep context. AI could help accelerate delivery while maintaining high quality and consistency.
PID 10

Performance Profiling & Optimization Suggestions

performance_profiling_optimization
AI that identifies performance bottlenecks, assists with profiling and interpretation of performance signals, and suggests (or safely implements) optimizations for runtime efficiency and resource usage. Code this theme when the respondent explicitly requests performance profiling, analysis, or optimization as a distinct activity. It does NOT apply when 'performant' is mentioned only as a desired attribute of AI-generated code without requesting a specific performance analysis or optimization task.
0.977
Krippendorff's α (Excellent)
33
Responses (9.3%)
3/3
Model Convergence
Prevalence
33 of 353 responses (9.3%)
Source Codes (3/3 models converged)
gpt: performance_optimization_and_profiling gemini: performance_optimization opus: performance_optimization
Developer Quotes

DevOps, CI/CD, IaC & Engineering Workflow Automation

devops_ci_cd_iac_workflow_automation
AI that automates non-coding engineering workflows such as CI/CD setup and troubleshooting, deployments, infrastructure-as-code generation/fixes, build failure diagnosis, and toolchain/PR workflow automation to reduce operational toil. Assign when the respondent names DevOps, CI/CD, IaC, deployment, or pipeline tasks. When a response mentions automating 'repetitive tasks' or 'maintenance' without specifying operational or infrastructure context, default to codegen_refactor_modernization. Fixing issues within IaC or pipelines belongs here rather than under debugging_root_cause_fixing.
0.893
Krippendorff's α (Excellent)
23
Responses (6.5%)
3/3
Model Convergence
Prevalence
23 of 353 responses (6.5%)
Source Codes (3/3 models converged)
gpt: devops_ci_cd_and_workflow_automation gemini: devops_and_infrastructure opus: devops_and_infrastructure
Developer Quotes
#1
Incremental PR Builder for Mechanical Code Changes
The highest-leverage coding automation is not open-ended feature work. It is the backlog of tedious, well-bounded transformations: framework migrations, API renames, dependency bumps, nullability changes, generated scaffolding, and behavior-preserving refactors that touch many files but very little product judgment. Developers avoid these jobs because they are boring to author and painful to review when tools overreach.

Project Description

Turn a change request—API rename, dependency upgrade, framework migration, or boilerplate expansion—into a sequence of small diffs that match the repository’s naming, file layout, error-handling, and test patterns. The system should stop and surface the plan as soon as the change grows beyond the intended modules, public APIs, or review size budget.

Relevant Context Sources:
  • Source tree structure and analogous implementations already in the codebase
  • Dependency manifests, lockfiles, migration guides, and release notes
  • Build, lint, format, and test configuration from the repository
  • API schemas, interface definitions, and changed call sites
  • Recent version-control history for similar migrations or refactors
Capability Steps:
  1. Accept the transformation intent together with explicit file, directory, or module boundaries and whether behavior must remain unchanged.
  2. Inspect analogous implementations and impacted symbols to understand how the repository currently handles naming, layering, error paths, and tests.
  3. Produce a change plan that estimates files touched, likely API impact, and where the request may spill outside the approved area.
  4. Generate atomic diffs under a configurable size threshold so a rename across 40 files becomes several digestible patches instead of one opaque rewrite.
  5. Run formatter, linter, build, static analysis, and tests after each patch and halt with diagnostics when a step fails or confidence drops.
  6. Assemble the accepted patches into a branch and PR summary that explains each transformation step in plain language.

Who It Affects

177 responses (50.1%)α=0.894

177 of 353 respondents (50.1%) were coded to this theme with high inter-rater reliability (α=0.894), making it the single most prevalent developer need in the survey. The 1.14-point gap between preference (4.21/5) and current usage (3.07/5) shows substantial unmet demand for help with repetitive code generation and transformation tasks.

Quantitative Signals:
  • 177 of 353 respondents (50.1%) mentioned concrete code generation or transformation tasks, the highest prevalence of any theme
  • 77.6% of respondents in this theme want High or Very High AI support for development tasks
  • Average AI Preference: 4.21/5 versus actual Usage: 3.07/5
  • Preference–Usage Gap: 1.14
Refactoring for sure. It's often the most boring task for me, and being able to mostly automate it would be amazing.
PID 1
Generation & dealing with boiler-plate code. More cross-codebase awareness, so I can make a single change in a small part of the codebase (eg changing a datatype in a model, or changing nullability of a field), and AI will help implement those changes across the rest of the codebase.
PID 2

Impact

If this capability exists, developers can delegate repetitive multi-file code changes and receive a staged sequence of small, reviewable diffs that follow local patterns and pass existing checks before review. That reduces maintenance toil, makes modernization work less disruptive, and frees developers to spend more time on business logic, design, and other higher-cognitive-load work.

Evidence
AI can reduce time spent on low level dev tasks by generating baseline code for the human engineer to develop off of.
PID 17
As much as I love coding, there is simply too much work to do to want to keep doing it myself. Ideally I could delegate most coding tasks, or even log analysis tasks, to agents while I focus on the things which AI currently cannot do well (Cross team objective setting, product decision making, high level design, etc)
PID 56
Probably, it should play major role in reducing manual tasks or mundane tasks, so that developer can focus solving actual business usecase rather than in regular maintenance.
PID 305

Constraints & Guardrails

Success Definition

Qualitative Measures

  • Developers describe generated changesets as small enough to review confidently and easy to stop, correct, or approve step by step.
  • Developers report that generated code follows existing repository patterns instead of generic AI style.
  • Refactors and upgrades feel boring but easy to delegate rather than risky and time-consuming.
  • Developers report spending more time on business logic and design and less on repetitive maintenance work.
  • Teams trust the capability for production modernization tasks, not only for disposable or throwaway code.

Quantitative Measures

  • ≥50% reduction in manual file edits for common mechanical migrations such as renames, signature changes, and deprecation replacements
  • User-configured scope violations: <2% of tasks attempt to touch out-of-scope files, and those attempts are blocked and logged
  • 100% of presented changesets pass existing linter and test suites before developer review
  • Average changeset size stays below 200 lines changed, with 90th percentile below 400 lines
  • Reduce median time-to-draft-PR for refactor and upgrade tasks by ≥30%
  • Reduce average PR cycle time for refactoring-tagged work items by 40%

Theme Evidence

Code Generation, Refactoring & Modernization Automation
codegen_refactor_modernization
177 responses (50.1%)α=0.8943/3 convergence

AI that generates boilerplate/scaffolding, ports or converts code, writes small scripts, and implements well-scoped features directly from requirements or specs to reduce repetitive implementation work. It also performs behavior-preserving refactors, migrations, and framework/library/dependency...

#2
Inline Quality Gate for Code Review and Test Generation
The best code review comment is the one the developer sees before the code ever leaves the editor. Teams want help with bug spotting, standards enforcement, insecure patterns, and missing tests at the moment of authorship, but they do not want yet another stream of generic lint-like noise. The practical challenge is precision: comments must reflect this codebase, this diff, and this testing style, or developers will tune them out.

Project Description

Embed a high-precision quality pass in the editor and pre-commit flow. It should look only at the current diff plus the nearby tests, rules, and interfaces that give that diff meaning, combine deterministic analyzers with targeted model reasoning, and suggest only the smallest missing tests or fixes worth a developer’s attention.

Relevant Context Sources:
  • Changed functions, their direct callers, and touched interfaces
  • Nearby unit and integration tests plus local test helpers and assertion styles
  • Repository lint, format, compiler, type-check, and secure-coding rules
  • Coverage deltas for changed lines and branches
  • Dependency manifests and vulnerability registries
  • Historical PR comments that encode recurring team-specific feedback
Capability Steps:
  1. On save or stage, compute the minimal impacted area from changed functions, adjacent call sites, touched interfaces, and affected modules.
  2. Run compiler, type, lint, basic static, and dependency/security checks first and normalize those findings into a common model.
  3. Compare the change against nearby implementation and testing patterns to flag likely correctness, readability, performance, or security issues with line-anchored explanations.
  4. When behavior is ambiguous, ask a short question tied to the affected branch or interface instead of guessing what the code is supposed to do.
  5. Generate small, change-focused tests that match existing repository style and cover happy path, boundary, error, and regression cases for the altered behavior.
  6. Before the PR is opened, publish a compact quality summary of unresolved findings, added tests, coverage movement, and security-sensitive surfaces.

Who It Affects

98 responses (27.8%)α=0.939

98 of 353 developers (27.8%) explicitly described wanting AI assistance with code review, test generation, security scanning, or standards enforcement. This theme had near-perfect inter-rater reliability (α=0.94) and the highest preference-usage gap (1.14) in this category, indicating clear unmet demand for higher-signal quality support.

Quantitative Signals:
  • Preference-Usage Gap: 1.14
  • Average AI Preference: 4.21/5
  • Average AI Usage: 3.07/5
  • 77.6% of respondents in this theme want High or Very High AI support
  • Inter-rater reliability α=0.94 showing extremely clear and consistent expression of this need across respondents
I want AI to spot bugs while I'm writing the code! I want it to give me code review feedback in real time.
PID 174
Testing. It would be nice to have AI to generate test cases and propose gaps on code coverage. Recently I was trying to improve the coverage on my project and it was super slow to come up with things from scratch. AI could speed up the process.
PID 366
Unit testing and discovering edge cases. I feel like it will help out code to become much better.
PID 386

Impact

If this exists, developers receive small, evidence-backed review findings and targeted test suggestions while they code, rather than discovering routine correctness, standards, or security issues only during code review or later testing. That shifts defect detection earlier, reduces back-and-forth on basic review comments, and improves confidence that code changes will not break expected behavior.

Evidence
Thinking bigger picture though, you can avoid much of that if you leverage AI upstream and catch bugs during development or in the lowest possible test environment, before they affect any human users.
PID 120
I want AI to spot bugs while I'm writing the code! I want it to give me code review feedback in real time.
PID 174
Help write unit tests, ensure that changes will not break anything.
PID 213

Constraints & Guardrails

Must not submit or approve code review artifacts; a human remains accountable for final review and submission.
"I don't want AI to handle submitting or approving pull requests. There should be a human who reviews all changes that AI suggests to prevent the introduction of issues." (PID 125)

Success Definition

Qualitative Measures

  • Developers report that inline findings match the type and quality of feedback they receive from experienced human reviewers.
  • Generated tests are accepted and kept by developers rather than deleted or heavily rewritten, indicating that they are meaningful and repository-conformant.
  • Developers say suggestions are repository-aware and evidence-backed rather than generic automated comments.
  • Developers report catching bugs and insecure patterns during editing that would previously have surfaced during code review or later testing.
  • Developers report that the tool is non-blocking and easy to dismiss when its suggestions are not relevant.

Quantitative Measures

  • Reduce average pull request review turnaround time by 30%+ as fewer issues are surfaced for the first time during review
  • Increase repository test coverage by 15-25% within 6 months of adoption through generated test suggestions
  • Reduce regression-related bug reports by 20%+ through pre-commit change validation and targeted regression test generation
  • Reduce security-related findings in dedicated security review or penetration testing by 40%+ due to shift-left detection
  • Achieve >70% acceptance rate on generated test suggestions (tests kept without major modification)
  • Achieve >60% acceptance rate on inline correctness and standards suggestions

Theme Evidence

Code Quality, Review Automation, Automated Testing & Security/Compliance Guidance
quality_review_testing_security
98 responses (27.8%)α=0.9393/3 convergence

AI that acts as an intelligent quality gate by providing real-time code review feedback, enforcing style/standards, and flagging correctness issues and bad practices. It generates meaningful unit/integration/E2E tests, identifies edge cases and coverage gaps, supports TDD workflows, and validates...

#3
Trace-to-Patch Root Cause Workbench
Debugging is still dominated by evidence collection. Before an engineer can even test a theory, they have to find the right logs, line up the request trace with the release that introduced the problem, inspect failing artifacts, and reconstruct enough runtime context to reproduce the fault. The bottleneck is not just code understanding; it is assembling a credible case file from scattered operational data.

Project Description

Start from a bug report, stack trace, failing test, or incident alert and assemble a debug case file: correlated logs and traces, the most likely regression window, similar historical failures, competing root-cause hypotheses, and—when verification succeeds—a candidate patch plus regression test. The emphasis is investigative workflow, not generic code completion.

Relevant Context Sources:
  • Logs, traces, and metrics tied to the failing request or time window
  • Stack traces, crash dumps, compiler errors, and CI failure artifacts
  • Recent commits, deploys, config flips, and feature-flag changes
  • Historical bug reports, incidents, and linked remediation commits
  • Source code plus dependency relationships for implicated modules
  • Sandboxed build and reproduction environments matching the failing configuration
Capability Steps:
  1. Normalize the trigger into a failure signature and derive the minimum data collection plan needed for that signature.
  2. Pull correlated runtime signals by time, correlation ID, service boundary, and deployment version to build a failure timeline instead of a raw log dump.
  3. Rank recent commits, configuration changes, and rollout events by how well they explain the observed failure path.
  4. Retrieve similar past bugs and incidents, including their fixes, validation steps, and modules touched, to seed plausible hypotheses.
  5. Reproduce the issue in an isolated environment using the closest matching build and config snapshot, then capture enriched diagnostics from the repro attempt.
  6. Generate several root-cause hypotheses, each with a causal story, supporting artifacts, confidence estimate, and a short disproof checklist.
  7. If one hypothesis survives verification, generate a narrow patch and regression test and present them alongside the full investigation record.

Who It Affects

69 responses (19.5%)α=0.958

69 of 353 respondents (19.5%) explicitly described wanting AI assistance with reactive debugging, root cause analysis, log correlation, bug triage, or fix suggestion — making this the second most prominent theme. 77.6% of these respondents want High or Very High AI support, yet current usage averages only 3.07/5, indicating tools are not meeting the need.

Quantitative Signals:
  • 69 responses coded to this theme across 353 total, representing 19.5% prevalence
  • 77.6% want High/Very High AI support for debugging and root cause analysis
  • Average AI Preference of 4.21/5 vs. Usage of 3.07/5 — a 1.14 gap indicating strong unmet demand
  • Inter-rater reliability α = 0.958, indicating near-perfect agreement among coders that these responses describe reactive debugging
Correlating with logs for self-debugging
PID 26
1, Naming Variables 2. Cache Invalidation 3. Root causing live site incidents 4. Auto-refactoring
PID 83
...leverage AI to do a lot of the investigation into bug reports for us, and if it confirms there is a bug or a recent regression was introduced in source control, it can suggest what the fix should be...
PID 120

Impact

If this workbench exists, the first phase of debugging shifts from manual hunting to reviewing an investigation brief. Instead of bouncing across logs, telemetry, version history, and bug databases, engineers receive a structured packet with the failure timeline, implicated code, likely regression window, relevant prior incidents, and—when verification succeeds—a small patch plus regression test. This reduces cognitive load during incidents and makes senior engineers reviewers of investigations rather than collectors of artifacts.

Evidence
This would free us up, as the experienced human engineers to focus on 'do we agree with the Agent's assessment / investigation and proposed fix?' and we could collectively churn through our bugs much faster this way.
PID 120
Hope AI to improve on bug fixing and debugging area since there are lots of logistics needs to be done like getting log, setting up debugger, and etc. Those are ingested by AI, and provide summary, and environment to jump in would be great win.
PID 156
Ideally I could delegate most coding tasks, or even log analysis tasks, to agents while I focus on the things which AI currently cannot do well (Cross team objective setting, product decision making, high level design, etc)
PID 56

Constraints & Guardrails

Success Definition

Qualitative Measures

  • On-call engineers report reduced cognitive load during incident response, citing the structured brief as a useful starting point rather than a distraction
  • Developers report that the investigation brief accurately identifies the root cause on first attempt for the majority of bugs they investigate
  • Developers working in unfamiliar codebases report the tool makes them effective at diagnosing bugs they otherwise could not have tackled independently
  • Developers trust the confidence calibration — when the tool says it is uncertain, they find that accurate, and when it says high confidence, the diagnosis is usually correct
  • Developers report less time spent gathering/logging-in to multiple tools to find relevant evidence ("it collected what I would have hunted for")

Quantitative Measures

  • Mean time from bug report filed to root cause identified reduced by 40% or more for bugs where the tool is invoked
  • Reduce MTTR for eligible incidents/bugs by 25% (end-to-end from detection to merged fix) compared with baseline
  • Regression rate of AI-proposed fixes (fixes that introduce new test failures) below 5%, measured against CI pipeline results
  • At least 50% of suggested patches that pass verification are accepted (merged) after human review
  • Historical bug match accuracy: at least 70% of surfaced similar-bug links rated as relevant by the investigating developer

Theme Evidence

Debugging, Root Cause Analysis & Bug Fix Assistance
debugging_root_cause_fixing
69 responses (19.5%)α=0.9583/3 convergence

AI that accelerates debugging by analyzing stack traces, logs, telemetry, and failing tests; reproduces issues; identifies root causes; suggests fixes; and helps triage incidents/regressions. This theme applies when the respondent describes diagnosing or fixing specific bugs or incidents that have...

#4
Repository Context Graph for Cross-File Changes
Cross-file changes fail when the engineer cannot see the ripples. Changing one type, interface, or contract in a large codebase means understanding downstream callers, tests, ownership boundaries, duplicated patterns, and the unwritten reasons code ended up where it did. Current assistants lose that thread after a few turns and fall back to local edits that look plausible in isolation but are wrong in the whole system.

Project Description

Maintain a continuously updated map of symbols, calls, types, modules, interfaces, tests, and historical discussions so developers can ask 'where should this change go?' and 'what breaks if I touch this?' before editing. The emphasis is system-level context retention across sessions, not one-off code generation.

Relevant Context Sources:
  • AST- and symbol-level index of the full repository across supported languages
  • Build definitions, module boundaries, dependency graphs, and test relationships
  • Version-control history, blame, and pull-request discussions
  • Linked ADRs, design notes, bug reports, and incident postmortems
  • Current lint/style rules and naming conventions from the repository
Capability Steps:
  1. On repository onboarding, parse source and build metadata into a persistent graph of symbols, types, calls, modules, APIs, and tests, then update that graph incrementally as code changes.
  2. Attach rationale and historical context to code regions using blame, PR review discussions, ADRs, and linked bugs or incidents.
  3. Infer a convention profile from the existing repository so file placement, naming, abstractions, and testing style reflect local norms rather than generic habits.
  4. Answer repo-specific how, why, and where questions with citations to code, docs, review threads, and historical changes.
  5. For a proposed change, compute a ripple report that lists affected files, interfaces, modules, tests, and likely risk categories before any patch is drafted.
  6. Generate a multi-file patch plan grouped into small steps that match repository conventions and can be validated against the build and test suite.

Who It Affects

65 responses (18.4%)α=0.884

65 of 353 developers (18.4%) explicitly described needing AI to understand, retain, and safely act on repository-scale context across files and modules. This theme had strong inter-rater reliability (α=0.884) and the highest preference-usage gap (1.14), indicating a coherent, high-priority need that current AI support does not meet.

Quantitative Signals:
  • Inter-rater reliability α=0.884 indicating high agreement that these responses describe a coherent need
  • Preference-Usage Gap: 1.14
  • Average AI Preference: 4.21/5
  • Average AI Usage: 3.07/5
  • 77.6% of respondents want High/Very High AI support for this area
More cross-codebase awareness, so I can make a single change in a small part of the codebase (eg changing a datatype in a model, or changing nullability of a field), and AI will help implement those changes across the rest of the codebase.
PID 2
Refactoring would be the biggest help! It's almost always tedious and well-defined tasks. Frequently it needs more than just find-replace though. Almost always it is over large file and multi-file workflows.
PID 27

Impact

With a persistent, traceable model of the repository, developers could ask where a change belongs, see the ripple effects before editing, and receive small reviewable multi-file diffs that fit existing patterns and stay within scope. This would reduce time spent re-establishing context, hunting for prior fixes and design rationale, and manually checking downstream breakage, especially in unfamiliar parts of the codebase.

Evidence
More cross-codebase awareness, so I can make a single change in a small part of the codebase (eg changing a datatype in a model, or changing nullability of a field), and AI will help implement those changes across the rest of the codebase.
PID 2
keep track of how changes ripple across modules, that would save me hours and reduce risk.
PID 313
AI should know which files are related to the code.
PID 346

Constraints & Guardrails

Success Definition

Qualitative Measures

  • Developers report they no longer need to re-explain repository context at the start of each session.
  • Developers describe the tool's how/why/where answers as accurate and traceable to source artifacts.
  • Developers voluntarily use ripple-effect analysis before making cross-cutting changes.
  • Developers trust cross-file edit proposals enough to review and accept them instead of rewriting from scratch.
  • Developers stop reporting that AI duplicates code or breaks existing functionality when extending unfamiliar areas.

Quantitative Measures

  • Reduce the preference-usage gap from 1.14 to below 0.5 within 12 months of deployment.
  • Increase acceptance rate of cross-file edit proposals to 70%+ (vs. baseline of single-file suggestions).
  • >= 80% of tool-generated cross-file patch sets pass build + unit tests on the first execution in CI (within repo's existing test suite).
  • Reduce average number of back-and-forth iterations to complete a cross-file refactor/rename/migration by >= 30%.
  • Achieve 90%+ citation accuracy on how/why answers (verified by developer feedback on whether cited sources were relevant and correct).

Theme Evidence

Codebase Context, Knowledge Capture & Safe Cross-File Changes
codebase_context_knowledge_safe_changes
65 responses (18.4%)α=0.8843/3 convergence

AI that builds a reliable, repo-wide understanding of large/legacy systems -- including dependencies, conventions, and cross-module/service relationships -- to navigate code and make coordinated multi-file edits with awareness of ripple effects. It retrieves and synthesizes codebase- and...

Quality & Risk

155 responses | 8 themes

View Codebook
Research Projects
Ranked by prevalence and multi-model consensus
#1
Change-Aware Test Generation and Quality Gates
This project starts at the pull request, not the release meeting: its unit of analysis is the changed code and the behavior that diff is...
69 responses (44.5%)α=0.97
#2
Context-Aware Pull Request Review Assistant
Large pull requests waste reviewer time before they hurt quality. Reviewers first have to reconstruct intent, infer which interfaces or...
36 responses (23.2%)α=0.88
#3
Pre-Merge Security Advisor with Patch Suggestions
Security tooling already tells teams that something is wrong. The real delay comes after that, when developers have to figure out whether...
34 responses (21.9%)α=0.94
#4
Compliance Evidence Auto-Collection and Questionnaire Drafting Assistant
Compliance work is dominated by translation. Engineers have to decode policy language, determine whether a control applies to their system,...
22 responses (14.2%)α=0.95
#5
Change Risk Radar for Early Regression Warning
Regression risk often shows up first as a weak signal spread across rollout events, telemetry drift, and patterns from prior incidents....
18 responses (11.6%)α=0.89
Theme Prevalence
(majority vote: assigned when 2+ of 3 models agree)
Automated Test Generation, Maintenance & Quality Gates
69 (44.5%)
α=0.973/3
Intelligent PR/Code Review Assistant
36 (23.2%)
α=0.883/3
Security Vulnerability Detection & Fix Guidance
34 (21.9%)
α=0.943/3
Compliance, Standards & Audit Process Automation
22 (14.2%)
α=0.953/3
Proactive Risk Monitoring, Prediction & Anomaly Detection
18 (11.6%)
α=0.893/3
Agentic Workflow Automation & Automated Remediation
15 (9.7%)
α=0.623/3
Knowledge Retrieval, Summarization & Standards Guidance
14 (9.0%)
α=0.813/3
Debugging, Root Cause Analysis & Failure Triage
8 (5.2%)
α=0.953/3
Key Constraints & Guardrails (30)
Must not decide what the requirements or validation criteria should be; engineers must be able to edit the intent-to-test mapping.
"Deciding what the requirements should be. For example, with writing tests I want to be the one controlling the criteria we're testing for. Help writing the tests is fantastic, but it should be easy and standard for me to initiate changes in criteria/requirements, since those are the high level goals humans have to define." (PID 171)
From: Project #1
Must attach auditable proof of work for any exploratory or end-to-end activity; the tool cannot claim testing occurred without evidence.
"It's fine to delegate many tasks to AI agents such as testing code or end-user products, but we always need a person to be accountable for verifying: "what testing did the AI do?", "do we have a report with screenshots, etc, proving that the AI actually did what it claims, and it's not lying to us?", etc That is: somebody has to be checking up on the AI agents and validating their actions." (PID 120)
From: Project #1
Must not become the sole authority for security decisions; developers remain responsible for validating every suggested finding and fix.
"I do not want AI to be solely responsible for quality and security; a developer needs to validate the results of any generated code without leaning on it entirely." (PID 42)
From: Project #3
Must provide verifiable evidence of what it analyzed and tested, rather than unsupported claims.
"we always need a person to be accountable for verifying: "what testing did the AI do?", "do we have a report with screenshots, etc, proving that the AI actually did what it claims, and it's not lying to us?", etc" (PID 120)
From: Project #3
The tool must not make final ship/no-ship or risk-acceptance decisions for high-risk changes; those decisions remain human responsibilities.
"I don't want AI to handle final decision-making on high-risk changes, because human judgment is essential for understanding nuanced context, stakeholder impact, and accountability." (PID 703)
From: Project #5

Quality & Risk Codebook

Automated Test Generation, Maintenance & Quality Gates automated_test_generation_and_quality_gates α=0.97 3/3
Generate and maintain meaningful unit, integration, and E2E tests from requirements, code changes, UI/workflow context, and repo history; propose edge cases and test data; identify coverage gaps and regressions; and enforce quality checks in CI/CD to prevent low-quality changes from shipping. This theme covers the substance of what tests to write and what quality bars to enforce. It does not cover reviewing code for bugs or anti-patterns in a PR (that is intelligent_pr_code_review), nor the multi-step orchestration of running tests across tools or auto-creating fix PRs (that is agentic_workflow_automation_and_remediation). When a response mentions 'quality' generically, assign this theme when the response references testing, test coverage, test case generation, or CI/CD validation gates.
Intelligent PR/Code Review Assistant intelligent_pr_code_review α=0.88 3/3
Act as a context-aware reviewer that understands the codebase and team norms to summarize large diffs, flag likely bugs and anti-patterns, improve readability and maintainability, suggest refactors, and surface performance concerns -- reducing reviewer load and catching issues earlier. The distinguishing feature is evaluation and critique of existing or proposed code for correctness, style, and design quality. It does NOT cover generating tests or identifying coverage gaps (that is automated_test_generation_and_quality_gates), detecting security vulnerabilities specifically (that is security_vulnerability_detection_and_fix_guidance), or predicting which changes are high-risk based on telemetry (that is proactive_risk_monitoring_and_prediction). When a response mentions 'catching errors,' assign this theme only if the mechanism is code review or inspection.
Security Vulnerability Detection & Fix Guidance security_vulnerability_detection_and_fix_guidance α=0.94 3/3
Proactively scan code, PRs, and dependencies to identify security vulnerabilities (e.g., auth/authz gaps, insecure patterns, risky libraries, weak cryptography) and provide actionable remediation guidance or suggested patches before merge or deployment. This theme applies when the response specifically requests finding or fixing security flaws in code or dependencies. It does NOT cover interpreting security policies and compliance standards (that is compliance_and_audit_automation), nor broad risk monitoring via telemetry/logs (that is proactive_risk_monitoring_and_prediction). When a response mentions security alongside compliance, assign this theme only for the vulnerability-detection portion.
Compliance, Standards & Audit Process Automation compliance_and_audit_automation α=0.95 3/3
Reduce compliance toil by interpreting internal/external standards and policies, translating them into actionable developer steps, checking whether compliance bars are met, automating evidence collection and form/questionnaire filling, and improving audit readiness (e.g., SFI, S360, security review workflows). This theme applies when the response references a compliance process, audit workflow, or standards-enforcement procedure. It does NOT cover simply retrieving and summarizing policy documents for informational purposes without reference to a compliance workflow (that is knowledge_retrieval_and_standards_guidance). It also does NOT cover detecting security vulnerabilities in code (that is security_vulnerability_detection_and_fix_guidance).
Proactive Risk Monitoring, Prediction & Anomaly Detection proactive_risk_monitoring_and_prediction α=0.89 3/3
Use telemetry, logs, configurations, and historical change data to predict high-risk changes, detect regressions and anomalies early, track risk trends across services, assess likely impact, and generate prioritized risk reports or alerts so teams can mitigate issues before incidents escalate. The distinguishing feature is continuous or systematic monitoring of signals beyond the code itself to predict and surface risks proactively. It does NOT cover reviewing code in a PR for bugs (that is intelligent_pr_code_review), nor identifying test coverage gaps (that is automated_test_generation_and_quality_gates). When a response mentions 'identifying risks,' assign this theme only if the mechanism involves monitoring data signals or analyzing historical patterns, not inspecting code directly in a review.
Debugging, Root Cause Analysis & Failure Triage debugging_root_cause_and_failure_triage α=0.95 3/3
Accelerate diagnosis of complex failures by triaging test/CI failures and incidents, analyzing variants and signals across systems, identifying likely root causes, and suggesting next-best debugging steps -- reducing firefighting and post-incident investigation time. This theme applies when a response describes investigating why something failed or broke. It does NOT cover predicting which changes might fail in the future (that is proactive_risk_monitoring_and_prediction), nor reviewing code for potential bugs before they manifest (that is intelligent_pr_code_review). The key distinction is reactive investigation of an existing failure versus proactive detection of potential problems.
Knowledge Retrieval, Summarization & Standards Guidance knowledge_retrieval_and_standards_guidance α=0.81 3/3
Find and synthesize relevant code, documentation, and policies quickly; summarize long or complex guidance; explain ambiguous requirements; and provide up-to-date, org-specific best practices -- helping developers apply standards correctly without deep manual searching. This theme applies when a response describes retrieving, synthesizing, or explaining information to inform the developer's own decisions. It does NOT cover automating compliance workflows, checking whether bars are met, or filling out audit forms (that is compliance_and_audit_automation). It also does NOT cover scanning code for security vulnerabilities (that is security_vulnerability_detection_and_fix_guidance).
Agentic Workflow Automation & Automated Remediation agentic_workflow_automation_and_remediation α=0.62 3/3
Execute multi-step quality/risk tasks autonomously across tools (repos, CI, ticketing, scanners) such as running validations, opening or updating PRs, fixing low-hanging issues (including dependency and security updates), and applying repetitive repairs -- reducing manual overhead and speeding up remediation loops. The distinguishing feature is autonomous, multi-step execution across systems, not just suggesting or identifying what to do. This does not cover generating test cases (that is automated_test_generation_and_quality_gates), reviewing code for quality (that is intelligent_pr_code_review), or detecting security vulnerabilities (that is security_vulnerability_detection_and_fix_guidance). Assign when the response emphasizes AI taking action end-to-end; general automation wishes without reference to cross-system orchestration are better captured by the domain-specific themes.

Themes identified from "What do you NOT want AI to handle?" responses.

Human Approval and Accountability Required for High-Stakes Decisions and Actions human_approval_and_accountability_for_high_stakes_actions 3/3
Respondents want AI to surface evidence, risks, and recommendations, but not to make final high-stakes decisions (e.g., ship/no-ship, risk acceptance, incident severity/escalation) or be accountable for outcomes. They also oppose AI autonomously executing actions with real-world blast radius—editing/committing code, merging, deploying, or changing production/cloud configuration—without explicit human review and approval. A human must retain final judgment, authorization, and clear responsibility for the consequences.
Human Code Review / PR Approval Must Remain the Gate human_code_review_gate_required 3/3
Developers do not want AI to replace peer code review or be the sole reviewer/approver for pull requests. AI may assist (e.g., flag issues, summarize diffs), but meaningful human review is required for domain context, intentional design choices, and accountability at the last quality gate before CI/CD.
Security, Compliance, and Threat Modeling Must Be Human-Led security_and_compliance_must_be_human_led 3/3
Respondents do not trust AI to independently assure security/compliance, author threat models autonomously, or apply security fixes without rigorous human oversight. They cite catastrophic consequences of mistakes, rapidly changing threats/regulations, and the risk of false confidence from AI-generated security judgments.
Do Not Give AI Access to Sensitive/Customer Data or Credentials no_sensitive_data_or_credentials_access 3/3
Developers want strict limits on AI handling or accessing sensitive data (customer/user data, PII, confidential internal information) and secrets (keys, certificates, credentials). The core constraint is avoiding leakage, privacy breaches, and compliance violations.
AI Must Be Reliable, Verifiable, and Not Responsible for Its Own Validation ai_outputs_must_be_verifiable_and_not_self_validated 3/3
Respondents reject using AI for quality/risk work when outputs are hallucinatory, noisy, slow, or unverifiable, or when AI is expected to validate its own correctness. They want AI to provide grounded evidence, abstain when uncertain, and remain subject to human validation rather than being treated as an authoritative verifier.
Humans Must Own Requirements, Architecture, and Complex Trade-Off Reasoning humans_own_requirements_architecture_and_tradeoffs 3/3
Developers do not want AI to define requirements, infer intent, set priorities/criteria, or own complex architecture/integration decisions and advanced optimization. These tasks require holistic system context, nuanced business logic, originality, and careful trade-off reasoning that respondents believe current AI cannot reliably provide.
Test Intent/Strategy and Test-Plan Sign-Off Must Be Human-Led human_led_test_strategy_intent_and_signoff 1/3
AI can help generate or prioritize tests, but respondents want humans to define test intent, coverage criteria, and acceptance thresholds—and to sign off on test plans (including security testing). Concerns include AI generating misleading or harmful tests, missing intended behavior, or creating false confidence.
Preserve Human Ethics, Empathy, Communication, and Human-Centric Work preserve_human_ethics_empathy_and_human_centric_work 2/3
Some respondents resist AI taking over work that depends on empathy, ethical judgment, stakeholder/crisis communication, or that preserves human agency and growth (e.g., learning opportunities, creativity, personally meaningful tasks). The constraint is that these areas should remain primarily human-driven rather than delegated to AI.

Automated Test Generation, Maintenance & Quality Gates

automated_test_generation_and_quality_gates
Generate and maintain meaningful unit, integration, and E2E tests from requirements, code changes, UI/workflow context, and repo history; propose edge cases and test data; identify coverage gaps and regressions; and enforce quality checks in CI/CD to prevent low-quality changes from shipping. This theme covers the substance of what tests to write and what quality bars to enforce. It does not cover reviewing code for bugs or anti-patterns in a PR (that is intelligent_pr_code_review), nor the multi-step orchestration of running tests across tools or auto-creating fix PRs (that is agentic_workflow_automation_and_remediation). When a response mentions 'quality' generically, assign this theme when the response references testing, test coverage, test case generation, or CI/CD validation gates.
0.974
Krippendorff's α (Excellent)
69
Responses (44.5%)
3/3
Model Convergence
Prevalence
69 of 155 responses (44.5%)
Source Codes (3/3 models converged)
gpt: automated_test_generation gpt: test_coverage_and_quality_gates gemini: automated_test_generation opus: automated_test_generation
Developer Quotes

Intelligent PR/Code Review Assistant

intelligent_pr_code_review
Act as a context-aware reviewer that understands the codebase and team norms to summarize large diffs, flag likely bugs and anti-patterns, improve readability and maintainability, suggest refactors, and surface performance concerns -- reducing reviewer load and catching issues earlier. The distinguishing feature is evaluation and critique of existing or proposed code for correctness, style, and design quality. It does NOT cover generating tests or identifying coverage gaps (that is automated_test_generation_and_quality_gates), detecting security vulnerabilities specifically (that is security_vulnerability_detection_and_fix_guidance), or predicting which changes are high-risk based on telemetry (that is proactive_risk_monitoring_and_prediction). When a response mentions 'catching errors,' assign this theme only if the mechanism is code review or inspection.
0.880
Krippendorff's α (Excellent)
36
Responses (23.2%)
3/3
Model Convergence
Prevalence
36 of 155 responses (23.2%)
Source Codes (3/3 models converged)
gpt: pr_code_review_assistant gemini: intelligent_code_review opus: intelligent_code_review
Developer Quotes

Security Vulnerability Detection & Fix Guidance

security_vulnerability_detection_and_fix_guidance
Proactively scan code, PRs, and dependencies to identify security vulnerabilities (e.g., auth/authz gaps, insecure patterns, risky libraries, weak cryptography) and provide actionable remediation guidance or suggested patches before merge or deployment. This theme applies when the response specifically requests finding or fixing security flaws in code or dependencies. It does NOT cover interpreting security policies and compliance standards (that is compliance_and_audit_automation), nor broad risk monitoring via telemetry/logs (that is proactive_risk_monitoring_and_prediction). When a response mentions security alongside compliance, assign this theme only for the vulnerability-detection portion.
0.938
Krippendorff's α (Excellent)
34
Responses (21.9%)
3/3
Model Convergence
Prevalence
34 of 155 responses (21.9%)
Source Codes (3/3 models converged)
gpt: security_risk_detection gemini: security_vulnerability_detection opus: security_vulnerability_detection
Developer Quotes
I want AI to proactively evaluate contributions for potential security issues before I merge changes.
PID 42

Compliance, Standards & Audit Process Automation

compliance_and_audit_automation
Reduce compliance toil by interpreting internal/external standards and policies, translating them into actionable developer steps, checking whether compliance bars are met, automating evidence collection and form/questionnaire filling, and improving audit readiness (e.g., SFI, S360, security review workflows). This theme applies when the response references a compliance process, audit workflow, or standards-enforcement procedure. It does NOT cover simply retrieving and summarizing policy documents for informational purposes without reference to a compliance workflow (that is knowledge_retrieval_and_standards_guidance). It also does NOT cover detecting security vulnerabilities in code (that is security_vulnerability_detection_and_fix_guidance).
0.949
Krippendorff's α (Excellent)
22
Responses (14.2%)
3/3
Model Convergence
Prevalence
22 of 155 responses (14.2%)
Source Codes (3/3 models converged)
gpt: compliance_policy_automation gemini: compliance_and_standards opus: compliance_process_automation
Developer Quotes
I’d like AI to play a major role in real-time risk prediction, compliance monitoring, and automated root cause analysis. These capabilities can help catch issues early, ensure standards are met continuously, and reduce the time spent on post-incident investigations.
PID 10
Code review feedback including security and compliance checks. Specifically, accessibility, localization, and security issues.
PID 74

Proactive Risk Monitoring, Prediction & Anomaly Detection

proactive_risk_monitoring_and_prediction
Use telemetry, logs, configurations, and historical change data to predict high-risk changes, detect regressions and anomalies early, track risk trends across services, assess likely impact, and generate prioritized risk reports or alerts so teams can mitigate issues before incidents escalate. The distinguishing feature is continuous or systematic monitoring of signals beyond the code itself to predict and surface risks proactively. It does NOT cover reviewing code in a PR for bugs (that is intelligent_pr_code_review), nor identifying test coverage gaps (that is automated_test_generation_and_quality_gates). When a response mentions 'identifying risks,' assign this theme only if the mechanism involves monitoring data signals or analyzing historical patterns, not inspecting code directly in a review.
0.894
Krippendorff's α (Excellent)
18
Responses (11.6%)
3/3
Model Convergence
Prevalence
18 of 155 responses (11.6%)
Source Codes (3/3 models converged)
gpt: real_time_risk_monitoring_analytics gemini: proactive_risk_anomaly_detection opus: proactive_risk_prediction
Developer Quotes
Should be able to detect high risk changes & derisk them.
PID 47

Agentic Workflow Automation & Automated Remediation

agentic_workflow_automation_and_remediation
Execute multi-step quality/risk tasks autonomously across tools (repos, CI, ticketing, scanners) such as running validations, opening or updating PRs, fixing low-hanging issues (including dependency and security updates), and applying repetitive repairs -- reducing manual overhead and speeding up remediation loops. The distinguishing feature is autonomous, multi-step execution across systems, not just suggesting or identifying what to do. This does not cover generating test cases (that is automated_test_generation_and_quality_gates), reviewing code for quality (that is intelligent_pr_code_review), or detecting security vulnerabilities (that is security_vulnerability_detection_and_fix_guidance). Assign when the response emphasizes AI taking action end-to-end; general automation wishes without reference to cross-system orchestration are better captured by the domain-specific themes.
0.620
Krippendorff's α (Moderate)
15
Responses (9.7%)
3/3
Model Convergence
Prevalence
15 of 155 responses (9.7%)
Source Codes (3/3 models converged)
gpt: agentic_task_automation_integration gemini: automated_remediation_and_fixes gemini: general_automation_and_assistance opus: general_quality_automation
Developer Quotes
More automated QA checks of documents, videos, decks, etc assets. Even better is if AI could execute low hanging fruit repairs rather than merely listing them.
PID 147

Knowledge Retrieval, Summarization & Standards Guidance

knowledge_retrieval_and_standards_guidance
Find and synthesize relevant code, documentation, and policies quickly; summarize long or complex guidance; explain ambiguous requirements; and provide up-to-date, org-specific best practices -- helping developers apply standards correctly without deep manual searching. This theme applies when a response describes retrieving, synthesizing, or explaining information to inform the developer's own decisions. It does NOT cover automating compliance workflows, checking whether bars are met, or filling out audit forms (that is compliance_and_audit_automation). It also does NOT cover scanning code for security vulnerabilities (that is security_vulnerability_detection_and_fix_guidance).
0.813
Krippendorff's α (Excellent)
14
Responses (9.0%)
3/3
Model Convergence
Prevalence
14 of 155 responses (9.0%)
Source Codes (3/3 models converged)
gpt: knowledge_search_summarization gemini: documentation_and_information_retrieval opus: standards_knowledge_advisory
Developer Quotes
I find AI is good for scanning and summarizing massive amounts of information. So having an AI agent well-versed in conventions and standards could be very helpful for drafting code in environments with which I am less familiar.
PID 21

Debugging, Root Cause Analysis & Failure Triage

debugging_root_cause_and_failure_triage
Accelerate diagnosis of complex failures by triaging test/CI failures and incidents, analyzing variants and signals across systems, identifying likely root causes, and suggesting next-best debugging steps -- reducing firefighting and post-incident investigation time. This theme applies when a response describes investigating why something failed or broke. It does NOT cover predicting which changes might fail in the future (that is proactive_risk_monitoring_and_prediction), nor reviewing code for potential bugs before they manifest (that is intelligent_pr_code_review). The key distinction is reactive investigation of an existing failure versus proactive detection of potential problems.
0.954
Krippendorff's α (Excellent)
8
Responses (5.2%)
3/3
Model Convergence
Prevalence
8 of 155 responses (5.2%)
Source Codes (3/3 models converged)
gpt: debugging_root_cause_triage gemini: debugging_and_root_cause_analysis opus: root_cause_debugging
Developer Quotes
I’d like AI to play a major role in real-time risk prediction, compliance monitoring, and automated root cause analysis. These capabilities can help catch issues early, ensure standards are met continuously, and reduce the time spent on post-incident investigations.
PID 10
I would like to have AI investigate variants and help in debugging
PID 183
#1
Change-Aware Test Generation and Quality Gates
This project starts at the pull request, not the release meeting: its unit of analysis is the changed code and the behavior that diff is supposed to preserve. Teams already know testing is tedious; the harder problem is tying candidate tests and gate failures to the exact behaviors, branches, interfaces, and workflows the change actually disturbed. In legacy codebases especially, developers need a change-by-change account of what was exercised, what is still risky, and why.

Project Description

On each pull request, derive a behavior-to-test map from the diff, linked requirements, existing suites, and coverage deltas; generate missing unit, integration, or end-to-end tests; run them; and publish a compact gate report showing exactly what changed behavior remains untested. For workflow-heavy changes, the same system can switch into a release-rehearsal mode and simulate the affected pipeline steps in a sandbox so config and CI logic are validated alongside the code.

Relevant Context Sources:
  • PR diff, touched interfaces, and dependency impact map
  • Existing tests, helpers, and coverage deltas for changed code
  • Linked requirements, acceptance criteria, and bug/regression history for touched areas
  • Build and test configuration plus flaky-test history
  • Pipeline or workflow definitions when CI/CD files are part of the change
  • Approved synthetic fixtures and test-data templates
Capability Steps:
  1. Ingest the diff and compute an impact map of changed functions, interfaces, workflows, and downstream dependents instead of treating the PR as an undifferentiated blob.
  2. Recover expected behavior from linked requirements, existing tests, and recent bugs; when intent is missing, ask only for the behavior that cannot be inferred from those sources.
  3. Create an intent-to-test matrix that ties each proposed check to a changed branch, interface, workflow, or stated requirement.
  4. Generate or update repository-style unit, integration, and end-to-end tests, reusing existing helpers and suppressing redundant cases.
  5. Run the new tests together with impacted existing suites; if workflow files changed, simulate the affected pipeline stages in a sandbox with stubbed secrets and non-production targets.
  6. Publish a gate report that shows what changed behavior was exercised, where changed-line or changed-branch coverage is still thin, and which gaps matter for critical paths.
  7. Apply configurable gates to changed code and high-risk workflows while preserving auditable human overrides.

Who It Affects

69 responses (44.5%)α=0.974

69 of 155 respondents (44.5%) explicitly asked for AI help with test generation, coverage analysis, quality gates, or test automation—the largest theme in the quality-risk category. Coding agreement was near-perfect (α = 0.974), and the gap between high preference and lower current usage indicates a broad unmet need rather than a niche request.

Quantitative Signals:
  • 81% of respondents in this theme want High or Very High AI support for testing tasks
  • Average AI Preference: 4.32/5
  • Average AI Usage: 2.75/5
  • Preference-Usage Gap: 1.57
Over the next 1–3 years, I want AI to play a major role in proactively detecting regressions, analyzing test coverage gaps, and predicting high-risk areas in code changes so we can shift from reactive firefighting to preventive quality assurance.
PID 59

Impact

If successful, developers would move from writing most tests from scratch to reviewing a change-specific test plan and candidate tests that already cover happy paths, edge cases, and regressions. Pull requests would include concrete evidence of what was exercised, what changed-code coverage is still missing, and whether minimum gates are met, giving teams more confidence in new features and legacy-code changes without handing final judgment to the tool.

Evidence
Given the amount of testing gaps, particularly in old code bases, having AI take a role in helping ensure our code's ongoing quality is probably most ideal.
PID 153
I want it to enhance my testing code. I can provide happy cases and I'd like it to see my change and create additional tests for edge cases.
PID 286

Constraints & Guardrails

Must not decide what the requirements or validation criteria should be; engineers must be able to edit the intent-to-test mapping.
"Deciding what the requirements should be. For example, with writing tests I want to be the one controlling the criteria we're testing for. Help writing the tests is fantastic, but it should be easy and standard for me to initiate changes in criteria/requirements, since those are the high level goals humans have to define." (PID 171)
Must attach auditable proof of work for any exploratory or end-to-end activity; the tool cannot claim testing occurred without evidence.
"It's fine to delegate many tasks to AI agents such as testing code or end-user products, but we always need a person to be accountable for verifying: "what testing did the AI do?", "do we have a report with screenshots, etc, proving that the AI actually did what it claims, and it's not lying to us?", etc That is: somebody has to be checking up on the AI agents and validating their actions." (PID 120)

Success Definition

Qualitative Measures

  • Developers report that writing tests has shifted from a write-from-scratch task to a review-and-refine task on typical changes.
  • Generated tests are described as readable, human-editable, and aligned with repository conventions rather than generic boilerplate.
  • Teams say the system reliably surfaces edge cases and changed-code gaps they would otherwise miss.
  • Engineers trust CI gate outcomes because they include concrete evidence of executed tests, coverage deltas, and artifacts for audited workflows.
  • Teams working in legacy code report higher confidence making changes because untested areas are made explicit and easier to address.

Quantitative Measures

  • Reduce median time-to-add-tests per change by 30% (measured from first pipeline run to tests accepted in the repository).
  • Increase coverage on changed lines or branches by +15% within 3 months of adoption for onboarded repositories (coverage delta focused on touched code, not overall).
  • Reduce escaped regression defects linked to insufficient test coverage by 20% (based on bug or incident tagging and change linkage).
  • Decrease occurrences of behavior-changing pull requests with no new tests added by 25% (measured via gate outcomes).
  • Keep false-positive gate failures under 5% of runs (developer-labeled unhelpful or incorrect gate outcomes).

Theme Evidence

Automated Test Generation, Maintenance & Quality Gates
automated_test_generation_and_quality_gates
69 responses (44.5%)α=0.9743/3 convergence

Generate and maintain meaningful unit, integration, and E2E tests from requirements, code changes, UI/workflow context, and repo history; propose edge cases and test data; identify coverage gaps and regressions; and enforce quality checks in CI/CD to prevent low-quality changes from shipping. This...

#2
Context-Aware Pull Request Review Assistant
Large pull requests waste reviewer time before they hurt quality. Reviewers first have to reconstruct intent, infer which interfaces or modules are actually at risk, and separate mechanical churn from meaningful behavior changes. Generic automated review comments make that worse by pointing out style trivia while missing the repo-specific mistakes humans actually care about.

Project Description

Add a bounded first-pass analyst to the PR workflow. It should explain what changed, trace which modules and interfaces are most affected, and attach line-specific findings on likely defects, readability problems, performance issues, missing tests, or policy violations—together with what it checked and what it did not check.

Relevant Context Sources:
  • Pull request diff, description, commit messages, and linked work items
  • Semantic code index for the target branch, including symbols and call relationships
  • Team review rules, architectural constraints, and contribution guidelines from the repository
  • Historical PR discussions and resolved review comments from the same codebase
  • CI outputs such as static analysis, type checks, tests, and build results
Capability Steps:
  1. Ingest the PR diff and metadata, then compute the reachable impact set across changed symbols, public interfaces, and dependency-relevant context.
  2. Generate a layered summary of the change: intent, modules touched, interface movement, risky hotspots, and likely downstream impact.
  3. Analyze the changed code for likely correctness, performance, readability, maintainability, and policy issues using the repository’s own review norms and historical comment patterns.
  4. Attach each finding to exact files or lines with severity, confidence, and a short rationale; when intent is ambiguous, turn the finding into a clarifying question instead of a false assertion.
  5. Publish both inline annotations and a top-level summary that explicitly lists what categories were checked and where the analysis boundary stopped.
  6. Use accepted and dismissed findings to adapt to repo-specific norms and reduce repeat noise over time.

Who It Affects

36 responses (23.2%)α=0.880

This affects teams that use pull requests as a core quality checkpoint and struggle with reviewer load, large diffs, and shallow automated feedback. In the survey, 36 of 155 developers (23.2%) who answered the 'want AI help' question explicitly asked for AI help with PR/code review tasks, and demand is strong despite low current usage.

Quantitative Signals:
  • 36 of 155 respondents (23.2%) were coded to the Intelligent PR/Code Review Assistant theme
  • 81.0% of respondents want High or Very High AI support in this area
  • Average AI Preference: 4.32/5
  • Average AI Usage: 2.75/5
  • Preference-Usage Gap: 1.57
Better automated code reviewing would be fantastic. Pointing to locations in code with likely bugs, anti-patterns, confusing code that should be documented/written better.
PID 171
Doing code reviews in context based manner rather than simply looking for best practices
PID 264
It would be a nice to have for AI to help with code reviews. Summarize tidbits, it would be more of a time-saver tool (or understanding very complex code). Especially beneficial for abnormally large PRs with few thousands of lines of code.
PID 754

Impact

If successful, the assistant gives reviewers a reliable first pass on each pull request: a concise explanation of what changed, which parts of the codebase are most affected, and evidence-backed findings on likely defects, anti-patterns, readability problems, and performance concerns. This should reduce the time humans spend orienting to large diffs and repetitive low-level checks, while helping them focus their attention on design intent, special-case business logic, and other judgments the team does not want automated away.

Evidence
It would be a nice to have for AI to help with code reviews. Summarize tidbits, it would be more of a time-saver tool (or understanding very complex code). Especially beneficial for abnormally large PRs with few thousands of lines of code.
PID 754
Look for all common coding misses in Code reviews and mention in a comment that it has reviewed "xyz" aspects of the change and the confidence of the review so that human reviewers can look at other aspects of the code changes.
PID 429
Better automated code reviewing would be fantastic. Pointing to locations in code with likely bugs, anti-patterns, confusing code that should be documented/written better.
PID 171

Constraints & Guardrails

Success Definition

Qualitative Measures

  • Reviewers report that summaries help them understand large or complex pull requests faster.
  • Developers say the assistant's comments are specific to their repository and team norms rather than generic best-practice boilerplate.
  • Human reviewers report spending less time on mechanical checking and more time on design intent and change-specific judgment.
  • Users say they trust the assistant more because each finding includes evidence, confidence, and explicit coverage limits.

Quantitative Measures

  • At least 80% of reviewed pull requests have the assistant's summary rated as useful or very useful in a post-merge survey.
  • Reduce median time-to-first-human-review on pull requests using the assistant by 20-30%.
  • Reduce review iterations by 10-20% for pull requests larger than 500 changed lines.
  • Increase acceptance rate of assistant findings to more than 40% while keeping false-positive dismissals below 30%.
  • Reduce post-merge defects attributable to covered categories by 10-15%.

Theme Evidence

Intelligent PR/Code Review Assistant
intelligent_pr_code_review
36 responses (23.2%)α=0.8803/3 convergence

Act as a context-aware reviewer that understands the codebase and team norms to summarize large diffs, flag likely bugs and anti-patterns, improve readability and maintainability, suggest refactors, and surface performance concerns -- reducing reviewer load and catching issues earlier. The...

#3
Pre-Merge Security Advisor with Patch Suggestions
Security tooling already tells teams that something is wrong. The real delay comes after that, when developers have to figure out whether the finding is real in their code path, whether the vulnerable package is actually reachable, and what a safe local fix looks like in this repository. A useful assistant here has to connect scanner output to concrete handlers, middleware, dependency chains, and configuration choices in the current change.

Project Description

Run a pre-merge security pass that ties findings to concrete code paths, dependency uses, and configuration choices in the current change. For high-confidence cases, draft the smallest plausible remediation patch together with the specific checks—build, unit, route-level tests, or security scans—that show whether the fix actually works.

Relevant Context Sources:
  • PR diff and surrounding source context for touched handlers, routes, and middleware
  • Call graph and dataflow through changed auth, input, crypto, or secret-handling paths
  • Dependency manifests, lockfiles, SBOM data, and advisory feeds
  • Security-relevant configuration such as route policy, middleware, and access-control files
  • Past security findings, suppressions, and remediation commits in the same repo
  • Build, test, and security scan outputs for the candidate patch
Capability Steps:
  1. Localize the analysis to changed code and nearby security-sensitive surfaces such as auth checks, input handling, secrets, crypto, filesystem access, and network-facing paths.
  2. Run static, pattern, taint-style, and dependency advisory checks, then correlate the results into findings tied to exact files, lines, source-to-sink paths, or dependency chains.
  3. Explain each issue in developer terms: how it manifests here, what makes it risky, and which repository patterns or policies suggest the safer form.
  4. Generate minimal remediation options, including a conservative patch and, when relevant, a slightly broader but more future-proof alternative.
  5. Validate the chosen candidate against build, unit, and relevant security checks before surfacing it in the PR.
  6. Record dismissals, deferrals, and false positives with rationale so the assistant can reduce repeat noise per repository.

Who It Affects

34 responses (21.9%)α=0.938

34 of 155 developers (21.9%) explicitly requested AI assistance with security vulnerability detection and fix guidance. Demand is both strong and clear: this theme has near-perfect coding agreement (α = 0.938), 81.0% of respondents wanting high AI support, and a large preference-usage gap of 1.57 (4.32/5 desired vs. 2.75/5 current usage), indicating that existing tools are not meeting need.

Quantitative Signals:
  • 34 of 155 responses coded to this theme (21.9%)
  • 81.0% of respondents in this theme want high or very high AI support for security vulnerability detection
  • Average AI preference: 4.32/5
  • Average AI usage: 2.75/5
  • Preference-Usage Gap: 1.57
  • α = 0.938
I want AI to proactively evaluate contributions for potential security issues before I merge changes.
PID 42
it could also be helpful in testing backend APIs, like trying to sniff out any authentication or authorization gaps in our web services.
PID 120

Impact

Developers get security feedback at the point of code change, with evidence of exploitability and a repository-specific fix they can review, instead of late or non-actionable scanner output. This shortens the path from finding to remediation for both code and dependency issues, reduces back-and-forth with security reviewers, and helps teams catch auth/authz gaps and vulnerable libraries before merge.

Evidence
I want AI to proactively evaluate contributions for potential security issues before I merge changes.
PID 42
There are lots of security requirements being pushed down to each team. They know the answer and how it should be fixed, but lack the local context. It would be nice to have an AI understand the local context for a security issue, and make a PR, or other reviewable action to take.
PID 212
Flagging security bugs/risks in libraries, updating libraries, updating outdated code
PID 225

Constraints & Guardrails

Must not become the sole authority for security decisions; developers remain responsible for validating every suggested finding and fix.
"I do not want AI to be solely responsible for quality and security; a developer needs to validate the results of any generated code without leaning on it entirely." (PID 42)
Must provide verifiable evidence of what it analyzed and tested, rather than unsupported claims.
"we always need a person to be accountable for verifying: "what testing did the AI do?", "do we have a report with screenshots, etc, proving that the AI actually did what it claims, and it's not lying to us?", etc" (PID 120)

Success Definition

Qualitative Measures

  • Developers report that security findings are actionable because explanations tie the issue to local code and dependency usage.
  • Developers report that remediation suggestions are written in plain language rather than security jargon.
  • Developers trust the tool because it shows evidence, confidence, and validation results and does not overclaim.
  • Security reviewers report fewer back-and-forth cycles because pull requests arrive with clearer fixes and rationale.

Quantitative Measures

  • Increase pre-merge security issue detection rate (share of security issues found before merge) by 20+ percentage points.
  • Reduce high/critical security findings detected after merge by 30-50% within 2 quarters of rollout.
  • False positive rate below 10% of findings marked "not applicable" after 4 weeks of tuning per repository.
  • Median time-to-remediate for dependency vulnerabilities reduced by 25-40%.
  • Preference-usage gap reduction: increase average usage from 2.75/5 toward preference (4.32/5) over 6 months.

Theme Evidence

Security Vulnerability Detection & Fix Guidance
security_vulnerability_detection_and_fix_guidance
34 responses (21.9%)α=0.9383/3 convergence

Proactively scan code, PRs, and dependencies to identify security vulnerabilities (e.g., auth/authz gaps, insecure patterns, risky libraries, weak cryptography) and provide actionable remediation guidance or suggested patches before merge or deployment. This theme applies when the response...

#4
Compliance Evidence Auto-Collection and Questionnaire Drafting Assistant
Compliance work is dominated by translation. Engineers have to decode policy language, determine whether a control applies to their system, hunt down the right artifacts across scanners and repos, and restate those findings in the format an auditor expects. The final attestation is not the main burden; the burden is the evidence chase and the form-filling that turns everyday engineering facts into defensible compliance answers.

Project Description

Translate a selected control set into an applicability matrix, fetch the required artifacts from engineering systems, and draft questionnaire answers with citations, timestamps, and explicit gaps. This is a workflow engine for evidence collection and policy-to-engineering translation, not a system that pronounces a service compliant.

Relevant Context Sources:
  • Control catalogs, policy text, and questionnaire templates with stable IDs and versions
  • Service and repository inventory, including ownership and data classification
  • CI results, scanner outputs, and cloud policy state from read-only connectors
  • Work items, waivers, approvals, and remediation history
  • Prior submissions and stored evidence packets for the same control families
Capability Steps:
  1. Resolve the target system and parse the selected control set into required evidence types, applicability rules, and allowed answer formats.
  2. Collect a minimal set of scoping facts that determine which controls apply and record the reasoning for each applicable or non-applicable judgment.
  3. Generate an evidence plan per control that specifies where the artifact should come from, how fresh it must be, and whether human input is still required.
  4. Harvest available artifacts into an evidence ledger with immutable references, timestamps, and provenance instead of copying raw screenshots and fragments into free text.
  5. Draft questionnaire answers control by control, citing the evidence used, marking low-confidence fields, and creating remediation tasks where the evidence is missing or points to a gap.
  6. Assemble an audit packet containing the applicability matrix, evidence ledger, drafted responses, unresolved issues, and the final edits made by human reviewers.

Who It Affects

22 responses (14.2%)α=0.949

22 out of 155 "want help" responses explicitly referenced compliance processes, audit readiness, or standards-enforcement workflows. This theme had very high inter-rater reliability (α=0.949), indicating a clear and consistently expressed practitioner pain point rather than isolated complaints.

Quantitative Signals:
  • Average AI Preference: 4.32/5
  • Average AI Usage: 2.75/5
  • Preference-Usage Gap: 1.57
  • 81.0% of respondents in this theme want High or Very High AI support for compliance tasks
AI can take on the repetitive overhead involved in the tasks. E.g. a security review today requires a hundred question survey, much of which is querying and fetching data through a time-consuming process. AI should be able to handle a first-draft and information retrieval.
PID 210

Impact

Instead of spending hours hunting through repositories, pipeline outputs, scanners, and documents to answer long compliance questionnaires, developers would receive a reviewable draft with cited evidence, explicit gaps, and plain-language action items. This shifts effort from repetitive retrieval and form-filling to checking edge cases, reduces incomplete submissions and back-and-forth with reviewers, and moves teams closer to continuous audit readiness.

Evidence
AI can take on the repetitive overhead involved in the tasks. E.g. a security review today requires a hundred question survey, much of which is querying and fetching data through a time-consuming process. AI should be able to handle a first-draft and information retrieval.
PID 210

Constraints & Guardrails

Success Definition

Qualitative Measures

  • Developers report that compliance requirements are translated into clear, actionable steps rather than dense jargon.
  • Developers trust drafted answers because every control status includes evidence links, timestamps, and an explanation of what was checked.
  • Reviewers report more complete submissions with fewer back-and-forth revision cycles.
  • Teams feel audit-ready earlier in the process rather than scrambling only at formal review time.

Quantitative Measures

  • Reduce median time to complete a compliance/security review questionnaire by 50% (from first open to submitted draft).
  • Reduce number of manual evidence lookups per review by 40% (tracked via workflow events and connector usage).
  • Auto-collect and attach evidence for at least 60% of applicable controls.
  • Increase first-pass acceptance rate of compliance submissions by 30% (fewer returns for missing evidence or unclear answers).
  • Maintain low error rate: fewer than 5% of drafted questionnaire answers require major correction due to factual inaccuracies.

Theme Evidence

Compliance, Standards & Audit Process Automation
compliance_and_audit_automation
22 responses (14.2%)α=0.9493/3 convergence

Reduce compliance toil by interpreting internal/external standards and policies, translating them into actionable developer steps, checking whether compliance bars are met, automating evidence collection and form/questionnaire filling, and improving audit readiness (e.g., SFI, S360, security review...

#5
Change Risk Radar for Early Regression Warning
Regression risk often shows up first as a weak signal spread across rollout events, telemetry drift, and patterns from prior incidents. Humans are poor at seeing that composite picture early because the clues live in different systems and look harmless on their own. Teams need a first-pass risk lens that connects a new deployment or config flip to the operational signatures that usually precede trouble.

Project Description

Continuously score recent deployments, config flips, and feature-flag changes against service-specific baselines and historical failure patterns, then issue short risk briefs that explain why a change looks dangerous, which signals are drifting, and what blast radius is most plausible. The output should read like an early warning note, not an opaque verdict.

Relevant Context Sources:
  • Deployment, configuration, and feature-flag events with service and environment identifiers
  • Service metrics, traces, and representative anomalous log signatures
  • Historical incidents and postmortems for similar services or change types
  • Dependency or service-topology relationships that shape blast radius
  • CI quality signals and change metadata for the rollout under evaluation
Capability Steps:
  1. Assign a canonical ChangeID to each deployment or config event so telemetry can be tied back to a specific rollout or flag flip.
  2. Learn service-specific healthy baselines and detect low-severity deviations in latency, error rate, crash patterns, saturation, or log templates.
  3. Correlate deviations to recent changes using timing, topology, rollout strategy, and historical incident similarity.
  4. Compute a risk score with explicit drivers such as blast radius, prior instability, limited test coverage, risky dependency edges, or unusual config movement.
  5. Generate a short risk brief for the owning team that includes the top drivers, before/after signal comparisons, relevant past incidents, and suggested next checks.
  6. Use engineer feedback and post-incident outcomes to recalibrate thresholds and highlight recurring risk motifs across services.

Who It Affects

18 responses (11.6%)α=0.894

18 of 155 respondents (11.6%) explicitly asked for AI support with proactive risk monitoring, prediction, or anomaly detection using operational signals beyond code review. This theme has the highest preference-usage gap among all themes, indicating strong unmet demand for systems that correlate production signals and change history to surface risk early.

Quantitative Signals:
  • Average AI Preference: 4.32/5
  • Average AI Usage: 2.75/5
  • Preference-Usage Gap: 1.57
  • 81% of respondents in this theme want High or Very High AI support
  • High inter-rater reliability (α = 0.894)
I'd like AI to play a major role in real-time risk prediction, compliance monitoring, and automated root cause analysis. These capabilities can help catch issues early, ensure standards are met continuously, and reduce the time spent on post-incident investigations.
PID 10
Should be able to detect high risk changes & derisk them.
PID 47
Over the next 1–3 years, I want AI to play a major role in proactively detecting regressions, analyzing test coverage gaps, and predicting high-risk areas in code changes so we can shift from reactive firefighting to preventive quality assurance.
PID 59

Impact

If this exists, teams get early, evidence-backed warnings about risky changes while problems are still low severity instead of discovering them during firefighting. Developers spend less time stitching together dashboards, logs, and rollout history during triage, and more time acting on a prioritized view of likely causes and impact. Over time, repeated alerts can also reveal recurring risk drivers so teams can address root causes rather than repeatedly patching symptoms.

Evidence
AI shows promise in detecting deviations from a known good baseline. I want these agents to alert us to potential issues while they're still low-sev, before they've had a chance to blow up.
PID 174
Over the next 1–3 years, I want AI to help most with automating quality checks and surfacing hidden risks early. If it can scan logs, configs, or workflow patterns and flag things that might break later – especially the ones I'd miss in manual review – that would save me a ton of time and prevent surprises. I also want it to track risk trends across projects, so I can focus on fixing root causes instead of just patching symptoms.
PID 313
We can shift from reactive incident handling to real time mitigation and smarter decision making.
PID 331

Constraints & Guardrails

The tool must not make final ship/no-ship or risk-acceptance decisions for high-risk changes; those decisions remain human responsibilities.
"I don't want AI to handle final decision-making on high-risk changes, because human judgment is essential for understanding nuanced context, stakeholder impact, and accountability." (PID 703)

Success Definition

Qualitative Measures

  • Developers report fewer surprise regressions after rollout because alerts arrive early with enough evidence to investigate.
  • On-call engineers say alerts are understandable and clearly linked to likely related changes.
  • Developers describe the risk brief and drill-down view as reducing manual log-hunting and cross-signal correlation during triage.
  • Teams report that repeated alerts help them identify recurring risk drivers and prioritize root-cause fixes instead of patching symptoms.

Quantitative Measures

  • Reduce mean time to detect regressions after a change by at least 30% within 2 quarters for onboarded services.
  • Increase the percentage of incidents where a pre-incident risk signal was surfaced from less than 10% to more than 50% within 12 months.
  • Achieve a false-positive rate below 15% for high-severity risk alerts after an initial calibration period.
  • Attach a risk score and risk brief to more than 90% of production changes for onboarded services.
  • Reduce average time spent on manual post-incident root cause correlation by at least 30%.

Theme Evidence

Proactive Risk Monitoring, Prediction & Anomaly Detection
proactive_risk_monitoring_and_prediction
18 responses (11.6%)α=0.8943/3 convergence

Use telemetry, logs, configurations, and historical change data to predict high-risk changes, detect regressions and anomalies early, track risk trends across services, assess likely impact, and generate prioritized risk reports or alerts so teams can mitigate issues before incidents escalate. The...

Infrastructure & Ops

101 responses | 8 themes

View Codebook
Research Projects
Ranked by prevalence and multi-model consensus
#1
Telemetry Correlation Assistant for Alert Tuning and Incident Triage
On-call pain is usually front-loaded. The first fifteen minutes of an incident are spent deciding whether the page is noise, figuring out...
41 responses (40.6%)α=0.97
#2
CI/CD and Infrastructure-as-Code Blueprint Builder with Failure Triage
Pipeline and infrastructure definition work has an awkward profile: it is repetitive enough to template, brittle enough to fear, and...
34 responses (33.7%)α=0.91
#3
Service Upkeep Backlog Generator
Service ownership accumulates as a hundred small obligations: patch the base image, retire a deprecated API, close a scanner finding,...
17 responses (16.8%)α=0.95
#4
Support Triage and Drafting Assistant
Support teams lose time twice: first deciding what category a ticket belongs in, then repeating the same telemetry lookup and...
12 responses (11.9%)α=1.00
Theme Prevalence
(majority vote: assigned when 2+ of 3 models agree)
Observability & Incident Response Automation (Monitoring, Triage, RCA, Mitigation, Self-Heal)
41 (40.6%)
α=0.973/3
CI/CD, Deployment & Infrastructure Provisioning Automation (Pipelines + IaC)
34 (33.7%)
α=0.913/3
Proactive Maintenance, Upgrades, Security/Compliance & Cost Optimization
17 (16.8%)
α=0.953/3
Customer Support Triage & Auto-Response
12 (11.9%)
α=1.003/3
Testing, Quality Validation & Safer Releases
8 (7.9%)
α=1.003/3
Knowledge Management, Documentation Search & System Context
7 (6.9%)
α=0.783/3
Ops Toil Automation & Script Writing/Debugging
7 (6.9%)
α=0.953/3
Better AI Tooling UX (Accuracy, Control & Cohesive Workflows)
7 (6.9%)
α=0.803/3
Key Constraints & Guardrails (24)
Do not execute any production-affecting remediation or recovery action without explicit human oversight and approval.
"Executing operations against production resources to try and resolve incidents without human oversight, because some of our systems are very interconnected and I would not expect an AI agent to have enough background/historical context to make the correct decision." (PID 639)
From: Project #1
Operate with read-only permissions by default; do not treat the assistant as a system administrator.
"Agentic AI is not a system administrator. It does not get any permissions to do anything but read and alert. This is a rubicon I shan't cross." (PID 476)
From: Project #1
Do not autonomously perform production rollbacks or security policy changes.
"I wouldn't want AI to handle critical production rollbacks or security policy changes autonomously, because these actions carry high risk and require human judgment to weigh context, trade-offs, and potential impact." (PID 706)
From: Project #1
Must not create or approve change requests without a human intermediary.
"I don't want AI to handle creating or approving change requests without a human intermediary because it's important that a human familiar with the relevant services ensures that changes proposed by AI won't introduce issues" (PID 125)
From: Project #2
Must not hold live write permissions; it should operate read-only against infrastructure and generate artifacts rather than execute changes.
"Agentic AI is not a system administrator. It does not get any permissions to do anything but read and alert. This is a rubicon I shan't cross." (PID 476)
From: Project #2

Infrastructure & Ops Codebook

Observability & Incident Response Automation (Monitoring, Triage, RCA, Mitigation, Self-Heal) observability_and_incident_response_automation α=0.97 3/3
AI that continuously analyzes telemetry (metrics, logs, traces) to set up and tune monitoring, detect anomalies, predict failures, and generate higher-signal alerts to reduce noise and missed conditions. When incidents occur, it correlates signals across systems, summarizes impact and timeline, identifies likely root causes, and proposes mitigations/runbooks while enriching incident records. Where safe, it can execute automated recovery/self-healing actions. This theme covers the real-time detection-through-mitigation lifecycle. It does NOT cover longer-term operational upkeep (upgrades, patching, security posture, cost optimization) -- that belongs under infrastructure_maintenance_upgrades_security_cost_optimization. When monitoring surfaces a security or health issue, the detection belongs here but the planned remediation work belongs under maintenance.
CI/CD, Deployment & Infrastructure Provisioning Automation (Pipelines + IaC) cicd_deployment_and_iac_automation α=0.91 3/3
AI that creates, explains, migrates, reviews, and maintains CI/CD pipelines and deployment workflows, including automating releases and troubleshooting build/deploy failures. It also reduces toil in provisioning environments by generating or updating infrastructure-as-code (e.g., Bicep/ARM/EV2) and assisting with setup, configuration, and environment migrations. This theme covers the creation and structural maintenance of pipelines, deployment workflows, and environment definitions. It does NOT cover routine ongoing operational upkeep such as patching, dependency upgrades, or security posture work on already-running environments -- that belongs under infrastructure_maintenance_upgrades_security_cost_optimization.
Proactive Maintenance, Upgrades, Security/Compliance & Cost Optimization infrastructure_maintenance_upgrades_security_cost_optimization α=0.95 3/3
AI that plans and drives routine operational upkeep of already-running services -- upgrades/patching, dependency and API/workflow migrations, security/compliance posture management (e.g., SFI, S360 remediation), and resource/cost optimization -- by generating actionable work items and recommendations to keep services healthy. This theme captures the ongoing 'keep the lights on' lifecycle. It does NOT cover the initial creation or provisioning of infrastructure and pipelines (that is cicd_deployment_and_iac_automation), nor real-time anomaly detection or incident triage (that is observability_and_incident_response_automation).
Customer Support Triage & Auto-Response customer_support_triage_and_autoresponse α=1.00 3/3
AI that screens and buckets customer or user support requests, correlates user-reported issues with telemetry/logs, drafts responses from known solutions/knowledge bases, and escalates appropriately -- reducing repetitive support workload. This theme applies when the respondent explicitly mentions customer-facing or user-facing support interactions. It does NOT cover internal engineering incident triage or on-call alert handling, which belongs under observability_and_incident_response_automation even when the incident was initially reported by a customer.
Knowledge Management, Documentation Search & System Context knowledge_management_doc_search_and_system_context α=0.78 3/3
AI that captures and organizes tribal knowledge, retrieves relevant docs with citations, surfaces related prior incidents and decisions, and builds a holistic understanding of system topology and dependencies to speed troubleshooting, onboarding, and decision-making. Apply when the respondent's core ask is about building, searching, or maintaining the knowledge system itself. Do not assign when poor documentation is mentioned only as background context motivating a different operational request.
Ops Toil Automation & Script Writing/Debugging ops_toil_automation_and_script_generation α=0.95 3/3
AI that eliminates repetitive manual infrastructure/ops work by generating, debugging, and maintaining reliable automation scripts (e.g., Bash, PowerShell) and run-anywhere task automations, including cross-platform compatibility and reduced drudgery. This theme applies when the emphasis is on general-purpose scripting or eliminating undifferentiated manual toil not specific to another theme. It does NOT cover environment provisioning or setup better described as IaC or CI/CD pipeline work (see cicd_deployment_and_iac_automation). Code this theme for the scripting/automation mechanism itself, not when the toil happens to involve a domain already covered by another theme.
Testing, Quality Validation & Safer Releases testing_quality_validation_and_safe_deploy α=1.00 3/3
AI that improves delivery quality by generating unit/integration tests, validating changes and quality gates, enabling local workflow/pipeline testing, and helping determine safe/unsafe deployment timing to reduce regressions and failed releases. This theme applies when the response explicitly discusses test generation, quality validation, or deployment-safety assessment. It does NOT cover anomaly detection in production systems (that is observability_and_incident_response_automation) unless the respondent specifically frames the ask as a pre-release or release-gating check. It also does NOT cover analyzing CI/CD logs to debug build failures (that is cicd_deployment_and_iac_automation).
Better AI Tooling UX (Accuracy, Control & Cohesive Workflows) ai_tooling_ux_accuracy_and_cohesive_workflows α=0.80 3/3
Developers want AI assistance that is accurate and trustworthy, integrated across related infra tasks (not fragmented), provides usable UI/controls, and supports human understanding/learning rather than opaque automation. This theme applies when a response is primarily about the quality, reliability, usability, or integration of AI tooling itself rather than about a specific infrastructure task the AI would perform. It does not apply to responses expressing only generic positive sentiment toward AI without identifying a concrete tooling-quality concern.

Themes identified from "What do you NOT want AI to handle?" responses.

No Direct AI-to-Customer Interaction no_direct_customer_interaction 3/3
AI should not directly interact with customers or replace human-led customer support/customer-facing communications. Respondents cite lack of empathy/nuance, increased customer frustration, and reputational/brand trust risk; AI is more acceptable as behind-the-scenes support (drafting, summarizing, suggesting).
No Autonomous Production Deployments or Production Changes no_autonomous_production_changes 3/3
AI should not independently deploy to production, change production configuration/infrastructure, or execute production-affecting mitigations/rollbacks without human control. The blast radius, outage/revenue risk, and cascading dependencies make unsupervised production actions unacceptable.
Human Trigger Required for Incident/Emergency and Other Consequential Actions human_trigger_required_for_consequential_actions 3/3
AI should not autonomously execute consequential actions—especially live incident response, emergency mitigations, or critical overrides—without an explicit human trigger/approval. It can detect issues, triage, propose actions, and prepare workflows or messages, but a human must review context, weigh trade-offs, and decide when to act. This preserves accountability and avoids brittle end-to-end “auto-fix” pipelines that can amplify errors under pressure.
No AI Management of Security, Access, Permissions, or Secrets no_security_permissions_secrets_management 3/3
AI should not manage security-sensitive configurations (IAM/permissions, access controls, secret/key management, compliance/security policy changes, sensitive data handling) because mistakes or hallucinations can cause severe security exposure and governance/compliance violations. Elevated-privilege operations should remain human-controlled.
Avoid AI for High-Precision/Deterministic, High-Cost-of-Error Work avoid_ai_for_high_precision_deterministic_work 3/3
AI is seen as insufficiently reliable/deterministic for operational tasks requiring high precision and correctness guarantees (due to hallucinations, edge cases, and opaque reasoning). Where correctness must be assured, respondents prefer traditional deterministic automation or strict verification with humans responsible for final correctness.
No Full Autonomy for Environment Setup and Ongoing Maintenance no_full_autonomy_for_environment_setup_maintenance 2/3
AI should not fully own foundational environment setup/configuration or ongoing maintenance (especially superuser/admin tasks). Respondents worry about hidden drift, hard-to-debug states, and long-term maintainability/understandability if AI changes core environments without strong oversight and determinism.
Preserve Human Learning, System Understanding, and Accountability preserve_human_learning_and_accountability 3/3
AI should not replace work that builds engineers' foundational understanding (especially for juniors) or create over-reliance that weakens hands-on expertise. Respondents also want clear human responsibility for outcomes rather than delegating blame/ownership to AI.
No AI-Initiated Irreversible/Destructive Data Operations no_ai_initiated_irreversible_or_destructive_data_actions 3/3
AI should not perform irreversible or destructive operations (e.g., deleting databases/data, destructive migrations, non-rollbackable changes) because the asymmetric cost of mistakes (easy to execute, extremely costly to recover) demands explicit human control and safeguards.

Observability & Incident Response Automation (Monitoring, Triage, RCA, Mitigation, Self-Heal)

observability_and_incident_response_automation
AI that continuously analyzes telemetry (metrics, logs, traces) to set up and tune monitoring, detect anomalies, predict failures, and generate higher-signal alerts to reduce noise and missed conditions. When incidents occur, it correlates signals across systems, summarizes impact and timeline, identifies likely root causes, and proposes mitigations/runbooks while enriching incident records. Where safe, it can execute automated recovery/self-healing actions. This theme covers the real-time detection-through-mitigation lifecycle. It does NOT cover longer-term operational upkeep (upgrades, patching, security posture, cost optimization) -- that belongs under infrastructure_maintenance_upgrades_security_cost_optimization. When monitoring surfaces a security or health issue, the detection belongs here but the planned remediation work belongs under maintenance.
0.973
Krippendorff's α (Excellent)
41
Responses (40.6%)
3/3
Model Convergence
Prevalence
41 of 101 responses (40.6%)
Source Codes (3/3 models converged)
gpt: incident_response_rca_remediation gpt: smart_monitoring_alerts gemini: incident_response_and_rca gemini: intelligent_monitoring_and_alerting opus: incident_response_root_cause opus: intelligent_monitoring_alerting
Developer Quotes

CI/CD, Deployment & Infrastructure Provisioning Automation (Pipelines + IaC)

cicd_deployment_and_iac_automation
AI that creates, explains, migrates, reviews, and maintains CI/CD pipelines and deployment workflows, including automating releases and troubleshooting build/deploy failures. It also reduces toil in provisioning environments by generating or updating infrastructure-as-code (e.g., Bicep/ARM/EV2) and assisting with setup, configuration, and environment migrations. This theme covers the creation and structural maintenance of pipelines, deployment workflows, and environment definitions. It does NOT cover routine ongoing operational upkeep such as patching, dependency upgrades, or security posture work on already-running environments -- that belongs under infrastructure_maintenance_upgrades_security_cost_optimization.
0.910
Krippendorff's α (Excellent)
34
Responses (33.7%)
3/3
Model Convergence
Prevalence
34 of 101 responses (33.7%)
Source Codes (3/3 models converged)
gpt: cicd_deployment_automation gpt: infra_environment_provisioning_iac gemini: ci_cd_pipeline_automation gemini: infra_environment_setup opus: cicd_pipeline_automation opus: environment_setup_provisioning
Developer Quotes

Proactive Maintenance, Upgrades, Security/Compliance & Cost Optimization

infrastructure_maintenance_upgrades_security_cost_optimization
AI that plans and drives routine operational upkeep of already-running services -- upgrades/patching, dependency and API/workflow migrations, security/compliance posture management (e.g., SFI, S360 remediation), and resource/cost optimization -- by generating actionable work items and recommendations to keep services healthy. This theme captures the ongoing 'keep the lights on' lifecycle. It does NOT cover the initial creation or provisioning of infrastructure and pipelines (that is cicd_deployment_and_iac_automation), nor real-time anomaly detection or incident triage (that is observability_and_incident_response_automation).
0.954
Krippendorff's α (Excellent)
17
Responses (16.8%)
3/3
Model Convergence
Prevalence
17 of 101 responses (16.8%)
Source Codes (3/3 models converged)
gpt: proactive_maintenance_security_cost_optimization gemini: routine_maintenance_and_upgrades opus: infra_maintenance_upgrades
Developer Quotes
Setting up infra, migrating infra, maintaining/updating infra.
PID 3
If AI can do all infra setup, troubleshooting, and maintenance, and monitoring and alerting, that would be great.
PID 175

Customer Support Triage & Auto-Response

customer_support_triage_and_autoresponse
AI that screens and buckets customer or user support requests, correlates user-reported issues with telemetry/logs, drafts responses from known solutions/knowledge bases, and escalates appropriately -- reducing repetitive support workload. This theme applies when the respondent explicitly mentions customer-facing or user-facing support interactions. It does NOT cover internal engineering incident triage or on-call alert handling, which belongs under observability_and_incident_response_automation even when the incident was initially reported by a customer.
1.000
Krippendorff's α (Excellent)
12
Responses (11.9%)
3/3
Model Convergence
Prevalence
12 of 101 responses (11.9%)
Source Codes (3/3 models converged)
gpt: customer_support_automation gemini: customer_support_automation opus: customer_support_triage
Developer Quotes
If an AI agent could actually look at customer support request text and logs and make screening, bucketing, and triage decisions based on the content, that would be super helpful.
PID 92
customer support. we got a lot of repetitive customer question, like permission issue. I hope AI can detect common pattern of our customer request triaging and reply to customer directly.
PID 117

Testing, Quality Validation & Safer Releases

testing_quality_validation_and_safe_deploy
AI that improves delivery quality by generating unit/integration tests, validating changes and quality gates, enabling local workflow/pipeline testing, and helping determine safe/unsafe deployment timing to reduce regressions and failed releases. This theme applies when the response explicitly discusses test generation, quality validation, or deployment-safety assessment. It does NOT cover anomaly detection in production systems (that is observability_and_incident_response_automation) unless the respondent specifically frames the ask as a pre-release or release-gating check. It also does NOT cover analyzing CI/CD logs to debug build failures (that is cicd_deployment_and_iac_automation).
1.000
Krippendorff's α (Excellent)
8
Responses (7.9%)
3/3
Model Convergence
Prevalence
8 of 101 responses (7.9%)
Source Codes (3/3 models converged)
gpt: testing_quality_validation
Developer Quotes
I would like AI analysis of metrics and logs to accelerate root cause analysis of incidents, highlighting interesting things in dashboards, and discovering anomalies in prerelease deployments.
PID 56
Coding: Be a coding assistant CI/CD: Review code/configuration; Detect safe/unsafe time to deploy Debugging: Assist in analyzing logs, metrics, test errors Monitoring: Detect anomalies, predict failures Customer service: Self-service for well-known issues; Be an assistant for customer service reps.
PID 202

Knowledge Management, Documentation Search & System Context

knowledge_management_doc_search_and_system_context
AI that captures and organizes tribal knowledge, retrieves relevant docs with citations, surfaces related prior incidents and decisions, and builds a holistic understanding of system topology and dependencies to speed troubleshooting, onboarding, and decision-making. Apply when the respondent's core ask is about building, searching, or maintaining the knowledge system itself. Do not assign when poor documentation is mentioned only as background context motivating a different operational request.
0.776
Krippendorff's α (Acceptable)
7
Responses (6.9%)
3/3
Model Convergence
Prevalence
7 of 101 responses (6.9%)
Source Codes (3/3 models converged)
gpt: knowledge_management_doc_search_context gemini: contextual_knowledge_and_documentation opus: knowledge_documentation_context
Developer Quotes
Organizing "tribal knowledge" for easy transfer of context.
PID 64

Ops Toil Automation & Script Writing/Debugging

ops_toil_automation_and_script_generation
AI that eliminates repetitive manual infrastructure/ops work by generating, debugging, and maintaining reliable automation scripts (e.g., Bash, PowerShell) and run-anywhere task automations, including cross-platform compatibility and reduced drudgery. This theme applies when the emphasis is on general-purpose scripting or eliminating undifferentiated manual toil not specific to another theme. It does NOT cover environment provisioning or setup better described as IaC or CI/CD pipeline work (see cicd_deployment_and_iac_automation). Code this theme for the scripting/automation mechanism itself, not when the toil happens to involve a domain already covered by another theme.
0.951
Krippendorff's α (Excellent)
7
Responses (6.9%)
3/3
Model Convergence
Prevalence
7 of 101 responses (6.9%)
Source Codes (3/3 models converged)
gpt: scripting_task_automation opus: automate_toil_repetitive opus: script_writing_debugging
Developer Quotes
Help debug scripts used for infra tasks.
PID 22
to be able to make bash scripts which are compatible with different unix os like osx, ubuntu or centos, azurelinux etc.
PID 165

Better AI Tooling UX (Accuracy, Control & Cohesive Workflows)

ai_tooling_ux_accuracy_and_cohesive_workflows
Developers want AI assistance that is accurate and trustworthy, integrated across related infra tasks (not fragmented), provides usable UI/controls, and supports human understanding/learning rather than opaque automation. This theme applies when a response is primarily about the quality, reliability, usability, or integration of AI tooling itself rather than about a specific infrastructure task the AI would perform. It does not apply to responses expressing only generic positive sentiment toward AI without identifying a concrete tooling-quality concern.
0.796
Krippendorff's α (Acceptable)
7
Responses (6.9%)
3/3
Model Convergence
Prevalence
7 of 101 responses (6.9%)
Source Codes (3/3 models converged)
gpt: tooling_experience_accuracy_learning
Developer Quotes
I want AI to enhance human engineer's ability WHILE increasing the rate at which human engineers learn and develop. The human should always be paramount, and AI should be a tool to enhance what humans can understand and apply to their work, not a crutch to replace learning and thinking and therefore stunting human learning.
PID 17
I would like it to answer questions more accurately
PID 133
#1
Telemetry Correlation Assistant for Alert Tuning and Incident Triage
On-call pain is usually front-loaded. The first fifteen minutes of an incident are spent deciding whether the page is noise, figuring out which service actually moved first, and rediscovering the same query patterns under pressure. The issue is not just alert volume; it is the lack of a synthesized incident picture at the moment a responder needs one.

Project Description

Operate as a read-mostly observability assistant that tunes monitors from historical alert quality and, when an alert fires, assembles an incident brief from logs, traces, recent deploys, dependency topology, similar incidents, and runbook fragments. The assistant should shorten the orientation phase of incident response, not impersonate an incident commander.

Relevant Context Sources:
  • Metrics, logs, and traces from production telemetry systems
  • Alert rules, page history, acknowledgement data, and routing policies
  • Service dependency graph, ownership metadata, and criticality tags
  • Deployment, configuration, and feature-flag events
  • Incident records, postmortems, and mitigation runbooks
  • Audit logs for assistant reads and proposed actions
Capability Steps:
  1. Normalize telemetry, alert, and change events into a common schema with stable service IDs and aligned timestamps.
  2. Refresh the service dependency graph from metadata and observed traces so the assistant can separate upstream causes from downstream symptoms.
  3. Analyze historical alert quality to propose threshold, deduplication, and missing-monitor changes with explicit trade-offs between noise and missed detection.
  4. When an alert fires, gather a compact evidence bundle from correlated logs, metrics, traces, neighboring services, and recent deploy or config events.
  5. Generate a structured incident brief with suspected user impact, event timeline, top candidate causes, and clearly stated unknowns.
  6. Retrieve similar prior incidents and related runbook steps, annotating each mitigation option with expected effect, risk, and rollback guidance.
  7. Auto-enrich the incident record with the evidence bundle, chosen mitigation, and outcome so later alerts can benefit from the resolved case.

Who It Affects

41 responses (40.6%)α=0.973

41 of 101 respondents (40.6%) were coded to this theme — the largest theme in infrastructure operations, with near-perfect inter-rater reliability (α = 0.973). These are on-call engineers and service owners responsible for monitoring production health, triaging alerts, and coordinating incident response across distributed systems.

Quantitative Signals:
  • Theme prevalence: 41/101 (40.6%) requests
  • 74.5% of respondents in this category want High or Very High AI support for infrastructure operations tasks
  • Average AI preference: 4.11/5 vs current usage: 2.42/5 (gap: 1.69)
Alert and error triage. Finding root causes.
PID 52
I would like AI analysis of metrics and logs to accelerate root cause analysis of incidents, highlighting interesting things in dashboards, and discovering anomalies in prerelease deployments.
PID 56

Impact

Instead of starting from raw alerts, scattered dashboards, and manual log queries, on-call engineers would receive fewer low-signal alerts and an incident brief that already correlates likely impact, recent changes, and candidate causes. Repeated incidents would become faster to diagnose because the assistant surfaces similar prior incidents and relevant runbook steps, letting engineers focus on deciding and executing the response rather than assembling context.

Evidence
Automating queries in logs, making alarms smarter, reducing manual intervention by getting to root cause faster
PID 83
collect information from different component and help me understand the live incident, and give me some suggestions
PID 254

Constraints & Guardrails

Do not execute any production-affecting remediation or recovery action without explicit human oversight and approval.
"Executing operations against production resources to try and resolve incidents without human oversight, because some of our systems are very interconnected and I would not expect an AI agent to have enough background/historical context to make the correct decision." (PID 639)
Operate with read-only permissions by default; do not treat the assistant as a system administrator.
"Agentic AI is not a system administrator. It does not get any permissions to do anything but read and alert. This is a rubicon I shan't cross." (PID 476)
Do not autonomously perform production rollbacks or security policy changes.
"I wouldn't want AI to handle critical production rollbacks or security policy changes autonomously, because these actions carry high risk and require human judgment to weigh context, trade-offs, and potential impact." (PID 706)

Success Definition

Qualitative Measures

  • On-call engineers report that incident summaries are accurate and useful, reducing the need to manually query logs during triage.
  • On-call engineers report fewer low-signal pages and faster understanding of what is broken and why.
  • Engineers trust the system's root cause hypotheses enough to use them as a starting point rather than investigating from scratch.
  • Teams report that monitoring coverage gaps are caught proactively, with fewer incidents caused by missing or misconfigured monitors.
  • Past-incident matching and runbook suggestions are perceived as relevant and actionable.
  • Incident commanders say the generated timelines and impact summaries reduce coordination overhead and improve handoffs between responders.

Quantitative Measures

  • Reduce mean time-to-triage (alert fire to root cause hypothesis) by 50% within 6 months of adoption.
  • Reduce alert/page volume per service by 20–40% while maintaining or improving incident detection coverage (no increase in escaped incidents).
  • Increase monitoring coverage (% of services with health monitors) from baseline by 30% through gap detection and auto-suggested monitors.
  • Achieve 80%+ accuracy for top-3 root cause hypotheses as validated against post-mortem confirmed root causes.
  • Improve mean time to mitigate/restore (MTTR) by 10–25% through faster correlation, timeline generation, and mitigation guidance.
  • Reduce manual incident record enrichment time by 70% through auto-generated summaries and evidence bundles.

Theme Evidence

Observability & Incident Response Automation (Monitoring, Triage, RCA, Mitigation, Self-Heal)
observability_and_incident_response_automation
41 responses (40.6%)α=0.9733/3 convergence

AI that continuously analyzes telemetry (metrics, logs, traces) to set up and tune monitoring, detect anomalies, predict failures, and generate higher-signal alerts to reduce noise and missed conditions. When incidents occur, it correlates signals across systems, summarizes impact and timeline,...

#2
CI/CD and Infrastructure-as-Code Blueprint Builder with Failure Triage
Pipeline and infrastructure definition work has an awkward profile: it is repetitive enough to template, brittle enough to fear, and organization-specific enough that public examples often do more harm than good. Engineers want help with the parts nobody enjoys—wiring a new repo into the delivery stack, migrating an old pipeline format, debugging opaque build or deployment logs—without turning the assistant into a deployment operator.

Project Description

Author, migrate, explain, and debug delivery definitions by reading the repository’s build targets, workflow files, IaC modules, and platform templates, then producing file-by-file patches and diagnostic notes rather than opaque tips. The same assistant should be able to render the current pipeline topology in plain language so engineers can see how jobs, artifacts, environments, and gates actually fit together.

Relevant Context Sources:
  • Existing workflow files, build scripts, deployment manifests, and IaC modules
  • Pipeline execution logs, failed run metadata, and deployment events
  • Organization pipeline templates, policy rules, and approved base images
  • Read-only environment inventory and desired-state data
  • Repository structure, test targets, and dependency manifests
  • Version-control history for delivery definition files
Capability Steps:
  1. Parse the repository into a delivery profile that identifies runtime, build, test, packaging, deployment targets, and the files that currently implement them.
  2. Construct a topology map of stages, jobs, gates, artifacts, environments, and dependencies, and explain that map in plain language for the team.
  3. Generate baseline CI/CD or environment skeletons for new repos or services by combining the project profile with organization templates and policy constraints.
  4. For legacy definitions, produce minimal migration diffs with staged rollout notes and rollback guidance instead of full rewrites.
  5. When a build or deployment fails, trace the error back to the pipeline step, configuration stanza, script line, or resource definition most likely responsible.
  6. Validate generated changes against platform schemas and organization policies, then emit a safety packet summarizing changed files, risky operations, compliance results, and recommended smoke tests.

Who It Affects

34 responses (33.7%)α=0.910

34 of 101 respondents (33.7%) were coded to this theme, making it the second-largest demand cluster after observability/incident response. These respondents described work spanning environment provisioning, pipeline authoring, migration, and deployment troubleshooting; coding agreement was strong across three independent coders (α = 0.91).

Quantitative Signals:
  • 34 responses (33.7%) coded to this theme, making it the second most prevalent 'want' category
  • 74.5% of respondents coded to this theme want High or Very High AI support
  • Average AI Preference of 4.11/5 vs. Average AI Usage of 2.42/5 - a 1.69-point gap
  • Inter-rater reliability α = 0.91 across 3 independent coders
reduction in hours spent (re)building environments. reduction on toil hours spent for monitoring/alerting.
PID 2
Troubleshooting deployment issues.
PID 105

Impact

If this capability existed, teams could bootstrap new pipelines and environments from repository context instead of hand-assembling them from poor documentation, while still receiving reviewable diffs rather than autonomous changes. The same system would shorten build and deployment debugging by mapping failures back to specific pipeline or infrastructure lines, and it would make existing delivery topologies understandable enough for onboarding, migration, and change planning. The result is less prerequisite toil and fewer hours lost to trial-and-error in delivery infrastructure.

Evidence
reduction in hours spent (re)building environments. reduction on toil hours spent for monitoring/alerting.
PID 2
I don't want to have to do all the setup and tedious things that are not really a part of my day-to-day job but are pre-requisites for everything I do.
PID 177

Constraints & Guardrails

Must not create or approve change requests without a human intermediary.
"I don't want AI to handle creating or approving change requests without a human intermediary because it's important that a human familiar with the relevant services ensures that changes proposed by AI won't introduce issues" (PID 125)
Must not hold live write permissions; it should operate read-only against infrastructure and generate artifacts rather than execute changes.
"Agentic AI is not a system administrator. It does not get any permissions to do anything but read and alert. This is a rubicon I shan't cross." (PID 476)

Success Definition

Qualitative Measures

  • Developers report that setting up a new project's CI/CD pipeline and infrastructure templates feels like filling in the blanks on a well-structured draft rather than starting from scratch.
  • Developers report they can understand how their pipeline and environments are wired end-to-end via the generated explanations and topology maps.
  • On failures, developers report the triage output is actionable and points to specific files, lines, or steps instead of forcing trial-and-error log reading.
  • Developers trust the system because every output is a reviewable change-set with rationale and validation evidence rather than an autonomous action.

Quantitative Measures

  • Reduce median time-to-first-successful CI pipeline for a new repo by 50%.
  • Reduce median time-to-provision a new non-production environment (from infrastructure-as-code) by 30-60%.
  • Reduce mean time-to-diagnosis for build and deployment failures by 30%.
  • 75% of generated pipeline definitions and infrastructure templates pass organizational compliance validation on first generation (before human edits).
  • Decrease re-run count per failed pipeline run by 20%.
  • Safety: 0 autonomous production changes; 100% of production-impacting suggestions accompanied by a blast-radius report and policy evaluation artifact.

Theme Evidence

CI/CD, Deployment & Infrastructure Provisioning Automation (Pipelines + IaC)
cicd_deployment_and_iac_automation
34 responses (33.7%)α=0.9103/3 convergence

AI that creates, explains, migrates, reviews, and maintains CI/CD pipelines and deployment workflows, including automating releases and troubleshooting build/deploy failures. It also reduces toil in provisioning environments by generating or updating infrastructure-as-code (e.g., Bicep/ARM/EV2) and...

#3
Service Upkeep Backlog Generator
Service ownership accumulates as a hundred small obligations: patch the base image, retire a deprecated API, close a scanner finding, update enrollment, right-size a cluster, rotate a workflow. None of these tasks is usually difficult in isolation. The friction comes from the fact that the signals live in different systems and arrive as alerts, advisories, scan results, and platform notices that someone still has to consolidate into actual work.

Project Description

Consolidate deprecations, security findings, runtime drift, platform notices, and cost anomalies into a maintenance agenda that already says what to change, why it matters now, how to verify it, and who owns the missing context. The assistant’s job is backlog synthesis and task preparation, not silent remediation.

Relevant Context Sources:
  • Service catalog with ownership, criticality, and environment mappings
  • Runtime, package, image, and configuration inventory per environment
  • Vulnerability, compliance, and platform lifecycle findings
  • API deprecation notices and upgrade requirements from internal platforms
  • Cloud utilization and billing data for optimization candidates
  • Existing runbooks, escalation contacts, and backlog history
Capability Steps:
  1. Map each service to its environments, repositories, owners, runtimes, and deployed artifacts so findings land in the right operational context.
  2. Normalize upkeep signals—security/compliance findings, unsupported versions, deprecations, upgrade requirements, and cost anomalies—into a common remediation queue.
  3. Deduplicate repeated alerts across tools and environments by collapsing them into root remediation actions rather than one ticket per scanner result.
  4. Prioritize the queue using severity, criticality, blast radius, estimated effort, due dates, and potential cost savings.
  5. Draft backlog items that answer the practitioner’s preferred rubric: what to do, why now, detailed steps, validation checks, rollback notes, and who to contact if assumptions do not line up.
  6. After teams accept or complete items, rescan the relevant signals to attach closure evidence and suppress already-resolved or duplicate work.

Who It Affects

17 responses (16.8%)α=0.954

17 of 101 "want" responses (16.8%) described developers responsible for ongoing service ownership after launch: maintaining environments, upgrading systems, closing security/compliance findings, and managing resource efficiency. This theme had very high inter-rater reliability (α = 0.954), indicating a clear and repeated practitioner need.

Quantitative Signals:
  • 17/101 want responses (16.8%) were coded to this theme with α = 0.954
  • Average AI preference: 4.11/5
  • Average AI usage: 2.42/5
  • Preference-usage gap: 1.69
  • 74.5% of respondents want High or Very High AI support for infrastructure ops tasks
AI should help in maintenance of services, making sure that the lights are kept on when the developers move on to new features
PID 269

Impact

Developers stop manually chasing scattered upkeep signals across scanners, advisories, deprecation notices, and billing reports. Instead, service owners get a prioritized daily or weekly agenda of proposed work items—each with what to do, why it matters, detailed execution steps, validation evidence, and escalation guidance—so upkeep becomes a predictable routine rather than ad hoc fire drills. This should reduce the engineering burden of service ownership, especially around security/compliance remediation, while also surfacing upgrade and cost-saving work earlier.

Evidence
The rubric should be: - What to do - Why I'm doing it (short) - How to do it, in fine detail - Why I'm doing it (long) - Who to bother if something doesn't line up
PID 476
AI should help in maintenance of services, making sure that the lights are kept on when the developers move on to new features
PID 269

Constraints & Guardrails

Must operate with read-only access to production infrastructure and no default administrative permissions.
"Agentic AI is not a system administrator. It does not get any permissions to do anything but read and alert. This is a rubicon I shan't cross." (PID 476)
Must not create or approve authoritative change requests without a human intermediary.
"I don't want AI to handle creating or approving change requests without a human intermediary because it's important that a human familiar with the relevant services ensures that changes proposed by AI won't introduce issues" (PID 125)
Must not autonomously modify security configurations, access controls, or permissions.
"I don't want AI to handle critical infrastructure changes, security configurations, or access controls, as mistakes in these areas could cause serious outages or security risks." (PID 294)

Success Definition

Qualitative Measures

  • Service owners report that each proposed item clearly answers what to do next, why it matters, and how to verify completion.
  • Engineers trust recommendations because each item cites the underlying finding, affected resources, and supporting policy or advisory text.
  • Teams report fewer missed upkeep tasks when developers shift attention back to feature work.
  • Security/compliance stakeholders report less back-and-forth because maintenance items include acceptance criteria and audit-ready evidence.

Quantitative Measures

  • Reduction in median time-to-remediate for security/compliance findings
  • Decrease in overdue security/compliance findings per service
  • Increase in patch and upgrade compliance rate across environments
  • Reduction in manual effort to create maintenance tickets, measured by accepted drafted items versus tickets created from scratch
  • Measured cost savings from accepted optimization recommendations
  • Lower rate of generated upkeep items closed as duplicate, irrelevant, or not applicable

Theme Evidence

Proactive Maintenance, Upgrades, Security/Compliance & Cost Optimization
infrastructure_maintenance_upgrades_security_cost_optimization
17 responses (16.8%)α=0.9543/3 convergence

AI that plans and drives routine operational upkeep of already-running services -- upgrades/patching, dependency and API/workflow migrations, security/compliance posture management (e.g., SFI, S360 remediation), and resource/cost optimization -- by generating actionable work items and...

#4
Support Triage and Drafting Assistant
Support teams lose time twice: first deciding what category a ticket belongs in, then repeating the same telemetry lookup and knowledge-base search for issue patterns they handled last week. The interesting problem is not making AI the public face of support. It is cutting the repetitive intake and fact-gathering work so specialists see novel cases with context already assembled and routine cases get a faster, more specific first draft.

Project Description

Screen incoming tickets, match them to known issue patterns, run only pre-approved telemetry lookups keyed by safe identifiers, and draft a responder-facing triage card plus a customer-safe reply. The assistant should act like a fast intake specialist sitting next to the support engineer, not like a wall between the customer and a human.

Relevant Context Sources:
  • Incoming support tickets, attachments, and verified account metadata
  • Approved knowledge-base articles, troubleshooting guides, and known-issue records
  • Anonymized resolved tickets and resolution codes for similar cases
  • Service ownership maps and escalation policies
  • Read-only telemetry query templates keyed by safe identifiers
  • Communication, privacy, and confidentiality rules for support responses
Capability Steps:
  1. Normalize the request into a common schema, redact sensitive fields from free text, and attach the relevant privacy and communication policies before retrieval begins.
  2. Classify the ticket by likely product area, issue type, severity, and owner queue, and show rationale snippets so the assigned responder can see why it was routed that way.
  3. Retrieve similar resolved cases and approved knowledge sources to ground the triage in prior outcomes rather than in the current ticket alone.
  4. When a verified identifier is present, run only allowed telemetry or log queries and summarize recent errors, anomalous states, or health signals relevant to the report.
  5. Draft a triage card and a customer-facing reply that state what was observed, what steps are recommended, what information is still missing, and where confidence is low.
  6. Route the case to the appropriate human queue with the context pack attached and learn from overrides, edits, and escalations.

Who It Affects

12 responses (11.9%)α=1.000

12 of 101 respondents (11.9%) explicitly asked for AI help with customer- or user-facing support interactions. This theme showed strong unmet demand—average AI preference 4.11/5 versus usage 2.42/5, a 1.69-point gap, with 74.5% wanting High or Very High support—while also drawing at least 15 'do not want' mentions, indicating that affected teams face real support volume but want assistance designed as human-in-the-loop rather than full automation.

Quantitative Signals:
  • Average AI Preference: 4.11/5
  • Average AI Usage: 2.42/5
  • Preference-Usage Gap: 1.69
  • 74.5% of respondents in this theme want High or Very High AI support
  • At least 15 'do not want' respondents explicitly mentioned customer support
If an AI agent could actually look at customer support request text and logs and make screening, bucketing, and triage decisions based on the content, that would be super helpful.
PID 92
customer support. we got a lot of repetitive customer question, like permission issue. I hope AI can detect common pattern of our customer request triaging and reply to customer directly.
PID 117
Customer service: Self-service for well-known issues; Be an assistant for customer service reps.
PID 202

Impact

For support engineers and developers, common tickets no longer begin with manual screening, queue selection, log hunting, and drafting a first reply from scratch. The assistant surfaces the likely category, relevant diagnostic signals, similar resolved cases, and a cited draft response so humans can review and respond quickly; novel cases arrive to specialists with context already assembled. This shifts time away from repetitive support toil and toward mitigation and product work, while giving customers faster and more specific first responses.

Evidence
If an AI agent could actually look at customer support request text and logs and make screening, bucketing, and triage decisions based on the content, that would be super helpful.
PID 92
customer support. we got a lot of repetitive customer question, like permission issue. I hope AI can detect common pattern of our customer request triaging and reply to customer directly.
PID 117
It would be great to have an AI agent triage issues from users and reply with knowledge base information or escalate appropriately.
PID 450

Constraints & Guardrails

Do not give the assistant permission to make production, infrastructure, or customer-environment changes while handling support; it must stay read-only.
"Agentic AI is not a system administrator. It does not get any permissions to do anything but read and alert. This is a rubicon I shan't cross." (PID 476)

Success Definition

Qualitative Measures

  • Support responders report less time spent screening tickets, deciding routing, and manually digging through logs for common issues.
  • Responders trust the triage cards and drafts because they include citations, show what evidence was checked, and clearly flag uncertainty and missing information.
  • Engineers receiving escalations report that tickets arrive with clearer reproduction details and more relevant diagnostic context.
  • Customers do not perceive a degradation in support quality, and escalation to a human remains straightforward when needed.

Quantitative Measures

  • Reduce median time-to-first-response by 25–40% for supported ticket categories.
  • Reduce average handle time for common issues such as permissions, configuration problems, and known errors by 20–35%.
  • Increase correct initial routing rate to the right queue or owner by 15–30%.
  • Achieve ≥80% classification accuracy on the top-10 most common ticket categories, validated by agent override rate <20%.
  • Agent draft-acceptance rate for knowledge-base-matched tickets reaches ≥60% within 6 months.
  • Maintain a low customer-facing error rate: <1% of AI-assisted responses later marked incorrect or misleading by human audit.
  • Privacy/security compliance: 0 incidents of secret or PII leakage in drafts as measured by automated scanners and audits.

Theme Evidence

Customer Support Triage & Auto-Response
customer_support_triage_and_autoresponse
12 responses (11.9%)α=1.0003/3 convergence

AI that screens and buckets customer or user support requests, correlates user-reported issues with telemetry/logs, drafts responses from known solutions/knowledge bases, and escalates appropriately -- reducing repetitive support workload. This theme applies when the respondent explicitly mentions...

Meta-Work

157 responses | 8 themes

View Codebook
Research Projects
Ranked by prevalence and multi-model consensus
#1
DocSync: Continuous Documentation Synchronization from Code Changes
The documentation problem is not only first-draft authoring. It is silent drift. Interfaces change, commands move, flags are renamed, test...
72 responses (45.9%)α=0.97
#2
Traceable Personalized Developer Ramp-Up Coach
New engineers do not need more information. They need the right first path through too much information. Onboarding stalls because the...
44 responses (28.0%)α=0.95
#4
Stakeholder Communication Drafting Workbench
The tedious part of stakeholder communication is not writing sentences. It is translating the same technical fact pattern into five...
20 responses (12.7%)α=0.93
#5
Solution-Space Explorer for Trade-off Analysis and Prototype Spikes
Before a team commits to architecture, there is a fuzzier phase that looks more like technical research than design review. Someone has an...
18 responses (11.5%)α=0.94
Theme Prevalence
(majority vote: assigned when 2+ of 3 models agree)
Automated Documentation Generation & Maintenance
72 (45.9%)
α=0.973/3
Onboarding, Mentoring & Personalized Upskilling
44 (28.0%)
α=0.953/3
Project Knowledge Search & Discovery (with Traceable Sources)
42 (26.8%)
α=0.883/3
Stakeholder/Client Communication Drafting & Translation
20 (12.7%)
α=0.933/3
Brainstorming, Option Generation & Rapid Exploration
18 (11.5%)
α=0.943/3
Meeting Scheduling, Notes, Summaries & Action Items
15 (9.6%)
α=0.953/3
Proactive Personal Agent & Routine Admin Automation
15 (9.6%)
α=0.883/3
Planning, Prioritization, Blocker Detection & Status Reporting
13 (8.3%)
α=0.823/3
Key Constraints & Guardrails (24)
The tool must not handle sensitive, relationship-critical stakeholder communications where empathy, trust-building, or nuanced trade-off discussion is central.
"I don’t want AI to handle sensitive stakeholder communications or prioritization trade-offs, as these require empathy, trust-building, and nuanced understanding of team dynamics." (PID 179)
From: Project #4
The tool must not outsource creativity; it should explicitly encourage users to challenge the initial options, add their own alternatives, and avoid treating the first generation as sufficient.
"AI should also not be relied upon in a way that reduces creativity. Research and brainstorming are exercises it can help ask questions to facilitate, but we shouldn't go with its first ideas to solve a problem." (PID 225)
From: Project #5

Meta-Work Codebook

Automated Documentation Generation & Maintenance automated_documentation α=0.97 3/3
Generate, update, and validate documentation artifacts directly from code, PRs, specs, and tests (e.g., READMEs, inline comments, API docs, architecture overviews/diagrams). This includes producing new documentation, maintaining accuracy as code evolves, and identifying gaps or staleness. Assign when the respondent wants AI to create or maintain documentation. It does NOT cover searching or navigating existing documentation to find answers (that is knowledge_search_and_discovery), nor capturing meeting discussions (that is meeting_assistance).
Project Knowledge Search & Discovery (with Traceable Sources) knowledge_search_and_discovery α=0.88 3/3
Act as a smart, project-aware search and Q&A layer that retrieves, aggregates, and summarizes information from existing sources (repos, internal docs, tickets, tools, external references) and provides links/citations for traceability. Assign when the respondent wants AI to find, retrieve, or synthesize existing information to answer questions or reduce time spent hunting. This theme is NOT about generating new documentation (automated_documentation), exploring novel solution options (brainstorming_and_solution_exploration), or learning a new technology through structured instruction (onboarding_mentoring_and_upskilling) -- though a single response may invoke multiple themes when 'research' involves both gathering information and evaluating approaches.
Brainstorming, Option Generation & Rapid Exploration brainstorming_and_solution_exploration α=0.94 3/3
Serve as a technical sounding board to expand solution options, propose architectures or approaches, compare tradeoffs, and rapidly explore directions (including lightweight prototypes/mockups). Assign when the respondent wants AI to help generate, evaluate, or iterate on ideas and design alternatives rather than merely retrieve known information. This theme is NOT about finding existing documentation or answers to factual questions (knowledge_search_and_discovery). When 'research' means gathering facts, code knowledge_search; when it means exploring possible approaches, code brainstorming.
Onboarding, Mentoring & Personalized Upskilling onboarding_mentoring_and_upskilling α=0.95 3/3
Act as an adaptive tutor or mentor that tailors explanations, examples, and learning plans to help developers learn new languages, frameworks, and APIs, or ramp up on unfamiliar systems and domains. Includes generating onboarding guides/checklists, packaging institutional knowledge into training materials, and providing judgment-free learning support. Assign when the respondent's primary intent is skill acquisition or ramp-up, not merely getting an answer to a one-off question. General-purpose Q&A where the goal is an answer rather than learning belongs under knowledge_search_and_discovery.
Stakeholder/Client Communication Drafting & Translation stakeholder_communication_support α=0.93 3/3
Draft, rewrite, proofread, and tailor communications (emails, updates, status messages, explanations) for stakeholders, clients, or other audiences, including simplifying technical details for non-technical readers and adjusting tone or language. Assign when the respondent wants AI to help compose or refine a message directed at another person or group. It does NOT cover the automated scheduling or note-taking of meetings (meeting_assistance), nor the proactive triaging/filtering of incoming communications (proactive_personal_agent_and_admin_automation).
Meeting Scheduling, Notes, Summaries & Action Items meeting_assistance α=0.95 3/3
Reduce meeting overhead by scheduling, transcribing, capturing structured notes, summarizing discussions, and extracting decisions/action items (including recaps for missed meetings). Assign when the respondent explicitly references meetings, meeting notes, meeting summaries, scheduling, or recaps. It does NOT cover general note-taking or rote administrative work that is not meeting-specific (proactive_personal_agent_and_admin_automation), nor writing communications to stakeholders about meeting outcomes (stakeholder_communication_support).
Planning, Prioritization, Blocker Detection & Status Reporting planning_prioritization_and_status_tracking α=0.82 3/3
Support meta-work coordination by breaking down goals into tasks, prioritizing work, planning timelines and dependencies, tracking progress and decisions, identifying blockers, and generating status or progress reports. Assign when the respondent wants AI to help with project structure, prioritization, tracking, or reporting. The distinction from proactive_personal_agent_and_admin_automation is that this theme covers structured planning and tracking activities, while the personal agent theme covers always-on, initiative-taking assistance with general admin tasks. Tailoring status updates for specific audiences is stakeholder_communication_support.
Proactive Personal Agent & Routine Admin Automation proactive_personal_agent_and_admin_automation α=0.88 3/3
Act as a persistent, context-aware personal assistant that remembers prior work and commitments, proactively flags items needing attention (e.g., unanswered threads, upcoming deadlines, stale action items), and automates low-value administrative chores such as triaging and filtering email/messages, handling routine replies, and scheduling. This theme captures the always-on, agent-like qualities -- memory, initiative, and delegation with human oversight -- rather than one-off planning artifacts (covered by planning_prioritization_and_status_tracking) or drafting specific communications (covered by stakeholder_communication_support). Assign when the respondent wants an AI that acts on its own initiative.

Themes identified from "What do you NOT want AI to handle?" responses.

Keep mentoring and onboarding human-led human_led_mentoring_onboarding 3/3
Developers do not want AI to directly mentor, onboard, or integrate new team members. These activities are viewed as fundamentally interpersonal and culture-bearing (trust, empathy, relationship building). AI may assist with logistics or rote steps, but a human should lead and own the experience.
Keep communications human-authored and human-approved human_reviewed_communications 3/3
Developers prefer that interpersonal and stakeholder-facing communications remain authored by humans to preserve authenticity, nuance, and trust; AI may offer suggestions but should not be the primary voice. They also do not want AI to autonomously send, post, or publish messages or shared content (e.g., emails, chat replies, tickets, announcements) without explicit human review and approval. This constraint aims to prevent tone/culture missteps, misinformation, and reputational or operational harm.
Keep AI away from confidential or sensitive information no_confidential_or_sensitive_data 3/3
Developers do not want AI tools to handle confidential data, private/sensitive communications, or restricted internal/customer information due to privacy, security, and access-control risks.
Don’t outsource learning and skills development to AI preserve_hands_on_learning 3/3
Developers want learning new technologies to remain a hands-on, learn-by-doing process. They worry that relying on AI for learning erodes understanding, weakens skill development, and introduces unverified or outdated guidance; AI can supplement but should not replace genuine learning work.
Keep research/brainstorming and ideation primarily human preserve_human_research_and_ideation 3/3
Developers resist AI driving research, deep thinking, brainstorming, or idea generation. Concerns include reduced creativity, derivative outputs, interruption of human thought processes, and loss of system/technical understanding; AI may assist but should not lead.
High-stakes decisions must remain human-led and accountable human_accountability_for_high_stakes_decisions 3/3
AI should not be the final authority for consequential judgment calls (e.g., architecture/design, strategy, prioritization trade-offs, evaluations). Developers emphasize the need for contextual judgment, human factors awareness, and clear accountability when outcomes materially affect products or people.
Don’t treat AI outputs (incl. documentation) as authoritative; require vetting ai_output_requires_human_verification 3/3
Developers do not want AI output—especially AI-generated or AI-filled documentation—to be treated as a primary or authoritative source. Because models can hallucinate, be outdated, miss internal context/permissions, and present confident but wrong answers, any AI-provided facts, guidance, or docs must be explicitly verified by humans. They also worry about accumulating low-signal “AI slop” that pollutes knowledge bases when AI text is accepted uncritically.

Automated Documentation Generation & Maintenance

automated_documentation
Generate, update, and validate documentation artifacts directly from code, PRs, specs, and tests (e.g., READMEs, inline comments, API docs, architecture overviews/diagrams). This includes producing new documentation, maintaining accuracy as code evolves, and identifying gaps or staleness. Assign when the respondent wants AI to create or maintain documentation. It does NOT cover searching or navigating existing documentation to find answers (that is knowledge_search_and_discovery), nor capturing meeting discussions (that is meeting_assistance).
0.966
Krippendorff's α (Excellent)
72
Responses (45.9%)
3/3
Model Convergence
Prevalence
72 of 157 responses (45.9%)
Source Codes (3/3 models converged)
gpt: documentation_automation gemini: automated_documentation opus: automated_documentation
Developer Quotes
Having AI help create and maintain documentation based on checked-in code, PRs, tests, etc would be a game changer for documentation
PID 28

Onboarding, Mentoring & Personalized Upskilling

onboarding_mentoring_and_upskilling
Act as an adaptive tutor or mentor that tailors explanations, examples, and learning plans to help developers learn new languages, frameworks, and APIs, or ramp up on unfamiliar systems and domains. Includes generating onboarding guides/checklists, packaging institutional knowledge into training materials, and providing judgment-free learning support. Assign when the respondent's primary intent is skill acquisition or ramp-up, not merely getting an answer to a one-off question. General-purpose Q&A where the goal is an answer rather than learning belongs under knowledge_search_and_discovery.
0.948
Krippendorff's α (Excellent)
44
Responses (28.0%)
3/3
Model Convergence
Prevalence
44 of 157 responses (28.0%)
Source Codes (3/3 models converged)
gpt: learning_onboarding_mentoring gemini: learning_and_onboarding opus: learning_new_technologies opus: onboarding_mentoring
Developer Quotes
Learning new technologies - exercises to learn new coding languages and tech within AI agents instead of reading books or online tutorials.
PID 74

Project Knowledge Search & Discovery (with Traceable Sources)

knowledge_search_and_discovery
Act as a smart, project-aware search and Q&A layer that retrieves, aggregates, and summarizes information from existing sources (repos, internal docs, tickets, tools, external references) and provides links/citations for traceability. Assign when the respondent wants AI to find, retrieve, or synthesize existing information to answer questions or reduce time spent hunting. This theme is NOT about generating new documentation (automated_documentation), exploring novel solution options (brainstorming_and_solution_exploration), or learning a new technology through structured instruction (onboarding_mentoring_and_upskilling) -- though a single response may invoke multiple themes when 'research' involves both gathering information and evaluating approaches.
0.884
Krippendorff's α (Excellent)
42
Responses (26.8%)
3/3
Model Convergence
Prevalence
42 of 157 responses (26.8%)
Source Codes (3/3 models converged)
gpt: research_knowledge_discovery gemini: research_and_knowledge_discovery opus: information_discovery
Developer Quotes
AI is great for research, as long as it cites sources. It's more like talking to someone who knows about a subject than having to read through documentation.
PID 21

Stakeholder/Client Communication Drafting & Translation

stakeholder_communication_support
Draft, rewrite, proofread, and tailor communications (emails, updates, status messages, explanations) for stakeholders, clients, or other audiences, including simplifying technical details for non-technical readers and adjusting tone or language. Assign when the respondent wants AI to help compose or refine a message directed at another person or group. It does NOT cover the automated scheduling or note-taking of meetings (meeting_assistance), nor the proactive triaging/filtering of incoming communications (proactive_personal_agent_and_admin_automation).
0.928
Krippendorff's α (Excellent)
20
Responses (12.7%)
3/3
Model Convergence
Prevalence
20 of 157 responses (12.7%)
Source Codes (3/3 models converged)
gpt: communication_stakeholder_support gemini: stakeholder_communication opus: stakeholder_communication
Developer Quotes
I would like AI to help with tailoring stakeholder communications to different stakeholders. Right now so much team meta-effort goes into preparing tailored communications.
PID 40
It would be great if I can delegate the task of updating my stakeholders to an AI assistant and it can keep them informed of my recent work, and summarize any questions to me that they have.
PID 120
rephrase words to stakeholders or client - chat message or emails. help to explain the tech details or issue/concerns of technique in an easy understandable way
PID 130

Brainstorming, Option Generation & Rapid Exploration

brainstorming_and_solution_exploration
Serve as a technical sounding board to expand solution options, propose architectures or approaches, compare tradeoffs, and rapidly explore directions (including lightweight prototypes/mockups). Assign when the respondent wants AI to help generate, evaluate, or iterate on ideas and design alternatives rather than merely retrieve known information. This theme is NOT about finding existing documentation or answers to factual questions (knowledge_search_and_discovery). When 'research' means gathering facts, code knowledge_search; when it means exploring possible approaches, code brainstorming.
0.938
Krippendorff's α (Excellent)
18
Responses (11.5%)
3/3
Model Convergence
Prevalence
18 of 157 responses (11.5%)
Source Codes (3/3 models converged)
gpt: brainstorming_ideation_prototyping gemini: research_and_knowledge_discovery opus: research_brainstorming
Developer Quotes

Meeting Scheduling, Notes, Summaries & Action Items

meeting_assistance
Reduce meeting overhead by scheduling, transcribing, capturing structured notes, summarizing discussions, and extracting decisions/action items (including recaps for missed meetings). Assign when the respondent explicitly references meetings, meeting notes, meeting summaries, scheduling, or recaps. It does NOT cover general note-taking or rote administrative work that is not meeting-specific (proactive_personal_agent_and_admin_automation), nor writing communications to stakeholders about meeting outcomes (stakeholder_communication_support).
0.953
Krippendorff's α (Excellent)
15
Responses (9.6%)
3/3
Model Convergence
Prevalence
15 of 157 responses (9.6%)
Source Codes (3/3 models converged)
gpt: meeting_assistance gemini: meeting_management_and_summarization opus: meeting_management
Developer Quotes
Meeting scheduling. Here is a list of names put a meeting on the calendar with a room and Teams link.
PID 52
I want AI to support meta work by streamlining documentation, meeting prep, and cross-functional alignment, surfacing relevant context, summarizing discussions, and tracking decisions over time. It should reduce cognitive load and help maintain clarity across fast-moving, complex initiatives
PID 179

Proactive Personal Agent & Routine Admin Automation

proactive_personal_agent_and_admin_automation
Act as a persistent, context-aware personal assistant that remembers prior work and commitments, proactively flags items needing attention (e.g., unanswered threads, upcoming deadlines, stale action items), and automates low-value administrative chores such as triaging and filtering email/messages, handling routine replies, and scheduling. This theme captures the always-on, agent-like qualities -- memory, initiative, and delegation with human oversight -- rather than one-off planning artifacts (covered by planning_prioritization_and_status_tracking) or drafting specific communications (covered by stakeholder_communication_support). Assign when the respondent wants an AI that acts on its own initiative.
0.880
Krippendorff's α (Excellent)
15
Responses (9.6%)
3/3
Model Convergence
Prevalence
15 of 157 responses (9.6%)
Source Codes (3/3 models converged)
gpt: proactive_personal_agent gemini: general_omnipresent_assistance opus: admin_automation
Developer Quotes
I would like the AI to 1) automatically generate progress reports using my ADOs and send me a draft for proof reading, 2) notify me when a new technology related with my daily work or my interests comes out (with a brief summary with references), 3) automatically generating documentation from my code.
PID 75
It would be great if I can delegate the task of updating my stakeholders to an AI assistant and it can keep them informed of my recent work, and summarize any questions to me that they have.
PID 120

Planning, Prioritization, Blocker Detection & Status Reporting

planning_prioritization_and_status_tracking
Support meta-work coordination by breaking down goals into tasks, prioritizing work, planning timelines and dependencies, tracking progress and decisions, identifying blockers, and generating status or progress reports. Assign when the respondent wants AI to help with project structure, prioritization, tracking, or reporting. The distinction from proactive_personal_agent_and_admin_automation is that this theme covers structured planning and tracking activities, while the personal agent theme covers always-on, initiative-taking assistance with general admin tasks. Tailoring status updates for specific audiences is stakeholder_communication_support.
0.817
Krippendorff's α (Excellent)
13
Responses (8.3%)
3/3
Model Convergence
Prevalence
13 of 157 responses (8.3%)
Source Codes (3/3 models converged)
gpt: planning_prioritization_status_reporting gemini: project_management_and_task_prioritization opus: task_planning_prioritization
Developer Quotes
I would like the AI to 1) automatically generate progress reports using my ADOs and send me a draft for proof reading, 2) notify me when a new technology related with my daily work or my interests comes out (with a brief summary with references), 3) automatically generating documentation from my code.
PID 75
Create roadmap and breakdown of tasks when given high level idea of the next projects.
PID 164
#1
DocSync: Continuous Documentation Synchronization from Code Changes
The documentation problem is not only first-draft authoring. It is silent drift. Interfaces change, commands move, flags are renamed, test behavior shifts, and the README or runbook keeps saying the old thing because no one noticed the mismatch during normal development. Teams need a documentation system that treats drift as a change-detection problem tied to code review, not as a separate writing sprint.

Project Description

Watch code changes, infer which docs they invalidate, and propose small patches—README paragraphs, API notes, runbook steps, diagrams, docstrings—linked directly to the code symbols, tests, and schemas that justify them. The unit of output should be a precise doc patch, not a page of generic prose.

Relevant Context Sources:
  • PR diffs, changed symbols, and repository structure
  • Existing documentation corpus, including READMEs, docs folders, runbooks, and docstrings
  • Tests and coverage changes that clarify current behavior
  • API, schema, and configuration definitions for changed interfaces
  • Architecture diagrams or dependency graphs for structural changes
  • CI validation outputs for examples, snippets, and references
Capability Steps:
  1. Inventory the repository’s documentation surfaces and conventions so the assistant knows which files, doc styles, and diagram formats actually matter here.
  2. For each code change, compute a documentation impact map that links changed interfaces, commands, configs, and behaviors to the docs most likely to drift.
  3. Pull facts only from authoritative sources such as symbols, tests, schemas, and existing docs, storing line-level citations for every generated claim.
  4. Generate minimal patches for the affected sections instead of rewriting whole documents, and regenerate only diagram elements that the structural change actually touched.
  5. Validate links, code examples, schema snippets, and referenced symbols in CI wherever possible.
  6. Surface stale-document risk for modules outside the current PR when code churn or validation failures suggest hidden drift.

Who It Affects

72 responses (45.9%)α=0.966

72 of 157 respondents (45.9%) were coded to the automated documentation theme, making it the most prevalent meta-work request in the survey; inter-rater reliability was near-perfect (α = 0.966). The need appears across repositories with dense code, missing documentation, public interfaces, and onboarding materials that quickly drift out of date.

Quantitative Signals:
  • 72 of 157 respondents (45.9%) were coded to this theme, the most prevalent meta-work request
  • 72.6% of respondents in this theme want High or Very High AI support for documentation tasks
  • Average AI Preference: 4.07/5
  • Average AI Usage: 2.58/5, creating a 1.49-point preference-usage gap
Having AI help create and maintain documentation based on checked-in code, PRs, tests, etc would be a game changer for documentation
PID 28
I think the biggest role that AI can assist in is maintaining documentation to be accurate and up-to-date. Missing or incomplete or outdated documentation is arguably the biggest pitfall of development.
PID 129
Analyze the code I just wrote and write documentation that is relevant and adds the relevant context.
PID 105

Impact

If this exists, documentation work shifts from a separate, often-skipped task to a lightweight review activity attached to normal code change flow. Teams get targeted updates when interfaces or behavior change, stale docs are detected before drift accumulates, and onboarding improves because README-level and architecture-level documentation stays aligned with the live system instead of depending on heroic manual maintenance.

Evidence
Having AI help create and maintain documentation based on checked-in code, PRs, tests, etc would be a game changer for documentation
PID 28
I think the biggest role that AI can assist in is maintaining documentation to be accurate and up-to-date. Missing or incomplete or outdated documentation is arguably the biggest pitfall of development.
PID 129
Auto-documentation of "what" would be great. AI models aren't smart enough to completely explain the "why"s yet, but it could be a great tool for creating first draft documentation. Hardly any engineers write documentation, or enjoy doing it, so making that easier could be a nice win.
PID 171

Constraints & Guardrails

Success Definition

Qualitative Measures

  • Reviewers trust the system because each generated statement is traceable to code, tests, or schemas.
  • Developers report that documentation for new and changed code 'just exists' without separate documentation sprints.
  • Teams report that stale documentation is identified within days of code changes rather than months later.
  • Developers describe generated outputs as concise, targeted patches rather than low-value bulk text.
  • New team members report clearer README-level and architecture-level documentation during onboarding.

Quantitative Measures

  • Increase in the percentage of code-changing pull/merge requests with accepted documentation updates when public interfaces or behavior change.
  • Increase in documentation coverage for public interfaces from baseline (target: >80% coverage within 6 months of adoption).
  • Reduction in documentation drift, measured as days between a code change and an update to the impacted documentation (target: <14 days drift).
  • Reduction in documentation validation failures and stale-document findings per week.
  • Higher reviewer acceptance rate for generated documentation patches and lower average number of manual edits per accepted patch.
  • Preference-usage gap for documentation assistance narrows from 1.49 to below 0.5 within 12 months of deployment.

Theme Evidence

Automated Documentation Generation & Maintenance
automated_documentation
72 responses (45.9%)α=0.9663/3 convergence

Generate, update, and validate documentation artifacts directly from code, PRs, specs, and tests (e.g., READMEs, inline comments, API docs, architecture overviews/diagrams). This includes producing new documentation, maintaining accuracy as code evolves, and identifying gaps or staleness. Assign...

#2
Traceable Personalized Developer Ramp-Up Coach
New engineers do not need more information. They need the right first path through too much information. Onboarding stalls because the learner cannot tell which repo entry points matter, which setup steps are still current, which ADRs explain today’s architecture, or which basic questions feel too embarrassing to ask yet again in a public channel.

Project Description

Compose a role-specific ramp-up plan from the actual repos, docs, ADRs, setup scripts, and examples the learner will touch, then teach through sequenced exercises, answerable questions, and source-linked explanations tailored to that engineer’s background. It should feel less like a chatbot and more like a personalized, traceable onboarding syllabus for this team.

Relevant Context Sources:
  • Learner profile, ramp-up goal, and time budget
  • Target repositories, READMEs, examples, and setup scripts
  • Internal docs, ADRs, and service catalog entries for the target system
  • Resolved internal Q&A threads or support tickets for common setup and domain questions
  • Approved external docs and version-specific SDK or framework references
  • Senior engineer notes or narrated knowledge dumps approved for onboarding use
Capability Steps:
  1. Capture the learner’s goal, starting skill level, time constraints, preferred learning style, and accessible source set.
  2. Generate an onboarding playbook with prerequisites, setup checks, success criteria, and explicit human touchpoints such as mentor meetings or pairing sessions.
  3. Produce a codebase map and curated reading path that identifies entry points, core modules, dependencies, and architecture concepts in the order they matter for the learner’s goal.
  4. Create hands-on exercises tied to the actual target repo or technology and adapt the next step based on completed work and self-reported confidence.
  5. Answer questions with exact citations to files, docs, or examples, while surfacing permission gaps, conflicting sources, and uncertainty instead of bluffing.
  6. Turn senior engineers’ notes into draft FAQs, walkthroughs, and onboarding modules, and suggest updates when key setup or architecture files change.

Who It Affects

44 responses (28.0%)α=0.948

44 of 157 developers (28.0%) explicitly requested AI help with onboarding, mentoring, or learning new technologies. Demand is both strong and clear: 72.6% of respondents in this theme want high or very high AI support, average preference is 4.07/5 versus current usage of 2.58/5 (a 1.49-point gap), and inter-rater reliability was very high (α = 0.948).

Quantitative Signals:
  • Average AI Preference: 4.07/5
  • Average AI Usage: 2.58/5
  • Preference-Usage Gap: 1.49
  • 72.6% want High or Very High AI support
  • Inter-Rater Reliability: α = 0.948
Helping me gather the resources to learn more effectively on the job.
PID 26
Learning new technologies - exercises to learn new coding languages and tech within AI agents instead of reading books or online tutorials.
PID 74
Learning New Technologies - when learning new technologies, I generally have very specific questions that I might feel embarrassed to ask a more knowledgeable person. Being able to ask a specific question about a new technology without self-concern and get an expert and specific response back is transformative in the learning process.
PID 125

Impact

If this exists, developers get a repository- and role-specific ramp-up path instead of a pile of disconnected resources: a traceable onboarding checklist, a map of the system, guided readings, and hands-on exercises that lead toward a first meaningful contribution. Newcomers gain a private place to ask basic or context-specific questions without social friction, while senior engineers can turn repeated explanations into reviewed onboarding modules. The net effect is faster ramp-up on both new technologies and existing systems, with human mentors freed to focus on judgment, relationships, and team culture rather than repetitive setup and factual questions.

Evidence
AI can help a lot in onboarding new people into a team, and help in speeding up the time for them to start contributing to the project.
PID 781
Auto-generated training plan for new onboardings
PID 281
Learning New Technologies - when learning new technologies, I generally have very specific questions that I might feel embarrassed to ask a more knowledgeable person. Being able to ask a specific question about a new technology without self-concern and get an expert and specific response back is transformative in the learning process.
PID 125

Constraints & Guardrails

Success Definition

Qualitative Measures

  • New joiners report they can ask 'embarrassing' questions safely and get understandable explanations with examples tailored to their context.
  • Mentors report fewer repetitive onboarding questions and more time for relationship-building and nuanced guidance.
  • Senior engineers report that knowledge-capture workflows reduce time spent repeatedly packaging the same onboarding knowledge.
  • Teams report that generated onboarding modules feel accurate, concise, and owned rather than generic or spammy.
  • Developers say they trust the tool because each answer shows sources and clearly signals uncertainty.

Quantitative Measures

  • Reduce median time-to-first-successful-local-build/run for new joiners.
  • Reduce median time-to-first-merged change for new team members.
  • Increase completion rate of onboarding checklists and hands-on exercises.
  • Decrease repeated onboarding questions in internal Q&A or support channels for the same repository or system.
  • Maintain content accuracy ratings of 90%+ in developer spot-checks of source attribution and technical correctness.
  • Reduce the theme's preference-usage gap from the 1.49 baseline by increasing sustained use of the ramp-up tool.

Theme Evidence

Onboarding, Mentoring & Personalized Upskilling
onboarding_mentoring_and_upskilling
44 responses (28.0%)α=0.9483/3 convergence

Act as an adaptive tutor or mentor that tailors explanations, examples, and learning plans to help developers learn new languages, frameworks, and APIs, or ramp up on unfamiliar systems and domains. Includes generating onboarding guides/checklists, packaging institutional knowledge into training...

#4
Stakeholder Communication Drafting Workbench
The tedious part of stakeholder communication is not writing sentences. It is translating the same technical fact pattern into five different levels of detail without dropping nuance, overpromising, or leaking something internal. Engineers repeatedly have to reshape status, risk, and incident information for audiences who differ in technical fluency, language, and prior context, and they often have to do it inside ongoing threads where yesterday’s wording still matters.

Project Description

Assemble a context pack from user-selected tickets, commits, incidents, prior messages, and meeting notes, then draft audience-specific updates, replies, rewrites, or proofreading suggestions that preserve the engineer’s intent and voice. The useful contribution here is careful audience adaptation with traceable claims, not generic business prose.

Relevant Context Sources:
  • User-selected work items, status changes, blockers, and owners
  • Selected commits, PR notes, incident records, and technical docs
  • Prior thread messages and stakeholder questions for the same topic
  • Meeting notes or transcripts explicitly chosen by the user
  • Stakeholder profile fields such as role, technical fluency, language, and formality expectations
  • Approved terminology, translation memory, and confidentiality rules
Capability Steps:
  1. Start from a drafting or review action where the user specifies the audience, channel, intent, language, and source pack.
  2. Extract the facts, deltas since the last message, open risks, dependencies, and asks from the chosen artifacts, and identify what information is still missing for this audience.
  3. Generate one or more drafts or edit suggestions at the right technical depth, simplifying jargon where appropriate without flattening important nuance.
  4. For multilingual communication, offer either full translation or targeted language coaching on register and phrasing while preserving the author’s meaning.
  5. Show sentence-level source links and warnings for unsupported claims, likely overcommitments, or wording that leaks internal detail.
  6. Run confidentiality and external-communication checks before exporting an editable draft back to the user.

Who It Affects

20 responses (12.7%)α=0.928

20 of 157 respondents (12.7%) were coded to this theme with high inter-rater reliability (α = 0.93). The requests span communication with internal stakeholders, customers, and external partners; 72.6% wanted high support, yet average usage is only 2.58/5 versus 4.07/5 preference, indicating a substantial unmet need in current communication-support tools.

Quantitative Signals:
  • Average AI Preference: 4.07/5
  • Average AI Usage: 2.58/5
  • Preference-Usage Gap: 1.49
  • 72.6% wanting high support
  • Inter-rater reliability α = 0.93
I would like AI to help with tailoring stakeholder communications to different stakeholders. Right now so much team meta-effort goes into preparing tailored communications.
PID 40
It would be great if I can delegate the task of updating my stakeholders to an AI assistant and it can keep them informed of my recent work, and summarize any questions to me that they have.
PID 120
rephrase words to stakeholders or client - chat message or emails. help to explain the tech details or issue/concerns of technique in an easy understandable way
PID 130

Impact

If this capability exists, developers no longer start stakeholder communication from a blank page or manually restate the same technical work for each audience. Instead, they assemble a trusted context pack and receive editable drafts or review suggestions that explain status, risks, and answers at the right level of detail, while preserving intended meaning across languages and reducing repeated explanations in ongoing threads.

Evidence
I would like AI to help with tailoring stakeholder communications to different stakeholders. Right now so much team meta-effort goes into preparing tailored communications.
PID 40
rephrase words to stakeholders or client - chat message or emails. help to explain the tech details or issue/concerns of technique in an easy understandable way
PID 130
I want AI suggestions for communication with external partners, particularly when I'm writing in Japanese rather than English. I don't want wholesale retranslation, and certainly not re-interpretation of my original meaning, but I would greatly appreciate focused comments on correct use of 丁寧語、尊敬語 and 謙譲語 when writing to external partners, or senior members of the company. For example, cases where different verb forms should be used, or (indeed) different verbs entirely.
PID 182

Constraints & Guardrails

The tool must not handle sensitive, relationship-critical stakeholder communications where empathy, trust-building, or nuanced trade-off discussion is central.
"I don’t want AI to handle sensitive stakeholder communications or prioritization trade-offs, as these require empathy, trust-building, and nuanced understanding of team dynamics." (PID 179)

Success Definition

Qualitative Measures

  • Users report that stakeholder messages start from a trustworthy draft or review pass rather than a blank page
  • Users report that outputs preserve their authentic voice instead of sounding generic or synthetic
  • Users trust the tool because every important claim can be traced to selected source artifacts and unsupported claims are clearly flagged
  • Cross-language users report increased confidence in formality and register choices without loss of intended meaning
  • Stakeholders report no degradation in clarity, accuracy, or personal touch compared to prior communication practices

Quantitative Measures

  • Reduce median time-to-first-draft for stakeholder updates by 50%
  • Decrease average number of manual revisions per message by 25% before export
  • At least 30% reduction in post-send corrective follow-up messages for the same topic
  • <= 5% of exported drafts contain unsupported claims without explicit user override acknowledgment
  • Zero autonomous sends: 100% of outbound communications pass through a human approval gate
  • >= 60% weekly active usage among users who try the feature at least once

Theme Evidence

Stakeholder/Client Communication Drafting & Translation
stakeholder_communication_support
20 responses (12.7%)α=0.9283/3 convergence

Draft, rewrite, proofread, and tailor communications (emails, updates, status messages, explanations) for stakeholders, clients, or other audiences, including simplifying technical details for non-technical readers and adjusting tone or language. Assign when the respondent wants AI to help compose...

#5
Solution-Space Explorer for Trade-off Analysis and Prototype Spikes
Before a team commits to architecture, there is a fuzzier phase that looks more like technical research than design review. Someone has an instinct, a few constraints, and maybe one favored approach; what is missing is a structured way to widen the option set, compare genuinely different paths, and test the risky assumptions quickly. Current tools tend to hand back polished answers too early, which narrows thinking instead of broadening it.

Project Description

Use an interactive exploration board for early technical discovery: widen the option set, score competing approaches against user-chosen criteria, and spin up disposable prototype spikes or diagrams for the options worth testing. This is the fuzzy front end of technical research, not a substitute for a formal architecture review.

Relevant Context Sources:
  • Problem statement, constraints, non-goals, and risk tolerance supplied by the user
  • Selected repository modules, interfaces, and dependency manifests relevant to the question
  • Existing design docs, ADRs, and past decisions for similar problems
  • Operational artifacts such as incidents, runbooks, and service objectives when they bear on the trade-off
  • Approved technology lists and engineering standards
Capability Steps:
  1. Start a session from a design or research question and capture the constraints, assumptions, and evaluation criteria the user wants to hold fixed.
  2. Generate 3–6 option cards, including at least one deliberately contrasting alternative so the first idea does not dominate the session.
  3. Attach local precedent, prerequisites, and explicit assumptions to each option using the selected codebase and design artifacts.
  4. Let the user reweight criteria, eliminate options, add their own candidate, or ask the system to challenge the current set with more extreme or conservative alternatives.
  5. For selected options, create lightweight spike artifacts such as skeleton code, interface stubs, sample config, or quick diagrams that test the risky part of the idea rather than the whole system.
  6. Export a shareable comparison log that records the options considered, trade-offs, user edits, prototype results, and open questions.

Who It Affects

18 responses (11.5%)α=0.938

18 of 157 developers (11.5%) explicitly asked for AI help with research, brainstorming, option generation, or rapid prototypes, indicating a clear need among developers doing early design and technical discovery work.

Quantitative Signals:
  • 72.6% want High or Very High AI support for this theme
  • Average AI Preference: 4.07/5
  • Average AI Usage: 2.58/5
  • Preference-Usage Gap: 1.49
AI as a brainstorming tool, or a first pass of ideas, is a great way to easily expand the number of ideas being considered. Too often we just have one idea - hey, I want people's opinions, I'll send out an online survey - without considering other options. Sometimes our first thought is the best option, sometimes it's not.
PID 457
An AI that can help with research and brainstorming of technical things would be very useful. I don't think current models have the specialised knowledge to do that.
PID 196

Impact

If this capability exists, developers can move from a vague design question to a reviewable set of distinct options, explicit assumptions, and lightweight prototype spikes in hours instead of days. The main benefit is reduced tunnel vision: teams consider more than one plausible approach, make tradeoffs visible, and validate risky ideas earlier without handing final decisions to the tool.

Evidence
AI as a brainstorming tool, or a first pass of ideas, is a great way to easily expand the number of ideas being considered. Too often we just have one idea - hey, I want people's opinions, I'll send out an online survey - without considering other options. Sometimes our first thought is the best option, sometimes it's not.
PID 457
Find issues in documentation. Help with research and creating working prototypes.
PID 151
AI guided demo on a new technology might be interesting. Something like having AI act as a solution engineer to do an initial demo/mock-up solution to some customer problem. AI assistants in documentation/note taking in customer interactions might reduce human biases.
PID 384

Constraints & Guardrails

The tool must not outsource creativity; it should explicitly encourage users to challenge the initial options, add their own alternatives, and avoid treating the first generation as sufficient.
"AI should also not be relied upon in a way that reduces creativity. Research and brainstorming are exercises it can help ask questions to facilitate, but we shouldn't go with its first ideas to solve a problem." (PID 225)

Success Definition

Qualitative Measures

  • Developers describe the system as a technical sounding board or thought partner rather than an answer generator.
  • Users report that the tool surfaced at least one plausible option they had not previously considered.
  • Teams report that trade-off matrices, assumptions, and unknowns make early design discussions easier to review and align around.
  • Developers say prototype spikes helped validate risky assumptions before committing to full implementation.

Quantitative Measures

  • Reduce time from initial problem statement to a shareable option set by 50%
  • Increase average number of distinct options considered per design discussion from 1-2 to 4-6
  • Reduce time to first runnable spike or prototype for selected options by 30%
  • Achieve >70% of exploration sessions where the developer uses steer, expand, challenge, or add-option actions beyond the initial generation
  • Reduce the preference-usage gap for this theme from 1.49 to below 0.5 in a follow-up deployment study

Theme Evidence

Brainstorming, Option Generation & Rapid Exploration
brainstorming_and_solution_exploration
18 responses (11.5%)α=0.9383/3 convergence

Serve as a technical sounding board to expand solution options, propose architectures or approaches, compare tradeoffs, and rapidly explore directions (including lightweight prototypes/mockups). Assign when the respondent wants AI to help generate, evaluate, or iterate on ideas and design...