2
Analysis Tracks
5
Task Categories
3
LLM Coders
20
Batch Size
α
IRR Metric

This document describes the full methodology for a multi-model qualitative analysis pipeline that identifies research opportunities (what developers want AI to do) and design constraints (what developers do not want AI to handle) from open-ended survey responses. Three frontier LLMs serve as independent coders, with inter-rater reliability calculated via Krippendorff's alpha and consensus reached through majority vote. A human review gate validates codebooks before systematic coding begins.

Research Questions & Data

Survey Questions (per category)

Opportunity track: "Where do you want AI to play the biggest role in [category] activities?"
Open-ended responses capturing desired capabilities and unmet needs.
Constraint track: "What aspects do you NOT want AI to handle and why?"
Open-ended responses capturing guardrails, no-go zones, and boundary conditions.

Unit of Analysis

Each unit is a single respondent's open-ended answer to one of the two questions within a category. Respondents answered about 2–3 categories each, and a single response may be assigned multiple theme codes.

Task Categories & Response Counts

CategoryRespondentsTasks Covered
Development816Coding, Bug Fixing, Perf Optimization, Refactoring, AI Development
Design & Planning548System Architecture, Requirements Gathering, Project Planning
Meta-Work532Documentation, Communication, Mentoring, Learning, Research
Quality & Risk401Testing & QA, Code Review / PRs, Security & Compliance
Infrastructure & Ops283DevOps / CI-CD, Environment Setup, Monitoring, Customer Support

Data Quality Note

Approximately 11% of responses contain data quality issues detected during coding: misplaced answers (respondent wrote a "want" answer in the "not want" field or vice-versa), back-references to prior answers that are unintelligible on their own, or terse non-responses. These are flagged with ISSUE_* codes rather than discarded, avoiding pre-filter bias (see Methodological Controls).

Pipeline Overview

The analysis runs in two stages with a human review gate between them. Both the opportunity and constraint tracks follow identical process steps but with track-specific prompts and codebooks.

Opportunity Track
 
Constraint Track
Theme Discovery
3 models discover "want" themes independently
Theme Discovery
3 models discover "not want" themes independently
Reconciliation
GPT-5.2 merges themes into unified codebook
Reconciliation
GPT-5.2 merges constraint themes into codebook
Human Gate  Author Review & Codebook Approval
Researcher validates themes, merges/splits/renames as needed, approves before coding proceeds
Systematic Coding
3 models code every response (batch=20)
Systematic Coding
3 models code every constraint response
Triangulation  IRR Calculation & Majority-Vote Consensus
Krippendorff's α per theme • 2-of-3 majority vote • ISSUE code aggregation
Rich Opportunity Cards
Top-5 per category, 3-model synthesis
Constraint Maps & Design Principles
No-go zones with guardrail guidance

Stage 1: Theme Discovery & Reconciliation

Phase 1: Independent 3-Model Discovery

Each model receives all responses for a given category and independently proposes 4–15 themes with supporting evidence (PIDs). The prompt instructs models to create specific, actionable, problem-focused themes and to allow multi-coding.

ModelProviderThinking ModeRole
GPT-5.2OpenAIreasoning_effort="high"Independent coder & reconciler
Gemini 3.1 ProGooglethinking_level="HIGH"Independent coder
Claude Opus 4.6Anthropicthinking: adaptive, effort: highIndependent coder

Inputs

  • Open-ended survey responses with PIDs (e.g., 816 Development responses or 548 Design & Planning responses)
  • Category name and context description

Outputs (per category, per model)

  • Theme codebook: code, name, description, supporting PIDs
  • Per-response codings: PID → [theme_code_1, theme_code_2, ...]
  • Files: {category}_themes_{model}.json (15 opportunity files + 15 constraint files)

Phase 2: GPT-5.2 Reconciliation

A single reconciliation model (GPT-5.2) receives all three models' theme sets and produces a unified codebook per category by:

  1. Identifying overlapping themes across models (same concept, different names)
  2. Merging overlapping themes into single unified entries
  3. Retaining single-model themes only if substantive (≥3 PIDs)
  4. Dropping themes that are too vague or have very few supporting responses
  5. Targeting 5–10 unified themes per category

Each unified theme records its source_models (which of the three models independently proposed it) and source_codes (original model-specific code names), providing full provenance.

Outputs

  • consolidated_codebook.json — all 5 category codebooks (opportunity track)
  • constraint_codebook.json — all 5 category codebooks (constraint track)

Human Review Gate Required

The pipeline pauses for researcher review before systematic coding begins. The researcher:

  • Reviews each proposed theme and reads sample supporting responses
  • Checks themes for specificity, granularity, and completeness
  • Can keep, rename, merge, split, or remove any theme
  • Can add themes the models missed
  • Documents rationale for all changes

Systematic coding (Stage 2) does not proceed until the codebook is explicitly approved.

Stage 2: Systematic Coding & Analysis

Coding Protocol

All three models independently re-code every response against the finalized codebook. Key protocol elements:

ParameterValueRationale
Batch size20 responses per API callBalances context window usage against API call count
Rationale-firstModel writes rationale before assigning codesImproves accuracy via chain-of-thought; enables auditability
Cross-response contextEach response shown alongside opposite-question answerEnables misresponse detection (ISSUE codes)
Multi-coding0, 1, or many themes per responseCaptures full semantic content
Codebook-onlyOnly codebook codes or ISSUE_* codes allowedPrevents code drift across batches

ISSUE Code System

During systematic coding, models flag data quality problems rather than silently discarding responses:

CodeMeaningExample
ISSUE_WRONG_FIELDRespondent answered the opposite questionDescribing constraints in the "want" field
ISSUE_BACK_REFERENCEReferences a prior answer; unintelligible alone"Same as before", "see above"
ISSUE_NON_RESPONSETerse non-answer with no analyzable content"N/A", "none", "no"

Models may create additional ISSUE_* codes if they encounter other data quality problems. The ISSUE prefix ensures these are never confused with substantive themes.

Inter-Rater Reliability (IRR)

Agreement between the three LLM coders is measured per theme using Krippendorff's alpha (α), the standard multi-rater reliability coefficient for qualitative research. For each theme, a binary (present/absent) coding matrix is built across all responses, and α is calculated at the nominal level.

RangeInterpretation
α ≥ 0.80Excellent agreement — publishable
α ≥ 0.67Acceptable agreement — tentative conclusions
α ≥ 0.50Moderate agreement — use with caution
α < 0.50Poor agreement — unreliable for this theme

Additionally, pairwise Cohen's kappa (κ) is calculated for each model pair (GPT–Gemini, GPT–Opus, Gemini–Opus) and 3-rater percent agreement (all three models assign the same code) is reported per theme.

Consensus Voting

Final theme assignments use a majority vote: 2 of 3 models must agree for a theme to be assigned to a response. This is applied independently per response and per theme code.

ISSUE code handling

If 2+ models flag any ISSUE code for a response (regardless of which specific ISSUE code), the response receives a generic ISSUE marker and is excluded from substantive analysis. This prevents a single aggressive model from filtering out too many responses.

Rich Opportunity Cards

For the top 5 themes per category (by prevalence), all three models independently generate detailed opportunity cards including:

  • Problem statement and proposed capability description
  • Required context sources and capability steps
  • Impact description with supporting evidence quotes
  • Success criteria (qualitative and quantitative measures)
  • Constraints and guardrails drawn from the constraint track
  • Prevalence data and quantitative signals (AI preference, usage gap)

Cards from the three models are merged using a union-and-deduplicate strategy: longest title wins, context sources are combined (max 7), capability steps use the longest sequence (max 6), and constraints are deduplicated (max 4).

Constraint Maps & Design Principles

Constraint-track prevalence is calculated identically to the opportunity track. The top no-go zones per category are documented with:

  • Zone name, description, and prevalence count
  • Up to 10 supporting respondent quotes
  • 3–6 synthesized design principles per category (generated by GPT-5.2)
  • Each principle includes implementation guidance and derivation provenance

Methodological Controls

The pipeline incorporates several controls designed to increase rigor beyond what a single-model analysis can provide.

ControlMechanismWhat It Mitigates
Multi-LLM triangulation 3 frontier models from different families code independently Single-model bias, training-data artifacts, idiosyncratic interpretations
Rationale-first coding Models write reasoning before assigning codes Snap-judgment errors; enables post-hoc audit of coding decisions
Cross-response context Both "want" and "not want" answers shown to coder Misresponse blindness; enables ISSUE_WRONG_FIELD detection
ISSUE code system Flag quality problems in-band rather than pre-filtering Pre-filter bias from silently dropping ambiguous responses
Idempotent checkpointing Staleness detection skips phases whose inputs haven't changed Wasted computation; ensures reproducible reruns
Consensus merging Majority vote (2/3) for codes; union-and-deduplicate for synthesis Noise from single-model outlier codes; incomplete synthesis from any single model

Design Decisions & Trade-offs

DecisionRationaleTrade-off
3 models, not 2 or 5 Minimum for meaningful IRR (Krippendorff's α); covers 3 major LLM families Higher API cost (∼3×); manageable with batch parallelism
HIGH thinking for all models Qualitative coding benefits from extended reasoning; reduces surface-level pattern matching Slower inference, higher token cost (thinking tokens billed at output rate)
Batch size of 20 Enough responses for cross-response pattern recognition; fits comfortably in context windows More API calls than larger batches; but avoids context truncation risks
Majority vote (2/3) Balances sensitivity and specificity; equivalent to >50% agreement threshold May miss themes where only one model sees a valid pattern
Human gate before coding Prevents systematic errors from propagating through the entire coding phase Introduces a manual pause in an otherwise automated pipeline
No pre-filtering of responses ISSUE codes capture quality problems without discarding data points Models must process noisy responses; ISSUE detection is itself imperfect
GPT-5.2 as sole reconciler Reconciliation requires structured comparison rather than independent generation; one model suffices Reconciliation may inherit GPT-specific biases in theme naming
Streaming for Claude Opus Avoids 10-minute HTTP timeout on long-running inference More complex error handling; no retry on partial stream failures

Limitations & Mitigations

LimitationImpactMitigation
LLM nondeterminism Exact codings may vary across runs even with identical inputs 3-model triangulation smooths out individual variance; IRR quantifies remaining disagreement; idempotent checkpointing ensures reproducible runs when inputs are stable
LLM rationalization Models may construct plausible but incorrect rationales Multi-model disagreement surfaces cases where rationalization diverges; majority vote filters single-model confabulations
Prompt sensitivity Different prompt wording could yield different themes Codebook-anchored coding constrains coder freedom; prompts are documented and versioned for replication
Not replacing human qualitative research LLM coders lack lived experience; may miss cultural nuances Human review gate validates codebook; methodology is positioned as accelerating qualitative work, not replacing it; all outputs include supporting quotes for human verification
Survey sample 860 Microsoft developers may not represent the broader industry Out of scope for the analysis methodology itself; noted as a limitation of the source data
LLM knowledge contamination Models may have been trained on similar survey analyses Codebook-first design constrains output to researcher-approved themes; verbatim quotes provide verifiable evidence independent of model knowledge

Artifacts & Replication

Artifact Inventory

PhaseFile PatternCountDescription
Data{category}_responses.json5Extracted open-ended responses with PIDs
Data{category}_quantitative.json5Aggregated Likert scale metrics per task
Data{category}_do_not_want_responses.json5Extracted constraint responses with PIDs
Stage 1{category}_themes_{model}.json15Independent opportunity theme discoveries
Stage 1{category}_constraint_themes_{model}.json15Independent constraint theme discoveries
Stage 1consolidated_codebook.json1Unified opportunity codebook (all categories)
Stage 1constraint_codebook.json1Unified constraint codebook (all categories)
Stage 2{category}_phase4_codings.json53-model systematic codings with rationales
Stage 2phase5_irr_results.json1Krippendorff's α, Cohen's κ, agreement %
Stage 2phase6_prevalence_results.json1Majority-vote consensus and theme prevalence
Stage 2phase6_rich_opportunities.json1Top-5 opportunity cards per category (3-model synthesis)
Stage 2constraint_maps.json1No-go zones and design principles

Dependency Chain

data.xlsx → {cat}_responses.json {cat}_do_not_want_responses.json {cat}_quantitative.json ↓ {cat}_themes_{model}.json (3×5 = 15 files) {cat}_constraint_themes_{model}.json (15 files) ↓ consolidated_codebook.json constraint_codebook.json ↓ ■ HUMAN REVIEW GATE ↓ {cat}_phase4_codings.json (5 files, 3 models each) ↓ phase5_irr_results.json phase6_prevalence_results.json ↓ ↓ phase6_rich_opportunities.json constraint_maps.json

Staleness Detection

Every pipeline phase checks whether its output is stale relative to its inputs by comparing file modification times. If all inputs are older than the output, the phase is skipped. If any input is newer, the output is regenerated. This enables:

  • Incremental reruns: updating one category's theme discovery only regenerates downstream outputs for that category
  • Safe restarts: if the pipeline crashes mid-phase, only the incomplete phase reruns
  • Force override: --force flag bypasses staleness checks for full regeneration

How to Rerun

  1. Ensure API keys are set in .env for OpenAI, Google, and Anthropic
  2. Install dependencies: uv sync
  3. Run full pipeline: bash run_full_pipeline.sh
  4. Pipeline pauses after Stage 1 for human codebook review
  5. After approval, Stage 2 runs automatically
  6. To force regeneration: bash run_full_pipeline.sh --force
  7. To rerun a single category: uv run phase4_systematic_coding.py design_planning

Appendix

Opportunity Codebook (All 5 Categories, 48 Themes)

Unified codebook produced by GPT-5.2 reconciliation of themes independently discovered by all three models. Each theme lists which models independently identified it.

Development 10 themes

CodeThemeModels
refactoring_modernizationAutomated Refactoring, Modernization & Tech-Debt ReductionGPT, Gemini, Opus
boilerplate_scaffolding_feature_codegenBoilerplate, Scaffolding & Routine Feature Code GenerationGPT, Gemini, Opus
automated_testing_validationAutomated Test Generation, Coverage & Change ValidationGPT, Gemini, Opus
debugging_root_cause_fixingDebugging, Root Cause Analysis & Bug Fix AssistanceGPT, Gemini, Opus
repo_wide_context_dependency_awarenessRepo-Wide Context, Dependency Awareness & Safe Multi-File ChangesGPT, Gemini, Opus
code_quality_review_security_complianceCode Quality, Review Automation, Standards & Security/Compliance GuidanceGPT, Gemini, Opus
performance_profiling_optimizationPerformance Profiling & Optimization SuggestionsGPT, Gemini, Opus
architecture_design_planning_supportArchitecture, Design Brainstorming & Planning SupportGPT, Gemini, Opus
devops_ci_cd_iac_workflow_automationDevOps, CI/CD, IaC & Engineering Workflow AutomationGPT, Gemini, Opus
documentation_knowledge_retrieval_onboardingDocumentation Generation, Knowledge Retrieval & Onboarding/Learning SupportGPT, Gemini, Opus

Design & Planning 10 themes

CodeThemeModels
requirements_gathering_synthesisRequirements Gathering, Synthesis & ClarificationGPT, Gemini, Opus
architecture_design_generationArchitecture & System Design Generation/IterationGPT, Gemini, Opus
interactive_brainstorming_design_partnerInteractive Brainstorming & Design CopilotGPT, Gemini, Opus
tradeoff_decision_support_simulationTrade-off Analysis, What-if Simulation & Decision SupportGPT, Gemini, Opus
design_validation_risk_edge_casesDesign Validation, Risk Assessment & Edge-Case DiscoveryGPT, Gemini, Opus
project_planning_tasking_status_automationProject Planning, Ticket/Task Breakdown & Status AutomationGPT, Gemini, Opus
documentation_spec_diagram_generationDocumentation, Specs & Diagram/Artifact GenerationGPT, Gemini, Opus
context_retrieval_codebase_and_institutional_memoryContext Retrieval: Codebase Understanding & Institutional MemoryGPT, Gemini, Opus
research_and_information_synthesisResearch, Information Gathering & Knowledge SynthesisGPT, Gemini, Opus
trustworthy_outputs_with_citationsTrustworthy Outputs: Higher Accuracy & Verifiable CitationsGPT, Gemini, Opus

Quality & Risk 9 themes

CodeThemeModels
automated_test_generation_and_quality_gatesAutomated Test Generation, Maintenance & Quality GatesGPT, Gemini, Opus
intelligent_pr_code_reviewIntelligent PR/Code Review AssistantGPT, Gemini, Opus
security_vulnerability_detection_and_fix_guidanceSecurity Vulnerability Detection & Fix GuidanceGPT, Gemini, Opus
compliance_and_audit_automationCompliance, Standards & Audit Process AutomationGPT, Gemini, Opus
proactive_risk_monitoring_and_predictionProactive Risk Monitoring, Prediction & Anomaly DetectionGPT, Gemini, Opus
debugging_root_cause_and_failure_triageDebugging, Root Cause Analysis & Failure TriageGPT, Gemini, Opus
knowledge_retrieval_and_standards_guidanceKnowledge Retrieval, Summarization & Standards GuidanceGPT, Gemini, Opus
agentic_workflow_automation_and_remediationAgentic Workflow Automation & Automated RemediationGPT, Gemini, Opus
ai_driven_exploratory_chaos_and_fuzz_testingAI-Driven Exploratory, Chaos & Fuzz TestingOpus only

Infrastructure & Ops 10 themes

CodeThemeModels
intelligent_monitoring_alerting_anomaly_detectionIntelligent Monitoring, Alerting & Anomaly DetectionGPT, Gemini, Opus
incident_response_rca_mitigation_self_healIncident Response Automation (Triage, RCA, Mitigation, Self-Heal)GPT, Gemini, Opus
cicd_pipeline_and_deployment_automationCI/CD Pipeline & Deployment AutomationGPT, Gemini, Opus
infrastructure_provisioning_and_iac_generationAutomated Environment Setup & IaC GenerationGPT, Gemini, Opus
infrastructure_maintenance_upgrades_security_cost_optimizationProactive Maintenance, Upgrades, Security/Compliance & Cost OptimizationGPT, Gemini, Opus
customer_support_triage_and_autoresponseCustomer Support Triage & Auto-ResponseGPT, Gemini, Opus
knowledge_management_doc_search_and_system_contextKnowledge Management, Documentation Search & System ContextGPT, Gemini, Opus
ops_toil_automation_and_script_generationOps Toil Automation & Script Writing/DebuggingGPT, Gemini, Opus
testing_quality_validation_and_safe_deployTesting, Quality Validation & Safer ReleasesGPT, Gemini, Opus
ai_tooling_ux_accuracy_and_cohesive_workflowsBetter AI Tooling UX (Accuracy, Control & Cohesive Workflows)GPT, Gemini, Opus

Meta-Work 9 themes

CodeThemeModels
automated_documentationAutomated Documentation Generation & MaintenanceGPT, Gemini, Opus
knowledge_search_and_discoveryProject Knowledge Search & Discovery (with Traceable Sources)GPT, Gemini, Opus
brainstorming_and_solution_explorationBrainstorming, Option Generation & Rapid ExplorationGPT, Gemini, Opus
personalized_learning_and_upskillingPersonalized Learning for New TechnologiesGPT, Gemini, Opus
team_onboarding_and_mentoringTeam Onboarding, Mentoring & Institutional Knowledge TransferGPT, Gemini, Opus
stakeholder_communication_supportStakeholder/Client Communication Drafting & TranslationGPT, Gemini, Opus
meeting_assistanceMeeting Scheduling, Notes, Summaries & Action ItemsGPT, Gemini, Opus
planning_prioritization_and_status_trackingPlanning, Prioritization, Blocker Detection & Status ReportingGPT, Gemini, Opus
proactive_personal_agent_and_admin_automationProactive Personal Agent & Routine Admin AutomationGPT, Gemini, Opus
Constraint Codebook (All 5 Categories, 50 Themes)

Unified constraint codebook produced by GPT-5.2 reconciliation. Captures what developers do not want AI to handle.

Development 10 themes

CodeThemeModels
no_autonomous_architecture_system_designNo Autonomous Architecture or System Design DecisionsGPT, Gemini, Opus
no_large_unscoped_refactorsNo Large, Unscoped, or Sweeping Codebase ChangesGPT, Gemini, Opus
no_autonomous_execution_merge_deploy_or_agentic_controlNo Autonomous Execution, Merging/Deploying, or Agentic ControlGPT, Gemini, Opus
no_complex_debugging_or_critical_bug_fixesNo AI Ownership of Complex Debugging or Critical Bug FixesGPT, Gemini, Opus
no_security_privacy_secrets_handlingNo Security/Privacy-Sensitive Work or Secrets HandlingGPT, Gemini, Opus
no_autonomous_performance_optimizationNo Autonomous Performance OptimizationGPT, Gemini, Opus
no_ai_deciding_requirements_business_logic_or_api_uxNo AI-Led Requirements, Core Business Logic, or API/UX DecisionsGPT, Gemini
preserve_developer_agency_learning_and_job_ownershipPreserve Developer Agency, Learning, and OwnershipGPT, Gemini, Opus
avoid_ai_when_unreliable_contextless_hard_to_verify_or_intrusiveAvoid AI Output That Is Unreliable, Contextless, Hard to Verify, or IntrusiveGPT, Gemini, Opus
no_constraints_open_to_ai_helpNo Specific No-Go Zones (Open to AI Help)GPT, Gemini

Design & Planning 10 themes

CodeThemeModels
human_accountability_final_decisionsNo AI Final Decision-Making (Human Accountability Required)GPT, Gemini, Opus
human_led_architecture_designNo AI as Primary System Architect / High-Level DesignerGPT, Gemini, Opus
no_ai_project_management_task_assignmentNo AI Running Project ManagementGPT, Gemini, Opus
no_ai_requirements_stakeholder_elicitationNo AI-Led Requirements Gathering or Stakeholder AlignmentGPT, Gemini, Opus
no_ai_empathy_team_dynamicsNo Replacement of Human Empathy, Collaboration, or Interpersonal DynamicsGPT, Gemini, Opus
ai_assistant_human_in_loopNo Autopilot: AI Should Assist with Human-in-the-Loop OversightGPT, Gemini, Opus
trust_accuracy_and_context_limitationsAvoid AI for High-Stakes Work Due to Reliability & Missing ContextGPT, Gemini, Opus
privacy_confidentiality_ip_and_message_controlNo AI Handling Sensitive/Confidential Data or Uncontrolled MessagingGPT, Gemini, Opus
no_ai_vision_strategy_creativity_tasteNo AI Owning Product Vision, Strategy, or Creative JudgmentsGPT, Gemini
no_constraints_or_unsureNo Constraints Stated / Welcome Full AI InvolvementGPT, Gemini, Opus

Quality & Risk 10 themes

CodeThemeModels
human_final_decision_and_accountabilityHumans Must Make Final High-Stakes DecisionsGPT, Gemini, Opus
no_autonomous_code_or_production_actionsNo Autonomous Code/Repo/Production Actions Without ApprovalGPT, Gemini, Opus
human_code_review_gate_requiredHuman Code Review / PR Approval Must Remain the GateGPT, Gemini, Opus
security_and_compliance_must_be_human_ledSecurity, Compliance, and Threat Modeling Must Be Human-LedGPT, Gemini, Opus
no_sensitive_data_or_credentials_accessDo Not Give AI Access to Sensitive/Customer Data or CredentialsGPT, Gemini, Opus
ai_outputs_must_be_verifiable_and_not_self_validatedAI Must Be Reliable, Verifiable, and Not Self-ValidatedGPT, Gemini, Opus
humans_own_requirements_architecture_and_tradeoffsHumans Must Own Requirements, Architecture, and Trade-OffsGPT, Gemini, Opus
human_led_test_strategy_intent_and_signoffTest Strategy and Sign-Off Must Be Human-LedGPT only
preserve_human_ethics_empathy_and_human_centric_workPreserve Human Ethics, Empathy, and Human-Centric WorkGPT, Gemini
no_constraints_statedNo Specific No-Go Areas StatedGPT, Opus

Infrastructure & Ops 10 themes

CodeThemeModels
no_direct_customer_interactionNo Direct AI-to-Customer InteractionGPT, Gemini, Opus
no_autonomous_production_changesNo Autonomous Production Deployments or ChangesGPT, Gemini, Opus
human_approval_before_consequential_actionsHuman Approval Required Before Consequential ActionsGPT, Opus
no_security_permissions_secrets_managementNo AI Management of Security, Access, Permissions, or SecretsGPT, Gemini, Opus
no_autonomous_incident_response_or_overridesNo Autonomous Incident Response or Critical OverridesGPT, Gemini, Opus
avoid_ai_for_high_precision_deterministic_workAvoid AI for High-Precision/Deterministic WorkGPT, Gemini, Opus
no_full_autonomy_for_environment_setup_maintenanceNo Full Autonomy for Environment Setup and MaintenanceGPT, Gemini
preserve_human_learning_and_accountabilityPreserve Human Learning, System Understanding, and AccountabilityGPT, Gemini, Opus
no_ai_initiated_irreversible_or_destructive_data_actionsNo AI-Initiated Irreversible/Destructive Data OperationsGPT, Gemini, Opus
no_constraints_expressed_or_pro_automationNo Constraints Expressed / Comfortable with Broad AutomationGPT, Gemini, Opus

Meta-Work 10 themes

CodeThemeModels
human_led_mentoring_onboardingKeep mentoring and onboarding human-ledGPT, Gemini, Opus
human_authored_communicationKeep interpersonal communications human-authoredGPT, Gemini, Opus
human_review_required_before_sending_or_publishingNo autonomous sending/publishing without human reviewGPT, Gemini, Opus
no_confidential_or_sensitive_dataKeep AI away from confidential or sensitive informationGPT, Gemini, Opus
preserve_hands_on_learningDon't outsource learning and skills development to AIGPT, Gemini, Opus
preserve_human_research_and_ideationKeep research/brainstorming primarily humanGPT, Gemini, Opus
human_accountability_for_high_stakes_decisionsHigh-stakes decisions must remain human-ledGPT, Gemini, Opus
avoid_unvetted_documentationAI-generated documentation must be vettedGPT, Gemini, Opus
ai_outputs_not_trustworthy_as_primary_sourceDon't treat AI output as trustworthy/authoritativeOpus only
no_constraints_or_unsureNo constraints stated / unsureGPT, Gemini, Opus
Coding Prompt: Opportunity Track (Phase 4)
You are a qualitative research coder. Your task is to systematically code
each "WANT" response using ONLY the themes from the provided codebook.

Each response is shown alongside the same respondent's answer to a related
question about what they do NOT want AI to handle, for additional context.

CODEBOOK:
{codebook themes listed here}

ISSUE CODES (assign when a response has data quality issues):
- ISSUE_WRONG_FIELD: The respondent appears to have answered the other question
- ISSUE_BACK_REFERENCE: Response references a prior answer and is unintelligible
  on its own
- ISSUE_NON_RESPONSE: Terse non-answer with no analyzable content
- You may create other ISSUE_* codes if you encounter a different type of data
  quality problem

INSTRUCTIONS:
1. Read each response carefully
2. For each response, write a brief rationale
3. Then assign ALL applicable theme codes from the codebook
4. A response can have 0, 1, or multiple themes
5. Only use codes from the codebook or ISSUE codes
6. If no themes apply, return an empty array

RESPONSES TO CODE:
[Batches of 20 responses, each with context from opposite question]

OUTPUT FORMAT:
[
  {"pid": 8, "rationale": "...", "themes": ["theme_code_1", "theme_code_2"]},
  {"pid": 11, "rationale": "...", "themes": ["ISSUE_BACK_REFERENCE"]}
]

Return ONLY the JSON array, no other text.
Coding Prompt: Constraint Track

Identical structure to the opportunity track prompt, but with:

  • Constraint codebook themes replacing opportunity themes
  • "NOT WANT" responses as the primary coding target
  • "WANT" responses shown as cross-response context
  • Same ISSUE code system applies
Theme Discovery Prompt Template
You are analyzing open-ended survey responses from software developers about
where they want AI assistance in their work. Your task is to identify themes
in these responses.

Guidelines for theme creation:
- Themes should be SPECIFIC and ACTIONABLE
  (e.g., "Automated test generation for edge cases" not just "Testing")
- Themes should be PROBLEM-FOCUSED (describe the pain point, not a solution)
- A response can belong to MULTIPLE themes
- Aim for 4-15 themes that capture the major patterns

For each response, return:
{
  "pid": <participant ID>,
  "themes": ["theme_code_1", "theme_code_2", ...]
}

Also provide a theme codebook:
{
  "themes": [
    {
      "code": "snake_case_theme_code",
      "name": "Human-Readable Theme Name",
      "description": "What this capability means and why developers want it",
      "pids": [list of PIDs expressing this]
    }
  ],
  "codings": [
    {"pid": 123, "themes": ["theme_code_1", "theme_code_2"]}
  ]
}

RESPONSES TO ANALYZE:
[All responses for the category]
Theme Reconciliation Prompt Template
You are a qualitative research analyst performing theme reconciliation.

CONTEXT: Three independent AI models analyzed survey responses from the
"[category]" category. Each identified opportunity themes. Your job is to
reconcile these into a unified codebook.

--- GPT-5.2 THEMES ---
[All GPT themes with names, descriptions, PIDs]

--- GEMINI THEMES ---
[All Gemini themes]

--- OPUS THEMES ---
[All Opus themes]

TASK: Create a unified codebook by:
1. Identifying themes that overlap across models (same concept, different names)
2. Merging overlapping themes into single unified themes
3. Keeping single-model themes IF substantive (≥3 PIDs)
4. Dropping themes that are too vague or have very few supporting responses
5. Aim for 5-10 unified themes per category

For each unified theme, provide:
- code: snake_case identifier
- name: Human-readable name
- description: Clear description of the desired capability
- source_models: which models identified it (["gpt", "gemini", "opus"])
- source_codes: the original codes from each model

Return ONLY valid JSON, no other text.
ISSUE Code Taxonomy
CodeDefinitionDetection SignalConsensus Rule
ISSUE_WRONG_FIELD Respondent answered the opposite question (e.g., wrote constraints in the "want" field) Cross-response context reveals contradictory intent 2+ models flag any ISSUE_* → generic ISSUE marker applied
ISSUE_BACK_REFERENCE Response references a prior answer ("same as before", "see above") and is unintelligible alone Short response with deictic language
ISSUE_NON_RESPONSE Terse reply with no analyzable content "N/A", "none", "no", single punctuation
ISSUE_* (custom) Models may create additional issue codes for novel quality problems Varies Same 2/3 majority rule; prefix matching ensures grouping
Output JSON Schemas

Theme Discovery Output

{
  "model": "gpt-5.2",
  "category": "design_planning",
  "category_name": "Design & Planning",
  "response_count": 223,
  "timestamp": "ISO-8601",
  "themes": [
    {
      "code": "string",
      "name": "string",
      "description": "string",
      "pids": [integer]
    }
  ],
  "codings": [
    { "pid": integer, "themes": ["string"] }
  ]
}

Consolidated Codebook

{
  "metadata": {
    "phase": "Opportunity Theme Reconciliation",
    "timestamp": "ISO-8601",
    "reconciliation_model": "gpt-5.2",
    "discovery_models": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"]
  },
  "categories": {
    "category_key": {
      "category": "string",
      "category_name": "string",
      "theme_count": integer,
      "models_reconciled": ["gpt", "gemini", "opus"],
      "themes": [
        {
          "code": "string",
          "name": "string",
          "description": "string",
          "source_models": ["gpt", "gemini", "opus"],
          "source_codes": {
            "gpt": ["string"],
            "gemini": ["string"],
            "opus": ["string"]
          }
        }
      ]
    }
  }
}

Systematic Codings (Phase 4)

{
  "category": "string",
  "phase": "Phase 4 - Systematic Coding",
  "timestamp": "ISO-8601",
  "models": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"],
  "codebook": [ { "code": "string", "name": "string", "description": "string" } ],
  "response_count": integer,
  "codings": {
    "gpt": [ { "pid": integer, "rationale": "string", "themes": ["string"] } ],
    "gemini": [ ... ],
    "opus": [ ... ]
  },
  "cost": { ... }
}

IRR Results (Phase 5)

{
  "phase": "Phase 5 - Inter-Rater Reliability",
  "methodology": {
    "raters": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"],
    "metrics": ["Krippendorff's Alpha", "Cohen's Kappa (pairwise)", "Percent Agreement"]
  },
  "overall_statistics": {
    "mean_krippendorff_alpha": float,
    "mean_percent_agreement": float,
    "interpretation": "string"
  },
  "category_results": {
    "category_key": {
      "krippendorff_alpha": { "theme_code": float },
      "percent_agreement": { "theme_code": float },
      "pairwise_kappa": {
        "gpt_vs_gemini": { "theme_code": float },
        "gpt_vs_opus": { "theme_code": float },
        "gemini_vs_opus": { "theme_code": float }
      },
      "code_frequencies": { "gpt": {}, "gemini": {}, "opus": {} }
    }
  }
}

Prevalence Results (Phase 6)

{
  "methodology": {
    "consensus_method": "majority_vote",
    "threshold": "2+ of 3 models must agree"
  },
  "category_results": {
    "category_key": {
      "theme_prevalence": [
        {
          "code": "string",
          "count": integer,
          "percentage": float,
          "pids": [integer]
        }
      ],
      "consensus_codings": { "pid": ["theme1", "theme2"] }
    }
  }
}

Rich Opportunity Card

{
  "rank": integer,
  "theme_code": "string",
  "category": "string",
  "title": "string",
  "problem_statement": "string",
  "proposed_capability": {
    "summary": "string",
    "context_sources_needed": ["string"],
    "capability_steps": ["string"]
  },
  "impact": {
    "description": "string",
    "evidence_quotes": [ { "pid": integer, "quote": "string" } ]
  },
  "success_definition": {
    "qualitative_measures": ["string"],
    "quantitative_measures": ["string"]
  },
  "constraints_and_guardrails": [
    {
      "constraint": "string",
      "supporting_quote": { "pid": integer, "quote": "string" }
    }
  ],
  "who_it_affects": {
    "prevalence_count": integer,
    "prevalence_percentage": float,
    "description": "string",
    "signals": ["string"]
  },
  "models_consulted": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"]
}
IRR Interpretation Guide

Why Krippendorff's Alpha?

  • Designed for multi-rater reliability (3+ raters)
  • Handles missing data gracefully (if one model fails on a batch)
  • Supports nominal-level measurement (categorical theme codes)
  • Does not assume a fixed rater set
  • More conservative than simple percent agreement, adjusting for chance

How It's Calculated

For each theme code, a binary matrix is constructed:

# Row per model, column per response
#                PID_1  PID_2  PID_3  PID_4  ...
# GPT:          [  1,     0,     1,     0,   ...]
# Gemini:       [  1,     0,     1,     0,   ...]
# Opus:         [  1,     0,     0,     0,   ...]

alpha = krippendorff.alpha(
    reliability_data=matrix,
    level_of_measurement="nominal"
)

Reporting

  • Per-theme α values identify which themes models agree/disagree on
  • Themes with α < 0.67 are flagged for potential human adjudication
  • Overall mean α provides a summary reliability score
  • Pairwise κ identifies whether specific model pairs diverge
  • Code frequency counts reveal systematic over/under-coding by individual models

What IRR Tells Us (and Doesn't)

High α means the three models consistently apply the same theme to the same responses—the codebook is operationally clear and the models "understand" it similarly. Low α on a specific theme may indicate the theme definition is ambiguous, the theme requires human judgment the models handle differently, or the theme captures a rare pattern where base-rate effects inflate disagreement.

IRR does not tell us whether the codes are correct—only that the coders agree. This is why the human review gate exists: to ensure the codebook itself captures meaningful, well-defined themes before reliability is measured.

Model Configuration & Cost Tracking

Model Parameters

ModelThinking ConfigTemperatureStreaming
GPT-5.2reasoning_effort="high"1No
Gemini 3.1 ProThinkingConfig(thinking_level="HIGH")DefaultNo
Claude Opus 4.6thinking: adaptive, effort: highDefaultYes (timeout avoidance)

Token Pricing (per 1M tokens)

ModelInputOutputNotes
GPT-5.2$1.75$14.00Thinking tokens billed at output rate
Gemini 3.1 Pro$2.00$12.00Thinking tokens billed at output rate
Claude Opus 4.6$5.00$25.00Thinking tokens billed at output rate

The CostTracker class in llm.py tracks input, output, and thinking tokens separately per API call, with phase-level summaries printed to console.