Methodology: Multi-LLM Qualitative Analysis of Developer AI Preferences

Analysis Tracks

Task Categories

LLM Coders

Batch Size

IRR Metric

This document describes the full methodology for a multi-model qualitative analysis pipeline that identifies research opportunities (what developers want AI to do) and design constraints (what developers do not want AI to handle) from open-ended survey responses. Three frontier LLMs serve as independent coders, with inter-rater reliability calculated via Krippendorff's alpha and consensus reached through majority vote. A human review gate validates codebooks before systematic coding begins.

Research Questions & Data

Survey Questions (per category)

Opportunity track: "Where do you want AI to play the biggest role in [category] activities?"
Open-ended responses capturing desired capabilities and unmet needs.

Constraint track: "What aspects do you NOT want AI to handle and why?"
Open-ended responses capturing guardrails, no-go zones, and boundary conditions.

Unit of Analysis

Each unit is a single respondent's open-ended answer to one of the two questions within a category. Respondents answered about 2–3 categories each, and a single response may be assigned multiple theme codes.

Task Categories & Response Counts

Category	Respondents	Tasks Covered
Development	816	Coding, Bug Fixing, Perf Optimization, Refactoring, AI Development
Design & Planning	548	System Architecture, Requirements Gathering, Project Planning
Meta-Work	532	Documentation, Communication, Mentoring, Learning, Research
Quality & Risk	401	Testing & QA, Code Review / PRs, Security & Compliance
Infrastructure & Ops	283	DevOps / CI-CD, Environment Setup, Monitoring, Customer Support

Data Quality Note

Approximately 11% of responses contain data quality issues detected during coding: misplaced answers (respondent wrote a "want" answer in the "not want" field or vice-versa), back-references to prior answers that are unintelligible on their own, or terse non-responses. These are flagged with ISSUE_* codes rather than discarded, avoiding pre-filter bias (see Methodological Controls).

Pipeline Overview

The analysis runs in two stages with a human review gate between them. Both the opportunity and constraint tracks follow identical process steps but with track-specific prompts and codebooks.

Opportunity Track

Constraint Track

Theme Discovery

3 models discover "want" themes independently

↔

Theme Discovery

3 models discover "not want" themes independently

↓

Reconciliation

GPT-5.2 merges themes into unified codebook

↔

Reconciliation

GPT-5.2 merges constraint themes into codebook

↓

Human Gate Author Review & Codebook Approval

Researcher validates themes, merges/splits/renames as needed, approves before coding proceeds

↓

Systematic Coding

3 models code every response (batch=20)

↔

Systematic Coding

3 models code every constraint response

↓

Triangulation IRR Calculation & Majority-Vote Consensus

Krippendorff's α per theme • 2-of-3 majority vote • ISSUE code aggregation

↓

Rich Opportunity Cards

Top-5 per category, 3-model synthesis

↔

Constraint Maps & Design Principles

No-go zones with guardrail guidance

Stage 1: Theme Discovery & Reconciliation

Phase 1: Independent 3-Model Discovery

Each model receives all responses for a given category and independently proposes 4–15 themes with supporting evidence (PIDs). The prompt instructs models to create specific, actionable, problem-focused themes and to allow multi-coding.

Model	Provider	Thinking Mode	Role
GPT-5.2	OpenAI	`reasoning_effort="high"`	Independent coder & reconciler
Gemini 3.1 Pro	Google	`thinking_level="HIGH"`	Independent coder
Claude Opus 4.6	Anthropic	`thinking: adaptive, effort: high`	Independent coder

Inputs

Open-ended survey responses with PIDs (e.g., 816 Development responses or 548 Design & Planning responses)
Category name and context description

Outputs (per category, per model)

Theme codebook: code, name, description, supporting PIDs
Per-response codings: PID → [theme_code_1, theme_code_2, ...]
Files: {category}_themes_{model}.json (15 opportunity files + 15 constraint files)

Phase 2: GPT-5.2 Reconciliation

A single reconciliation model (GPT-5.2) receives all three models' theme sets and produces a unified codebook per category by:

Identifying overlapping themes across models (same concept, different names)
Merging overlapping themes into single unified entries
Retaining single-model themes only if substantive (≥3 PIDs)
Dropping themes that are too vague or have very few supporting responses
Targeting 5–10 unified themes per category

Each unified theme records its source_models (which of the three models independently proposed it) and source_codes (original model-specific code names), providing full provenance.

Outputs

consolidated_codebook.json — all 5 category codebooks (opportunity track)
constraint_codebook.json — all 5 category codebooks (constraint track)

Human Review Gate Required

The pipeline pauses for researcher review before systematic coding begins. The researcher:

Reviews each proposed theme and reads sample supporting responses
Checks themes for specificity, granularity, and completeness
Can keep, rename, merge, split, or remove any theme
Can add themes the models missed
Documents rationale for all changes

Systematic coding (Stage 2) does not proceed until the codebook is explicitly approved.

Stage 2: Systematic Coding & Analysis

Coding Protocol

All three models independently re-code every response against the finalized codebook. Key protocol elements:

Parameter	Value	Rationale
Batch size	20 responses per API call	Balances context window usage against API call count
Rationale-first	Model writes rationale before assigning codes	Improves accuracy via chain-of-thought; enables auditability
Cross-response context	Each response shown alongside opposite-question answer	Enables misresponse detection (ISSUE codes)
Multi-coding	0, 1, or many themes per response	Captures full semantic content
Codebook-only	Only codebook codes or ISSUE_* codes allowed	Prevents code drift across batches

ISSUE Code System

During systematic coding, models flag data quality problems rather than silently discarding responses:

Code	Meaning	Example
`ISSUE_WRONG_FIELD`	Respondent answered the opposite question	Describing constraints in the "want" field
`ISSUE_BACK_REFERENCE`	References a prior answer; unintelligible alone	"Same as before", "see above"
`ISSUE_NON_RESPONSE`	Terse non-answer with no analyzable content	"N/A", "none", "no"

Models may create additional ISSUE_* codes if they encounter other data quality problems. The ISSUE prefix ensures these are never confused with substantive themes.

Inter-Rater Reliability (IRR)

Agreement between the three LLM coders is measured per theme using Krippendorff's alpha (α), the standard multi-rater reliability coefficient for qualitative research. For each theme, a binary (present/absent) coding matrix is built across all responses, and α is calculated at the nominal level.

Range	Interpretation
α ≥ 0.80	Excellent agreement — publishable
α ≥ 0.67	Acceptable agreement — tentative conclusions
α ≥ 0.50	Moderate agreement — use with caution
α < 0.50	Poor agreement — unreliable for this theme

Additionally, pairwise Cohen's kappa (κ) is calculated for each model pair (GPT–Gemini, GPT–Opus, Gemini–Opus) and 3-rater percent agreement (all three models assign the same code) is reported per theme.

Consensus Voting

Final theme assignments use a majority vote: 2 of 3 models must agree for a theme to be assigned to a response. This is applied independently per response and per theme code.

ISSUE code handling

If 2+ models flag any ISSUE code for a response (regardless of which specific ISSUE code), the response receives a generic ISSUE marker and is excluded from substantive analysis. This prevents a single aggressive model from filtering out too many responses.

Rich Opportunity Cards

For the top 5 themes per category (by prevalence), all three models independently generate detailed opportunity cards including:

Problem statement and proposed capability description
Required context sources and capability steps
Impact description with supporting evidence quotes
Success criteria (qualitative and quantitative measures)
Constraints and guardrails drawn from the constraint track
Prevalence data and quantitative signals (AI preference, usage gap)

Cards from the three models are merged using a union-and-deduplicate strategy: longest title wins, context sources are combined (max 7), capability steps use the longest sequence (max 6), and constraints are deduplicated (max 4).

Constraint Maps & Design Principles

Constraint-track prevalence is calculated identically to the opportunity track. The top no-go zones per category are documented with:

Zone name, description, and prevalence count
Up to 10 supporting respondent quotes
3–6 synthesized design principles per category (generated by GPT-5.2)
Each principle includes implementation guidance and derivation provenance

Methodological Controls

The pipeline incorporates several controls designed to increase rigor beyond what a single-model analysis can provide.

Control	Mechanism	What It Mitigates
Multi-LLM triangulation	3 frontier models from different families code independently	Single-model bias, training-data artifacts, idiosyncratic interpretations
Rationale-first coding	Models write reasoning before assigning codes	Snap-judgment errors; enables post-hoc audit of coding decisions
Cross-response context	Both "want" and "not want" answers shown to coder	Misresponse blindness; enables ISSUE_WRONG_FIELD detection
ISSUE code system	Flag quality problems in-band rather than pre-filtering	Pre-filter bias from silently dropping ambiguous responses
Idempotent checkpointing	Staleness detection skips phases whose inputs haven't changed	Wasted computation; ensures reproducible reruns
Consensus merging	Majority vote (2/3) for codes; union-and-deduplicate for synthesis	Noise from single-model outlier codes; incomplete synthesis from any single model

Design Decisions & Trade-offs

Decision	Rationale	Trade-off
3 models, not 2 or 5	Minimum for meaningful IRR (Krippendorff's α); covers 3 major LLM families	Higher API cost (∼3×); manageable with batch parallelism
HIGH thinking for all models	Qualitative coding benefits from extended reasoning; reduces surface-level pattern matching	Slower inference, higher token cost (thinking tokens billed at output rate)
Batch size of 20	Enough responses for cross-response pattern recognition; fits comfortably in context windows	More API calls than larger batches; but avoids context truncation risks
Majority vote (2/3)	Balances sensitivity and specificity; equivalent to >50% agreement threshold	May miss themes where only one model sees a valid pattern
Human gate before coding	Prevents systematic errors from propagating through the entire coding phase	Introduces a manual pause in an otherwise automated pipeline
No pre-filtering of responses	ISSUE codes capture quality problems without discarding data points	Models must process noisy responses; ISSUE detection is itself imperfect
GPT-5.2 as sole reconciler	Reconciliation requires structured comparison rather than independent generation; one model suffices	Reconciliation may inherit GPT-specific biases in theme naming
Streaming for Claude Opus	Avoids 10-minute HTTP timeout on long-running inference	More complex error handling; no retry on partial stream failures

Limitations & Mitigations

Limitation	Impact	Mitigation
LLM nondeterminism	Exact codings may vary across runs even with identical inputs	3-model triangulation smooths out individual variance; IRR quantifies remaining disagreement; idempotent checkpointing ensures reproducible runs when inputs are stable
LLM rationalization	Models may construct plausible but incorrect rationales	Multi-model disagreement surfaces cases where rationalization diverges; majority vote filters single-model confabulations
Prompt sensitivity	Different prompt wording could yield different themes	Codebook-anchored coding constrains coder freedom; prompts are documented and versioned for replication
Not replacing human qualitative research	LLM coders lack lived experience; may miss cultural nuances	Human review gate validates codebook; methodology is positioned as accelerating qualitative work, not replacing it; all outputs include supporting quotes for human verification
Survey sample	860 Microsoft developers may not represent the broader industry	Out of scope for the analysis methodology itself; noted as a limitation of the source data
LLM knowledge contamination	Models may have been trained on similar survey analyses	Codebook-first design constrains output to researcher-approved themes; verbatim quotes provide verifiable evidence independent of model knowledge

Artifacts & Replication

Artifact Inventory

Phase	File Pattern	Count	Description
Data	`{category}_responses.json`	5	Extracted open-ended responses with PIDs
Data	`{category}_quantitative.json`	5	Aggregated Likert scale metrics per task
Data	`{category}_do_not_want_responses.json`	5	Extracted constraint responses with PIDs
Stage 1	`{category}_themes_{model}.json`	15	Independent opportunity theme discoveries
Stage 1	`{category}_constraint_themes_{model}.json`	15	Independent constraint theme discoveries
Stage 1	`consolidated_codebook.json`	1	Unified opportunity codebook (all categories)
Stage 1	`constraint_codebook.json`	1	Unified constraint codebook (all categories)
Stage 2	`{category}_phase4_codings.json`	5	3-model systematic codings with rationales
Stage 2	`phase5_irr_results.json`	1	Krippendorff's α, Cohen's κ, agreement %
Stage 2	`phase6_prevalence_results.json`	1	Majority-vote consensus and theme prevalence
Stage 2	`phase6_rich_opportunities.json`	1	Top-5 opportunity cards per category (3-model synthesis)
Stage 2	`constraint_maps.json`	1	No-go zones and design principles

Dependency Chain

data.xlsx  →  {cat}_responses.json
                  {cat}_do_not_want_responses.json
                  {cat}_quantitative.json
                        ↓
              {cat}_themes_{model}.json  (3×5 = 15 files)
              {cat}_constraint_themes_{model}.json  (15 files)
                        ↓
              consolidated_codebook.json
              constraint_codebook.json
                        ↓
              ■ HUMAN REVIEW GATE
                        ↓
              {cat}_phase4_codings.json  (5 files, 3 models each)
                        ↓
              phase5_irr_results.json
              phase6_prevalence_results.json
                     ↓            ↓
      phase6_rich_opportunities.json   constraint_maps.json

Staleness Detection

Every pipeline phase checks whether its output is stale relative to its inputs by comparing file modification times. If all inputs are older than the output, the phase is skipped. If any input is newer, the output is regenerated. This enables:

Incremental reruns: updating one category's theme discovery only regenerates downstream outputs for that category
Safe restarts: if the pipeline crashes mid-phase, only the incomplete phase reruns
Force override: --force flag bypasses staleness checks for full regeneration

How to Rerun

Ensure API keys are set in .env for OpenAI, Google, and Anthropic
Install dependencies: uv sync
Run full pipeline: bash run_full_pipeline.sh
Pipeline pauses after Stage 1 for human codebook review
After approval, Stage 2 runs automatically
To force regeneration: bash run_full_pipeline.sh --force
To rerun a single category: uv run phase4_systematic_coding.py design_planning

Appendix

Opportunity Codebook (All 5 Categories, 48 Themes)

Unified codebook produced by GPT-5.2 reconciliation of themes independently discovered by all three models. Each theme lists which models independently identified it.

Development 10 themes

Code	Theme	Models
`refactoring_modernization`	Automated Refactoring, Modernization & Tech-Debt Reduction	GPT, Gemini, Opus
`boilerplate_scaffolding_feature_codegen`	Boilerplate, Scaffolding & Routine Feature Code Generation	GPT, Gemini, Opus
`automated_testing_validation`	Automated Test Generation, Coverage & Change Validation	GPT, Gemini, Opus
`debugging_root_cause_fixing`	Debugging, Root Cause Analysis & Bug Fix Assistance	GPT, Gemini, Opus
`repo_wide_context_dependency_awareness`	Repo-Wide Context, Dependency Awareness & Safe Multi-File Changes	GPT, Gemini, Opus
`code_quality_review_security_compliance`	Code Quality, Review Automation, Standards & Security/Compliance Guidance	GPT, Gemini, Opus
`performance_profiling_optimization`	Performance Profiling & Optimization Suggestions	GPT, Gemini, Opus
`architecture_design_planning_support`	Architecture, Design Brainstorming & Planning Support	GPT, Gemini, Opus
`devops_ci_cd_iac_workflow_automation`	DevOps, CI/CD, IaC & Engineering Workflow Automation	GPT, Gemini, Opus
`documentation_knowledge_retrieval_onboarding`	Documentation Generation, Knowledge Retrieval & Onboarding/Learning Support	GPT, Gemini, Opus

Design & Planning 10 themes

Code	Theme	Models
`requirements_gathering_synthesis`	Requirements Gathering, Synthesis & Clarification	GPT, Gemini, Opus
`architecture_design_generation`	Architecture & System Design Generation/Iteration	GPT, Gemini, Opus
`interactive_brainstorming_design_partner`	Interactive Brainstorming & Design Copilot	GPT, Gemini, Opus
`tradeoff_decision_support_simulation`	Trade-off Analysis, What-if Simulation & Decision Support	GPT, Gemini, Opus
`design_validation_risk_edge_cases`	Design Validation, Risk Assessment & Edge-Case Discovery	GPT, Gemini, Opus
`project_planning_tasking_status_automation`	Project Planning, Ticket/Task Breakdown & Status Automation	GPT, Gemini, Opus
`documentation_spec_diagram_generation`	Documentation, Specs & Diagram/Artifact Generation	GPT, Gemini, Opus
`context_retrieval_codebase_and_institutional_memory`	Context Retrieval: Codebase Understanding & Institutional Memory	GPT, Gemini, Opus
`research_and_information_synthesis`	Research, Information Gathering & Knowledge Synthesis	GPT, Gemini, Opus
`trustworthy_outputs_with_citations`	Trustworthy Outputs: Higher Accuracy & Verifiable Citations	GPT, Gemini, Opus

Quality & Risk 9 themes

Code	Theme	Models
`automated_test_generation_and_quality_gates`	Automated Test Generation, Maintenance & Quality Gates	GPT, Gemini, Opus
`intelligent_pr_code_review`	Intelligent PR/Code Review Assistant	GPT, Gemini, Opus
`security_vulnerability_detection_and_fix_guidance`	Security Vulnerability Detection & Fix Guidance	GPT, Gemini, Opus
`compliance_and_audit_automation`	Compliance, Standards & Audit Process Automation	GPT, Gemini, Opus
`proactive_risk_monitoring_and_prediction`	Proactive Risk Monitoring, Prediction & Anomaly Detection	GPT, Gemini, Opus
`debugging_root_cause_and_failure_triage`	Debugging, Root Cause Analysis & Failure Triage	GPT, Gemini, Opus
`knowledge_retrieval_and_standards_guidance`	Knowledge Retrieval, Summarization & Standards Guidance	GPT, Gemini, Opus
`agentic_workflow_automation_and_remediation`	Agentic Workflow Automation & Automated Remediation	GPT, Gemini, Opus
`ai_driven_exploratory_chaos_and_fuzz_testing`	AI-Driven Exploratory, Chaos & Fuzz Testing	Opus only

Infrastructure & Ops 10 themes

Code	Theme	Models
`intelligent_monitoring_alerting_anomaly_detection`	Intelligent Monitoring, Alerting & Anomaly Detection	GPT, Gemini, Opus
`incident_response_rca_mitigation_self_heal`	Incident Response Automation (Triage, RCA, Mitigation, Self-Heal)	GPT, Gemini, Opus
`cicd_pipeline_and_deployment_automation`	CI/CD Pipeline & Deployment Automation	GPT, Gemini, Opus
`infrastructure_provisioning_and_iac_generation`	Automated Environment Setup & IaC Generation	GPT, Gemini, Opus
`infrastructure_maintenance_upgrades_security_cost_optimization`	Proactive Maintenance, Upgrades, Security/Compliance & Cost Optimization	GPT, Gemini, Opus
`customer_support_triage_and_autoresponse`	Customer Support Triage & Auto-Response	GPT, Gemini, Opus
`knowledge_management_doc_search_and_system_context`	Knowledge Management, Documentation Search & System Context	GPT, Gemini, Opus
`ops_toil_automation_and_script_generation`	Ops Toil Automation & Script Writing/Debugging	GPT, Gemini, Opus
`testing_quality_validation_and_safe_deploy`	Testing, Quality Validation & Safer Releases	GPT, Gemini, Opus
`ai_tooling_ux_accuracy_and_cohesive_workflows`	Better AI Tooling UX (Accuracy, Control & Cohesive Workflows)	GPT, Gemini, Opus

Meta-Work 9 themes

Code	Theme	Models
`automated_documentation`	Automated Documentation Generation & Maintenance	GPT, Gemini, Opus
`knowledge_search_and_discovery`	Project Knowledge Search & Discovery (with Traceable Sources)	GPT, Gemini, Opus
`brainstorming_and_solution_exploration`	Brainstorming, Option Generation & Rapid Exploration	GPT, Gemini, Opus
`personalized_learning_and_upskilling`	Personalized Learning for New Technologies	GPT, Gemini, Opus
`team_onboarding_and_mentoring`	Team Onboarding, Mentoring & Institutional Knowledge Transfer	GPT, Gemini, Opus
`stakeholder_communication_support`	Stakeholder/Client Communication Drafting & Translation	GPT, Gemini, Opus
`meeting_assistance`	Meeting Scheduling, Notes, Summaries & Action Items	GPT, Gemini, Opus
`planning_prioritization_and_status_tracking`	Planning, Prioritization, Blocker Detection & Status Reporting	GPT, Gemini, Opus
`proactive_personal_agent_and_admin_automation`	Proactive Personal Agent & Routine Admin Automation	GPT, Gemini, Opus

Constraint Codebook (All 5 Categories, 50 Themes)

Unified constraint codebook produced by GPT-5.2 reconciliation. Captures what developers do not want AI to handle.

Development 10 themes

Code	Theme	Models
`no_autonomous_architecture_system_design`	No Autonomous Architecture or System Design Decisions	GPT, Gemini, Opus
`no_large_unscoped_refactors`	No Large, Unscoped, or Sweeping Codebase Changes	GPT, Gemini, Opus
`no_autonomous_execution_merge_deploy_or_agentic_control`	No Autonomous Execution, Merging/Deploying, or Agentic Control	GPT, Gemini, Opus
`no_complex_debugging_or_critical_bug_fixes`	No AI Ownership of Complex Debugging or Critical Bug Fixes	GPT, Gemini, Opus
`no_security_privacy_secrets_handling`	No Security/Privacy-Sensitive Work or Secrets Handling	GPT, Gemini, Opus
`no_autonomous_performance_optimization`	No Autonomous Performance Optimization	GPT, Gemini, Opus
`no_ai_deciding_requirements_business_logic_or_api_ux`	No AI-Led Requirements, Core Business Logic, or API/UX Decisions	GPT, Gemini
`preserve_developer_agency_learning_and_job_ownership`	Preserve Developer Agency, Learning, and Ownership	GPT, Gemini, Opus
`avoid_ai_when_unreliable_contextless_hard_to_verify_or_intrusive`	Avoid AI Output That Is Unreliable, Contextless, Hard to Verify, or Intrusive	GPT, Gemini, Opus
`no_constraints_open_to_ai_help`	No Specific No-Go Zones (Open to AI Help)	GPT, Gemini

Design & Planning 10 themes

Code	Theme	Models
`human_accountability_final_decisions`	No AI Final Decision-Making (Human Accountability Required)	GPT, Gemini, Opus
`human_led_architecture_design`	No AI as Primary System Architect / High-Level Designer	GPT, Gemini, Opus
`no_ai_project_management_task_assignment`	No AI Running Project Management	GPT, Gemini, Opus
`no_ai_requirements_stakeholder_elicitation`	No AI-Led Requirements Gathering or Stakeholder Alignment	GPT, Gemini, Opus
`no_ai_empathy_team_dynamics`	No Replacement of Human Empathy, Collaboration, or Interpersonal Dynamics	GPT, Gemini, Opus
`ai_assistant_human_in_loop`	No Autopilot: AI Should Assist with Human-in-the-Loop Oversight	GPT, Gemini, Opus
`trust_accuracy_and_context_limitations`	Avoid AI for High-Stakes Work Due to Reliability & Missing Context	GPT, Gemini, Opus
`privacy_confidentiality_ip_and_message_control`	No AI Handling Sensitive/Confidential Data or Uncontrolled Messaging	GPT, Gemini, Opus
`no_ai_vision_strategy_creativity_taste`	No AI Owning Product Vision, Strategy, or Creative Judgments	GPT, Gemini
`no_constraints_or_unsure`	No Constraints Stated / Welcome Full AI Involvement	GPT, Gemini, Opus

Quality & Risk 10 themes

Code	Theme	Models
`human_final_decision_and_accountability`	Humans Must Make Final High-Stakes Decisions	GPT, Gemini, Opus
`no_autonomous_code_or_production_actions`	No Autonomous Code/Repo/Production Actions Without Approval	GPT, Gemini, Opus
`human_code_review_gate_required`	Human Code Review / PR Approval Must Remain the Gate	GPT, Gemini, Opus
`security_and_compliance_must_be_human_led`	Security, Compliance, and Threat Modeling Must Be Human-Led	GPT, Gemini, Opus
`no_sensitive_data_or_credentials_access`	Do Not Give AI Access to Sensitive/Customer Data or Credentials	GPT, Gemini, Opus
`ai_outputs_must_be_verifiable_and_not_self_validated`	AI Must Be Reliable, Verifiable, and Not Self-Validated	GPT, Gemini, Opus
`humans_own_requirements_architecture_and_tradeoffs`	Humans Must Own Requirements, Architecture, and Trade-Offs	GPT, Gemini, Opus
`human_led_test_strategy_intent_and_signoff`	Test Strategy and Sign-Off Must Be Human-Led	GPT only
`preserve_human_ethics_empathy_and_human_centric_work`	Preserve Human Ethics, Empathy, and Human-Centric Work	GPT, Gemini
`no_constraints_stated`	No Specific No-Go Areas Stated	GPT, Opus

Infrastructure & Ops 10 themes

Code	Theme	Models
`no_direct_customer_interaction`	No Direct AI-to-Customer Interaction	GPT, Gemini, Opus
`no_autonomous_production_changes`	No Autonomous Production Deployments or Changes	GPT, Gemini, Opus
`human_approval_before_consequential_actions`	Human Approval Required Before Consequential Actions	GPT, Opus
`no_security_permissions_secrets_management`	No AI Management of Security, Access, Permissions, or Secrets	GPT, Gemini, Opus
`no_autonomous_incident_response_or_overrides`	No Autonomous Incident Response or Critical Overrides	GPT, Gemini, Opus
`avoid_ai_for_high_precision_deterministic_work`	Avoid AI for High-Precision/Deterministic Work	GPT, Gemini, Opus
`no_full_autonomy_for_environment_setup_maintenance`	No Full Autonomy for Environment Setup and Maintenance	GPT, Gemini
`preserve_human_learning_and_accountability`	Preserve Human Learning, System Understanding, and Accountability	GPT, Gemini, Opus
`no_ai_initiated_irreversible_or_destructive_data_actions`	No AI-Initiated Irreversible/Destructive Data Operations	GPT, Gemini, Opus
`no_constraints_expressed_or_pro_automation`	No Constraints Expressed / Comfortable with Broad Automation	GPT, Gemini, Opus

Meta-Work 10 themes

Code	Theme	Models
`human_led_mentoring_onboarding`	Keep mentoring and onboarding human-led	GPT, Gemini, Opus
`human_authored_communication`	Keep interpersonal communications human-authored	GPT, Gemini, Opus
`human_review_required_before_sending_or_publishing`	No autonomous sending/publishing without human review	GPT, Gemini, Opus
`no_confidential_or_sensitive_data`	Keep AI away from confidential or sensitive information	GPT, Gemini, Opus
`preserve_hands_on_learning`	Don't outsource learning and skills development to AI	GPT, Gemini, Opus
`preserve_human_research_and_ideation`	Keep research/brainstorming primarily human	GPT, Gemini, Opus
`human_accountability_for_high_stakes_decisions`	High-stakes decisions must remain human-led	GPT, Gemini, Opus
`avoid_unvetted_documentation`	AI-generated documentation must be vetted	GPT, Gemini, Opus
`ai_outputs_not_trustworthy_as_primary_source`	Don't treat AI output as trustworthy/authoritative	Opus only
`no_constraints_or_unsure`	No constraints stated / unsure	GPT, Gemini, Opus

Coding Prompt: Opportunity Track (Phase 4)

You are a qualitative research coder. Your task is to systematically code
each "WANT" response using ONLY the themes from the provided codebook.

Each response is shown alongside the same respondent's answer to a related
question about what they do NOT want AI to handle, for additional context.

CODEBOOK:
{codebook themes listed here}

ISSUE CODES (assign when a response has data quality issues):
- ISSUE_WRONG_FIELD: The respondent appears to have answered the other question
- ISSUE_BACK_REFERENCE: Response references a prior answer and is unintelligible
  on its own
- ISSUE_NON_RESPONSE: Terse non-answer with no analyzable content
- You may create other ISSUE_* codes if you encounter a different type of data
  quality problem

INSTRUCTIONS:
1. Read each response carefully
2. For each response, write a brief rationale
3. Then assign ALL applicable theme codes from the codebook
4. A response can have 0, 1, or multiple themes
5. Only use codes from the codebook or ISSUE codes
6. If no themes apply, return an empty array

RESPONSES TO CODE:
[Batches of 20 responses, each with context from opposite question]

OUTPUT FORMAT:
[
  {"pid": 8, "rationale": "...", "themes": ["theme_code_1", "theme_code_2"]},
  {"pid": 11, "rationale": "...", "themes": ["ISSUE_BACK_REFERENCE"]}
]

Return ONLY the JSON array, no other text.

Coding Prompt: Constraint Track

Identical structure to the opportunity track prompt, but with:

Constraint codebook themes replacing opportunity themes
"NOT WANT" responses as the primary coding target
"WANT" responses shown as cross-response context
Same ISSUE code system applies

Theme Discovery Prompt Template

You are analyzing open-ended survey responses from software developers about
where they want AI assistance in their work. Your task is to identify themes
in these responses.

Guidelines for theme creation:
- Themes should be SPECIFIC and ACTIONABLE
  (e.g., "Automated test generation for edge cases" not just "Testing")
- Themes should be PROBLEM-FOCUSED (describe the pain point, not a solution)
- A response can belong to MULTIPLE themes
- Aim for 4-15 themes that capture the major patterns

For each response, return:
{
  "pid": <participant ID>,
  "themes": ["theme_code_1", "theme_code_2", ...]
}

Also provide a theme codebook:
{
  "themes": [
    {
      "code": "snake_case_theme_code",
      "name": "Human-Readable Theme Name",
      "description": "What this capability means and why developers want it",
      "pids": [list of PIDs expressing this]
    }
  ],
  "codings": [
    {"pid": 123, "themes": ["theme_code_1", "theme_code_2"]}
  ]
}

RESPONSES TO ANALYZE:
[All responses for the category]

Theme Reconciliation Prompt Template

You are a qualitative research analyst performing theme reconciliation.

CONTEXT: Three independent AI models analyzed survey responses from the
"[category]" category. Each identified opportunity themes. Your job is to
reconcile these into a unified codebook.

--- GPT-5.2 THEMES ---
[All GPT themes with names, descriptions, PIDs]

--- GEMINI THEMES ---
[All Gemini themes]

--- OPUS THEMES ---
[All Opus themes]

TASK: Create a unified codebook by:
1. Identifying themes that overlap across models (same concept, different names)
2. Merging overlapping themes into single unified themes
3. Keeping single-model themes IF substantive (≥3 PIDs)
4. Dropping themes that are too vague or have very few supporting responses
5. Aim for 5-10 unified themes per category

For each unified theme, provide:
- code: snake_case identifier
- name: Human-readable name
- description: Clear description of the desired capability
- source_models: which models identified it (["gpt", "gemini", "opus"])
- source_codes: the original codes from each model

Return ONLY valid JSON, no other text.

ISSUE Code Taxonomy

Code	Definition	Detection Signal	Consensus Rule
`ISSUE_WRONG_FIELD`	Respondent answered the opposite question (e.g., wrote constraints in the "want" field)	Cross-response context reveals contradictory intent	2+ models flag any ISSUE_* → generic ISSUE marker applied
`ISSUE_BACK_REFERENCE`	Response references a prior answer ("same as before", "see above") and is unintelligible alone	Short response with deictic language
`ISSUE_NON_RESPONSE`	Terse reply with no analyzable content	"N/A", "none", "no", single punctuation
`ISSUE_*` (custom)	Models may create additional issue codes for novel quality problems	Varies	Same 2/3 majority rule; prefix matching ensures grouping

Output JSON Schemas

Theme Discovery Output

{
  "model": "gpt-5.2",
  "category": "design_planning",
  "category_name": "Design & Planning",
  "response_count": 223,
  "timestamp": "ISO-8601",
  "themes": [
    {
      "code": "string",
      "name": "string",
      "description": "string",
      "pids": [integer]
    }
  ],
  "codings": [
    { "pid": integer, "themes": ["string"] }
  ]
}

Consolidated Codebook

{
  "metadata": {
    "phase": "Opportunity Theme Reconciliation",
    "timestamp": "ISO-8601",
    "reconciliation_model": "gpt-5.2",
    "discovery_models": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"]
  },
  "categories": {
    "category_key": {
      "category": "string",
      "category_name": "string",
      "theme_count": integer,
      "models_reconciled": ["gpt", "gemini", "opus"],
      "themes": [
        {
          "code": "string",
          "name": "string",
          "description": "string",
          "source_models": ["gpt", "gemini", "opus"],
          "source_codes": {
            "gpt": ["string"],
            "gemini": ["string"],
            "opus": ["string"]
          }
        }
      ]
    }
  }
}

Systematic Codings (Phase 4)

{
  "category": "string",
  "phase": "Phase 4 - Systematic Coding",
  "timestamp": "ISO-8601",
  "models": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"],
  "codebook": [ { "code": "string", "name": "string", "description": "string" } ],
  "response_count": integer,
  "codings": {
    "gpt": [ { "pid": integer, "rationale": "string", "themes": ["string"] } ],
    "gemini": [ ... ],
    "opus": [ ... ]
  },
  "cost": { ... }
}

IRR Results (Phase 5)

{
  "phase": "Phase 5 - Inter-Rater Reliability",
  "methodology": {
    "raters": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"],
    "metrics": ["Krippendorff's Alpha", "Cohen's Kappa (pairwise)", "Percent Agreement"]
  },
  "overall_statistics": {
    "mean_krippendorff_alpha": float,
    "mean_percent_agreement": float,
    "interpretation": "string"
  },
  "category_results": {
    "category_key": {
      "krippendorff_alpha": { "theme_code": float },
      "percent_agreement": { "theme_code": float },
      "pairwise_kappa": {
        "gpt_vs_gemini": { "theme_code": float },
        "gpt_vs_opus": { "theme_code": float },
        "gemini_vs_opus": { "theme_code": float }
      },
      "code_frequencies": { "gpt": {}, "gemini": {}, "opus": {} }
    }
  }
}

Prevalence Results (Phase 6)

{
  "methodology": {
    "consensus_method": "majority_vote",
    "threshold": "2+ of 3 models must agree"
  },
  "category_results": {
    "category_key": {
      "theme_prevalence": [
        {
          "code": "string",
          "count": integer,
          "percentage": float,
          "pids": [integer]
        }
      ],
      "consensus_codings": { "pid": ["theme1", "theme2"] }
    }
  }
}

Rich Opportunity Card

{
  "rank": integer,
  "theme_code": "string",
  "category": "string",
  "title": "string",
  "problem_statement": "string",
  "proposed_capability": {
    "summary": "string",
    "context_sources_needed": ["string"],
    "capability_steps": ["string"]
  },
  "impact": {
    "description": "string",
    "evidence_quotes": [ { "pid": integer, "quote": "string" } ]
  },
  "success_definition": {
    "qualitative_measures": ["string"],
    "quantitative_measures": ["string"]
  },
  "constraints_and_guardrails": [
    {
      "constraint": "string",
      "supporting_quote": { "pid": integer, "quote": "string" }
    }
  ],
  "who_it_affects": {
    "prevalence_count": integer,
    "prevalence_percentage": float,
    "description": "string",
    "signals": ["string"]
  },
  "models_consulted": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"]
}

IRR Interpretation Guide

Why Krippendorff's Alpha?

Designed for multi-rater reliability (3+ raters)
Handles missing data gracefully (if one model fails on a batch)
Supports nominal-level measurement (categorical theme codes)
Does not assume a fixed rater set
More conservative than simple percent agreement, adjusting for chance

How It's Calculated

For each theme code, a binary matrix is constructed:

# Row per model, column per response
#                PID_1  PID_2  PID_3  PID_4  ...
# GPT:          [  1,     0,     1,     0,   ...]
# Gemini:       [  1,     0,     1,     0,   ...]
# Opus:         [  1,     0,     0,     0,   ...]

alpha = krippendorff.alpha(
    reliability_data=matrix,
    level_of_measurement="nominal"
)

Reporting

Per-theme α values identify which themes models agree/disagree on
Themes with α < 0.67 are flagged for potential human adjudication
Overall mean α provides a summary reliability score
Pairwise κ identifies whether specific model pairs diverge
Code frequency counts reveal systematic over/under-coding by individual models

What IRR Tells Us (and Doesn't)

High α means the three models consistently apply the same theme to the same responses—the codebook is operationally clear and the models "understand" it similarly. Low α on a specific theme may indicate the theme definition is ambiguous, the theme requires human judgment the models handle differently, or the theme captures a rare pattern where base-rate effects inflate disagreement.

IRR does not tell us whether the codes are correct—only that the coders agree. This is why the human review gate exists: to ensure the codebook itself captures meaningful, well-defined themes before reliability is measured.

Model Configuration & Cost Tracking

Model Parameters

Model	Thinking Config	Temperature	Streaming
GPT-5.2	`reasoning_effort="high"`	1	No
Gemini 3.1 Pro	`ThinkingConfig(thinking_level="HIGH")`	Default	No
Claude Opus 4.6	`thinking: adaptive, effort: high`	Default	Yes (timeout avoidance)

Token Pricing (per 1M tokens)

Model	Input	Output	Notes
GPT-5.2	$1.75	$14.00	Thinking tokens billed at output rate
Gemini 3.1 Pro	$2.00	$12.00	Thinking tokens billed at output rate
Claude Opus 4.6	$5.00	$25.00	Thinking tokens billed at output rate

The CostTracker class in llm.py tracks input, output, and thinking tokens separately per API call, with phase-level summaries printed to console.