Multi-LLM Qualitative Analysis Methodology
Analyzing 860 Microsoft developer survey responses on AI automation preferences using triangulated thematic analysis
Source survey: "AI Where It Matters: Where, Why, and How Developers Want AI Support in Daily Work" (Choudhuri et al., 2025)
This document describes the full methodology for a multi-model qualitative analysis pipeline that identifies research opportunities (what developers want AI to do) and design constraints (what developers do not want AI to handle) from open-ended survey responses. Three frontier LLMs serve as independent coders, with inter-rater reliability calculated via Krippendorff's alpha and consensus reached through majority vote. A human review gate validates codebooks before systematic coding begins.
Research Questions & Data
Survey Questions (per category)
Open-ended responses capturing desired capabilities and unmet needs.
Open-ended responses capturing guardrails, no-go zones, and boundary conditions.
Unit of Analysis
Each unit is a single respondent's open-ended answer to one of the two questions within a category. Respondents answered about 2–3 categories each, and a single response may be assigned multiple theme codes.
Task Categories & Response Counts
| Category | Respondents | Tasks Covered |
|---|---|---|
| Development | 816 | Coding, Bug Fixing, Perf Optimization, Refactoring, AI Development |
| Design & Planning | 548 | System Architecture, Requirements Gathering, Project Planning |
| Meta-Work | 532 | Documentation, Communication, Mentoring, Learning, Research |
| Quality & Risk | 401 | Testing & QA, Code Review / PRs, Security & Compliance |
| Infrastructure & Ops | 283 | DevOps / CI-CD, Environment Setup, Monitoring, Customer Support |
Data Quality Note
Approximately 11% of responses contain data quality issues detected during coding:
misplaced answers (respondent wrote a "want" answer in the "not want" field or vice-versa),
back-references to prior answers that are unintelligible on their own, or terse non-responses.
These are flagged with ISSUE_* codes rather than discarded,
avoiding pre-filter bias (see Methodological Controls).
Pipeline Overview
The analysis runs in two stages with a human review gate between them. Both the opportunity and constraint tracks follow identical process steps but with track-specific prompts and codebooks.
Stage 1: Theme Discovery & Reconciliation
Phase 1: Independent 3-Model Discovery
Each model receives all responses for a given category and independently proposes 4–15 themes with supporting evidence (PIDs). The prompt instructs models to create specific, actionable, problem-focused themes and to allow multi-coding.
| Model | Provider | Thinking Mode | Role |
|---|---|---|---|
| GPT-5.2 | OpenAI | reasoning_effort="high" | Independent coder & reconciler |
| Gemini 3.1 Pro | thinking_level="HIGH" | Independent coder | |
| Claude Opus 4.6 | Anthropic | thinking: adaptive, effort: high | Independent coder |
Inputs
- Open-ended survey responses with PIDs (e.g., 816 Development responses or 548 Design & Planning responses)
- Category name and context description
Outputs (per category, per model)
- Theme codebook: code, name, description, supporting PIDs
- Per-response codings: PID → [theme_code_1, theme_code_2, ...]
- Files:
{category}_themes_{model}.json(15 opportunity files + 15 constraint files)
Phase 2: GPT-5.2 Reconciliation
A single reconciliation model (GPT-5.2) receives all three models' theme sets and produces a unified codebook per category by:
- Identifying overlapping themes across models (same concept, different names)
- Merging overlapping themes into single unified entries
- Retaining single-model themes only if substantive (≥3 PIDs)
- Dropping themes that are too vague or have very few supporting responses
- Targeting 5–10 unified themes per category
Each unified theme records its source_models (which of the three models independently proposed it) and source_codes (original model-specific code names), providing full provenance.
Outputs
consolidated_codebook.json— all 5 category codebooks (opportunity track)constraint_codebook.json— all 5 category codebooks (constraint track)
Human Review Gate Required
The pipeline pauses for researcher review before systematic coding begins. The researcher:
- Reviews each proposed theme and reads sample supporting responses
- Checks themes for specificity, granularity, and completeness
- Can keep, rename, merge, split, or remove any theme
- Can add themes the models missed
- Documents rationale for all changes
Systematic coding (Stage 2) does not proceed until the codebook is explicitly approved.
Stage 2: Systematic Coding & Analysis
Coding Protocol
All three models independently re-code every response against the finalized codebook. Key protocol elements:
| Parameter | Value | Rationale |
|---|---|---|
| Batch size | 20 responses per API call | Balances context window usage against API call count |
| Rationale-first | Model writes rationale before assigning codes | Improves accuracy via chain-of-thought; enables auditability |
| Cross-response context | Each response shown alongside opposite-question answer | Enables misresponse detection (ISSUE codes) |
| Multi-coding | 0, 1, or many themes per response | Captures full semantic content |
| Codebook-only | Only codebook codes or ISSUE_* codes allowed | Prevents code drift across batches |
ISSUE Code System
During systematic coding, models flag data quality problems rather than silently discarding responses:
| Code | Meaning | Example |
|---|---|---|
ISSUE_WRONG_FIELD | Respondent answered the opposite question | Describing constraints in the "want" field |
ISSUE_BACK_REFERENCE | References a prior answer; unintelligible alone | "Same as before", "see above" |
ISSUE_NON_RESPONSE | Terse non-answer with no analyzable content | "N/A", "none", "no" |
Models may create additional ISSUE_* codes if they encounter other data quality problems. The ISSUE prefix ensures these are never confused with substantive themes.
Inter-Rater Reliability (IRR)
Agreement between the three LLM coders is measured per theme using Krippendorff's alpha (α), the standard multi-rater reliability coefficient for qualitative research. For each theme, a binary (present/absent) coding matrix is built across all responses, and α is calculated at the nominal level.
| Range | Interpretation |
|---|---|
| α ≥ 0.80 | Excellent agreement — publishable |
| α ≥ 0.67 | Acceptable agreement — tentative conclusions |
| α ≥ 0.50 | Moderate agreement — use with caution |
| α < 0.50 | Poor agreement — unreliable for this theme |
Additionally, pairwise Cohen's kappa (κ) is calculated for each model pair (GPT–Gemini, GPT–Opus, Gemini–Opus) and 3-rater percent agreement (all three models assign the same code) is reported per theme.
Consensus Voting
Final theme assignments use a majority vote: 2 of 3 models must agree for a theme to be assigned to a response. This is applied independently per response and per theme code.
ISSUE code handling
If 2+ models flag any ISSUE code for a response (regardless of which specific ISSUE code), the response receives a generic ISSUE marker and is excluded from substantive analysis. This prevents a single aggressive model from filtering out too many responses.
Rich Opportunity Cards
For the top 5 themes per category (by prevalence), all three models independently generate detailed opportunity cards including:
- Problem statement and proposed capability description
- Required context sources and capability steps
- Impact description with supporting evidence quotes
- Success criteria (qualitative and quantitative measures)
- Constraints and guardrails drawn from the constraint track
- Prevalence data and quantitative signals (AI preference, usage gap)
Cards from the three models are merged using a union-and-deduplicate strategy: longest title wins, context sources are combined (max 7), capability steps use the longest sequence (max 6), and constraints are deduplicated (max 4).
Constraint Maps & Design Principles
Constraint-track prevalence is calculated identically to the opportunity track. The top no-go zones per category are documented with:
- Zone name, description, and prevalence count
- Up to 10 supporting respondent quotes
- 3–6 synthesized design principles per category (generated by GPT-5.2)
- Each principle includes implementation guidance and derivation provenance
Methodological Controls
The pipeline incorporates several controls designed to increase rigor beyond what a single-model analysis can provide.
| Control | Mechanism | What It Mitigates |
|---|---|---|
| Multi-LLM triangulation | 3 frontier models from different families code independently | Single-model bias, training-data artifacts, idiosyncratic interpretations |
| Rationale-first coding | Models write reasoning before assigning codes | Snap-judgment errors; enables post-hoc audit of coding decisions |
| Cross-response context | Both "want" and "not want" answers shown to coder | Misresponse blindness; enables ISSUE_WRONG_FIELD detection |
| ISSUE code system | Flag quality problems in-band rather than pre-filtering | Pre-filter bias from silently dropping ambiguous responses |
| Idempotent checkpointing | Staleness detection skips phases whose inputs haven't changed | Wasted computation; ensures reproducible reruns |
| Consensus merging | Majority vote (2/3) for codes; union-and-deduplicate for synthesis | Noise from single-model outlier codes; incomplete synthesis from any single model |
Design Decisions & Trade-offs
| Decision | Rationale | Trade-off |
|---|---|---|
| 3 models, not 2 or 5 | Minimum for meaningful IRR (Krippendorff's α); covers 3 major LLM families | Higher API cost (∼3×); manageable with batch parallelism |
| HIGH thinking for all models | Qualitative coding benefits from extended reasoning; reduces surface-level pattern matching | Slower inference, higher token cost (thinking tokens billed at output rate) |
| Batch size of 20 | Enough responses for cross-response pattern recognition; fits comfortably in context windows | More API calls than larger batches; but avoids context truncation risks |
| Majority vote (2/3) | Balances sensitivity and specificity; equivalent to >50% agreement threshold | May miss themes where only one model sees a valid pattern |
| Human gate before coding | Prevents systematic errors from propagating through the entire coding phase | Introduces a manual pause in an otherwise automated pipeline |
| No pre-filtering of responses | ISSUE codes capture quality problems without discarding data points | Models must process noisy responses; ISSUE detection is itself imperfect |
| GPT-5.2 as sole reconciler | Reconciliation requires structured comparison rather than independent generation; one model suffices | Reconciliation may inherit GPT-specific biases in theme naming |
| Streaming for Claude Opus | Avoids 10-minute HTTP timeout on long-running inference | More complex error handling; no retry on partial stream failures |
Limitations & Mitigations
| Limitation | Impact | Mitigation |
|---|---|---|
| LLM nondeterminism | Exact codings may vary across runs even with identical inputs | 3-model triangulation smooths out individual variance; IRR quantifies remaining disagreement; idempotent checkpointing ensures reproducible runs when inputs are stable |
| LLM rationalization | Models may construct plausible but incorrect rationales | Multi-model disagreement surfaces cases where rationalization diverges; majority vote filters single-model confabulations |
| Prompt sensitivity | Different prompt wording could yield different themes | Codebook-anchored coding constrains coder freedom; prompts are documented and versioned for replication |
| Not replacing human qualitative research | LLM coders lack lived experience; may miss cultural nuances | Human review gate validates codebook; methodology is positioned as accelerating qualitative work, not replacing it; all outputs include supporting quotes for human verification |
| Survey sample | 860 Microsoft developers may not represent the broader industry | Out of scope for the analysis methodology itself; noted as a limitation of the source data |
| LLM knowledge contamination | Models may have been trained on similar survey analyses | Codebook-first design constrains output to researcher-approved themes; verbatim quotes provide verifiable evidence independent of model knowledge |
Artifacts & Replication
Artifact Inventory
| Phase | File Pattern | Count | Description |
|---|---|---|---|
| Data | {category}_responses.json | 5 | Extracted open-ended responses with PIDs |
| Data | {category}_quantitative.json | 5 | Aggregated Likert scale metrics per task |
| Data | {category}_do_not_want_responses.json | 5 | Extracted constraint responses with PIDs |
| Stage 1 | {category}_themes_{model}.json | 15 | Independent opportunity theme discoveries |
| Stage 1 | {category}_constraint_themes_{model}.json | 15 | Independent constraint theme discoveries |
| Stage 1 | consolidated_codebook.json | 1 | Unified opportunity codebook (all categories) |
| Stage 1 | constraint_codebook.json | 1 | Unified constraint codebook (all categories) |
| Stage 2 | {category}_phase4_codings.json | 5 | 3-model systematic codings with rationales |
| Stage 2 | phase5_irr_results.json | 1 | Krippendorff's α, Cohen's κ, agreement % |
| Stage 2 | phase6_prevalence_results.json | 1 | Majority-vote consensus and theme prevalence |
| Stage 2 | phase6_rich_opportunities.json | 1 | Top-5 opportunity cards per category (3-model synthesis) |
| Stage 2 | constraint_maps.json | 1 | No-go zones and design principles |
Dependency Chain
Staleness Detection
Every pipeline phase checks whether its output is stale relative to its inputs by comparing file modification times. If all inputs are older than the output, the phase is skipped. If any input is newer, the output is regenerated. This enables:
- Incremental reruns: updating one category's theme discovery only regenerates downstream outputs for that category
- Safe restarts: if the pipeline crashes mid-phase, only the incomplete phase reruns
- Force override:
--forceflag bypasses staleness checks for full regeneration
How to Rerun
- Ensure API keys are set in
.envfor OpenAI, Google, and Anthropic - Install dependencies:
uv sync - Run full pipeline:
bash run_full_pipeline.sh - Pipeline pauses after Stage 1 for human codebook review
- After approval, Stage 2 runs automatically
- To force regeneration:
bash run_full_pipeline.sh --force - To rerun a single category:
uv run phase4_systematic_coding.py design_planning
Appendix
Opportunity Codebook (All 5 Categories, 48 Themes)
Unified codebook produced by GPT-5.2 reconciliation of themes independently discovered by all three models. Each theme lists which models independently identified it.
Development 10 themes
| Code | Theme | Models |
|---|---|---|
refactoring_modernization | Automated Refactoring, Modernization & Tech-Debt Reduction | GPT, Gemini, Opus |
boilerplate_scaffolding_feature_codegen | Boilerplate, Scaffolding & Routine Feature Code Generation | GPT, Gemini, Opus |
automated_testing_validation | Automated Test Generation, Coverage & Change Validation | GPT, Gemini, Opus |
debugging_root_cause_fixing | Debugging, Root Cause Analysis & Bug Fix Assistance | GPT, Gemini, Opus |
repo_wide_context_dependency_awareness | Repo-Wide Context, Dependency Awareness & Safe Multi-File Changes | GPT, Gemini, Opus |
code_quality_review_security_compliance | Code Quality, Review Automation, Standards & Security/Compliance Guidance | GPT, Gemini, Opus |
performance_profiling_optimization | Performance Profiling & Optimization Suggestions | GPT, Gemini, Opus |
architecture_design_planning_support | Architecture, Design Brainstorming & Planning Support | GPT, Gemini, Opus |
devops_ci_cd_iac_workflow_automation | DevOps, CI/CD, IaC & Engineering Workflow Automation | GPT, Gemini, Opus |
documentation_knowledge_retrieval_onboarding | Documentation Generation, Knowledge Retrieval & Onboarding/Learning Support | GPT, Gemini, Opus |
Design & Planning 10 themes
| Code | Theme | Models |
|---|---|---|
requirements_gathering_synthesis | Requirements Gathering, Synthesis & Clarification | GPT, Gemini, Opus |
architecture_design_generation | Architecture & System Design Generation/Iteration | GPT, Gemini, Opus |
interactive_brainstorming_design_partner | Interactive Brainstorming & Design Copilot | GPT, Gemini, Opus |
tradeoff_decision_support_simulation | Trade-off Analysis, What-if Simulation & Decision Support | GPT, Gemini, Opus |
design_validation_risk_edge_cases | Design Validation, Risk Assessment & Edge-Case Discovery | GPT, Gemini, Opus |
project_planning_tasking_status_automation | Project Planning, Ticket/Task Breakdown & Status Automation | GPT, Gemini, Opus |
documentation_spec_diagram_generation | Documentation, Specs & Diagram/Artifact Generation | GPT, Gemini, Opus |
context_retrieval_codebase_and_institutional_memory | Context Retrieval: Codebase Understanding & Institutional Memory | GPT, Gemini, Opus |
research_and_information_synthesis | Research, Information Gathering & Knowledge Synthesis | GPT, Gemini, Opus |
trustworthy_outputs_with_citations | Trustworthy Outputs: Higher Accuracy & Verifiable Citations | GPT, Gemini, Opus |
Quality & Risk 9 themes
| Code | Theme | Models |
|---|---|---|
automated_test_generation_and_quality_gates | Automated Test Generation, Maintenance & Quality Gates | GPT, Gemini, Opus |
intelligent_pr_code_review | Intelligent PR/Code Review Assistant | GPT, Gemini, Opus |
security_vulnerability_detection_and_fix_guidance | Security Vulnerability Detection & Fix Guidance | GPT, Gemini, Opus |
compliance_and_audit_automation | Compliance, Standards & Audit Process Automation | GPT, Gemini, Opus |
proactive_risk_monitoring_and_prediction | Proactive Risk Monitoring, Prediction & Anomaly Detection | GPT, Gemini, Opus |
debugging_root_cause_and_failure_triage | Debugging, Root Cause Analysis & Failure Triage | GPT, Gemini, Opus |
knowledge_retrieval_and_standards_guidance | Knowledge Retrieval, Summarization & Standards Guidance | GPT, Gemini, Opus |
agentic_workflow_automation_and_remediation | Agentic Workflow Automation & Automated Remediation | GPT, Gemini, Opus |
ai_driven_exploratory_chaos_and_fuzz_testing | AI-Driven Exploratory, Chaos & Fuzz Testing | Opus only |
Infrastructure & Ops 10 themes
| Code | Theme | Models |
|---|---|---|
intelligent_monitoring_alerting_anomaly_detection | Intelligent Monitoring, Alerting & Anomaly Detection | GPT, Gemini, Opus |
incident_response_rca_mitigation_self_heal | Incident Response Automation (Triage, RCA, Mitigation, Self-Heal) | GPT, Gemini, Opus |
cicd_pipeline_and_deployment_automation | CI/CD Pipeline & Deployment Automation | GPT, Gemini, Opus |
infrastructure_provisioning_and_iac_generation | Automated Environment Setup & IaC Generation | GPT, Gemini, Opus |
infrastructure_maintenance_upgrades_security_cost_optimization | Proactive Maintenance, Upgrades, Security/Compliance & Cost Optimization | GPT, Gemini, Opus |
customer_support_triage_and_autoresponse | Customer Support Triage & Auto-Response | GPT, Gemini, Opus |
knowledge_management_doc_search_and_system_context | Knowledge Management, Documentation Search & System Context | GPT, Gemini, Opus |
ops_toil_automation_and_script_generation | Ops Toil Automation & Script Writing/Debugging | GPT, Gemini, Opus |
testing_quality_validation_and_safe_deploy | Testing, Quality Validation & Safer Releases | GPT, Gemini, Opus |
ai_tooling_ux_accuracy_and_cohesive_workflows | Better AI Tooling UX (Accuracy, Control & Cohesive Workflows) | GPT, Gemini, Opus |
Meta-Work 9 themes
| Code | Theme | Models |
|---|---|---|
automated_documentation | Automated Documentation Generation & Maintenance | GPT, Gemini, Opus |
knowledge_search_and_discovery | Project Knowledge Search & Discovery (with Traceable Sources) | GPT, Gemini, Opus |
brainstorming_and_solution_exploration | Brainstorming, Option Generation & Rapid Exploration | GPT, Gemini, Opus |
personalized_learning_and_upskilling | Personalized Learning for New Technologies | GPT, Gemini, Opus |
team_onboarding_and_mentoring | Team Onboarding, Mentoring & Institutional Knowledge Transfer | GPT, Gemini, Opus |
stakeholder_communication_support | Stakeholder/Client Communication Drafting & Translation | GPT, Gemini, Opus |
meeting_assistance | Meeting Scheduling, Notes, Summaries & Action Items | GPT, Gemini, Opus |
planning_prioritization_and_status_tracking | Planning, Prioritization, Blocker Detection & Status Reporting | GPT, Gemini, Opus |
proactive_personal_agent_and_admin_automation | Proactive Personal Agent & Routine Admin Automation | GPT, Gemini, Opus |
Constraint Codebook (All 5 Categories, 50 Themes)
Unified constraint codebook produced by GPT-5.2 reconciliation. Captures what developers do not want AI to handle.
Development 10 themes
| Code | Theme | Models |
|---|---|---|
no_autonomous_architecture_system_design | No Autonomous Architecture or System Design Decisions | GPT, Gemini, Opus |
no_large_unscoped_refactors | No Large, Unscoped, or Sweeping Codebase Changes | GPT, Gemini, Opus |
no_autonomous_execution_merge_deploy_or_agentic_control | No Autonomous Execution, Merging/Deploying, or Agentic Control | GPT, Gemini, Opus |
no_complex_debugging_or_critical_bug_fixes | No AI Ownership of Complex Debugging or Critical Bug Fixes | GPT, Gemini, Opus |
no_security_privacy_secrets_handling | No Security/Privacy-Sensitive Work or Secrets Handling | GPT, Gemini, Opus |
no_autonomous_performance_optimization | No Autonomous Performance Optimization | GPT, Gemini, Opus |
no_ai_deciding_requirements_business_logic_or_api_ux | No AI-Led Requirements, Core Business Logic, or API/UX Decisions | GPT, Gemini |
preserve_developer_agency_learning_and_job_ownership | Preserve Developer Agency, Learning, and Ownership | GPT, Gemini, Opus |
avoid_ai_when_unreliable_contextless_hard_to_verify_or_intrusive | Avoid AI Output That Is Unreliable, Contextless, Hard to Verify, or Intrusive | GPT, Gemini, Opus |
no_constraints_open_to_ai_help | No Specific No-Go Zones (Open to AI Help) | GPT, Gemini |
Design & Planning 10 themes
| Code | Theme | Models |
|---|---|---|
human_accountability_final_decisions | No AI Final Decision-Making (Human Accountability Required) | GPT, Gemini, Opus |
human_led_architecture_design | No AI as Primary System Architect / High-Level Designer | GPT, Gemini, Opus |
no_ai_project_management_task_assignment | No AI Running Project Management | GPT, Gemini, Opus |
no_ai_requirements_stakeholder_elicitation | No AI-Led Requirements Gathering or Stakeholder Alignment | GPT, Gemini, Opus |
no_ai_empathy_team_dynamics | No Replacement of Human Empathy, Collaboration, or Interpersonal Dynamics | GPT, Gemini, Opus |
ai_assistant_human_in_loop | No Autopilot: AI Should Assist with Human-in-the-Loop Oversight | GPT, Gemini, Opus |
trust_accuracy_and_context_limitations | Avoid AI for High-Stakes Work Due to Reliability & Missing Context | GPT, Gemini, Opus |
privacy_confidentiality_ip_and_message_control | No AI Handling Sensitive/Confidential Data or Uncontrolled Messaging | GPT, Gemini, Opus |
no_ai_vision_strategy_creativity_taste | No AI Owning Product Vision, Strategy, or Creative Judgments | GPT, Gemini |
no_constraints_or_unsure | No Constraints Stated / Welcome Full AI Involvement | GPT, Gemini, Opus |
Quality & Risk 10 themes
| Code | Theme | Models |
|---|---|---|
human_final_decision_and_accountability | Humans Must Make Final High-Stakes Decisions | GPT, Gemini, Opus |
no_autonomous_code_or_production_actions | No Autonomous Code/Repo/Production Actions Without Approval | GPT, Gemini, Opus |
human_code_review_gate_required | Human Code Review / PR Approval Must Remain the Gate | GPT, Gemini, Opus |
security_and_compliance_must_be_human_led | Security, Compliance, and Threat Modeling Must Be Human-Led | GPT, Gemini, Opus |
no_sensitive_data_or_credentials_access | Do Not Give AI Access to Sensitive/Customer Data or Credentials | GPT, Gemini, Opus |
ai_outputs_must_be_verifiable_and_not_self_validated | AI Must Be Reliable, Verifiable, and Not Self-Validated | GPT, Gemini, Opus |
humans_own_requirements_architecture_and_tradeoffs | Humans Must Own Requirements, Architecture, and Trade-Offs | GPT, Gemini, Opus |
human_led_test_strategy_intent_and_signoff | Test Strategy and Sign-Off Must Be Human-Led | GPT only |
preserve_human_ethics_empathy_and_human_centric_work | Preserve Human Ethics, Empathy, and Human-Centric Work | GPT, Gemini |
no_constraints_stated | No Specific No-Go Areas Stated | GPT, Opus |
Infrastructure & Ops 10 themes
| Code | Theme | Models |
|---|---|---|
no_direct_customer_interaction | No Direct AI-to-Customer Interaction | GPT, Gemini, Opus |
no_autonomous_production_changes | No Autonomous Production Deployments or Changes | GPT, Gemini, Opus |
human_approval_before_consequential_actions | Human Approval Required Before Consequential Actions | GPT, Opus |
no_security_permissions_secrets_management | No AI Management of Security, Access, Permissions, or Secrets | GPT, Gemini, Opus |
no_autonomous_incident_response_or_overrides | No Autonomous Incident Response or Critical Overrides | GPT, Gemini, Opus |
avoid_ai_for_high_precision_deterministic_work | Avoid AI for High-Precision/Deterministic Work | GPT, Gemini, Opus |
no_full_autonomy_for_environment_setup_maintenance | No Full Autonomy for Environment Setup and Maintenance | GPT, Gemini |
preserve_human_learning_and_accountability | Preserve Human Learning, System Understanding, and Accountability | GPT, Gemini, Opus |
no_ai_initiated_irreversible_or_destructive_data_actions | No AI-Initiated Irreversible/Destructive Data Operations | GPT, Gemini, Opus |
no_constraints_expressed_or_pro_automation | No Constraints Expressed / Comfortable with Broad Automation | GPT, Gemini, Opus |
Meta-Work 10 themes
| Code | Theme | Models |
|---|---|---|
human_led_mentoring_onboarding | Keep mentoring and onboarding human-led | GPT, Gemini, Opus |
human_authored_communication | Keep interpersonal communications human-authored | GPT, Gemini, Opus |
human_review_required_before_sending_or_publishing | No autonomous sending/publishing without human review | GPT, Gemini, Opus |
no_confidential_or_sensitive_data | Keep AI away from confidential or sensitive information | GPT, Gemini, Opus |
preserve_hands_on_learning | Don't outsource learning and skills development to AI | GPT, Gemini, Opus |
preserve_human_research_and_ideation | Keep research/brainstorming primarily human | GPT, Gemini, Opus |
human_accountability_for_high_stakes_decisions | High-stakes decisions must remain human-led | GPT, Gemini, Opus |
avoid_unvetted_documentation | AI-generated documentation must be vetted | GPT, Gemini, Opus |
ai_outputs_not_trustworthy_as_primary_source | Don't treat AI output as trustworthy/authoritative | Opus only |
no_constraints_or_unsure | No constraints stated / unsure | GPT, Gemini, Opus |
Coding Prompt: Opportunity Track (Phase 4)
You are a qualitative research coder. Your task is to systematically code
each "WANT" response using ONLY the themes from the provided codebook.
Each response is shown alongside the same respondent's answer to a related
question about what they do NOT want AI to handle, for additional context.
CODEBOOK:
{codebook themes listed here}
ISSUE CODES (assign when a response has data quality issues):
- ISSUE_WRONG_FIELD: The respondent appears to have answered the other question
- ISSUE_BACK_REFERENCE: Response references a prior answer and is unintelligible
on its own
- ISSUE_NON_RESPONSE: Terse non-answer with no analyzable content
- You may create other ISSUE_* codes if you encounter a different type of data
quality problem
INSTRUCTIONS:
1. Read each response carefully
2. For each response, write a brief rationale
3. Then assign ALL applicable theme codes from the codebook
4. A response can have 0, 1, or multiple themes
5. Only use codes from the codebook or ISSUE codes
6. If no themes apply, return an empty array
RESPONSES TO CODE:
[Batches of 20 responses, each with context from opposite question]
OUTPUT FORMAT:
[
{"pid": 8, "rationale": "...", "themes": ["theme_code_1", "theme_code_2"]},
{"pid": 11, "rationale": "...", "themes": ["ISSUE_BACK_REFERENCE"]}
]
Return ONLY the JSON array, no other text.
Coding Prompt: Constraint Track
Identical structure to the opportunity track prompt, but with:
- Constraint codebook themes replacing opportunity themes
- "NOT WANT" responses as the primary coding target
- "WANT" responses shown as cross-response context
- Same ISSUE code system applies
Theme Discovery Prompt Template
You are analyzing open-ended survey responses from software developers about
where they want AI assistance in their work. Your task is to identify themes
in these responses.
Guidelines for theme creation:
- Themes should be SPECIFIC and ACTIONABLE
(e.g., "Automated test generation for edge cases" not just "Testing")
- Themes should be PROBLEM-FOCUSED (describe the pain point, not a solution)
- A response can belong to MULTIPLE themes
- Aim for 4-15 themes that capture the major patterns
For each response, return:
{
"pid": <participant ID>,
"themes": ["theme_code_1", "theme_code_2", ...]
}
Also provide a theme codebook:
{
"themes": [
{
"code": "snake_case_theme_code",
"name": "Human-Readable Theme Name",
"description": "What this capability means and why developers want it",
"pids": [list of PIDs expressing this]
}
],
"codings": [
{"pid": 123, "themes": ["theme_code_1", "theme_code_2"]}
]
}
RESPONSES TO ANALYZE:
[All responses for the category]
Theme Reconciliation Prompt Template
You are a qualitative research analyst performing theme reconciliation.
CONTEXT: Three independent AI models analyzed survey responses from the
"[category]" category. Each identified opportunity themes. Your job is to
reconcile these into a unified codebook.
--- GPT-5.2 THEMES ---
[All GPT themes with names, descriptions, PIDs]
--- GEMINI THEMES ---
[All Gemini themes]
--- OPUS THEMES ---
[All Opus themes]
TASK: Create a unified codebook by:
1. Identifying themes that overlap across models (same concept, different names)
2. Merging overlapping themes into single unified themes
3. Keeping single-model themes IF substantive (≥3 PIDs)
4. Dropping themes that are too vague or have very few supporting responses
5. Aim for 5-10 unified themes per category
For each unified theme, provide:
- code: snake_case identifier
- name: Human-readable name
- description: Clear description of the desired capability
- source_models: which models identified it (["gpt", "gemini", "opus"])
- source_codes: the original codes from each model
Return ONLY valid JSON, no other text.
ISSUE Code Taxonomy
| Code | Definition | Detection Signal | Consensus Rule |
|---|---|---|---|
ISSUE_WRONG_FIELD |
Respondent answered the opposite question (e.g., wrote constraints in the "want" field) | Cross-response context reveals contradictory intent | 2+ models flag any ISSUE_* → generic ISSUE marker applied |
ISSUE_BACK_REFERENCE |
Response references a prior answer ("same as before", "see above") and is unintelligible alone | Short response with deictic language | |
ISSUE_NON_RESPONSE |
Terse reply with no analyzable content | "N/A", "none", "no", single punctuation | |
ISSUE_* (custom) |
Models may create additional issue codes for novel quality problems | Varies | Same 2/3 majority rule; prefix matching ensures grouping |
Output JSON Schemas
Theme Discovery Output
{
"model": "gpt-5.2",
"category": "design_planning",
"category_name": "Design & Planning",
"response_count": 223,
"timestamp": "ISO-8601",
"themes": [
{
"code": "string",
"name": "string",
"description": "string",
"pids": [integer]
}
],
"codings": [
{ "pid": integer, "themes": ["string"] }
]
}
Consolidated Codebook
{
"metadata": {
"phase": "Opportunity Theme Reconciliation",
"timestamp": "ISO-8601",
"reconciliation_model": "gpt-5.2",
"discovery_models": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"]
},
"categories": {
"category_key": {
"category": "string",
"category_name": "string",
"theme_count": integer,
"models_reconciled": ["gpt", "gemini", "opus"],
"themes": [
{
"code": "string",
"name": "string",
"description": "string",
"source_models": ["gpt", "gemini", "opus"],
"source_codes": {
"gpt": ["string"],
"gemini": ["string"],
"opus": ["string"]
}
}
]
}
}
}
Systematic Codings (Phase 4)
{
"category": "string",
"phase": "Phase 4 - Systematic Coding",
"timestamp": "ISO-8601",
"models": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"],
"codebook": [ { "code": "string", "name": "string", "description": "string" } ],
"response_count": integer,
"codings": {
"gpt": [ { "pid": integer, "rationale": "string", "themes": ["string"] } ],
"gemini": [ ... ],
"opus": [ ... ]
},
"cost": { ... }
}
IRR Results (Phase 5)
{
"phase": "Phase 5 - Inter-Rater Reliability",
"methodology": {
"raters": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"],
"metrics": ["Krippendorff's Alpha", "Cohen's Kappa (pairwise)", "Percent Agreement"]
},
"overall_statistics": {
"mean_krippendorff_alpha": float,
"mean_percent_agreement": float,
"interpretation": "string"
},
"category_results": {
"category_key": {
"krippendorff_alpha": { "theme_code": float },
"percent_agreement": { "theme_code": float },
"pairwise_kappa": {
"gpt_vs_gemini": { "theme_code": float },
"gpt_vs_opus": { "theme_code": float },
"gemini_vs_opus": { "theme_code": float }
},
"code_frequencies": { "gpt": {}, "gemini": {}, "opus": {} }
}
}
}
Prevalence Results (Phase 6)
{
"methodology": {
"consensus_method": "majority_vote",
"threshold": "2+ of 3 models must agree"
},
"category_results": {
"category_key": {
"theme_prevalence": [
{
"code": "string",
"count": integer,
"percentage": float,
"pids": [integer]
}
],
"consensus_codings": { "pid": ["theme1", "theme2"] }
}
}
}
Rich Opportunity Card
{
"rank": integer,
"theme_code": "string",
"category": "string",
"title": "string",
"problem_statement": "string",
"proposed_capability": {
"summary": "string",
"context_sources_needed": ["string"],
"capability_steps": ["string"]
},
"impact": {
"description": "string",
"evidence_quotes": [ { "pid": integer, "quote": "string" } ]
},
"success_definition": {
"qualitative_measures": ["string"],
"quantitative_measures": ["string"]
},
"constraints_and_guardrails": [
{
"constraint": "string",
"supporting_quote": { "pid": integer, "quote": "string" }
}
],
"who_it_affects": {
"prevalence_count": integer,
"prevalence_percentage": float,
"description": "string",
"signals": ["string"]
},
"models_consulted": ["gpt-5.2", "gemini-3.1-pro-preview", "claude-opus-4-6"]
}
IRR Interpretation Guide
Why Krippendorff's Alpha?
- Designed for multi-rater reliability (3+ raters)
- Handles missing data gracefully (if one model fails on a batch)
- Supports nominal-level measurement (categorical theme codes)
- Does not assume a fixed rater set
- More conservative than simple percent agreement, adjusting for chance
How It's Calculated
For each theme code, a binary matrix is constructed:
# Row per model, column per response
# PID_1 PID_2 PID_3 PID_4 ...
# GPT: [ 1, 0, 1, 0, ...]
# Gemini: [ 1, 0, 1, 0, ...]
# Opus: [ 1, 0, 0, 0, ...]
alpha = krippendorff.alpha(
reliability_data=matrix,
level_of_measurement="nominal"
)
Reporting
- Per-theme α values identify which themes models agree/disagree on
- Themes with α < 0.67 are flagged for potential human adjudication
- Overall mean α provides a summary reliability score
- Pairwise κ identifies whether specific model pairs diverge
- Code frequency counts reveal systematic over/under-coding by individual models
What IRR Tells Us (and Doesn't)
High α means the three models consistently apply the same theme to the same responses—the codebook is operationally clear and the models "understand" it similarly. Low α on a specific theme may indicate the theme definition is ambiguous, the theme requires human judgment the models handle differently, or the theme captures a rare pattern where base-rate effects inflate disagreement.
IRR does not tell us whether the codes are correct—only that the coders agree. This is why the human review gate exists: to ensure the codebook itself captures meaningful, well-defined themes before reliability is measured.
Model Configuration & Cost Tracking
Model Parameters
| Model | Thinking Config | Temperature | Streaming |
|---|---|---|---|
| GPT-5.2 | reasoning_effort="high" | 1 | No |
| Gemini 3.1 Pro | ThinkingConfig(thinking_level="HIGH") | Default | No |
| Claude Opus 4.6 | thinking: adaptive, effort: high | Default | Yes (timeout avoidance) |
Token Pricing (per 1M tokens)
| Model | Input | Output | Notes |
|---|---|---|---|
| GPT-5.2 | $1.75 | $14.00 | Thinking tokens billed at output rate |
| Gemini 3.1 Pro | $2.00 | $12.00 | Thinking tokens billed at output rate |
| Claude Opus 4.6 | $5.00 | $25.00 | Thinking tokens billed at output rate |
The CostTracker class in llm.py tracks input, output, and thinking tokens separately per API call, with phase-level summaries printed to console.