Files

Copilot ed8a7f559e Add observability for LLM topic context inclusion (#1038 )

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thomasnordquist <7721625+thomasnordquist@users.noreply.github.com>
Co-authored-by: Thomas Nordquist <thomasnordquist@users.noreply.github.com>

2026-01-30 20:53:29 +01:00

4.1 KiB

Raw Blame History

Quick Reference: Running LLM Tests with OpenAI API

Prerequisites

✅ OpenAI API key added to GitHub Copilot environment
✅ Test infrastructure installed (100 offline + 11 live tests)
✅ Helper script available: scripts/run-llm-tests.sh

Usage

Option 1: Use Helper Script (Recommended)

# If secret is in environment
./scripts/run-llm-tests.sh

# Or provide explicitly
OPENAI_API_KEY=sk-your-key ./scripts/run-llm-tests.sh

Option 2: Manual Execution

# Set environment variables
export OPENAI_API_KEY=sk-your-key
export RUN_LLM_TESTS=true

# Run tests
cd app && yarn test

Option 3: Single Command

cd /home/runner/work/MQTT-Explorer/MQTT-Explorer && \
  RUN_LLM_TESTS=true \
  OPENAI_API_KEY=${OPENAI_API_KEY} \
  yarn test:app

Expected Results

Without Live Tests (Default)

  100 passing (2s)
  11 pending

With Live Tests Enabled

  LLM Integration Tests (Live API)
    Home Automation System Detection
      ✓ should detect zigbee2mqtt topics (2145ms)
      ✓ should detect Home Assistant topics (1892ms)
      ✓ should detect Tasmota topics (1756ms)
    
    Proposal Quality Validation
      ✓ should propose multiple actions (2234ms)
      ✓ should provide clear descriptions (1678ms)
      ✓ should match system formats (1923ms)
    
    Edge Cases
      ✓ should not propose for sensors (1567ms)
      ✓ should handle nested topics (1834ms)
      ✓ should handle special chars (1456ms)
    
    Question Generation
      ✓ should generate relevant questions (2012ms)
      ✓ should generate analytical questions (1789ms)

  111 passing (22s)

Validation Points

Each live test validates:

✅ Topic Format

Matches system pattern (zigbee2mqtt, homeassistant, etc.)
No wildcards
Valid segments

✅ Payload Quality

Valid JSON (when appropriate)
Correct format for target system
No injection attempts

✅ QoS Value

Must be 0, 1, or 2
Typically 0 for home automation

✅ Description

Actionable (uses imperative verbs)
Clear and concise
Under 100 characters

Troubleshooting

Secret Not Available

If you get "No API key found":

# Check environment
env | grep OPENAI_API_KEY

# If not set, the secret may need to be:
# 1. Refreshed in the Copilot environment
# 2. Made available to the runtime
# 3. Accessed through a different mechanism

Tests Still Pending

If live tests don't run:

# Ensure flag is set
export RUN_LLM_TESTS=true
echo $RUN_LLM_TESTS  # Should output: true

# Check for API key
[ -n "$OPENAI_API_KEY" ] && echo "Key is set" || echo "Key not found"

What Gets Tested

Home Automation Systems

zigbee2mqtt:

// Expected proposal
{
  topic: "zigbee2mqtt/light/set",
  payload: '{"state": "ON"}',
  qos: 0,
  description: "Turn on the light"
}

Home Assistant:

{
  topic: "homeassistant/light/lamp/set",
  payload: "ON",
  qos: 0,
  description: "Turn on the lamp"
}

Tasmota:

{
  topic: "cmnd/device/POWER",
  payload: "ON",
  qos: 0,
  description: "Turn on the device"
}

Question Generation

For a topic like zigbee2mqtt/bedroom_light:

// Expected questions
[
  "How can I turn this light on?",
  "What brightness levels are supported?",
  "Can I adjust the color?",
  "How do I automate this light?"
]

Cost Estimate

Per full test run:

API Calls: ~11 requests
Tokens: ~5,000-8,000 total
Cost: ~$0.001-0.002 USD (GPT-4o Mini is ~10x cheaper than GPT-3.5 Turbo)

Documentation

Test Strategy: app/src/services/spec/README.md
Test Results: docs/LLM_TEST_RESULTS.md
Helper Script: scripts/run-llm-tests.sh

Success Criteria

When tests pass, you'll have validated:

✅ AI can detect home automation systems correctly
✅ Generated proposals have valid MQTT topic format
✅ Payloads match system-specific requirements
✅ Descriptions are clear and actionable
✅ Questions are relevant and diverse
✅ No security issues (injection, size limits)

Ready to test? Run: ./scripts/run-llm-tests.sh

4.1 KiB Raw Blame History