01
AgentCon 2026  ·  Lecture Theatre 2  ·  25 min

Kill Your Selenium Scripts

AI-Driven E2E Testing · Live Demo of SkyTest Agent

Ben Cheng Oursky Cantonese Open Source
● TALK main
1 / 10
02
skytest@agentcon:~/e2e$ cat PROBLEM.md

# The Problem with E2E Testing

● TALKthe-problem
2 / 10
03
skytest@agentcon:~/e2e$ skytest eval --approaches=all

# Why Not Just Use Existing AI?

✗ Browser MCP

Feels magical — operates a browser like a human using computer use

Burns context tokens magically too. Pass/Fail signal is unreliable. Slow.

✗ Browser Agents

Raw Claude/GPT with browser access — completes flows end-to-end

Hard to control. No reliable Pass/Fail. Can't assert confidently.

✗ Agentic Code Gen

AI generates Playwright scripts fast — looks great on paper

You still have code to maintain. The dream is fault-tolerant, like a human tester — not more code.

→ SkyTest sits in the gap: plain-English test cases + structured Pass/Fail results

● TALKapproach-eval
3 / 10
04
skytest@agentcon:~/skytest$ skytest run --demo "Log in and confirm dashboard loads"

# How SkyTest Works

● TALKhow-it-works
4 / 10
05
skytest@agentcon:~/models$ skytest models --show-reasoning

# The Model Story

Current Qwen3.5-27B via midscene — planning + vision
Cost < $0.4 / M input tokens — cheap enough to run in CI
Candidates Gemini Flash, Gemma, Qwen3.6 — we keep swapping to find best
Strategy Quality / Cost / Speed — it's a moving target

Key insight: cheap vision + reasoning models are now good enough for this job. The economics finally work.

● TALKmodel-selection
5 / 10
06
skytest@agentcon:~/mobile$ skytest run --device "Pixel 7" --app com.example.app

# Mobile Testing

● TALKmobile-runner
6 / 10
07
skytest@agentcon:~/src$ git log --author="joyz" --oneline | wc -l  → 847 commits, 0 lines written by hand

# 100% Vibe-Coded

    How it was built
  • Built entirely by a QA person, not a programmer. Zero lines of code written by hand.
  • Double-agent PR review — Codex vs Claude review each other's PRs. Both approve = ship it.
  • ! Infra was the hardest part — not the code. Had to learn enough architecture to not break prod.
    The meta-skill loop
  • Built a set of SkyTest skills — AI generates and fixes its own test cases
  • When tests fail: skytest-4-fix skill → diagnoses via Chrome DevTools → tweaks wording → reruns
  • After fixing: agent reviews the conversation and revises the skills — the system gets smarter over time
● TALKvibe-coded
7 / 10
08
skytest@agentcon:~/results$ skytest report --honest

# What Works / What Doesn't

✓ Works well
  • Regression from bug reports — if the Linear bug has clear steps, test generation is nearly effortless. Rerun anytime.
  • Economics — ~$2 USD to run 15 test cases. Reliable results, good coverage per feature.
  • Assertions — generic descriptions of expected state are reliably evaluated (>80% accuracy)
! Not there yet
  • ! Bottleneck is now the QA, not the engineers — we create ~15 reliable tests/day but developers ship features faster than ever. The bottleneck has flipped.
  • ! Need 20+ tests/day to keep pace with 4 features/sprint × 10-15 test cases each
  • ! UI/UX bugs still need human QA — agents consistently miss visual and UX issues
● TALKresults
8 / 10
09
skytest@agentcon:~/roadmap$ skytest plan --next-quarter

# What's Next

● TALKroadmap
9 / 10
10
Live Demo

Write a test.
Watch it run.
Get results.

Write test in plain English
Watch it execute across browsers + mobile
Get screenshots, logs, Pass/Fail

謝謝  ·  Questions?

● DEMOlive
10 / 10