AgentCon 2026 · Lecture Theatre 2 · 25 min

Kill Your Selenium Scripts

AI-Driven E2E Testing · Live Demo of SkyTest Agent

▸ Ben Cheng ▸ Oursky ▸ Cantonese ▸ Open Source

github.com/oursky/skytest-agent

● TALK main

1 / 10

02

skytest@agentcon:~/e2e$ cat PROBLEM.md

# The Problem with E2E Testing

✦ E2E testing has always been a dream — everyone agrees you need it, yet it never quite happens
✦ Real projects: hundreds of test cases, dozens of steps each — requires an army of interns and testers running them manually for days or even weeks
✦ Automating with playwright / selenium scripts takes months of engineering time to build and maintain
! Scripts are brittle — they break on UI changes, not logic changes. Engineers spend more time fixing tests than writing features
→ This makes E2E testing a perfect candidate for AI automation — repetitive, rule-based, and expensive when done manually

● TALKthe-problem

2 / 10

03

skytest@agentcon:~/e2e$ skytest eval --approaches=all

# Why Not Just Use Existing AI?

✗ Browser MCP

Feels magical — operates a browser like a human using computer use

Burns context tokens magically too. Pass/Fail signal is unreliable. Slow.

✗ Browser Agents

Raw Claude/GPT with browser access — completes flows end-to-end

Hard to control. No reliable Pass/Fail. Can't assert confidently.

✗ Agentic Code Gen

AI generates Playwright scripts fast — looks great on paper

You still have code to maintain. The dream is fault-tolerant, like a human tester — not more code.

→ SkyTest sits in the gap: plain-English test cases + structured Pass/Fail results

● TALKapproach-eval

3 / 10

04

skytest@agentcon:~/skytest$ skytest run --demo "Log in and confirm dashboard loads"

# How SkyTest Works

✓ Write tests in plain English — "Log in and confirm the dashboard loads". No scripting knowledge required
✓ Runs against real browser using midscene — vision model sees the page exactly as a human would
✓ Standout: multi-browser and multi-device support — many real-world test cases require multiple simultaneous users (e.g. testing a chat feature needs two browser sessions)
✓ Assertions — describe what you expect to see generically. >80% accuracy on Pass/Fail results. Usually accurate for real-world assertions
→ Playwright fallback — when vision model can't locate an element, falls back to Playwright code automatically. Best of both worlds.

● TALKhow-it-works

4 / 10

05

skytest@agentcon:~/models$ skytest models --show-reasoning

# The Model Story

Current Qwen3.5-27B via midscene — planning + vision

Cost < $0.4 / M input tokens — cheap enough to run in CI

Candidates Gemini Flash, Gemma, Qwen3.6 — we keep swapping to find best

Strategy Quality / Cost / Speed — it's a moving target

Key insight: cheap vision + reasoning models are now good enough for this job. The economics finally work.

● TALKmodel-selection

5 / 10

06

skytest@agentcon:~/mobile$ skytest run --device "Pixel 7" --app com.example.app

# Mobile Testing

✓ Works via a paired macOS runner on the team — any team member can trigger a mobile test run
✓ Requires adb + a pre-connected device (simulator or physical hardware) — configure device name and app ID in the test case
→ Intentionally tests apps from Google Play Internal Testing — real user install path, not a dev APK. Simulates actual user experience.
$ Demo: Tech Connect project → test cases with [Mobile] prefix → click history → view a Passed result

● TALKmobile-runner

6 / 10

07

skytest@agentcon:~/src$ git log --author="joyz" --oneline | wc -l → 847 commits, 0 lines written by hand

# 100% Vibe-Coded

How it was built

✓ Built entirely by a QA person, not a programmer. Zero lines of code written by hand.
✓ Double-agent PR review — Codex vs Claude review each other's PRs. Both approve = ship it.
! Infra was the hardest part — not the code. Had to learn enough architecture to not break prod.

The meta-skill loop

✓ Built a set of SkyTest skills — AI generates and fixes its own test cases
✓ When tests fail: skytest-4-fix skill → diagnoses via Chrome DevTools → tweaks wording → reruns
→ After fixing: agent reviews the conversation and revises the skills — the system gets smarter over time

● TALKvibe-coded

7 / 10

08

skytest@agentcon:~/results$ skytest report --honest

# What Works / What Doesn't

✓ Works well

✓ Regression from bug reports — if the Linear bug has clear steps, test generation is nearly effortless. Rerun anytime.
✓ Economics — ~$2 USD to run 15 test cases. Reliable results, good coverage per feature.
✓ Assertions — generic descriptions of expected state are reliably evaluated (>80% accuracy)

! Not there yet

! Bottleneck is now the QA, not the engineers — we create ~15 reliable tests/day but developers ship features faster than ever. The bottleneck has flipped.
! Need 20+ tests/day to keep pace with 4 features/sprint × 10-15 test cases each
! UI/UX bugs still need human QA — agents consistently miss visual and UX issues

● TALKresults

8 / 10

09

skytest@agentcon:~/roadmap$ skytest plan --next-quarter

# What's Next

→ Faster, cheaper models will unlock the economics further — the >20 tests/day threshold gets easier as models improve
→ Meta-skill engineering — improving the prompts that generate tests is underrated leverage. The system improves itself.
✦ Best entry point for teams today: regression testing on bug reports. Clear steps in the bug = nearly effortless test generation. Highest ROI.
$ Open source, self-hostable, macOS runner pairable — your team can start today

● TALKroadmap

9 / 10

10

Live Demo

Write a test.
Watch it run.
Get results.

→ Write test in plain English

→ Watch it execute across browsers + mobile

→ Get screenshots, logs, Pass/Fail

⌥ github.com/oursky/skytest-agent

謝謝 · Questions?

● DEMOlive

10 / 10