Report on Testival Meetup #76, qa in the era of ai and vibe coding

Intro

Tomislav started with our usual participant introductions. There were more than ten participants, so that part was quick. Tomislav was disappointed with the turnout, but I reminded him:

“Quantity over quality, choose not. Quality, the true path is.”

We had one participant who is a mechanical engineer responsible for testing oil refineries before production. And yes, he is a big fan of vibe coding.

The Talk

It seems that Leon could talk for years about this topic—this is how much he knows about LLMs and vibe coding. He started his talk with the evolution of the Pythagora product through its history.

They first created their own LLM model for HTML and CSS. Once they had a working proof of concept, they moved on to unit-test generation for the JavaScript language. Armed with that knowledge, GPT Pilot was created—it could generate apps up to 300 lines of code (LOC). GPT Pilot uses existing LLMs for code generation.

An additional value of GPT Pilot is its agent system, where each agent has its own context. At that time, prompt-tuning was a big deal. Each input to an LLM has three parts: the system part that sets the context, the user part with the actual question, and the assistant part representing the answer. This structure is formatted as JSON. What needed to be tweaked was the system part of the prompt and the temperature attribute. The temperature defines the LLM’s creativity—meaning that higher values produce more non-deterministic answers.

To test the prompts, they created an evaluation tool. It had a UI for easier creation of testing prompts, but LLM answers were evaluated by another LLM model. Initially, each model was evaluated against itself, but they figured out that each model was biased toward its own output, so they switched to using a single external model for evaluation. They stopped prompt testing when Claude Sonnet 3.5 arrived—this model is so good at code generation that there’s no longer a need for prompt tweaking.

The current state of Pythagora is a law-enforcement app with 50,000 lines of code running in production. Manual QA is currently their main focus. They have two types of bugs—technical and logical. Logical ones are harder to spot and still require humans to find. Pythagora’s users are their best QA testers, especially those at the junior or mid-level of coding knowledge. They’re also hiring for QA, so if you’re interested, ping them on LinkedIn.

LLMs are getting better and better. For example, the new “thinking” feature is a loop where the LLM uses the output of the user input as context for its next input. The current LLM speed is around 300 tokens per second, with a context window of one million tokens (how many tokens an LLM can “see” at one time during a prompt).

Yes, Pythagora is using Pythagora for internal app development. We saw in action an AWS admin dashboard with various automation cases related to orchestrating EC2 instances. For marketing, they have a Hacktober YouTube channel featuring vibe-coding videos of various apps.

Pythagora gives to user full dev and deployment environment. Output is React on frontend, mongoDB as database and Node Js on backend.

Takeaways

The talk lasted around 30 minutes, followed by a discussion that went on for an hour! This is what you missed at this meetup. We eventually stopped because the food arrived—but the conversation continued.

One participant came prepared—she had already started a vibe-coding CV generator app the day before and continued her session with Leon’s support!

I personally learned a lot. On a daily basis, I’m learning about LLMs, using Cursor in my work, and experimenting with CrimeBeats.app, our small proof of concept showing how to use LLMs in real applications.

Next week is Michael Bolton time—hope to see you in Ljubljana!

I finished my evening with visit to Ambasada beer store, where I had one excellent Triple IPA

Report on Testival Meetup #76, qa in the era of ai and vibe coding

Intro

The Talk

Takeaways

Previous Article

Let’s Do It!