An evening with Michael Bolton

TL;DR

Last week we visited Ljubljana; the reason was Michael Bolton, who was doing consultancy work in the Slovenian capital, and he was very generous to give his talk, “What Are We Thinking in the Age of AI?” He provided an example of how to apply Rapid Software Testing in the context of AI.

The Host

The host of this event was Celtra d.o.o. whose office is in the heart of Ljubljana. Many thanks to them for organizing this meetup and providing food and drinks for the mingle session. There were over 50 attendants at this meetup.

The Talk

Making Windows 10 talk to with Google meet projection equipment

Michael started with the statement that he is labeled as an AI and automation “hater.” The reason is his blog writings and his collaboration with James Bach. But if you understand what software testing is about, then you can conclude for yourself that this is not true.

What he hates in software development are the following:

recklessness
bullshit
fakery
hype
marginalization
human negligence
stock market obsession
parasites
Elon Musk

Michael continued with important AI issues, such as various biases — gender, racial, religious, and others. We must not forget that software testing is socially challenging, where software testers have to provide information (usually bad news) to other teammates very carefully. Software testers think about what could go wrong with the software in order to make it better and prevent it from causing harm to people.

Software testing is a product-learning activity involving questioning, study, modeling (simplified models), observation, inference, risk analysis, and critical thinking. This is also what we do with AI.

We must think about the basis of AI claims. Michael provided examples such as Shazam and no-code automation testing apps (which have a lot of if statements under the hood). All AI apps look like magic at first impression. But Shazam is an algorithm, and that algorithm can also be wrong in some cases. Testers must be able to find those blind spots.

To be able to find blind spots, we must learn about the “magic.” Here are some books recommended by Michael:

Here is Michael’s list of how we should use LLMs:

To help in making queries instead of using them for controlling various things (our computers are one example)
When a human can supervise the LLM’s outcome
A qualified person should make the decision about which LLM answer to use
To help with inspiration and creativity (by lowering LLM temperature)
When variability in answers is expected

As Michael did his homework to understand how LLMs work, here is his simplified definition of an LLM:

An LLM just scrambles up training data. With variation, the LLM jiggles the results in an attempt to mirror humans.

As testers, we should ask the following questions about LLM-powered applications:

How does it work?
How is it trained?
Is it biased?
Is it consistent in providing answers?
Is it randomly wrong?
Is it fast?
Is the training data legal?

And I especially like this analogy by Michael:

An LLM is like a Petri dish with various samples.

Michael gave two good real-world examples of LLM biases. One was when Amazon used an LLM to find out what the perfect fit for an Amazon employee would be. The answer was a white male between 25 and 35 years old. The trick was that this simply reflected the current situation at Amazon — the LLM just provided an answer based on its training data.

The second example was with Air Canada’s customer chatbot. Michael had to travel urgently due to a loss in his family. He asked the bot if he was eligible for an emergency discount, and the bot confirmed it. The problem was that no such emergency discount existed.

The application domain is also important, as always in software testing. Michael did consultancy work on how to test a product based on an LLM. The product was for the Slovenian Supreme Court. The biggest risk was data safety — what data could be prompted into ChatGPT?

Another problem is algorithmic obscurity, when we need to test LLM’s human-like thinking. The LLM will try to please us as much as possible with its famous answer: “You are absolutely right!”

LLMs are expensive to test, run, train, and fix. For example, somebody once asked ChatGPT how it would test an application like Bolton and Bach. It took 30 minutes for ChatGPT to write that blog post, but James and Michael took 30 hours to analyze the information in that post — since the LLM was mostly wrong in generating that output. James and Michael would never test in the way described in that blog post.

How do Michael and James test LLM? Here is an explanation of The First hurdle heuristic, get the product of the starter’s block. Michael wanted a table from ChatGPT, first column with number, second column with english word for that number. And then sort by the second column. James made a LARC report, LLM Aggregated retrieval consistency. Pick a text that is part of prompt or training data. Ask for all examples that related to that text (e.g. noun phrases). Repeat that prompt. For provided result, ask again is this result is valid example that appears in prompt. We should get N identical lists.

Conclusion

In the are of vibe coding, programers are promoted to role of vibe code testers. Should we learn from LLM? We should analyze each LLM response, not trust it. Better option is to use LLM to help us construct input that we could use for google search.

After the talk and questions, it was mingle time. Michael said that he started programming with Ruby, and Everyday scripting with Ruby, was his book that start him off. We commented on Python syntax, where most ai coding is done. Michael comment on Python was:

No language should depend on invisible things.

Jerry Weinberg was mentioned with his books Perfect software and other illusions about testing and Introduction to general systems thinking. Based on latter book, James Lyndsay created his Simple Systems.

TL;DR

The Host

The Talk

Conclusion

Previous Article

Next Article

Let’s Do It!