Cloud Native Testing Podcast
The Cloud Native Testing Podcast, sponsored by Testkube, brings you insights from engineers navigating testing in cloud-native environments.
Hosted by Ole Lensmar, it explores test automation, CI/CD, Kubernetes, shifting left, scaling right, and reliability at scale through conversations with testing and cloud native experts.
Learn more about Testkube at http://testkube.io
Cloud Native Testing Podcast
Cloud Native Testing and AI - Why Tests provide the Context
In this episode, Ole Lensmar joins Richard Li, founder of Polar Sky, to explore how testing strategies must evolve for the age of AI. They discuss the journey of cloud-native tools like Telepresence and Richard's key insight for agentic coding: using automated tests as the "context" to guide AI behavior, rather than relying on complex prompts to describe system rules.
The conversation also digs into the mechanics of building AI-driven software, emphasizing why "evals" are critical for measuring success. Richard shares practical strategies for treating evaluations as data management problems and explains how his team uses simple AI agents to "babysit" CI pipelines and automatically resolve flaky tests.
Topics discussed:
- The Evolution of Telepresence: Moving from local debugging to headless execution in CI pipelines.
- Tests as Context: Why running a test suite is more effective than complex prompting for AI agents.
- Agentic Coding Strategy: Shifting focus from unit tests to integration and behavioral verification.
- AI Evals: Why evaluation is the most critical aspect of building reliable AI products.
- Babysitting CI: Using simple AI agents to identify and retry flaky tests automatically.
Ole Lensmar: Okay, hello, welcome to another episode of the Cloud Native Testing Podcast. I am so happy to be joined today by Richard Lee, who I've had the great privilege of working with for a couple of months at Ambassador Labs. That was like four and a half years ago, almost five. Richard, welcome.
Richard Li: Ole, great to be here. Thanks for having me.
Ole Lensmar: It's, I mean, I've been following you from a distance ever since I left Ambassador and it seems to me you're doing some super interesting stuff in the AI space. I'd love to talk about that, but please, why don't you just start off by introducing yourself and kind of telling us what we need to know.
Richard Li: Yeah, so I call myself an entrepreneur. I'm a B2B software entrepreneur to be specific. So I met Ole many years ago. I'd started a Kubernetes company called Ambassador Labs, and we created two fairly popular CNCF projects. One was telepresence, the other one was the Ambassador API Gateway. We just sold the company a few months ago, but sort of about... Couple years ago, I actually became very interested in AI. And so I spent the last few years learning a lot about AI. And most recently, I just started a new company in cybersecurity and AI called Polar Sky. And for all of you who are out there and listening, if you are in the market for a job, please visit our website, polarsky.ai, and we are looking for engineers. So.
Ole Lensmar: PolarSky.ai. Don't forget to go over there. Okay, great. Thank you. That's super exciting. So, and congrats on the sale of Ambassador, then. So, but let's go back a little bit. So, telepresence is one of the, in the context of cloud-native testing, which is obviously the umbrella, I want to stay under it for a little bit. Telepresence, to me, of what I remember, was a tool that did kind of help testing and was something that was one of the use cases that...
Richard Li: Thank you.
Ole Lensmar: it was used for or kind of described for. Can you just elaborate a little bit on that? Where is it now? Do you know? And how does it fit into that kind of those use cases today?
Richard Li: Yeah. Yeah, so I think when we started Telepresence, this microservices thing where you have multiple services was new. And so essentially what Telepresence did was it was sort of like a fancy domain-specific VPN. So it let you connect your local workstation where you had all your code with a remote Kubernetes cluster where you had all the dependent services. so I think the initial use case was actually... development because if you have 50 databases or services you're depending on, it was very hard to develop without something like telepresence. And lots of people use mocks, but mocks have all sorts of problems in this scenario because you have to maintain them and then you have to get all these mocks running and this is a much sort of simpler approach. Very quickly we realized actually after the development use case that people really wanted a headless mode because they want to run telepresence in CI. And the reason for that is because if you want to run an automated test and it has, you know, got 50 different dependent services, it's actually very useful to actually have sort of this VPN thing where you're not, you know, when you make a code change, you're not rebuilding and redeploying an entire Kubernetes cluster of services. You're just redeploying one service, but you're able to test it against a dependent Kubernetes cluster. And so that became, I think, the driving factor in telepresence in sort of more testing type use cases. And obviously developers could run their tests locally as well, but sort CI and headless mode was, think, the sort of the driving use case around testing and Kubernetes and CI.
Ole Lensmar: That's super interesting. I mean, we see a lot of people wanting to do like spinning up ephemeral environments as part of their CI-CD. They're using V-cluster or something like that where they provision the entire application. But with telepresence, it's kind of a hybrid approach where you're basically plugging the service that you're testing into the dependencies running elsewhere thanks to telepresence. Is that correctly understood? Yeah.
Richard Li: Exactly, yep, exactly. And so it's sort of a little bit more efficient. mean, you're setting up an ephemeral cluster, which is not that much time if you know what you're doing, but then you have to deploy all the stuff that goes into the cluster and your data. That's what gets a little more complex and time consuming. And so this lets you have sort of a simpler sort of setup. Yeah.
Ole Lensmar: Okay, okay. Have you seen, I mean, this was a couple of years ago, has telepresence evolved, or how is that kind of-
Richard Li: It's still going. mean, there's still quite a bit of development happening in the AI realm. There's now an MCP server that was just part of last release. So now you can actually sort of use telepresence actually as a tool from your AI. So we're sort of slowly moving telepresence into this sort of brave new world of agentic coding and development.
Ole Lensmar: Yes, all rushing in there. just out the top of my head, what would the use cases be for MCP or like an agent interacting with telepresence? Would it be to kind of provision set up telepresence basically or how would...
Richard Li: Yeah, to set up telepresence or set up the different routes, it turns out that there's a lot of options around how you want to route traffic, turns out. And so being able to manage all of that from, say, a Claude or Codex actually turns out to be kind of an interesting use case. It's very new, so I don't think there's a lot of usage, but I think an MCP certainly has some of its problems, especially in terms of token consumption. But this is sort of the standard we have today. I 100 % believe agenda coding is actually the future. I think exposing interfaces to AI tools is, I think, a big part of what we're doing with telepresence.
Ole Lensmar: Okay, okay. And so you said agentic coding is the future and I guess hand in hand with there, there's agentic testing, or I'd like to think there is. how do you, I mean, there's a lot to be set around testing for AI. And I was just at a conference last week in testcon where there was a lot of talks around how does testing keep up with AI, both kind of. MCP servers for testing tools or how do we ensure that we're testing all this code that's being generated by AI and does that kind of introduce new bottlenecks and then how do we test AI itself like evals and those kind of things. Are any of those, I like to talk about all of them, are any of those kind of closer to your heart and then another or can you relate?
Richard Li: Well, so I think the most important thing is actually eval. So if you're using an AI in your product, for example, you can't actually know if your AI is any good without a good eval. So, know, TDD was very sort of popular and then it sort of waned. You cannot actually develop an AI without eval. So it is, and so I would say, especially when you're developing AI, you have to actually have a very robust eval suite. You're never done. And there's like, I could talk for a long time if you're interested around sort of different eval strategies. But essentially, you do have to take this sort of test first or eval first approach. Because otherwise, you have no idea if when you make a tweak to your prompt or you switch out your AI, you have no idea if things are better or worse. You just know for that one prompt. And you really need to actually have the full body of evaluation. So I think that's really important and then in terms of agentic coding Agents actually make mistakes they hallucinate you know we all know this right and I think one of the key guardrails that you have is when you're writing code with an agent is you do need to have some tests but I think the shape of those tests is actually a little bit different because I think the test instead of being sort of more functional nature because AI my experience generally gets the syntax pretty good and functionally it's hard for it, like it will deliver. You're really trying to create a testing infrastructure that does a better job capturing the intent of what you're trying to do. So that means more integration testing, more end-to-end testing, and less around, more behavior testing, and less around sort of classic unit tests because in my experience the agent just doesn't mess up the unit tests. My unit tests always fail. My integration tests sometimes fail. My, sorry, my unit tests always pass. My integration tests sometimes fail.
Ole Lensmar: Because it doesn't it hasn't captured the high-level behavioral like this desired behavior of what you're trying to do or
Richard Li: Yeah, or it actually, you say, I have this feature and I wanna code this feature and it codes the feature, but it does so in a way that sort of breaks sort of all the other, breaks sort of the core workflow because it doesn't totally quote unquote understand what the product actually is supposed to do, right? So I think it's really important to capture sort of the core functions of the product as in test so that when you add a new feature, you're not sort of suddenly breaking whatever your core function actually is. So you do have to adapt a little bit of different mindset.
Ole Lensmar: Yes, so to me it feels like the quality mindset is even more important or just as important with AI driven development or AI enhanced or whatever you want to call it. the fact that it's easy now to generate code and add new features, etc. just doesn't mean that we have to test less. But just as before, we should be writing tests for the code we write. We should also be writing tests or creating the applicable evals or testing. for code that we generate. it kind of, if you have a quality mindset when you're building code and if you have a mindset around, I write tests or I write code for testability and if you maintain that when you're starting to use AI, hopefully that would also kind of result in you doing kind of the things that you just mentioned, but people who are initially maybe not so... used to writing tests or creating intent tests or unit tests, they probably won't do it with the AI either. so that kind of just multiply or just enhance their leads to the same problems, but at a higher occurrence.
Richard Li: Yeah, I agree. And I think, you know, now that you have AI, there's really no excuse not to write tests, right? Because you could just tell the AI to write tests, right? And I think, you know, so one of the patterns we use is we basically are like, what's the behavior, what's the core behavior of the system that we don't want to change, right? You know, whatever it is, like, let's just say your system is, you know, I don't know, an e-commerce market and the shopping cart. Well, then you need to be able to put stuff into the shopping cart and check out, right? You don't want to change that. And so, So you can literally just tell your agent to say, hey, write me an automated test that takes something from the catalog, sticks it into the shopping cart, and does checkout. And that way, you know that no matter, and then you can write the test, and then you can run the test as part of test suite. So that way, when you add more features, whenever it runs a test, you know that your shopping cart, in theory, should work, right? And if you have an e-commerce site that's... pretty important, right? And so I think what's really important is to capture these core things that are part of your application, codify them into a set of tests, and so that way when your agent's running, right? Because all of our agents, whenever we run an agent, we always, I mean, it's a little bit different when you're running a multi-agent case, but essentially whenever we're giving a task and delegating tasks to an agent, we're always making sure one of the acceptance criteria is all the regressions pass, so.
Ole Lensmar: And could you also capture those kind of behavioral requirements at the prompt level? So when you generate new code, like I want to implement this feature, can you already at that level say, hey, you know, make sure that the shopping cart doesn't break or that you don't take all the other high level behaviors or requirements that we have in the system in account, or does would that make the context too big for?
Richard Li: I think if it just makes the context way too big and it starts to lose focus, and one of the patterns, I don't know if you've tried using cloud code, is they actually use multi-agent. And so you can actually create a plan, and typically when you use multi-agent, they end up with a builder agent and a tester agent, is sort of how it sort of, in my experience, shakes out. So you can do that, but the reality is, the way I think about it is the tests are the context. for verifying behavior. So I think it's just easy, instead of saying, hey, these are the 50 behaviors as part of your prompts, you just say, run all the tests, right? And that is actually much more specific than trying to write down in text what the 50 core behaviors are, and no one ever remembers. So I definitely think an automated test that captures the core behaviors and functionality of your product is absolutely essential if you're using agentic coding.
Ole Lensmar: Okay, cool. Thanks. And so I'd like to go back to the eval though. do you see, I mean, I'm guessing there are frameworks out there to help you automate evals or write eval tests and then automate those as part of your pipeline. do you see, are these just another type of tests, just like a load test and functional tests, they're eval tests and you run them every time you commit or whenever you think it's reasonable to do it or are they, should we be thinking about them differently?
Richard Li: think, I mean, there are different types of tests exactly that you should run. What I have found is that the biggest challenge with evals is that it turned out to be a data management problem, right? Because essentially, an eval, you're basically, you've got, you essentially, whether you have a relational database or, you basically have a huge corpus of data, which is your input, and then you have your expected output, and you're scoring on this in some way. And so, some, you need some way to actually manage the input and output data in a structured way. And that turns out to almost certainly be a bespoke kind of workflow because everyone's data looks a little bit different, right? So like, let's say you're building an AI that's classifying road signs, right? Well, then you need a giant database of road signs, right? And so there's, and it's not a very complicated database, but you've got one column, which is the blob, the image, but then you have to have your label. And then you need to be able to, so it turns out that eval is actually a data management problem and then a lightweight harness to actually sort of feed the data one row at a time from your database into your AI and then the AI responds and then you have to figure out some way to score it, which is sort of another set of problems. Yeah.
Ole Lensmar: Okay, and so I've been playing around with the tools out there like DeepEval and others. Are those things that are applicable in these use cases or do you feel like...
Richard Li: Yeah, I mean, I would say, think it's, the biggest one is this thing called Brain Trust, which is this open source, think that's what it is, which is an open source eval solution, I think they're open source, maybe I'm getting wrong. We don't actually use any of them just because the shape of our data is actually kind of awkward. And so we, I just use an agent. And over the last month, I've built this thing that sort of solves our problem for now. But yeah, there's some open source sort of eval studios is really what it is. But essentially, they apply basically XML, I kid you not, labels to all this stuff to kind of help you structure it. People who are, other people say they get a lot of value out of it. I'm like, well, the problem with that is.
Ole Lensmar: Really?
Richard Li: Kodak and Claude can't manage that thing for me, so then I have to click around the UI, so therefore I'm like, I'd rather just deal with code, so I just have Kodak and Claude just code this thing for me. Your mileage may vary, so.
Ole Lensmar: Okay, interesting. Okay, and the last thing, so that was the evals and then was around, so the other one was just using, I guess you touched a little bit on that, AI to generate tests itself and different types of tests. So you mentioned end-to-end tests. Is that something that you've done successfully or like based on the requirements or the desired behavior, generate, I don't know, what playwright tests or something like that? Or how well do you think that's working and is that something you can rely on? Or you still need to, how much can you rely on that those tests actually verify what you asked it to verify?
Richard Li: I so I definitely think Playwright or something like that is a great idea. And here's the thing. The big problem with this stuff is test flakes in CI. But again, you have AI. So like 30%, 40 % of the time, you can resolve a flake by just restarting the job. So all you do is you code up an agent that talks to your CI. And whenever it fails, you just have the agent look at the log failures and tell the agent, look. And you give the agent like three tools, right? And one of the tools is just restart the job, right? And you just tell the agent, if you think restarting the job might solve the problem, try it. Just don't restart the job more than once or something, right? And it turns out, if you do something pretty simple like that, and then you can make the agent fancier over time, just by telling the agent just to kick the babysit CI, you can reduce your CI flakes by 30%, right? Just like that, right? Or whatever, like, you know, because... 30 % of time, you just get a CI flake and you just restart the CI, I mean that's what, I mean honestly, if you got a CI flake notification, or CI failure notification, what do do? You restart the job, and then you only look at it the second time if it fails again with the same error. So you can literally, it's not that hard to program an AI agent to basically take that very simple human heuristic and babysit it, and then suddenly you have a, you know. a CI that where it fails, it's like, well, you know, it's probably there's a good reason it fails. So yeah, so I would definitely like flakes it flakes refactoring and it completely changes how you think how I think about flakes refactoring all this stuff because I'm just like, I'm not scared anymore. Like I'm just like, well, we have to change from this framework to this framework or we got to upgrade from Java 12 to Java 14 or whatever. I don't even know anything about that. It's like it's just not scary. Just like, go do it like.
Ole Lensmar: So one interesting thing there is it feels like we're kind of varying to use AI for very simple heuristics to your point, right? The test fails, we run it. is that, are we kind of overusing AI? Because that's kind of a heuristic that you could just program into your CI-C, so you could configure something, right? And I think it's something that we're looking at for, know, TestCube is like, do we just... build in agents that can do all these things like detect, I don't know, like to your point, flaky tests and rerun them, or do we add a feature flag that says rerun the test if it fails the first time? And the allure of course, the agents is that they could do anything, but for the most common use cases, even that will consume some tokens and it'll be slower than at least having something hard coded. And where do we draw the line between, what do we build agents for and for what we actually build non-intelligent heuristics or functionality for it? Does that question make sense?
Richard Li: Yeah, so that's a great question. And it's actually a very active debate. Depending on who you talk to, the debate can be characterized as the one big model versus the one big workflow sort of debate. Open AI is sort of this proponent of this one big model. We're gonna build the cheapest, best AI, it's a human brain, so just shove everything into it and it'll figure itself out. So that's sort of open AI. At the other end of the spectrum, you have sort of more of the workflow vendors like Langchain or Aka or Temporal, which are a little bit more focused around, hey look, if you can create deterministic rules, right, you should do that because it's just a lot cheaper, very easy to debug, and then use the AI for the stuff that doesn't make sense, right? Anthropic sort of sits in the middle, So it's an active debate, so I don't think there's a clear answer. And I would say I tend a little bit more on the workflow side just because of the debugability, because, so I think to your point, on this, on the CIFLAKE thing, I would probably code it as a deterministic heuristic, but I would put it as part of an AI agent as well, and see how well the deterministic heuristic, deterministic, you know, three lines of Python work. I think the value of having an agent look at it first would be the agent could in theory be like, well, you know, this error doesn't seem like something that would make sense to, it doesn't look like a flake. So you could, it can add at some level of judgment. You know, I think you kind of got to experiment, but this is actually, there's actually been many a Twitter thread or X thread on this question. Yeah.
Ole Lensmar: Okay. can... Okay, well, thanks for broadening my horizon on that because I can definitely see sometimes like, I could just do everything. Why do we add features at all? We just have an agent and ask it to do everything for us. And like you said, why build deterministic heuristics around something? Okay, that's super interesting. So for the product that you're building, it's testing like... I mean... It sounds like testing is at top of mind for you, is that true?
Richard Li: Well, I mean, because, so we're building an AI cybersecurity product, so it's only as good as the eval, right? So basically, most of our testing effort has been invested in eval, because it's like, and so even though we're very young company, we have almost a thousand eval cases now, right? Because that's kind of what you need in order to actually figure out if your AI's any good, right? So yeah, so. So I would say every single AI company that's legitimate and has aspirated, like, they gotta have an eval system. Otherwise, they're just blowing smoke because there's no way they could be correct. Yeah.
Ole Lensmar: Okay. Okay. Awesome. Richard, thank you so much. It was fascinating. I learned a lot, and it was great talking to you. Thank you so much for sharing everything. Okay. Thanks. Thanks. Take care. Bye-bye.
Richard Li: Yeah. All right. Thanks, Ole See you later. Bye.