**kaoudis** @kaoudis@infosec.exchange · Sep 18, 2024, 08:51

**kaoudis** @kaoudis@infosec.exchange · Sep 18, 2024, 08:51

kaoudis @kaoudis@infosec.exchange

Sep 18, 2024, 08:51

If you’re generating code, and you’re *not* doing it with an LLM, is it reasonable to use metrics like F1 and recall to measure how well the tools you use are doing? This is bothering me because it feels a bit weird to apply metrics like this to static analyses, build tooling frameworks, or things that just plain don’t have any recall to begin with.

**Simon** @spoltier@qoto.org · 2024-09-19T08:51:17Z

Simon @spoltier@qoto.org

@kaoudis I'm not too familiar with the development/testing process for such tools. I would say if you have enough* representative* data, why not?

*depends on use case, target audience etc. of course.

Sep 19, 2024, 08:51 · · Tusky · · ·

**kaoudis** @kaoudis@infosec.exchange · Sep 19, 2024, 09:23

**kaoudis** @kaoudis@infosec.exchange · Sep 19, 2024, 09:23

Sep 19, 2024, 09:23

kaoudis @kaoudis@infosec.exchange

@spoltier like I think it “works” if you want to say something like “tools that aren’t models don’t have any recall” but I don’t think it works if you want to say “objectively how did our method do at deduplicating test cases versus other types of approaches, and what types of tools can make the most unique test cases”. I think they were aiming to use data to say the former, but I don’t think it’s sufficient justification for using an LLM to do something is why it bugged me 😅

**kaoudis** @kaoudis@infosec.exchange · Sep 19, 2024, 09:36

**kaoudis** @kaoudis@infosec.exchange · Sep 19, 2024, 09:36

Sep 19, 2024, 09:36

kaoudis @kaoudis@infosec.exchange

@spoltier the thing that I didn’t like, dependent on my understanding of basic LLM evaluation (which could!!! be wrong), is that metrics like recall are about how well the tool being measured did at producing information that aligns with ground truth information from a reference dataset. If there was no training of the tool that could take place since the tool was not a model, the tool doesn’t have a dataset to draw from to compare its output from to that ground truth.

Trending now

Resources

Developers

What is Mastodon?

qoto.org

More…