**Simon Gray** @simongray@indieweb.social · Oct 06, 2023, 08:49

**Simon Gray** @simongray@indieweb.social · Oct 06, 2023, 08:49

Simon Gray @simongray@indieweb.social

Oct 06, 2023, 08:49

Simon Gray @simongray@indieweb.social

I'm writing a paper with some colleagues where the idea is to construct a test dataset for #LLMs (such as those provided through #ChatGPT ) that specifically tests their knowledge of #Danish.

We are generating many of the tests from data found in the Danish #WordNet (#Dannet) which I maintain.

I am writing the #SPARQL queries that gets the data out and it's really turning into a test of DanNet itself, since the strength of the various (underspecified) RDF predicates varies quite a bit.

**Simon Gray** @simongray@indieweb.social · Oct 06, 2023, 08:54

**Simon Gray** @simongray@indieweb.social · Oct 06, 2023, 08:54

Oct 06, 2023, 08:54

Simon Gray @simongray@indieweb.social

In some cases, the relations created by the predicates are quite strong and consistent, so generating sentences from them will be an easy task.

In most cases, howver, I need to create various guard filters in my SPARQL queries, since the relations will be entirely logical in different contexts, but the contexts are too different to reliably generate sentences from, e.g. a "fillet" is a part of a "body" in a gastronomical context, but it gets weird when mixed with traditional body parts.

**Simon Gray** @simongray@indieweb.social · Oct 06, 2023, 09:03

**Simon Gray** @simongray@indieweb.social · Oct 06, 2023, 09:03

Oct 06, 2023, 09:03

Simon Gray @simongray@indieweb.social

And it's really putting the weakness of synsets as a concept on display (synsets are sets of synonyms which constitute the basic components of meaning in a WordNet).

For example, based on part-whole relationships we can generate the following sentence:

"et system kan have en krummerik"
("a system can have a bent, erect penis" 😅)

This comes about since one meaning of "system" is in the same synset as "body", while "krummerik" (bent, erect penis) is in a part-whole relationship with body. 🤷‍♂️

**Simon Gray** @simongray@indieweb.social · Oct 06, 2023, 09:47

**Simon Gray** @simongray@indieweb.social · Oct 06, 2023, 09:47

Oct 06, 2023, 09:47

Simon Gray @simongray@indieweb.social

Other than writing increasingly complex SPARQL queries, I'm also having to apply a general filter to every result row.

I use a #Clojure transducer for this task comprising an ever expanding set of lemma filters.

21fabd82ed950f92.png

**Simon Gray** @simongray@indieweb.social · Oct 06, 2023, 18:09

**Simon Gray** @simongray@indieweb.social · Oct 06, 2023, 18:09

Oct 06, 2023, 18:09

Simon Gray @simongray@indieweb.social

Another consideration I've had is what to do about #reproducibility.

Currently, my code generates a dataset consisting of results based on a wide range of random samples. Ideally someone who loads the same graph and runs the same experiment should produce exactly the same output.

This is what I have currently (#clojure), but it's not completely done.

58a8e532d8c7233b.png

**(webdev Tory)** @worldsendless@qoto.org · 2023-10-06T19:02:20Z

(webdev Tory) @worldsendless@qoto.org

@simongray what a cool project! I have done some work with Wordnet (English), but that was in C++ I think. But I am very interested in its non-big-data uses for language work. I love seeing what you are working on in Clojure.

Oct 06, 2023, 19:02 · · · ·

Trending now

Resources

Developers

What is Mastodon?

qoto.org

More…