Leshem Choshen

A new benchmark for data 📚
Rather than test if a model is good
This tests whether you can filter data
360 languages

They also share metrics for data redundancy if you want just those
arxiv.org/abs/2311.06440
github.com/toizzy/
#data #preprocessing #dedup #enough2skim #NLP #NLProc

weakmath

So this is the #inofficial #opening of #ICFCA2023 with a talk by Johannes Hirth on #preprocessing and #scaling contextual data.

Open Art Data

Extremely noticeable #KNOWLEDGE GAPS of ChatGPT in the #history of #Holocaust-related art claims make it clearer than ever the urgency of understanding the data #pipelines that feed the #AI language model.

What #filters are used in #OpenAI's data #preprocessing to EXCLUDE information? Who decides which information to exclude? What triggers exclusion?

#ChatGPT fills gaps with plausible -sounding disinformation - which is a disaster

#EHRI #YadVashem #memory #looted #histodon #tech #FAIR