Here are the slides for my #PyDataLondon keynote on LLMs from prototype to production ✨

Including:
◾ visions for NLP in the age of LLMS
◾ a case for LLM pragmatism
◾ solutions for structured data
◾ spaCy LLM + prodi.gy

speakerdeck.com/inesmontani/la

🎊🎁Big release of dirty-cat
dirty-cat.github.io/stable/

Broader focus: simplifying preparing non-curated dataframes for machine learning.
🔸Encoding of messy dataframes: a strong baseline for easy machine learning
🔸fuzzy_join: joining dataframes (pd.merge) despite typos
🔸Deduplication: matching categories with typos
🔸Feature augmentation: joining on an external data source to enrich tabular data
🔸Embedding of cites, companies, locations...

Tabular data can benefit from merging external sources of information.

The FeatureAugmenter is a sklearn transformer to augment a given dataframe by joins on reference tables.
dirty-cat.github.io/stable/gen

fuzzy_join makes it robust to mismatch in vocabulary. Hyperparameter optimization can tune matches for prediction

For such external information,
diry-cat can download embeddings of wikipedia data on millions of entities: companies, cities, geographic locations...
dirty-cat.github.io/stable/aut

Productive weekend! Just added 4 new Q&A's!

- Multi-GPU Training Paradigms
- The Distributional Hypothesis
- "Self"-Attention
- Training & Test Set Discordance

And "Machine Learning Q and AI" just crossed the 50% milestone! 🎉

PS: I included the Multi-GPU Training Paradigms section is in the free preview at
leanpub.com/machine-learning-q

CALL FOR PAPERS: Research and Innovation Track

We welcome papers on novel scientific research and/or innovations relevant to #SemanticWeb, #KnowledgeGraphs, #AI, #ML, #NLP and more

Deadlines:
🗓️Abstracts: May 09
🗓️Papers: May 16

For more info: 🌐2023-eu.semantics.cc/page/cfp_

In eigener Sache: heise online zieht auf eigene Mastodon-Instanz

Das Chaos bei Twitter hält an und die Mastodon profitiert weiter. Heise Medien betreibt in dem Fediverse-Netzwerk nun eine eigene Instanz.

heise.de/news/Twitter-Alternat

#Fediverse #Heise #Mastodon #SocialMedia #Twitter #TwitterÜbernahme #heiseonline

Do you love #selfhosting? What about providing service to the public via #Codeberg?

We are looking for maintainers that take on adding code search features to our #Forgejo instance to reduce the load on the existing infrastructure team and bring this project forward.

Please see codeberg.org/Codeberg/Communit if you are interested.

We are looking forward to your contributions. Thank you a lot!

Code Search: Looking for maintainers

An existing issue (#379) tracks the feature to enable code search on Codeberg. To move this forward, I'd appreciate to create a team of people available to experiment with it and maintain the setup in the future. The plan could look as follows: - discuss the resource allocations regarding memory, CPU and disk storage with the Codebeg infrastructure team - an LXC container is created matching the above results on our infrastructure - investigate the use of [ZincSearch](https://github.com/zinclabs/zinc) vs OpenSearch - configurate Forgejo / Gitea to connect to the setup using a test instance - adapt and finish [this pull request](https://codeberg.org/Codeberg/forgejo/pulls/47) to enable code search per repository (opt-in) - enable code search on the production instance and continue the maintenance of the setups inside the LXC container If this sounds interesting to you, please reach out here. We're available to collaborate closely with you, however it would be great if a dedicated team could push the effort and iterate independently. Thank you a lot! Useful links: - Forgejo config: https://docs.gitea.io/en-us/config-cheat-sheet/#indexer-indexer - #379 with experiences - connect Forgejo to ZincSearch: https://github.com/zinclabs/zinc/issues/538#issuecomment-1251748395 - consider Sourcegraph as an alternative to Forgejo-built-in search: https://about.sourcegraph.com/

codeberg.org

The most interesting thing about #ChatGPT that no one is talking about is how the future will be systems talking to each other with imprecise protocols but they’re still able to understand

And the year has barely started!

RT @MishaLaskin@twitter.com

In-context RL at scale. After online pre-training, the agent solves new tasks entirely in-context like an LLM and works in a complex domain. One of the most interesting RL results of the year. twitter.com/FeryalMP/status/16

🐦🔗: twitter.com/MishaLaskin/status

Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.