Llama 3.2 is out, and it's a much more substantial release than the 3.1 to 3.2 version bump might indicate
Four new models, including Meta's first two vision models (11B and 90B) and two new text-only small models (1B and 3B)
My notes so far: https://simonwillison.net/2024/Sep/25/llama-32/
RTO or GTFO? Last week, Amazon announced that starting in 2025, all employees will be in a physical office 5 days a week. While putatively to strengthen culture (as "the world's largest startup"), this is proving unpopular with many. Will other companies follow suit? On today's episode, @ahl and I are going to talk about remote work, RTO mandates -- and the physicality of both innovation and organization. Join us and share your perspective, 5p Pacific!
The latest xkcd made me wonder about dials being counterclockwise (as they are in the comic) vs. clockwise as I'm used to seeing. Is this a US thing?
@TomF I'm right handed and started having minor issues with my wrist while using a mouse - so about 15 years ago I started switching my mouse hand on my work machine to left and (back when we still went to offices) my home machine stayed mouse-right.
It took a few weeks to really get used to it, but it wasn't bad. And now I'm mouse-ambidextrous and generally free of wrist troubles.
In the course of my unachievable hobby project working on SAT model counting I've figured out a neat trick, which is a reasonably efficient method for doing set operations on SAT problems in CNF form (without increasing the number of variables). e.g. taking the intersection or set difference of two SAT problems.
Is this a known thing? I've not seen it before but it seems like it must be.
"The fairy photon enters the aperture and sings as it makes its way to the CMOS chip.
What's that? The permission troll just crawled out from under the Broadcom WiFi chip where it spends most of its time punching mean Internet traffic, and swatted the fairy photon away!
Oh no!
Halide needs your help, fair wise iPhone owner.
Close your eyes and make a wish and tap your heels together three times.
Tell the permission troll to stay away! Let the fairy photons come home to CMOS."
Here’s a statement from Bloom’s 1986 /The Closing of the American Mind/, with which I’m unhealthily obsessed:
“The best point of entry into the very special world inhabited by today’s students is the astonishing fact that they usually do not, in what were once called love affairs, say, ‘I love you.’
Were you a student in 1986? Did you, at that time, say “I love you” to a romantic partner?
Ok I made this into a blog post, you're welcome
https://hazelweakly.me/blog/cache-me-not-cache-me-cache-me-not/
But seriously, caching is hard, it's really hard, but you can make life WAY easier for yourselves when building an SPA if you do this super simple thing.
Break down all your content into two axis:
1. push vs pull
2. owned vs user
Whenever possible, turn your pull assets into push assets and all of your user assets into owned assets.
What do I mean by that? Let's break it down further 👇
@spoltier the thing that I didn’t like, dependent on my understanding of basic LLM evaluation (which could!!! be wrong), is that metrics like recall are about how well the tool being measured did at producing information that aligns with ground truth information from a reference dataset. If there was no training of the tool that could take place since the tool was not a model, the tool doesn’t have a dataset to draw from to compare its output from to that ground truth.
@spoltier like I think it “works” if you want to say something like “tools that aren’t models don’t have any recall” but I don’t think it works if you want to say “objectively how did our method do at deduplicating test cases versus other types of approaches, and what types of tools can make the most unique test cases”. I think they were aiming to use data to say the former, but I don’t think it’s sufficient justification for using an LLM to do something is why it bugged me 😅
If you’re generating code, and you’re *not* doing it with an LLM, is it reasonable to use metrics like F1 and recall to measure how well the tools you use are doing? This is bothering me because it feels a bit weird to apply metrics like this to static analyses, build tooling frameworks, or things that just plain don’t have any recall to begin with.
@kaoudis I'm not too familiar with the development/testing process for such tools. I would say if you have enough* representative* data, why not?
*depends on use case, target audience etc. of course.
Earlier this year, I worked on a side project to hack a car in JavaScript and finally found the energy to write the blog post about it! 🚙 📡
https://charliegerard.dev/blog/replay-attacks-javascript-hackrf
I have played a little bit with OpenAI's new iteration of GPT, GPT-o1, which performs an initial reasoning step before running the LLM. It is certainly a more capable tool than previous iterations, though still struggling with the most advanced research mathematical tasks.
Here are some concrete experiments (with a prototype version of the model that I was granted access to). In https://chatgpt.com/share/2ecd7b73-3607-46b3-b855-b29003333b87 I repeated an experiment from https://mathstodon.xyz/@tao/109948249160170335 in which I asked GPT to answer a vaguely worded mathematical query which could be solved by identifying a suitable theorem (Cramer's theorem) from the literature. Previously, GPT was able to mention some relevant concepts but the details were hallucinated nonsense. This time around, Cramer's theorem was identified and a perfectly satisfactory answer was given. (1/3)
"I don't want to live in a world where five companies dictate everything we do."
#Nextcloud founder kicking off the conference
code / data wrangler in Switzerland.
Compulsive reply guy. Posts random photos once in a while.