uspol, doubting one's sanity, empiricism, wasting resources?
So yesterday I ended up in a situation whereI was in disagreement about what I thought I could clearly hear in a video. Since it sounded perfectly clear to me, and the topic of the related discussion was politically charged, _and_ I have no reason to doubt the other participants honesty about what they say they are hearing, this is pretty concerning. I see three options:
1. I am so influenced by propaganda my basic senses are broken.
2. The above, but for the other participant.
3. This specific video is an auditory case of blue/black vs white/gold dress.
I think the odds are about 5/80/15. I kind of hope it's 3 though, it would mean the propaganda is not strong enough to wrap the minds of intelligent people that badly. If it is 1, I obviously need to at least make a drastic change in the media I am consuming, and probably re-evaluate a lot of stuff.
This toot is mostly a pre-commitment, so that I follow up on my attempt to settle this. My plan is as follows, mostly in order of effort needed:
0. Look at the auto-generated captions on the YT video. If this confirms what I hear this would be _extremely weak_ evidence against 1. There might not even be auto-captions enabled for the video and I am not sure if manual captions can be distinguished from automatic ones.
1. Extract the crucial part of the sound from the video and re-upload it to YT with no real visuals attached and no suggestive title. Check the auto-captions there. This could be weak to moderate evidence for any of the above.
2. Same but with a different system than YT. I'll probably pick a couple options from this page: https://fosspost.org/open-source-speech-recognition/ . They all would be weak to moderate evidence for any of the above, in aggregate they are strong if in agreement.
3. Use Mechanical Turk to ask people about what they hear. **If anyone knows a reasonable non-amazon alternative, let me know.** This would be strong evidence towards something, with the possibility of bias due to people being familiar with the content.
4. Same as above, but cut the audio into separate words to limit bias.
If too many of the steps fail (producing no reasonable output) I can fall back on using the single words to ask friends who are hopefully unfamiliar with the context, but this would be kind of weak. I might skip some later steps if previous steps produce sufficient agreement or if they turn out to be too expensive (I don't really know the rates on mturk...).
Crucially, what my specific claims about what I clearly hear are (which are incompatible with what the other person hears), in order of how confident I am of them:
1. The second word starts with an 'm', not a 'w'.
2. The first word ends with a consonant, most likely an 'ng' sound.
3. The first word starts with 'ha'.
4. The second word starts with a 'my' sound.
This might take a couple of days...
uspol, doubting one's sanity, empiricism, wasting resources?
# Test 0
No captions on the original video. Not a huge disappointment, it wouldn't have been strong evidence anyway.
Before I get to Test 1, I wanted to point out that if it correctly reconstructs the given name present in the chant this would be _weaker_ evidence of whatever gets recognized, because it might suggest the captioning system recognized the chant and assigned known captions to it (I don't know whether anything like that actually happens). Something like "Hang my pants!" (which is actually what I heard before I corrected for context) would be stronger evidence. Thankfully this won't be an issue in Test 2.
uspol, doubting one's sanity, empiricism, wasting resources?
# Test 1
Let's document this one properly.
## Preparation
Downloaded the video using `youtube-dl`.
Extracted the relevant part of the sound, from the moment it becomes clear (IMO) to when the video cuts to another part of the crowd.
```
ffmpeg -i Rioters\ chant\ \'hang\ Mike\ Pence\'\ as\ they\ breach\ Capitol-ba0UR7gITrU.mp4 -vn -acodec copy chant.aac
```
Created a video out of the sound file with a irrelevant name and the least political picture I could find on short notice (a drawing of a mathematical pun in Polish).
```
ffmpeg -loop 1 -y -i ../kurakLematowskiegoZorna.jpg -i chant.aac -shortest -acodec copy -vcodec libx264 sillyTestVideo.avi
```
Uploaded the result to YT, as of now there are no auto-generated captions present, but the instructions suggest this might take a while.
uspol, doubting one's sanity, empiricism, wasting resources?
On second look, if I'm understanding the UI correctly it generated captions already, but they are _empty_. There is a warning it might not generate proper captions if there are multiple people speaking, so maybe that's a problem. That would make the results inconclusive again. Oh well, I can wait just to make sure before declaring that.
uspol, doubting one's sanity, empiricism, wasting resources?
Well, that ended up silly. YT managed to autogenerate captions, but not for the chant, but for some barely audible person talking close to the person recording. And all the words it identified were "el bote no". Waiting for Q theories how this proves these were Mexican antifa who entered the capitol by ship and had problems escaping.
At least this is a very clear inconclusive result. I'll continue tomorrow with the other tests, but the odds of me needing to use actual money on this are rising.
uspol, doubting one's sanity, empiricism, wasting resources?
# Test 2
Apparently speech-to-text is something only professionals usually do, because the tools I managed to find are not especially easy to use. For now I managed to get julius, followed instructions on its GitHub substituting the file I wanted for the test file. It needed to be converted as follows:
```
ffmpeg -i chant.aac -ar 16000 -map_channel 0.0.0 chantL.wav -ar 16000 -map_channel 0.0.1 chantR.wav
```
The two channels were actually indistingiushable as far as I (and julius btw) can tell. Unfortunately all it recognized was "details had", which means it probably also picked up some random person talking, treating the chant as background noise.
I'll try cutting the file into smaller bits (bit per word, where I think they are most clear), since I will need to do this for further steps anyway, and check whether this helps.
uspol, doubting one's sanity, empiricism, wasting resources?
tl;dr Did not help.
The first word is recognized as "oh", the second as "five", the last as "but added". These are so nonsensical (especially the last one) that I believe they provide no evidence one way or another ("five" kinda sounds like "Mike"? pfffft), except for julius being terrible at transcribing chants. _Maaaybe_ this is tiny evidence towards 3., since a chant that's incomprehensible to programs might also be incomprehensible to humans.
I'll try at least one more software of this kind, but at this point I believe mturk will be necessary.
uspol, doubting one's sanity, empiricism, wasting resources!
Next I used this: https://github.com/facebookresearch/flashlight/tree/master/flashlight/app/asr/tutorial
It did not detect any words in the first clip and the word "one" in both the other clips. This suggests it again was picking up on noise different from the chant. It also didn't detect anything on the full chant.
Finally I tried Vosk. Did not detect anything on any file.
Welp, MTurk it is. But not today.
uspol, doubting one's sanity, empiricism, wasting resources!
There we go, sent both the full chant (without repetitions, I just picked the IMO clearest sounding instance) and single words (cut from the full chant). I requested 20 answers for every data piece, which should be enough for reasonable evidence (unless the responses are atrocious quality). I expect some people will be familiar with the full chant, so answers which correctly identify the given name present there are weaker evidence. With the single words this problem should be somewhat mitigated.
Still hoping I'm not theonly one hearing "Hang my pants!" when ignoring the context.
uspol, doubting one's sanity, empiricism, wasting resources!
So lets start with the predictably most disappointing, the full chant. Four people were clearly familiar with the chant, divided equally between the "standard" interpretations. One more person had an interpretation that was not exactly one of the standard ones, but close enough to make me suspect they were also familiar. Two further people had interpretations that were clearly made through careful listening, somewhat phonetically close to one of the standard interpretations (one each, lol). Ten people had done a terrible job and returned nonsense, wild guesses or just claims it's unintelligible. Three people had tried, but their interpretations are not close to either of the "standard" ones, and phonetic similarities are unclear.
This is relatively strong evidence for 3, and against both 1 and 2.
uspol, doubting one's sanity, empiricism, wasting resources!
Before I get to the first word, I need to point out that some people who listened to the full chants were also guessing single words. Among them only one managed to identify that one of the words was part of the full chant and assign the same word as he assigned to it in the full chant. Because of that I decided to not remove these people from the results (including the guy who managed to guess).
uspol, doubting one's sanity, empiricism, wasting resources!
So, first word. Here I must confess I did not manage to cut the word off properly, because I was trying very hard not to include the part I percieved as "m" in the clip, and I cut off also the part that I percieved as the "ng" sound (or maybe they were both just the same sound? considering 3. is a real option this might be the case). Anyway here we had more interesting results. Two people heard something beginning with "w", one of them even the "we" sound (although a different word). Nine people heard something starting with the "he" sound (ugh, not "he" as in "he", but "he" as in "hello", so what I meant by the "ha" sound in the original post; ugh, bloody english phonetics), three further people heard things that contained this sound. It's worth mentioning that one of the people claimed to hear the exact word I heard, but followed by another word, which seems like they were guessing, because I see no way in which they could have heard so much in this short clip. They were also one of the people who were aware of the chant (although strangely they claimed they heard the version I _did not_ hear in the full chant?!). And since I'm mentioning this, the person who claimed to have heard the "we" sound was also among the ones familiar with the chant. It is also worth pointing out that many (5/9 or 7/12) of the "he" guesses were variants of "hey" or "hello", towards which people might be biased.
The remaining six people heard mostly nonesense, although three of them also heard variants with some "h" sound.
The wide prevalence of the "h"/"he" sound is moderate evidence against 1, and about equally for 2 and 3.
uspol, doubting one's sanity, empiricism, wasting resources!
The second word is apparently the least intelligible. Three people explicitly complained about unintelligibility, six made suggestions that made no sense. Four suggestions contained the "w" sound, but none of them in a way that would be compatible with one of the standard interpretations, and the only one beginning with "w" was someone incorrectly trying to guess which word from the full chant it was (lol). Three people heard the "n" sound – that's not "m", but phonetically very close, and it was always at the beginning of the word. Four people heard the "eye/I" sound as the vowel – interestingly this group is exclusive with the previous one.
This is mostly evidence for 3., but also against 1., since the phonetics align somewhat more with what I heard.
uspol, doubting one's sanity, empiricism, wasting resources!
The third word is the strangest. Three people guessed that it was the third word from the chant (I missed them in the original toot where I claimed there was only one, oops... >.<), but they are almost the only people who have noticed the "(t)s" sound at the end – only two more have that sound. There is one explicit unintelligible and two terrible guesses. The remaining 12 people all correctly identified the vowel in the word we expect, but 9 of them also heard some variant of "hey", which should influence the evaluation of the first word (at least the "hey"/"hello" variants might be almost pure bias; note that there were still many others which had different h-words).
Since the "standard" interpretations agreed as to what word this is, people not recognizing it is definitely evidence for 3. On the other hand, this is not really evidence for 3 as applied to the previous two words, they might have been perfectly legible and this word only guessed from context, so it's not very strong evidence for 3 as applied to the problem we are trying to solve... And correctly identifying the vowel seems to be a pattern too...
uspol, doubting one's sanity, empiricism, wasting resources!
Anyway, it seems that mostly I should assign much more weight to 3, bloody meatmech really needs an upgrade. On the other hand when the phonetic evidence pointed in a direction it was more in agreement with my interpretation. This is some evidence against me being insane, but it might also be the result of how I cut the chant into pieces?
uspol, doubting one's sanity, empiricism, wasting resources!
To finish this up, the raw results I got:
* https://wuatek.tk/~timorl/dlaludzi/inne/transcripts_grouped.csv are the transcripts grouped by me as described in thee above toots.
* https://wuatek.tk/~timorl/dlaludzi/inne/transcripts_random.csv are in random order, went through shuf because I already modified the grouped file when I noticed this version would be useful <_<"
The links inside should be pointing to the files I used for the MTurk questions.
@freemo If you want to look through this yourself, you definitely want the **second** file. I know this is obvious, but stressing it to lower the chances of you accidentally biasing yourself more than needed by first looking at the first one.
Oh, and if you want to help me with all this investigation it might also be useful if you cut the chant into words the way you perceive them, preferably **before** listening to the cuts I made. I have a slight suspicion word boundaries might help me hear your version? I know this would take time though, so no pressure if you have better things to do.
uspol, doubting one's sanity, empiricism, wasting resources!
@timorl Ok so not a big enough sample for me personally to do much with, but its a start. I would only use the full words.mp3 myself as a person needs context to pull out phrases in general.
How much did this cost you, I might be willing to throw in some money (if its cheap enough) to get a larger sample data on just the words.mp3
uspol, doubting one's sanity, empiricism, wasting resources!
A bit more than 5 dollars, MTurk is terryfingly cheap.
I kind of disagree about the context helping, I would assume it increases biases (which means maybe I actually agree with your general point that it's easier for people to hear something specific with context?). I tried analyzing phonetics in the responses to specific words, which I thought from the start would give more useful results (see first toot, and some following ones where I explain parts of this reasoning).
On the other hand I really would like to hear your version of cutting the chant into separate words -- the only way I see context helping without just relying on increased biases is by making it easier to cut the phrase in a specific way, so if you suspect that is the case, then this could really help me hear your version.
uspol, doubting one's sanity, empiricism, wasting resources!
@timorl Only way i could see it increasing bias is if it causes someone to identify the scene it came from and remember it. But we should already be rulling that out by asking explicitly if that is the case. Presuming, however, that they are unable to identify the setting in which the chant occured and never heard about it or saw the video then I cant see why it would increase bias.
But our brains need context to understand what is being said even in the best of circumstances. The "chinese telephone" example that is often done in schools is a great example of that, single words become distorted very quickly as they are passed on but phrases tend to remain intact and are understood and rarely become changed into something refelecting ones biases (though im sure it does happen).
uspol, doubting one's sanity, empiricism, wasting resources!
@freemo There are pretty clear biases, for example expecting chants to start in a specific way (e.g. "We want..." is probably a more common way to start a chant than "Hang...", in a vacuum?) or expecting phrases to go together (e.g. maybe "Mike Pence" being more expected than "want Pence" in a vacuum?). And I still think we are in agreement here, just for some reason getting to different conclusions. It **is** easier for someone to feel they understood a phrase as a whole than a separate word, but this is precisely because of the biases they employ in interpreting the phrase. They have more context, and possibly more noise, to feed these biases, so they act strongly. In the case of separate words, the biases are still present and it's much less likely someone will be able to correctly recognize a word, but the biases will have to be based more closely on specific phonemes present, so they provide stronger evidence for these phonemes.
In other words in a longer phrase one catches a couple phonemes and treats the rest as noise to be filled by biases/priors, and the same happens in single words, but there are fewer phonemes so a smaller portion is ignored, because biases need _some_ input.
Also, if you don't want to do the cutting up of the chant (which is, as I said previously, definitely an understandable position!) I'd be grateful if you told me explicitly, so I might try finding someone else who hears what you hear and would be willing to do that.
uspol, doubting one's sanity, empiricism, wasting resources!
@timorl yes but those biases dont work against what youa re testing they work for it. I remind you what your testing for from your OP:
> 1. I am so influenced by propaganda my basic senses are broken.
2. The above, but for the other participant.
3. This specific video is an auditory case of blue/black vs white/gold dress.
So you are trying to determine if people hear if we were hearing what we were hearing because its what the chant **sounded like** (not what it actually was) or if we were hearing what we were hearing due to our **personal political biases**
If you eliminate people who already know the context of the chant then you eliminated any chance of what someone hears is due to political bias, as they have no context for that. Therefore what you get when people hear the whole phrase will be what the average person, not viewing it within any politicial framework, is likely to hear "hang mike pence" or "we want pence".
If, as you say people who hear any chant are more likely to lean towards "we want pence" simply because that is a more common wording for a chant, then youa re arguing that me hearing "we want pence" is more likely **not** to be due to political bias than you hearing "hang mike pence" would be (since the latter is not what a non-political context bias would lean towards).
So in fact while the biases you describe are valid, they are also entierly in line with what you were testing for and not biases you would even want to eliminate.
uspol, doubting one's sanity, empiricism, wasting resources!
@freemo I understand your reasoning for this approach, but I would argue it does not solve the basic problem I am trying to solve, which is about perceiving reality. There is an underlying reality in the words the people are chanting, and it consist of phonemes which I am trying to determine. A politically unbiased person being more likely to hear one interpretation more than the other is mostly evidence that "default" (i.e. non-political in this case) bias points towards that phrase. This might mean that a person hearing a different thing is politically biased, but it might also mean they overcame the "default" bias and heard something closer to reality. In general, if possible, determining reality is the most useful thing one can do. I am arguing that the single word approach is better at that.
Although this whole reasoning suggests that even if they actually chanted "Hang Mike Pence" then _you_ might not be politically biased much and just hearing what you hear due to the "default" bias. This would be something that could conceivably be a useful result if the full chant approach showed that most people hear something starting with "We want...".
uspol, doubting one's sanity, empiricism, wasting resources!
@freemo It would, although the best results are from both tests combined -- as I said the full chant test also has some merit in restricting the hypothesis space. That's why I did both, and not just the words test, on MTurk. Hearing something different from what was actually said is some evidence of bias, political or otherwise, and hearing something different from what most people hear is evidence of a _different bias than most people_ maybe in political ways, maybe in others. Any of them separately should change your belief in the amount of political bias you are under and combined they provide even more evidence. If I'm interested in overcoming my biases, then the words test is more interesting to me, because it is closer to reality, but that is all.
uspol, doubting one's sanity, empiricism, wasting resources!
@freemo After getting some proper sleep I think I can express this more clearly.
In my original statement you focused on the "propaganda" part, while I was mostly worried about the "my basic senses are broken" part. If political propaganda would strongly move me away from general consensus, but at the same time bring me closer to reality, that would be the opposite of a problem with my politics.
Obviously in practice this is rather unlikely, and estimating the strength of political bias compared to "default" bias will be a useful thing to do. But the reasoning highlights the fact that this approach is just a proxy for what I really want, which is having my beliefs aligned with reality. It's always better to check that as directly as possible.