🗣💬 Major update: the corpus collection tool BlueskyScraper now comes with a live stream module which lets you collect posts in real time, either randomly or through filters, then download the results as a file in the format of your choice.
🔗 Info: https://fmoncomble.github.io/blueskyscraper/
🙏 As usual, sharing is caring, and feedback, bug reports and feature requests are welcome!
@f_moncomble @linguistics and i'm sure all users gave their conscious consent to their data being processed automatically?
On the streaming side of things, the app does nothing that any other Bluesky client doesn't do: it taps into the Bluesky firehose. Users, when they join, consent to their posts being displayed publicly.
The storage and use of the streamed data, on the other hand, is subject to data privacy laws, such as GDPR in the EU, which under certain conditions impose anonymisation and limits to the time for which the data is kept. This applies regardless of how the data is collected.
@LupinoArts @f_moncomble I guess one possible attempt is to anonymise data, for corpus analysis you don't even need to relate different posts. For discourse analysis you could pseudonymise.