🗣💬 Major update: the corpus collection tool BlueskyScraper now comes with a live stream module which lets you collect posts in real time, either randomly or through filters, then download the results as a file in the format of your choice.

🔗 Info: fmoncomble.github.io/blueskysc

🙏 As usual, sharing is caring, and feedback, bug reports and feature requests are welcome!

@linguistics #linguistics

@f_moncomble @linguistics and i'm sure all users gave their conscious consent to their data being processed automatically?

@LupinoArts

On the streaming side of things, the app does nothing that any other Bluesky client doesn't do: it taps into the Bluesky firehose. Users, when they join, consent to their posts being displayed publicly.

The storage and use of the streamed data, on the other hand, is subject to data privacy laws, such as GDPR in the EU, which under certain conditions impose anonymisation and limits to the time for which the data is kept. This applies regardless of how the data is collected.

@f_moncomble if users decide to delete their posts afterwards for whatever reason, they vanish from the (public part of the) platform, but not from your scrapyard or the harddrives of users of your stream. Or do you account for those cases?

Follow

@LupinoArts @f_moncomble I guess one possible attempt is to anonymise data, for corpus analysis you don't even need to relate different posts. For discourse analysis you could pseudonymise.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.