#DailyBloggingChallenge (157/200)
When actively scraping, the main starting function is
document.querySelectorAll()
This will return a NodeList
, which typically one will use a for-loop to loop over each item.
On each item either the querySelector
or querySelectorAll
will be applied recursively until all specific data instances are extracted.
This data is then saved into various formats depending on future processing, either as on object in an array or as a string, which is then saved either to the localStorage
, sessionStorage
, IndexDB
, or downloaded via a temporal link.
#DailyBloggingChallenge (156/200)
The question persists why one should learn how to scrape? The obvious answer is to get data from the webpage. Though further reasons are to learn how to evaluate a website and then build extensions to present the page to one’s liking.
Although web scraping might have a negative connotation, how much different is it from skimming literature and choosing the specific patterns. And with AI/LLM on the rise, now one can evaluate texts even quicker.
This is a majestic view of Mt. Fuji surmounted by lenticular clouds while reflected in a lake.
[📸 Taitan21] #Japan #MtFuji #photography
#DailyBloggingChallenge (155/200)
To actively scrape a #website one employs either an extension or uses the console.
Here the difference is where and who maintains the code. The benefit of using the #console is that one is browser agnostic and still can keep a level anonymity. Whereas with an extension could be used as a fingerprint marker.
E.g. if using the #Tor browser one should not diverge from the installed extensions, since one will easier identified compared to the herd. Using the console would be preferred in this case.
On the flip side using an extension voids the need to copy and paste the code into the console every time.
#DailyBloggingChallenge (154/200)
To passively scrape a webpage one uses automation tools, ideally headless browsers like #Selenium or #Puppeteer. Of course one can use any tool that is typically used for #e2e testing in the #browser.
The biggest obstacle for passively scraping is dealing with either #captcha or #cloudflare.
There are options to use captcha farms for a small monetary fee. And Cloudflare can be over come by IP hopping.
In general, passively scraping only works on websites that were poorly configured.
Android Auto support for our sandboxed Google Play compatibility layer has been merged into GrapheneOS and should be available in the next release. It's currently going through final review and internal testing leading up to being able to make a public Alpha channel release.
#DailyBloggingChallenge (153/200)
There are two main ways to #scrape a #website, either actively or passively.
Active scraping is the process of using a trigger to actively scrape the already loaded webpage.
Passive scraping is the process of having the tool navigate to the webpage and scrape it.
The main difference is how one is getting to the loaded #webpage.
#DailyBloggingChallenge (152/200)
Not only hardware is a concern, though also internet speed. Lots of websites use some kind of media like images or videos and many don’t convert these to slow internet friendly speeds.
For images WEBP suffices and for videos a bit rate of 8Mbits.
#DailyBloggingChallenge (151/200)
Lots of websites these days are being first built on the client. This can easily be checked when downloading the #website does not align with the #HTML from the inspector.
This has the benefit for the provider to save transfer cost, though on the flip side, the client will need to have a specific amount of #hardware to successfully render the site.
#DailyBloggingChallenge (150/200)
Designing themes with #Vuetify is fairly straightforward, the difficulty is creating a #design or color palette in the first place.
In this approach the “import full palette” method was chosen. This consists of importing the color palette and assigning each color an unique identifier. The type ThemeDefinition
exists to help with naming conventions. The addition name to add is accent
which should fit well with the primary
and secondary
colors.
Later when the #theme is being built one can directly choose from the palette.
#DailyBloggingChallenge (149/200)
#TIL that the female #human has on average proportionally shorter legs in respect to their body height compared to their male counterparts.
This could also be the explanation why on average the female human is biologically predisposed to be able to touch their feet with their hands while keeping their legs straight.
#DailyBloggingChallenge (148/200)
Currently just imagining the idea of surpassing #weeklyOSM $w$ blog count with the DailyBloggingChallenge $d$.
To calculate the weeks number $x$, one sets $w(x) = d(x)$ with $w(x) = 700 + x$ and $d(x) = 148 + 7x$. This makes $x = 92$ weeks or 645 days until surpassed.
So just before #weeklyOSM800 (relatively speaking) edition, the DailyBloggingChallenge would overhaul it.
Well before fantasizing of hypothetical goals, I should stick with the current goal of 200.
#DailyBloggingChallenge (147/200)
As an active participant of the @weeklyOSM project which is celebrating its 700th weekly news update, one gets to admire a long lasting community project.
This would put the first edition almost 14a ago, which is only a couple years after the #OpenStreetMap project started.
There are a lot of people working behind the scenes of gathering the news stories, writing up a small summary, translating these into the variously languages, proof reading, and finally publishing at the end of the week.
#DailyBloggingChallenge (146/200)
One downside of #Croatia opening its borders to be part of #Schengen is that when flying into #Zagreb with #RyanAir is that there is a high chance that it will continue its flight outside of Schengen. This means that the plane will park at the international terminal making it more convenient for the upcoming passengers. Though the current passengers get to be conveyed with buses to the terminal.
On the flip side when flying out to Schengen area is similar as before with access to most amenities.
They put border patrol right in front of the gate. Instead of having it right after security which most airports follow.
#DailyBloggingChallenge (145/200)
The TagTable
has a photo_ids_list
which contain each photo or video with the specific tag. Further each id
is either prepended by thumb
for photo and video-
for video. The id
itself is saved as a hexadecimal number.
Once one has the id
’s as a decimal number, one can use it to search through either the PhotoTable
or VideoTable
.
#DailyBloggingChallenge (144/200)
One downside of using #Shotwell as a photo and video manager is that it isn’t that straight forward to extract tags from videos as it is with photos. With photos one can grab the data directly with an #exif tool.
After going through the source code and battling with #sqlite and #bash, I got a script that extracts tags from both media formats.
The nice thing is that it is more performant than the previous script I was using only for images.
I am a strong proponent of leaving this planet better behind than when I arrived on it. Thus to get the most bang for a lifetime my key focus is #longevity which I attempt to achieve with #nutrition specifically #plantbased.
Longevity is good and all as long as you are not frail and weak. Ideally would be to die young at an old age. Thus I incorporate tactics from #biohacking and #primalfitness. Additionally I am an advocate of #wildcrafting, which is a super set of #herbalism.
Studied many fields of science like maths or statistics, though the constant was always computer science.
Currently working as a fullstack web developer, though prefer to call myself a #SoftwareCrafter.
The goal of my side projects is to practice #GreenDevelopement meaning to create mainly static websites. The way the internet was intended to be.
On the artistic side, to dub all content under the Creative Commons license. Thereby, ideally, only using tools and resources that are #FLOSS #OpenSource. #nobot