#DailyBloggingChallenge (153/200)
There are two main ways to #scrape a #website, either actively or passively.
Active scraping is the process of using a trigger to actively scrape the already loaded webpage.
Passive scraping is the process of having the tool navigate to the webpage and scrape it.
The main difference is how one is getting to the loaded #webpage.
#DailyBloggingChallenge (154/200)
To passively scrape a webpage one uses automation tools, ideally headless browsers like #Selenium or #Puppeteer. Of course one can use any tool that is typically used for #e2e testing in the #browser.
The biggest obstacle for passively scraping is dealing with either #captcha or #cloudflare.
There are options to use captcha farms for a small monetary fee. And Cloudflare can be over come by IP hopping.
In general, passively scraping only works on websites that were poorly configured.
#DailyBloggingChallenge (155/200)
To actively scrape a #website one employs either an extension or uses the console.
Here the difference is where and who maintains the code. The benefit of using the #console is that one is browser agnostic and still can keep a level anonymity. Whereas with an extension could be used as a fingerprint marker.
E.g. if using the #Tor browser one should not diverge from the installed extensions, since one will easier identified compared to the herd. Using the console would be preferred in this case.
On the flip side using an extension voids the need to copy and paste the code into the console every time.
#DailyBloggingChallenge (156/200)
The question persists why one should learn how to scrape? The obvious answer is to get data from the webpage. Though further reasons are to learn how to evaluate a website and then build extensions to present the page to one’s liking.
Although web scraping might have a negative connotation, how much different is it from skimming literature and choosing the specific patterns. And with AI/LLM on the rise, now one can evaluate texts even quicker.
#DailyBloggingChallenge (158/200)
One option for future processing is opening a new tab as #HTML page.
This has the benefit that the header details stay constant meaning calling media like images aren’t being blocked by #CORS. Further one can highlight the details that one deems important compared to the original creator.
One builds the HTML page as a string, just as one typically would do. The only difference is that the file extension is *.js
instead of *.html
.
#DailyBloggingChallenge (162/200)
One inconvenience of single file scripting is eventually the overview becomes hard to manage. Thus one realizes that one will need to split up the file into files.
#Webpack gives a solution for building multiple files into one. Plus if one takes the extra effort of setting up #TypeScript one will get the benefits of type safety.
@barefootstache why are you worrying about IE quirks?
@Neblib one never knows who will use this…
#DailyBloggingChallenge (159/200)
This #TypeScript function builds a website from scratch with the
body
parameter being the only necessary input.This can be use as a way to display the scraped data.
#WebDev