#DailyBloggingChallenge (153/200)
There are two main ways to #scrape a #website, either actively or passively.
Active scraping is the process of using a trigger to actively scrape the already loaded webpage.
Passive scraping is the process of having the tool navigate to the webpage and scrape it.
The main difference is how one is getting to the loaded #webpage.
#DailyBloggingChallenge (154/200)
To passively scrape a webpage one uses automation tools, ideally headless browsers like #Selenium or #Puppeteer. Of course one can use any tool that is typically used for #e2e testing in the #browser.
The biggest obstacle for passively scraping is dealing with either #captcha or #cloudflare.
There are options to use captcha farms for a small monetary fee. And Cloudflare can be over come by IP hopping.
In general, passively scraping only works on websites that were poorly configured.
#DailyBloggingChallenge (156/200)
The question persists why one should learn how to scrape? The obvious answer is to get data from the webpage. Though further reasons are to learn how to evaluate a website and then build extensions to present the page to one’s liking.
Although web scraping might have a negative connotation, how much different is it from skimming literature and choosing the specific patterns. And with AI/LLM on the rise, now one can evaluate texts even quicker.
#DailyBloggingChallenge (158/200)
One option for future processing is opening a new tab as #HTML page.
This has the benefit that the header details stay constant meaning calling media like images aren’t being blocked by #CORS. Further one can highlight the details that one deems important compared to the original creator.
One builds the HTML page as a string, just as one typically would do. The only difference is that the file extension is *.js
instead of *.html
.
#DailyBloggingChallenge (159/200)
This #TypeScript function builds a website from scratch with the body
parameter being the only necessary input.
/** * Opens a new window with a 'title'. * * @param body - the body of the HTML page * @param style - the style of the HTML page * @param title - the title of the HTML page * @param script - the javascript of the HTML page */ static openNewWindow(body:string, style = '', title="new display", script=''):true { const mywindow = window.open('', '_blank') as Window; mywindow.document.write(`<html><head><title>${title}</title>`); mywindow.document.write(`<style>${style}</style>`); mywindow.document.write('</head><body>'); mywindow.document.write(body); mywindow.document.write('<script>'); mywindow.document.write(script); mywindow.document.write('</script>'); mywindow.document.write('</body></html>'); mywindow.document.close(); // necessary for IE >= 10 mywindow.focus(); // necessary for IE >= 10*/ return true; }
This can be use as a way to display the scraped data.
#DailyBloggingChallenge (162/200)
One inconvenience of single file scripting is eventually the overview becomes hard to manage. Thus one realizes that one will need to split up the file into files.
#Webpack gives a solution for building multiple files into one. Plus if one takes the extra effort of setting up #TypeScript one will get the benefits of type safety.
@barefootstache why are you worrying about IE quirks?
@Neblib one never knows who will use this…
#DailyBloggingChallenge (157/200)
When actively scraping, the main starting function is
This will return a
NodeList
, which typically one will use a for-loop to loop over each item.On each item either the
querySelector
orquerySelectorAll
will be applied recursively until all specific data instances are extracted.This data is then saved into various formats depending on future processing, either as on object in an array or as a string, which is then saved either to the
localStorage
,sessionStorage
,IndexDB
, or downloaded via a temporal link.#WebScraping #VanillaJS #WebDev