(153/200)

There are two main ways to a , either actively or passively.

Active scraping is the process of using a trigger to actively scrape the already loaded webpage.

Passive scraping is the process of having the tool navigate to the webpage and scrape it.

The main difference is how one is getting to the loaded .

(154/200)

To passively scrape a webpage one uses automation tools, ideally headless browsers like or . Of course one can use any tool that is typically used for testing in the .

The biggest obstacle for passively scraping is dealing with either or .

There are options to use captcha farms for a small monetary fee. And Cloudflare can be over come by IP hopping.

In general, passively scraping only works on websites that were poorly configured.

Show thread

(155/200)

To actively scrape a one employs either an extension or uses the console.

Here the difference is where and who maintains the code. The benefit of using the is that one is browser agnostic and still can keep a level anonymity. Whereas with an extension could be used as a fingerprint marker.

E.g. if using the browser one should not diverge from the installed extensions, since one will easier identified compared to the herd. Using the console would be preferred in this case.

On the flip side using an extension voids the need to copy and paste the code into the console every time.

Show thread

(156/200)

The question persists why one should learn how to scrape? The obvious answer is to get data from the webpage. Though further reasons are to learn how to evaluate a website and then build extensions to present the page to one’s liking.

Although web scraping might have a negative connotation, how much different is it from skimming literature and choosing the specific patterns. And with AI/LLM on the rise, now one can evaluate texts even quicker.

Show thread

(157/200)

When actively scraping, the main starting function is

document.querySelectorAll()

This will return a NodeList, which typically one will use a for-loop to loop over each item.

On each item either the querySelector or querySelectorAll will be applied recursively until all specific data instances are extracted.

This data is then saved into various formats depending on future processing, either as on object in an array or as a string, which is then saved either to the localStorage, sessionStorage, IndexDB, or downloaded via a temporal link.

Show thread
Follow

(158/200)

One option for future processing is opening a new tab as page.

This has the benefit that the header details stay constant meaning calling media like images aren’t being blocked by . Further one can highlight the details that one deems important compared to the original creator.

One builds the HTML page as a string, just as one typically would do. The only difference is that the file extension is *.js instead of *.html.

(159/200)

This function builds a website from scratch with the body parameter being the only necessary input.

  /**   * Opens a new window with a 'title'.   *   * @param body - the body of the HTML page   * @param style - the style of the HTML page   * @param title - the title of the HTML page   * @param script - the javascript of the HTML page   */  static openNewWindow(body:string, style = '', title="new display", script=''):true {    const mywindow = window.open('', '_blank') as Window;    mywindow.document.write(`<html><head><title>${title}</title>`);    mywindow.document.write(`<style>${style}</style>`);    mywindow.document.write('</head><body>');    mywindow.document.write(body);    mywindow.document.write('<script>');    mywindow.document.write(script);    mywindow.document.write('</script>');    mywindow.document.write('</body></html>');    mywindow.document.close(); // necessary for IE >= 10    mywindow.focus(); // necessary for IE >= 10*/    return true;  }

This can be use as a way to display the scraped data.

Show thread

(162/200)

One inconvenience of single file scripting is eventually the overview becomes hard to manage. Thus one realizes that one will need to split up the file into files.

gives a solution for building multiple files into one. Plus if one takes the extra effort of setting up one will get the benefits of type safety.

Show thread
Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.