**barefootstache** @barefootstache@qoto.org · Dec 30, 2023, 01:07

**barefootstache** @barefootstache@qoto.org · Dec 30, 2023, 01:07

barefootstache @barefootstache@qoto.org

Dec 30, 2023, 01:07

barefootstache @barefootstache@qoto.org

#DailyBloggingChallenge (153/200)

There are two main ways to #scrape a #website, either actively or passively.

_Active scraping_ is the process of using a trigger to actively scrape the already loaded webpage.

_Passive scraping_ is the process of having the tool navigate to the webpage and scrape it.

The main difference is how one is getting to the loaded #webpage.

#WebsiteScraping

**barefootstache** @barefootstache@qoto.org · Dec 31, 2023, 04:13

**barefootstache** @barefootstache@qoto.org · Dec 31, 2023, 04:13

Dec 31, 2023, 04:13

barefootstache @barefootstache@qoto.org

#DailyBloggingChallenge (154/200)

To passively scrape a webpage one uses automation tools, ideally headless browsers like #Selenium or #Puppeteer. Of course one can use any tool that is typically used for #e2e testing in the #browser.

The biggest obstacle for passively scraping is dealing with either #captcha or #cloudflare.

There are options to use captcha farms for a small monetary fee. And Cloudflare can be over come by IP hopping.

In general, passively scraping only works on websites that were poorly configured.

**barefootstache** @barefootstache@qoto.org · Dec 31, 2023, 12:30

**barefootstache** @barefootstache@qoto.org · Dec 31, 2023, 12:30

Dec 31, 2023, 12:30

barefootstache @barefootstache@qoto.org

#DailyBloggingChallenge (155/200)

To actively scrape a #website one employs either an extension or uses the console.

Here the difference is where and who maintains the code. The benefit of using the #console is that one is browser agnostic and still can keep a level anonymity. Whereas with an extension could be used as a fingerprint marker.

E.g. if using the #Tor browser one should not diverge from the installed extensions, since one will easier identified compared to the herd. Using the console would be preferred in this case.

On the flip side using an extension voids the need to copy and paste the code into the console every time.

**barefootstache** @barefootstache@qoto.org · 2024-01-01T20:11:24Z

barefootstache @barefootstache@qoto.org

#DailyBloggingChallenge (156/200)

The question persists why one should learn how to scrape? The obvious answer is to get data from the webpage. Though further reasons are to learn how to evaluate a website and then build extensions to present the page to one's liking.

Although web scraping might have a negative connotation, how much different is it from skimming literature and choosing the specific patterns. And with AI/LLM on the rise, now one can evaluate texts even quicker.

Jan 01, 2024, 20:11 · · · ·

**barefootstache** @barefootstache@qoto.org · Jan 02, 2024, 22:52

**barefootstache** @barefootstache@qoto.org · Jan 02, 2024, 22:52

Jan 02, 2024, 22:52

barefootstache @barefootstache@qoto.org

#DailyBloggingChallenge (157/200)

When actively scraping, the main starting function is

```
document.querySelectorAll()
```

This will return a `NodeList`, which typically one will use a for-loop to loop over each item.

On each item either the `querySelector` or `querySelectorAll` will be applied recursively until all specific data instances are extracted.

This data is then saved into various formats depending on future processing, either as on object in an array or as a string, which is then saved either to the `localStorage`, `sessionStorage`, `IndexDB`, or downloaded via a temporal link.

#WebScraping #VanillaJS #WebDev

**barefootstache** @barefootstache@qoto.org · Jan 04, 2024, 01:50

**barefootstache** @barefootstache@qoto.org · Jan 04, 2024, 01:50

Jan 04, 2024, 01:50

barefootstache @barefootstache@qoto.org

#DailyBloggingChallenge (158/200)

One option for future processing is opening a new tab as #HTML page.

This has the benefit that the header details stay constant meaning calling media like images aren't being blocked by #CORS. Further one can highlight the details that one deems important compared to the original creator.

One builds the HTML page as a string, just as one typically would do. The only difference is that the file extension is `*.js` instead of `*.html`.

#WebDev #VanillaJS

**barefootstache** @barefootstache@qoto.org · Jan 04, 2024, 17:27

**barefootstache** @barefootstache@qoto.org · Jan 04, 2024, 17:27

Jan 04, 2024, 17:27

barefootstache @barefootstache@qoto.org

#DailyBloggingChallenge (159/200)

This #TypeScript function builds a website from scratch with the `body` parameter being the only necessary input.

```
/**
* Opens a new window with a 'title'.
*
* @param body - the body of the HTML page
* @param style - the style of the HTML page
* @param title - the title of the HTML page
* @param script - the javascript of the HTML page
*/
static openNewWindow(body:string, style = '', title="new display", script=''):true {
const mywindow = window.open('', '_blank') as Window;

mywindow.document.write(`<html><head><title>${title}</title>`);
mywindow.document.write(`<style>${style}</style>`);
mywindow.document.write('</head><body>');
mywindow.document.write(body);

mywindow.document.write('<script>');
mywindow.document.write(script);
mywindow.document.write('</script>');

mywindow.document.write('</body></html>');

mywindow.document.close(); // necessary for IE >= 10
mywindow.focus(); // necessary for IE >= 10*/

return true;
}
```

This can be use as a way to display the scraped data.

#WebDev

**barefootstache** @barefootstache@qoto.org · Jan 08, 2024, 03:19

**barefootstache** @barefootstache@qoto.org · Jan 08, 2024, 03:19

Jan 08, 2024, 03:19

barefootstache @barefootstache@qoto.org

#DailyBloggingChallenge (162/200)

One inconvenience of single file scripting is eventually the overview becomes hard to manage. Thus one realizes that one will need to split up the file into files.

#Webpack gives a solution for building multiple files into one. Plus if one takes the extra effort of setting up #TypeScript one will get the benefits of type safety.

Trending now

Resources

Developers

What is Mastodon?

qoto.org

More…