**barefootstache** @barefootstache@qoto.org · Feb 01, 2025, 03:56

**barefootstache** @barefootstache@qoto.org · Feb 01, 2025, 03:56

barefootstache @barefootstache@qoto.org

Feb 01, 2025, 03:56

barefootstache @barefootstache@qoto.org

Currently trying to extract and format data from PDFs using #python #PyMuPDF.

Initially used the `get_text(value)` method with the `"text"` value, only to learn that I could have potentially saved time directly using the `"html"` value, since I have been creating pattern matchers to format the text into #HTML.

After investigation, although the html option exists, the post processing is more strenuous than the initial approach.

My fascination with the `get_text(value)` method is that each value packages the data differently. Where as `"html"` puts the text in `<p><span>text</span></p>`, `"xhtml"` puts it instead in `<h1>text</h1>`.

**barefootstache** @barefootstache@qoto.org · Feb 01, 2025, 04:13

**barefootstache** @barefootstache@qoto.org · Feb 01, 2025, 04:13

Feb 01, 2025, 04:13

barefootstache @barefootstache@qoto.org

Further while trying to extract and format data from PDFs using #python #PyMuPDF.

I was trying to create a perfect chain of functions that would format all the edge cases into the final desired #HTML format. This is where I quickly realized running every tweaked version of the functions on the 100 page PDF is quite time consuming.

Instead I can run it once and save the results in a #sqlite database. Then create #sql queries to do post processing on the edge cases while having a good enough way to observe the contents of each page over the pervious method of posting the output into the #terminal and scrolling to the desired page. And in the end, I am one step closer of having the data in a #csv file, which is easily exported with #Dbeaver.

**John BS** @johnabs@qoto.org · Feb 02, 2025, 04:31

**John BS** @johnabs@qoto.org · Feb 02, 2025, 04:31

Feb 02, 2025, 04:31

John BS @johnabs@qoto.org

@barefootstache Have you tried using pdftotext with layout mode and some regex? That worked for me with a 600 page database schema :)

**barefootstache** @barefootstache@qoto.org · 2025-02-02T04:37:19Z

barefootstache @barefootstache@qoto.org

@johnabs no haven't tried out many other libraries, which I am currently searching if there are better options, though at the same time I will need image extraction capabilities.

Have tried PyPDF though wasn't happy with the results.

Feb 02, 2025, 04:37 · · · ·

**John BS** @johnabs@qoto.org · Feb 02, 2025, 04:38

**John BS** @johnabs@qoto.org · Feb 02, 2025, 04:38

Feb 02, 2025, 04:38

John BS @johnabs@qoto.org

@barefootstache pdftotext allows image extraction apparently :3

https://github.com/pmdunggh/pdftotext

**barefootstache** @barefootstache@qoto.org · Feb 02, 2025, 04:40

**barefootstache** @barefootstache@qoto.org · Feb 02, 2025, 04:40

Feb 02, 2025, 04:40

barefootstache @barefootstache@qoto.org

@johnabs I was looking at

https://github.com/jalan/pdftotext

**John BS** @johnabs@qoto.org · Feb 02, 2025, 04:43

**John BS** @johnabs@qoto.org · Feb 02, 2025, 04:43

Feb 02, 2025, 04:43

John BS @johnabs@qoto.org

@barefootstache Nah, the php CLI version is way better than the python version. (In fact, python is a bad language, and you should use a better one 😉 😂 )

In fact, if you have poppler installed, you likely already have pdftotext installed as well.

**barefootstache** @barefootstache@qoto.org · Feb 02, 2025, 05:05

**barefootstache** @barefootstache@qoto.org · Feb 02, 2025, 05:05

Feb 02, 2025, 05:05

barefootstache @barefootstache@qoto.org

@johnabs thanks for pointing out poppler being installed, which brought me to the pdftohtml cli tool (after going through the man page of pdftotext), which has most what I am currently looking for.

**John BS** @johnabs@qoto.org · Feb 04, 2025, 20:56

**John BS** @johnabs@qoto.org · Feb 04, 2025, 20:56

Feb 04, 2025, 20:56

John BS @johnabs@qoto.org

@barefootstache Perfect, glad I could contribute a bit :3

Trending now

Resources

Developers

What is Mastodon?

qoto.org

More…