Currently trying to extract and format data from PDFs using #python #PyMuPDF.
Initially used the `get_text(value)` method with the `"text"` value, only to learn that I could have potentially saved time directly using the `"html"` value, since I have been creating pattern matchers to format the text into #HTML.
After investigation, although the html option exists, the post processing is more strenuous than the initial approach.
My fascination with the `get_text(value)` method is that each value packages the data differently. Where as `"html"` puts the text in `<p><span>text</span></p>`, `"xhtml"` puts it instead in `<h1>text</h1>`.
@barefootstache Have you tried using pdftotext with layout mode and some regex? That worked for me with a 600 page database schema :)
@barefootstache pdftotext allows image extraction apparently :3
@johnabs I was looking at
@barefootstache Nah, the php CLI version is way better than the python version. (In fact, python is a bad language, and you should use a better one 😉 😂 )
In fact, if you have poppler installed, you likely already have pdftotext installed as well.
@johnabs thanks for pointing out poppler being installed, which brought me to the pdftohtml cli tool (after going through the man page of pdftotext), which has most what I am currently looking for.
@barefootstache Perfect, glad I could contribute a bit :3
@johnabs no haven't tried out many other libraries, which I am currently searching if there are better options, though at the same time I will need image extraction capabilities.
Have tried PyPDF though wasn't happy with the results.