


Then you can parse that different ways but I personally would import that to excel for the cleaning and export as csv using buttons or vba rather than python. I tried several one line methods to try to get a good pre-process input and this was the cleanest but there are still extras even in import to excel there will needs be some minor edits to tidy double blank lines.Īnyway the windows command was (you can call that from python poppler utils) poppler-22.04.0\Library\bin>pdftotext -fixed 4 -nopgbrk in2.pdf temp.txt & type temp.txt |find /V "NSS" |find /V "F-" |Find /V "code" |Find /V "(7)" >out.txt Ok that particular file is not as easy as it looks or as may be expected, (with or without python) since it causes problems with so many variable shape voids.

then we have our spatial csv (space character separated values) exactly the way the field staff sends to their brain and excel can accept that as input no promblem That should be a doddle for experienced "Field Staff" so just program the same way, the novice needs to note that the headers are the same on each page thus not needed after first memorize, then the rows are all similar so we only need the bits between top matter and bottom matter, now PDF has no white space just space that is white, so we extract with padding as best we can and pdftotext can isolate and pad all in one line of code.
