PDF to Text (HTML, XML, or anything structured)

NodeJS
PDF-JS (Mozilla) https://github.com/mozilla/pdf.js/
Really good one, generates HTML problem in my case that the information is not structured and I need to be able to parse it and extract information
PDF2JSON https://github.com/modesty/pdf2json

PHP
http://www.pdfparser.org/
http://www.tcpdf.org/doc/code/classTCPDF__PARSER.html

PYTHON
http://www.unixuser.org/~euske/python/pdfminer/
Really good one, generates HTML problem in my case that the information is not structured and I need to be able to parse it and extract information

JAVA
http://stackoverflow.com/questions/3203790/parsing-pdf-files-especially-with-tables-with-pdfbox
http://www.idrsolutions.com/java-code-examples-for-pdf-files/
Tabular data
https://github.com/jazzido/tabula
http://opendata.stackexchange.com/questions/127/extracting-tables-from-multiple-pdfs
Ruby? https://github.com/jazzido/tabula-extractor

Recopilation
https://github.com/okfn/ideas/issues/52

Leave a Reply

Your email address will not be published. Required fields are marked *