Parsing PDF

Download as pdf
Download as pdf
You are on page 1of 22
3722724, 10:24 AM Parsing PDFS in Node js -LogRocket Blog Advisory boards aren’t only for executives. Join the LogRocket Content Advisory Board today + A@LogRocket Frontne Anahi Mar 4, 2024 - 7 min read Parsing PDFs in Node.js Chibuike Nwachukwu et Sees) acne Cae tec acl gon and lover of Jamstack who is passionate about lifelon Table of contents Vv Setting up the Node,js project Popular parsing libraries pdt-parse Output for PDF with tables pdf2json Would you be interested in a POF with joining LogRocket's developer community? arsing a ONo om parsers Yeah thanks nitps:bog logrocket.comiparsingpdfe-nodes/ 122 3722924, 1024 AM Parsing PDFS in Node js -LogRocket Blog Parsing PDF files is essential in various applications, particularly those involved with document processing and data extraction, There are a plethora of online tools available for PDF parsing. As such, the decision on which package to use can be a daunting task. This article will provide a comprehensive guide for how to navigate PDF parsing in Node.js, delving into the integration of Node packages like pdf-parse and pdf-reader. We'll highlight the unique benefits, usage, and challenges each package presents. We'll Would you be levelopers might prefer to construct custom parsers, tailoring interested in needs of their projects. joining LogRocket's developer community? Nop the Node.js project Yeah thanks nitps:blog logrocket.comiparsing ps-nodes! 22 3722924, 1024 AM Parsing PDFS in Node js -LogRocket Blog To start, we'll create a simple Nodes application, Head over to the terminal and run the following command: npm init -y Next, create a folder called uploads , which will be used to store the sample PDF files used to test the packages. Two PDF files are used for testing. One contains only a page of text contents and can be found in uploads/test.pdf . The second file contains text with a table in it and can be found in uploads/table. pdf . Here is what the test.pdf file looks like: Typescript, Serverless. And here is what the table.pdf file looks like: Would you be interested in joining LogRocket's developer community? No Yeah thanks nitps:bog logrocket.comiparsingpdfe-nodes/ a2 3722/24, 10:24AM Parsing PDFS in Nodes - LogRockst Blog Example table This is an example of a data table. Blind 5 1 4 34.5%, n=l | 1199 sec, n=1 Low Vision 5 2 3 98.3%n=2 | 1716 sec, n=3 (97.7%, n=3) | (1934sec, n=2) Dexterity 5 4 1 198.396, n=4 | 1672.1 sec, n=4 Mobility 3 3 0 95.4%, n=3 | 1416se¢, 03 Popular parsing libraries Let's explore some of the most popular open source Node packages for parsing files. pdf-parse pdf-parse is a popular parsing package among developers for its user-friendly interface. Its stability stems from its independence from other parser frameworks, which Would you be interested in joining LogRocket's developer community? No Yeah thanks ver bugs. ve, run the following command: cse named pd£-parse.mjs in the project root and add the following: hntps:iologlogrocket.comiparsing-pdfs-nodj 3722724, 10:24AM Parsing PDFS in Nodes - LogRockst Blog import pdf from “pdf-parse-debugging-disabled"; const data = await pdf("./uploads/test.pdf"); consolle.1og (data) This is a basic setup to demonstrate how to integrate pdf-parse into your workflow. Simply calling the pd£() method on line three and passing the path to the PDF is all that’s needed to process the file contents of the file. The sample result looks like this: i numpages: 1, numzender: 4, info: { PDFFoxmatVersion: '1.4', IsAcroFormPresent: false, IsXFAPresent: false, Title: ‘Test’, Producer: 'Skia/PDF m123 Google Docs Renderer! x metadata: null, text: '\n\nTypescript,Serverless.', version: '1.10.100' # Would you be | - interested in. "oPerty shows the number of pages in the PDF. In our example, itis joining contains the contents of the PDF. info contains more information Loaner nt, such as its title, the PDF format version, and how the PDF was developer community? metadata contains the PDF's metadata. No Yeah thanks htips:bloglogrocket.comiparsing-pats-nodejs 522 3722/24, 10:24AM Parsing PDFS in Nodes - LogRockst Blog Output for PDF with tables Now, let's take a look at an example of a PDF file that has a table: import pdi from "pdi-parse-debugging-disabled"; const data = await pdf("./uploads/table. pdf"); console. log (data) ‘The resulting payload looks like this: numpages: 1, numrender: 1, info: { PDFFormatVersion: '1.6', IsAcroFormPresent: false, IsXFAPresent: false, Author: ‘Mary", Creator: ‘Acrobat PDFMaker 9.0 for Word", Producer: ‘Adobe PDF Library 9.0', CreationDate: "D:20110123144232-05'00'", ModDate: "D:20140304212414-05'00'" 3 metadata: Metadata { a: [Object: null prototype] £ Would yoube “si tydate': *2014-03-04721:24:14-05:00' interested in joining ceatedate *2011-01-23T14:42:32-05:00' LogRocket’s *tadatadate': '2014-03-04721:24:14-05:00", developer ceatortool': ‘Acrobat PDFMaker 9.0 for Word", community? — :documentid': ‘uuid:4a18570c-dSbf-445d-9e0e-2efeb989eeb1', No vinstanceid': 'uuid:813474a4-22b0-4180-9415-bb67674d2b7b ', Yeah thanks :subject* fa htips:bloglogrocket.comiparsing-pafs-nodejs 622 ‘2224, 10:26 AM Parsing POFs in Nodes -LogRocket Blog ‘de:format': 'application/pdf' ‘de:creator': ‘Mary’, ‘pd£:producer': ‘Adobe PDF Library 9.0", '‘pdfx:sourcemodified': 'D:20110123172633' t text: '\n' + ‘\n' + ‘Example table \n' + ‘This is an example of a data table. \n' + ‘Disability \n' + "Accuracy Time to \n' + ‘complete \n' + ‘Blind 5 1 4 34.5%, n=1 1199 sec, ‘Low Vision 5 2 3 98.3% n=2 \n' + ' (97.7%, n=3) \n' + "1716 sec, n=3 \n' + "(1934 sec, n=2) \n' + ‘Dexterity 5 4 1 98.3%, n=4 1672.1 sec, n=4 \n' + ‘Mobility 3 3 0 95.4%, n=3 1416 sec, n=3 \n' + 1 \nt + version: '1.10.100' As we can observe, pdf-parse doesn’t preserve the table structure. Instead, it treats it all asa line. The ndf-parse package is useful if you only intend on extracting text from the Would yoube ~ 5rried about the file's structure. interested in joining LogRocket's developer community? No Yeah thanks onality. It also incorporates support for interactive form elements, ule that transforms PDF files from binary to JSON format, using pdf,js htips:bloglogrocket.comiparsing-pats-nodejs Tae 3122/24, 10:24AM Parsing PDFS in Nodes - LogRocket Blog enhancing its utility in processing and interpreting PDF content. To install pdf2json, run the following command: npm i pd£2json Next, create a file called pd£2json.mjs in the root folder of the project and insert the following: import fs from "fs"; import PDFParser from "pd£2json"; const pdfParser = new PDFParser(this, 1); const filename = "./uploads/table. pdf"; pd£Parser.on("pdfParser_dataError", (errData) => console.error(errData.parserError) i pd#Parser.on("pdfParsex_dataReady", (pdfData) => { console. 1og(pd£Data) ; b: pd£Parser. loadPDF (filename) ; This is a basic setup to show how to integrate this package into your workflow. on("pd£Parser_dataReady") is called when the parser is finished processing the PDF contents. Would you be interested in joining LogRocket's developer community? No Yeah thanks -ntps:blog logrocket.comiparsing-pdfs-node-js! 8122 3122/24, 10:24AM Parsing PDFS in Nodes - LogRocket Blog Francisco Quintero vy @Fran_Quintero - Follow | need to give a special shout-out to @! Drastically cut our debugging time in an issue that got reported this morning. Top 5 tools in our startup arsenal without a doubt. #startuplife 10:51 AM. Jan 8, 2022 @ 94 O Reply & Copy link Read 1 reply oem cote eats The sample result looks like this: Transcoder: '[email protected] [https://github.com/modesty/pdf2json]", Meta: { PDFFormatVersion: '1.6', IsAcroFormPresent: false, IsXFAPresent: false, Author: ‘Mary’, ‘Acrobat PDFNaker 9.0 for Word", Would you be". ‘adobe PDF Library 9.0', interested in joining date: "D:20110123144232-05'00'", LogRocket's "D:20140304212414-05'00'", developer gi community? odifydate': '2014-03-04T21:24:14-05:00', No teatedate': '2011-01-23T14:42:32-05:00', Yeah thanks = »tadatadate': '2014-03-04T21:24:14-05:00', -ntps:iblog logrocket.comiparsing-pdts-node-js! 9122 3722124, 1026 AM Parsing PDFS in Nodes - LogRockst Blog 'xmp:creatortool': ‘Acrobat PDFMaker 9.0 for Word', ‘xmpmm:documentid': ‘uuid:4a18570c-d5bf-445d-9eGe-2efeb989eeb1', ‘xmpmm:instanceid': 'uuid:813474a4-22b0-4180-9415-bb67674d2b7b', 'xmpmm:subject': '3', ‘de:format': ‘application/pdé', ‘de:creator': ‘Mary’, ‘pd£:producer': ‘Adobe PDF Library 9.0', '‘pdfx:sourcemodified': 'D:20110123172633' $ 3 Pages: [ t Width: 38.25, Height: 49.5, HLines: [], VLines: [1, Fills: [Array], Texts: [Array], Fields: [], Boxsets: [] $ ] t The Pages property contains the contents of the PDF, while Meta contains the PDF metadata. Would you be interested in atents of the file, replace the pd£Parser_dataReady listener with joining LogRocket's developer community? ("pdfParser_dataReady", (pdfData) => { No 3({ textContent: pd£Parser.getRawTextContent() 3); Yeah thanks ntps:ologlogrocket.comiparsing-pas-nods js! 10722 3722/24, 10:24AM Parsing PDFS in Nodes - LogRockst Blog Here's the output: textContent: ‘Typescript, Serverless.\r\n---------------- Page (0) Bre Next, let's update the sample PDF to one that has a table in it. Replace the above code with the following: import fs from "fs import PDFParser from "pdf2json"; const pdfParser = new PDFParser(this, 1); const filename = "./uploads/table.pdf" pdéParser.on("pdfParser_dataError", (errData) => console.error(errData.parserError) i pdéParser.on("pdfParser_dataReady", (pdfData) => { console.log({ textContent: pdfParser.getRawTextContent() +); D: pdfParser.loadPDF (filename) ; Here’s the output data in the console: i Would yoube ~ & ‘Example table \r\n‘ + interested in an example of a data table. \r\n' + joining ity \x\n' + LogRocket's / Participants \r\n' + developer T\n + community? My \z\n' + No \r\n" + Yeah thanks Bey \x\n' + htips:bloglogrocket.comiparsing-pafs-nodejs see ‘unama4, 10:2¢ AM Parsing PDFS in Nodes -LogRocke Blog "Terminated \x\n' + "Results \r\n' + ‘Accuracy Time to \x\n' + ‘complete \x\n' + "Blind 5 1 4 34.5%, n=1 1199 sec, n=1 \x\n' + ‘Low Vision 5 2 3 98.3% n=2 \r\n' + (97.7%, n=3) \x\n' + '1716 sec, \r\n' + "(1934 sec, n=2) \r\n' + ‘Dexterity 5 4 1 98.3%, n=4 1672.1 sec, n=4 \r\n' + ‘Mobility 3 3 0 95.4%, n=3 1416 sec, n=3 \r\n' + “\r\n' + Using pdf2json, there is no significant difference between a PDF with tables and one without, pdfreader pdfreader is another tool that converts PDFs from binary to JSON format. Underneath, it uses pdf2json, Unlike the packages we have seen so far, which don’t support tabular data, this package does so with automatic column detection and rule-based parsing. To install pdfreader, run the following command: Would you be interested in odfreader joining LogRocket's developer community? No called pdfreader.mjs in the root folder of the project and insert the Yeah thanks hntps:ologlogrocket.comiparsing-pafs-nod js! 122 3722/24, 10:24AM Parsing PDFS in Nodes - LogRockst Blog import { Pd#Reader } from “pdfreadex"; const filename = "./uploads/test.pd£" var rows = {}; // indexed by y-position function printRows() { Object.keys(rows) // => array of y-positions (type: float «sort((y1, y2) => parseFloat(y1) - parseFloat(y2)) // sort float p -forEach((y) => console.log((rows[y] || []).join(""))); t new PdfReader().parseFileItems (filename, function (err, item) { if (!item || item.page) { // end of file, or page printRows(); item?.page && console.log("PAGE:", item. page); rows = {3; // clear rows for next page } else if (item.text) { // accumulate text items into rows object, per line (rows[item.y] = rows[item.y] || []).push(item.text) ; The parseFileItems method returns a callback that contains the processed file. In this code example above, we assign this file to a variable called item on line nine. An item object can match one of the following objects: ~ -"""-""is means that parsing is over or an error has occurred. Would yoube ..: Object sample: {file: {path: string} . This occurs when a PDF interested in joining opened, and is always the first item LogRocket's ata: Object sample: {page:integer, width:£loat, developer community? 2at} . This represents that a new page is being parsed. It provides the No orstarting at index 1 Yeah thanks hntps:ologlogrocket.comiparsing-pafs-nod js! 1322 3722124, 1026 AM Parsing PDFS in Nodes - LogRockst Blog © Text: Object sample: {text:string, x:float, y:float, w:float, ...3. This contains the text property and floating 2D AABB coordinates on the page The result looks like this: PAGE: 1 Typescript ,Serverless. Sample payload for a PDF with tables Next, we'll replace the PDF we used above with a PDF file that includes a table. Node.js on PodRocket Preview Bun vs, Node, CS! Modern, maintainable Nod Node 20, React, and Al with Theo Browne, Michael Chan, an “code with the following: Would you be interested in joining reader, TableParser } from “pdfreader"; LogRocket's . /uploads/table. pdf"; developer community? dding = 4 No duantitizer = (item) => parseFloat(item.x) >= 20; Yeah thanks for String. prototype. padEnd() htips:bloglogrocket.comiparsing-pafs-nodejs 1422 3722/24, 10:24AM Parsing PDFS in Nodes - LogRockst Blog // https://github.com/uxitten/polyfill/blob/master/string.polyfill.js // http: (/developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/G if (1String.prototype.padEnd) { String.prototype.padEnd = function padEnd(targetLength, padString) { targetLength = targetLength >> 0; //floor if number or convert non padString = String(padString || " "); if (this-length > targetLength) { return String(this) ; } else { targetLength = targetLength - this.length if (targetLength > padString.length) { } padString += padString. repeat (targetLength / padString. length) return String(this) + padString.slice(9, targetLength) ; # Bi 3 const padColumns = (array, nb) => Array.apply(null, { length: nb }).map((val, i) => array[i] || (1); const mergeCells = (cells) => (cells || []).map((cell) => cell.text).j const formatMergedCell = (mergedCell) => mergedCell.substr(0, cellPadding) .padEnd(cellPadding, const renderMatrix = (matrix) => (matrix || [1) -map( (xom, y) => Would you be interested in joining LogRocket's developer community? No Yeah thanks + slumns(xow, nbCols) ap(mergeCel1s) ap (formatMergedCel1) vin(" |") + "5 rew TableParser() ; ntps:ologlogrocket.comiparsing-pas-nods js! 1522 13722124, 1024 AM Parsing PDFS in Node js -LogRocket Blog new PdfReader().parseFileItems(filename, function (err, item) { if (err) console.error(err) ; else if (!item || item.page) { console. log (renderMatrix (table. getMatrix())); item?.page && console.log("PAGE:", item.page); table = new TableParser() ; } else if (item.text) { table.processItem(item, columnQuantitizer(item)); Looking at the logged data in the console, this is what we see: cy ees POs kc CnC mc cmc ic Results Peta eee ese Coieetotiies Ceci ewe lyasins) cert) Cyeur est vm reser) Pore eae! Mobility3 3 i l i | I | l rere Te ve p l i | | | | i pdfreader differentiates from the other parsers we covered above by its ability to wes. rn oe Would you be tures. While it may not achieve perfect accuracy, as we can see in the interested in ers a more effective solution for handling complex document layouts. joining LogRocket's developer g the parsing packages community? No Yeah thanks nitps:blog logrocket.comiparsing ps-nodes! 6122 3722/24, 10:24AM Parsing PDFS in Nodes - LogRockst Blog Each package we covered above has its strengths and drawbacks. Let's compare their aspects side by side so you can easily determine the most suitable choice for your project: Enables reading from a specified file path (supports both local and Node v14 and pdf-parse ‘external sources) and from a memory buffer above Node v14 and ‘Supports password-protected files, file path and file butfer, PDFs with above. pdfreader tables, and CLI usage Depends on pdf2json Node v14 and Facilitates reading from a file path or memory buffer, above paf2json Depends on pats The need for custom parsers Often, project requirements might call for something beyond what most packages can provide, hence the need to innovate and build a custom parser. One example we discussed in this article is the necessity for applications to accurately parse tables within PDFs. Given that the evaluated packages fall short in this area, it might become essential to develop a custom parser for your project. Ideally, this parser would be constructed using an existing open source parser as a foundation to ensure cation. Would you be interested in joining yn LogRocket's developer community? > explored how to parse PDF files in Nodes using multiple npm No her expanded our knowledge by comparing the packages and exploring Yeah thanks tpseog ogrocet. comparing pa-nodeis! 72 3722/24, 10:24AM Parsing PDFS in Nodes - LogRockst Blog the challenges they encounter. You could also further expand your understanding by modifying the sample codes provided to discover new applications for these packages. Hopefully, you enjoyed this article and have learned a new way of processing your PDF file contents. Thanks for reading! 200s only “ Monitor failed and slow network requests in production Deploying a Node-based web app or website is the easy part. Making sure your Node instance continues to serve resources to your app is where things get tougher. If you're interested in ensuring requests to the backend or third-party services are successful, try LogRocket. wth to ie = - with request body (EEE [ suinenicate or with sponse body a Would you be interested in a DVR for web and mobile apps, recording literally everything that joining ser interacts with your app. Instead of guessing why problems happen, LogRocket's developer and report on problematic network requests to quickly understand the community? No Yeah thanks hntps:ologlogrocket.comiparsing-pafs-nod js! 1822 2322724, 10:26AM Prsing PDFS in Nodes -LogRocket log LogRocket instruments your app to record baseline performance timings such as page load time, time to first byte, slow network requests, and also logs Redux, NgRx, and ‘Vuex actions/state. Start monitoring for free. yon oe #node bio) Me UCT lale Mellel M ole mel (el b ec] | experience with LogRocket Recent posts: Would you be interested in joining LogRocket's developer community? No Yeah thanks hntps:iologlogrocket.comiparsing-pdfs-nodj 1922 3722/24, 10:24AM Parsing PDFS in Nodes - LogRockst Blog Exploring Catalyst, Tailwind’s UI kit for React Tailwind’s Catalyst UI kit offers many features and customization options for React user interface development Timonwa Akintokun Mar 21, 2024 - 5 min read Would you be interested in , »-print to generate a printable joining LogRocket's developer reat tool for easily adding a print feature community? while preserving your document's neat, No Yeah thanks htips:ologlogrocket.comip 20722 3722724, 10:24AM Parsing PDFS in Nodes - LogRockst Blog é Mar 20, 2024 - 9 min read Migrating a JavaScript frontend to Leptos, a Rust framework Leptos is a Rust frontend framework with excellent performance and support for modern features like signals, SSR, CSR, and more. Eze Sunday Mar 19, 2024 - 7 min read Would you be interested in joining LogRocket's developer community? ~ No 9ption guide: Overview, Yeah thanks J alternatives htips:bloglogrocket.comiparsing-pats-nodejs 2122 3722124, 1026 AM Parsing PDFS in Node js -LogRocket Blog Actix Web is definitely a compelling option to consider, whether you are starting a new project or considering a framework switch. Eze Sunday Mar 18, 2024 - 8 min read Leave a Reply Would you be interested in joining LogRocket's developer community? No Yeah thanks htips:bloglogrocket.comiparsing-pats-nodejs zane

You might also like