BizTalk PDF2Xml Pipeline Component for PDF Conversion

I just updated my BizTalk Pipeline Components Extensions Utility Pack project available on GitHub with two new components. The first one was the Archive Pipeline Component for BizTalk Server, which I blogged about on the BizTalk360 blog, and this new one I will address here is the BizTalk PDF2Xml Pipeline Component.

For those who are pt familiar, this project is a set of custom pipeline components (libraries) that can be used in received and sent pipelines, extending BizTalk’s out-of-the-box pipeline capabilities.

📝 One-Minute Brief

A custom BizTalk pipeline decode component that extracts text from PDF files and converts it into XML so BizTalk Server can process the content. Built on the iTextSharp library and extended from an old CodePlex project, the component can output HTML, raw XML, or XSLT‑transformed XML, giving you flexibility when integrating PDF‑based business documents.

BizTalk PDF2Xml Pipeline Component

BizTalk PDF2Xml Pipeline Component is, as the name suggests, a decode component that transforms the content of a PDF document to an XML message that BizTalk can understand and process. The component uses the iTextSharp library to extract the PDF content. The original source code was available on CodePlex (pdf2xmlbiztalk.codeplex.com). Still, I couldn’t validate who the original creator was. So, the component first transforms the PDF content to HTML, and then, using an external XSLT, applies a transformation to convert the HTML into a known XML document that BizTalk Server can process.

My team and I kept that behavior, but we extended this component and added the capability also to, by default, convert it to a well-known XML without the need for you to use an XSLT transformation directly on the pipeline.

How does this component work?

This is the list of properties that you can set up on the PDF2XML pipeline component:

Property Name	Description	Sample Values
InternalProcessToHTML	Value to decide if you want the component to transform the PDF content to HTML or XML	True/False
IsToApplyTrasnformation	Value to decide if you want to apply a transformation on the pipeline component or not	True/False
XsltFilePath	Path to an XSLT transformation file	C:\transf\mymap.xslt

Once you pass the PDF by this component, and depending on how you configure it, the outcome can be:

All PDF content is in an HTML format.
All PDF content is in an XML format.
Part of the PDF content on an XML format (if you apply a transformation).

Unfortunately, on my initial tests, this component works well with some PDF files, but others simply ignore its content. Nevertheless, I make it available as a proof-of-concept.

Download

THIS COMPONENT IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND.

You can download BizTalk PDF2Xml Pipeline Component from GitHub here:

Donwload BizTalk PDF2Xml Pipeline Component

Hope you find this helpful! If you liked the content or found it useful and would like to support me in writing more, consider buying (or helping to buy) a Star Wars Lego set for my son.

Author: Sandro Pereira

Sandro Pereira lives in Portugal and works as a consultant at DevScope. In the past years, he has been working on implementing Integration scenarios both on-premises and cloud for various clients, each with different scenarios from a technical point of view, size, and criticality, using Microsoft Azure, Microsoft BizTalk Server and different technologies like AS2, EDI, RosettaNet, SAP, TIBCO etc. He is a regular blogger, international speaker, and technical reviewer of several BizTalk books all focused on Integration. He is also the author of the book “BizTalk Mapping Patterns & Best Practices”. He has been awarded MVP since 2011 for his contributions to the integration community. View all posts by Sandro Pereira

BizTalk PDF2Xml Pipeline Component

📝 One-Minute Brief

BizTalk PDF2Xml Pipeline Component

How does this component work?

Download

Author: Sandro Pereira

Leave a Reply Cancel reply

The Ultimate Cloud
Management Platform for Azure

Supercharge your Azure Cost Saving

BizTalk PDF2Xml Pipeline Component

📝 One-Minute Brief

BizTalk PDF2Xml Pipeline Component

How does this component work?

Download

Author: Sandro Pereira

Leave a Reply Cancel reply

The Ultimate Cloud Management Platform for Azure

Supercharge your Azure Cost Saving

The Ultimate Cloud
Management Platform for Azure