BizTalk PDF2Xml Pipeline Component

I just updated my BizTalk Pipeline Components Extensions Utility Pack project available on GitHub with two new components. The first one was the Archive Pipeline Component for BizTalk Server that I blogged on the BizTalk360 blog, and this new one that I will address here is the BizTalk PDF2Xml Pipeline Component.

For those who are pt familiar, this project is a set of custom pipeline components (libraries) with several custom pipeline components that can be used in received and sent pipelines, which will extend BizTalk’s out-of-the-box pipeline capabilities.

BizTalk PDF2Xml Pipeline Component

BizTalk PDF2Xml Pipeline Component is, as the name mentioned, a decode component that transforms the content of a PDF document to an XML message that BizTalk can understand and process. The component uses the itextsharp library to extract the PDF content. The original source code was available on the CodePlex (pdf2xmlbiztalk.codeplex.com). Still, I couldn’t validate who was the original creator. So, the component first transforms the PDF content to HTML, and then using an external XSLT, will apply a transformation to convert the HTML into a know XML document that BizTalk Server can process. 

My team and I kept that behavior, but we extended this component and added the capability also to, by default, convert it to a well know XML without the need for you to use an XSTL transformation directly on the pipeline.

How does this component work?

This is the list of properties that you can set up on the PDF2XML pipeline component:

Property NameDescriptionSample Values
InternalProcessToHTMLValue to decide if you want the component to transform the PDF content to HTML or XMLTrue/False
IsToApplyTrasnformationValue to decide if you want to apply a transformation on the pipeline component or notTrue/False
XsltFilePathPath to an XSLT transformation fileC:\transf\mymap.xslt

Once you pass the PDF by this component and depending on how you configure it, the outcome can be:

  • All PDF content in an HTML format;
  • All PDF content in an XML format;
  • Part of the PDF content on an XML format (if you apply a transformation)

Unfortunately, on my initial tests, this component works well with some PDF files, but others simply ignore its content. Nevertheless, I make it available as a prof-of-concept.

Download

THIS COMPONENT IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND.

You can download BizTalk PDF2Xml Pipeline Component from GitHub here:

Author: Sandro Pereira

Sandro Pereira lives in Portugal and works as a consultant at DevScope. In the past years, he has been working on implementing Integration scenarios both on-premises and cloud for various clients, each with different scenarios from a technical point of view, size, and criticality, using Microsoft Azure, Microsoft BizTalk Server and different technologies like AS2, EDI, RosettaNet, SAP, TIBCO etc. He is a regular blogger, international speaker, and technical reviewer of several BizTalk books all focused on Integration. He is also the author of the book “BizTalk Mapping Patterns & Best Practices”. He has been awarded MVP since 2011 for his contributions to the integration community.

Leave a Reply

Your email address will not be published. Required fields are marked *

turbo360

Back to Top