BizTalk PDF2Xml Pipeline Component

  • Sandro Pereira
  • Mar 22, 2022
  • 3 min read

I just updated my BizTalk Pipeline Components Extensions Utility Pack project available on GitHub with two new components. The first one was the Archive Pipeline Component for BizTalk Serverwhich I blogged about on the BizTalk360 blog, and this new one I will address here is the BizTalk PDF2Xml Pipeline Component.

For those who are pt familiar, this project is a set of custom pipeline components (libraries) that can be used in received and sent pipelines, extending BizTalk’s out-of-the-box pipeline capabilities.

BizTalk Pipeline Components Extensions Utility Pack

📝 One-Minute Brief

A custom BizTalk pipeline decode component that extracts text from PDF files and converts it into XML so BizTalk Server can process the content. Built on the iTextSharp library and extended from an old CodePlex project, the component can output HTML, raw XML, or XSLT‑transformed XML, giving you flexibility when integrating PDF‑based business documents.

BizTalk PDF2Xml Pipeline Component

BizTalk PDF2Xml Pipeline Component is, as the name suggests, a decode component that transforms the content of a PDF document to an XML message that BizTalk can understand and process. The component uses the iTextSharp library to extract the PDF content. The original source code was available on CodePlex (pdf2xmlbiztalk.codeplex.com). Still, I couldn’t validate who the original creator was. So, the component first transforms the PDF content to HTML, and then, using an external XSLT, applies a transformation to convert the HTML into a known XML document that BizTalk Server can process. 

My team and I kept that behavior, but we extended this component and added the capability also to, by default, convert it to a well-known XML without the need for you to use an XSLT transformation directly on the pipeline.

How does this component work?

This is the list of properties that you can set up on the PDF2XML pipeline component:

Property NameDescriptionSample Values
InternalProcessToHTMLValue to decide if you want the component to transform the PDF content to HTML or XMLTrue/False
IsToApplyTrasnformationValue to decide if you want to apply a transformation on the pipeline component or notTrue/False
XsltFilePathPath to an XSLT transformation fileC:\transf\mymap.xslt
PDF to XML

Once you pass the PDF by this component, and depending on how you configure it, the outcome can be:

  • All PDF content is in an HTML format.
  • All PDF content is in an XML format.
  • Part of the PDF content on an XML format (if you apply a transformation).

Unfortunately, on my initial tests, this component works well with some PDF files, but others simply ignore its content. Nevertheless, I make it available as a proof-of-concept.

Download

THIS COMPONENT IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND.

You can download BizTalk PDF2Xml Pipeline Component from GitHub here:

Hope you find this helpful! If you liked the content or found it useful and would like to support me in writing more, consider buying (or helping to buy) a Star Wars Lego set for my son. 

Thanks for Buying me a coffe
Author: Sandro Pereira

Sandro Pereira lives in Portugal and works as a consultant at DevScope. In the past years, he has been working on implementing Integration scenarios both on-premises and cloud for various clients, each with different scenarios from a technical point of view, size, and criticality, using Microsoft Azure, Microsoft BizTalk Server and different technologies like AS2, EDI, RosettaNet, SAP, TIBCO etc. He is a regular blogger, international speaker, and technical reviewer of several BizTalk books all focused on Integration. He is also the author of the book “BizTalk Mapping Patterns & Best Practices”. He has been awarded MVP since 2011 for his contributions to the integration community.

Leave a Reply

Your email address will not be published. Required fields are marked *

The Ultimate Cloud
Management Platform for Azure

Supercharge your Azure Cost Saving

Learn More
Turbo360 Widget

Back to Top