如何使用 iTextSharp 将 HTML 转换为 PDF

小开

最佳答案

首先，HTML 和 PDF 是不相关的，尽管它们是在同一时间创建的。HTML 旨在传达更高层次的信息，如段落和表格。尽管有一些方法可以控制它，但最终还是由浏览器来绘制这些更高层次的概念。PDF 的目的是传达文件和文件 必须的“看”相同，无论他们是提供。

在一个 HTML 文档中，你可能有一个100% 宽的段落，根据显示器的宽度，它可能需要2行或10行，当你打印它时，它可能是7行，当你在手机上看它时，它可能需要20行。一个 PDF 文件，但是，一定是独立的渲染设备，所以不管你的屏幕大小它 必须永远渲染完全相同。

Because of the 必须的 above, PDF doesn't support abstract things like "tables" or "paragraphs". There are three basic things that PDF supports: text, lines/shapes and images. (还有一些其他的东西，比如注释和电影，但我在这里尽量保持简单。) In a PDF you don't say "here's a paragraph, browser do your thing!". Instead you say, "draw this text at this exact X,Y location using this exact font and don't worry, I've previously calculated the width of the text so I know it will all fit on this line". You also don't say "here's a table" but instead you say "draw this text at this exact location and then draw a rectangle at this other exact location that I've previously calculated so I know it will appear to be around the text".

其次，iText 和 iTextSharp 解析 HTML 和 CSS。就是这样。Net、 MVC、 Razor、 Struts、 Spring 等等都是 HTML 框架，但是 iText/iTextSharp 100% 不知道它们。与 DataGridView、中继器、模板、视图等都是特定于框架的抽象。从您选择的框架中获取 HTML 是你的的责任，iText 不会帮助您。如果你得到一个异常说 The document has no pages或你认为“ iText 不解析我的 HTML”几乎可以肯定，你其实没有有 HTML，你只认为你这样做。

第三，已经存在多年的内置类是 HTMLWorker，然而这已经被 XMLWorker(爪哇咖啡/。网)所取代。没有工作正在做的 HTMLWorker不支持 CSS 文件，只有有限的支持最基本的 CSS 属性，实际上是在某些标签上打破。如果你没有看到文件中的 HTML 属性或 CSS 属性和值，那么它可能是不支持的 HTMLWorker。XMLWorker有时可能更复杂，但这些并发症也成功了XMLWorker0XMLWorker1。

下面是 C # 代码，它演示了如何将 HTML 标记解析为 iText 抽象，这些抽象将自动添加到您正在处理的文档中。C # 和 Java 非常相似，所以转换它应该相对容易。示例 # 1使用内置的 HTMLWorker来解析 HTML 字符串。由于只支持内联样式，因此 class="headline"被忽略，但其他所有东西都应该可以正常工作。示例 # 2与第一个示例相同，只是使用了 XMLWorker。示例 # 3还解析了简单的 CSS 示例。

//Create a byte array that will eventually hold our final PDF
Byte[] bytes;


//Boilerplate iTextSharp setup here
//Create a stream that we can write to, in this case a MemoryStream
using (var ms = new MemoryStream()) {


//Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
using (var doc = new Document()) {


//Create a writer that's bound to our PDF abstraction and our stream
using (var writer = PdfWriter.GetInstance(doc, ms)) {


//Open the document for writing
doc.Open();


//Our sample HTML and CSS
var example_html = @"<p>This <em>is </em><span class=""headline"" style=""text-decoration: underline;"">some</span> <strong>sample <em> text</em></strong><span style=""color: red;"">!!!</span></p>";
var example_css = @".headline{font-size:200%}";


/**************************************************
* Example #1                                     *
*                                                *
* Use the built-in HTMLWorker to parse the HTML. *
* Only inline CSS is supported.                  *
* ************************************************/


//Create a new HTMLWorker bound to our document
using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(doc)) {


//HTMLWorker doesn't read a string directly but instead needs a TextReader (which StringReader subclasses)
using (var sr = new StringReader(example_html)) {


//Parse the HTML
htmlWorker.Parse(sr);
}
}


/**************************************************
* Example #2                                     *
*                                                *
* Use the XMLWorker to parse the HTML.           *
* Only inline CSS and absolutely linked          *
* CSS is supported                               *
* ************************************************/


//XMLWorker also reads from a TextReader and not directly from a string
using (var srHtml = new StringReader(example_html)) {


//Parse the HTML
iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
}


/**************************************************
* Example #3                                     *
*                                                *
* Use the XMLWorker to parse HTML and CSS        *
* ************************************************/


//In order to read CSS as a string we need to switch to a different constructor
//that takes Streams instead of TextReaders.
//Below we convert the strings into UTF8 byte array and wrap those in MemoryStreams
using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_css))) {
using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_html))) {


//Parse the HTML
iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, msHtml, msCss);
}
}




doc.Close();
}
}


//After all of the PDF "stuff" above is done and closed but **before** we
//close the MemoryStream, grab all of the active bytes from the stream
bytes = ms.ToArray();
}


//Now we just need to do something with those bytes.
//Here I'm writing them to disk but if you were in ASP.Net you might Response.BinaryWrite() them.
//You could also write the bytes to a database in a varbinary() column (but please don't) or you
//could pass them to another function for further PDF processing.
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
System.IO.File.WriteAllBytes(testFile, bytes);

2017年的最新情况

对于 HTML 到 PDF 的需求来说，这是个好消息。作为这个答案表明，W3C 标准 < a href = “ https://www.w3.org/TR/css-break-3/”rel = “ noReferrer”> css-break-3 将解决这个问题... 这是一个候选推荐，计划成为最终推荐今年，经过测试。

作为不那么标准的解决方案，有 C # 的插件，如指纹识别系统，石头所示。

小开

@ Chris Haas 很好地解释了如何使用 itextSharp将 HTML转换成 PDF，非常有帮助
我的建议是:
通过使用 HtmlTextWriter，我把 html 标签放在 HTML表格 + 内联 CSS 中，我不用 XMLWorker就可以得到我想要的 PDF。
编辑 : 添加示例代码:
ASPX 页面:

<asp:Panel runat="server" ID="PendingOrdersPanel">
<!-- to be shown on PDF-->
<table style="border-spacing: 0;border-collapse: collapse;width:100%;display:none;" >
<tr><td><img src="abc.com/webimages/logo1.png" style="display: none;" width="230" /></td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:11px;color:#10466E;padding:0px;text-align:center;"><i>blablabla</i> Pending orders report<br /></td></tr>
</table>
<asp:GridView runat="server" ID="PendingOrdersGV" RowStyle-Wrap="false" AllowPaging="true" PageSize="10" Width="100%" CssClass="Grid" AlternatingRowStyle-CssClass="alt" AutoGenerateColumns="false"
PagerStyle-CssClass="pgr" HeaderStyle-ForeColor="White" PagerStyle-HorizontalAlign="Center" HeaderStyle-HorizontalAlign="Center" RowStyle-HorizontalAlign="Center" DataKeyNames="Document#"
OnPageIndexChanging="PendingOrdersGV_PageIndexChanging" OnRowDataBound="PendingOrdersGV_RowDataBound" OnRowCommand="PendingOrdersGV_RowCommand">
<EmptyDataTemplate><div style="text-align:center;">no records found</div></EmptyDataTemplate>
<Columns>
<asp:ButtonField CommandName="PendingOrders_Details" DataTextField="Document#" HeaderText="Document #" SortExpression="Document#" ItemStyle-ForeColor="Black" ItemStyle-Font-Underline="true"/>
<asp:BoundField DataField="Order#" HeaderText="order #" SortExpression="Order#"/>
<asp:BoundField DataField="Order Date" HeaderText="Order Date" SortExpression="Order Date" DataFormatString="{0:d}"></asp:BoundField>
<asp:BoundField DataField="Status" HeaderText="Status" SortExpression="Status"></asp:BoundField>
<asp:BoundField DataField="Amount" HeaderText="Amount" SortExpression="Amount" DataFormatString="{0:C2}"></asp:BoundField>
</Columns>
</asp:GridView>
</asp:Panel>

C # 代码:

protected void PendingOrdersPDF_Click(object sender, EventArgs e)
{
if (PendingOrdersGV.Rows.Count > 0)
{
//to allow paging=false & change style.
PendingOrdersGV.HeaderStyle.ForeColor = System.Drawing.Color.Black;
PendingOrdersGV.BorderColor = Color.Gray;
PendingOrdersGV.Font.Name = "Tahoma";
PendingOrdersGV.DataSource = clsBP.get_PendingOrders(lbl_BP_Id.Text);
PendingOrdersGV.AllowPaging = false;
PendingOrdersGV.Columns[0].Visible = false; //export won't work if there's a link in the gridview
PendingOrdersGV.DataBind();


//to PDF code --Sam
string attachment = "attachment; filename=report.pdf";
Response.ClearContent();
Response.AddHeader("content-disposition", attachment);
Response.ContentType = "application/pdf";
StringWriter stw = new StringWriter();
HtmlTextWriter htextw = new HtmlTextWriter(stw);
htextw.AddStyleAttribute("font-size", "8pt");
htextw.AddStyleAttribute("color", "Grey");


PendingOrdersPanel.RenderControl(htextw); //Name of the Panel
Document document = new Document();
document = new Document(PageSize.A4, 5, 5, 15, 5);
FontFactory.GetFont("Tahoma", 50, iTextSharp.text.BaseColor.BLUE);
PdfWriter.GetInstance(document, Response.OutputStream);
document.Open();


StringReader str = new StringReader(stw.ToString());
HTMLWorker htmlworker = new HTMLWorker(document);
htmlworker.Parse(str);


document.Close();
Response.Write(document);
}
}

当然包括对 cs 文件的 iTextSharp 引用

using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.html.simpleparser;
using iTextSharp.tool.xml;

希望这个能帮上忙！
谢谢你

小开

这是我用来作为向导的链接。希望这有所帮助！

使用 ITextSharp 将 HTML 转换为 PDF

protected void Page_Load(object sender, EventArgs e)
{
try
{
string strHtml = string.Empty;
//HTML File path -http://aspnettutorialonline.blogspot.com/
string htmlFileName = Server.MapPath("~") + "\\files\\" + "ConvertHTMLToPDF.htm";
//pdf file path. -http://aspnettutorialonline.blogspot.com/
string pdfFileName = Request.PhysicalApplicationPath + "\\files\\" + "ConvertHTMLToPDF.pdf";


//reading html code from html file
FileStream fsHTMLDocument = new FileStream(htmlFileName, FileMode.Open, FileAccess.Read);
StreamReader srHTMLDocument = new StreamReader(fsHTMLDocument);
strHtml = srHTMLDocument.ReadToEnd();
srHTMLDocument.Close();


strHtml = strHtml.Replace("\r\n", "");
strHtml = strHtml.Replace("\0", "");


CreatePDFFromHTMLFile(strHtml, pdfFileName);


Response.Write("pdf creation successfully with password -http://aspnettutorialonline.blogspot.com/");
}
catch (Exception ex)
{
Response.Write(ex.Message);
}
}
public void CreatePDFFromHTMLFile(string HtmlStream, string FileName)
{
try
{
object TargetFile = FileName;
string ModifiedFileName = string.Empty;
string FinalFileName = string.Empty;


/* To add a Password to PDF -http://aspnettutorialonline.blogspot.com/ */
TestPDF.HtmlToPdfBuilder builder = new TestPDF.HtmlToPdfBuilder(iTextSharp.text.PageSize.A4);
TestPDF.HtmlPdfPage first = builder.AddPage();
first.AppendHtml(HtmlStream);
byte[] file = builder.RenderPdf();
File.WriteAllBytes(TargetFile.ToString(), file);


iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(TargetFile.ToString());
ModifiedFileName = TargetFile.ToString();
ModifiedFileName = ModifiedFileName.Insert(ModifiedFileName.Length - 4, "1");


string password = "password";
iTextSharp.text.pdf.PdfEncryptor.Encrypt(reader, new FileStream(ModifiedFileName, FileMode.Append), iTextSharp.text.pdf.PdfWriter.STRENGTH128BITS, password, "", iTextSharp.text.pdf.PdfWriter.AllowPrinting);
//http://aspnettutorialonline.blogspot.com/
reader.Close();
if (File.Exists(TargetFile.ToString()))
File.Delete(TargetFile.ToString());
FinalFileName = ModifiedFileName.Remove(ModifiedFileName.Length - 5, 1);
File.Copy(ModifiedFileName, FinalFileName);
if (File.Exists(ModifiedFileName))
File.Delete(ModifiedFileName);


}
catch (Exception ex)
{
throw ex;
}
}

您可以下载示例文件。只需将要转换的 html放在 files文件夹中并运行。它会自动生成 pdf 文件并将其放置在同一个文件夹中。但是在您的示例中，您可以在 htmlFileName变量中指定 html 路径。

小开

截至2018年，还有 IText7(旧的 iTextSharp 库的下一个迭代)及其 HTML 到 PDF 包: 一个 href = “ https://www.nuget.org/package/itext7.pdfhtml/”rel = “ noReferrer”> itext7.pdfhtml

用法很简单:

HtmlConverter.ConvertToPdf(
new FileInfo(@"Path\to\Html\File.html"),
new FileInfo(@"Path\to\Pdf\File.pdf")
);

方法具有更多的重载。

更新: iText * 系列产品有双重许可模式双重许可模式: 免费开源，付费商业使用。

小开

我使用下面的代码来创建 PDF

protected void CreatePDF(Stream stream)
{
using (var document = new Document(PageSize.A4, 40, 40, 40, 30))
{
var writer = PdfWriter.GetInstance(document, stream);
writer.PageEvent = new ITextEvents();
document.Open();


// instantiate custom tag processor and add to `HtmlPipelineContext`.
var tagProcessorFactory = Tags.GetHtmlTagProcessorFactory();
tagProcessorFactory.AddProcessor(
new TableProcessor(),
new string[] { HTML.Tag.TABLE }
);


//Register Fonts.
XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontProvider.Register(HttpContext.Current.Server.MapPath("~/Content/Fonts/GothamRounded-Medium.ttf"), "Gotham Rounded Medium");
CssAppliers cssAppliers = new CssAppliersImpl(fontProvider);


var htmlPipelineContext = new HtmlPipelineContext(cssAppliers);
htmlPipelineContext.SetTagFactory(tagProcessorFactory);


var pdfWriterPipeline = new PdfWriterPipeline(document, writer);
var htmlPipeline = new HtmlPipeline(htmlPipelineContext, pdfWriterPipeline);


// get an ICssResolver and add the custom CSS
var cssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(true);
cssResolver.AddCss(CSSSource, "utf-8", true);
var cssResolverPipeline = new CssResolverPipeline(
cssResolver, htmlPipeline
);


var worker = new XMLWorker(cssResolverPipeline, true);
var parser = new XMLParser(worker);
using (var stringReader = new StringReader(HTMLSource))
{
parser.Parse(stringReader);
document.Close();
HttpContext.Current.Response.ContentType = "application /pdf";
if (base.View)
HttpContext.Current.Response.AddHeader("content-disposition", "inline;filename=\"" + OutputFileName + ".pdf\"");
else
HttpContext.Current.Response.AddHeader("content-disposition", "attachment;filename=\"" + OutputFileName + ".pdf\"");
HttpContext.Current.Response.Cache.SetCacheability(HttpCacheability.NoCache);
HttpContext.Current.Response.WriteFile(OutputPath);
HttpContext.Current.Response.End();
}
}
}

小开

对于 iTextSharp.tool.xml.XMLWorkerHelper，您需要安装以下软件包 ITextSharp.xmlworker