获取 PDF 文档中的页数

小开

最佳答案

一个名为 PDFinfo的简单命令行可执行文件。

它是可供 Linux 和 Windows 下载。你下载一个压缩文件包含几个小的 PDF 相关的程序。解压缩它的地方。

其中一个文件是 PDFinfo(或者 Windows 中的 Pdfinfo.exe)。通过在 PDF 文档中运行返回的数据示例:

Title:          test1.pdf
Author:         John Smith
Creator:        PScript5.dll Version 5.2.2
Producer:       Acrobat Distiller 9.2.0 (Windows)
CreationDate:   01/09/13 19:46:57
ModDate:        01/09/13 19:46:57
Tagged:         yes
Form:           none
Pages:          13    <-- This is what we need
Encrypted:      no
Page size:      2384 x 3370 pts (A0)
File size:      17569259 bytes
Optimized:      yes
PDF version:    1.6

我还没有看到一个 PDF 文档返回一个错误的页面计数(迄今为止)。它也非常快，即使是200 + MB 的大文档，响应时间也只有几秒或更少。

在 PHP 中，有一种从输出中提取页面计数的简单方法:

// Make a function for convenience
function getPDFPages($document)
{
$cmd = "/path/to/pdfinfo";           // Linux
$cmd = "C:\\path\\to\\pdfinfo.exe";  // Windows
    

// Parse entire output
// Surround with double quotes if file name has spaces
exec("$cmd \"$document\"", $output);


// Iterate through lines
$pagecount = 0;
foreach($output as $op)
{
// Extract the number
if(preg_match("/Pages:\s*(\d+)/i", $op, $matches) === 1)
{
$pagecount = intval($matches[1]);
break;
}
}
    

return $pagecount;
}


// Use the function
echo getPDFPages("test 1.pdf");  // Output: 13

当然，这个命令行工具可以在其他语言中使用，这些语言可以解析外部程序的输出，但是我在 PHP 中使用它。

我知道它不是纯 PHP ，但是外部程序在 PDF 处理方面方式更好(如问题所示)。

我希望这可以帮助人们，因为我已经花了很多时间试图找到这个问题的解决方案，我看到了很多关于 PDF 页面计数的问题，其中我没有找到我想要的答案。这就是为什么我问了这个问题并且自己回答了它。

安全注意: 如果文档名是从用户输入或文件上传输入的，则在 $document上使用 escapeshellarg。

小开

如果您不能安装任何其他软件包，您可以使用以下简单的一行程序:

foundPages=$(strings < $PDF_FILE | sed -n 's|.*Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' | sort -rn | head -n 1)

小开

下面是一个使用 pdfinfo命令报告 PDF 文件页码的 R函数。

pdf.file.page.number <- function(fname) {
a <- pipe(paste("pdfinfo", fname, "| grep Pages | cut -d: -f2"))
page.number <- as.numeric(readLines(a))
close(a)
page.number
}
if (F) {
pdf.file.page.number("a.pdf")
}

小开

下面是使用 gsscript 报告 PDF 文件页码的 Windows 命令脚本

@echo off
echo.
rem
rem this file: getlastpagenumber.cmd
rem version 0.1 from commander 2015-11-03
rem need Ghostscript e.g. download and install from http://www.ghostscript.com/download/
rem Install path "C:\prg\ghostscript" for using the script without changes \\ and have less problems with UAC
rem


:vars
set __gs__="C:\prg\ghostscript\bin\gswin64c.exe"
set __lastpagenumber__=1
set __pdffile__="%~1"
set __pdffilename__="%~n1"
set __datetime__=%date%%time%
set __datetime__=%__datetime__:.=%
set __datetime__=%__datetime__::=%
set __datetime__=%__datetime__:,=%
set __datetime__=%__datetime__:/=%
set __datetime__=%__datetime__: =%
set __tmpfile__="%tmp%\%~n0_%__datetime__%.tmp"


:check
if %__pdffile__%=="" goto error1
if not exist %__pdffile__% goto error2
if not exist %__gs__% goto error3


:main
%__gs__% -dBATCH -dFirstPage=9999999 -dQUIET -dNODISPLAY -dNOPAUSE  -sstdout=%__tmpfile__%  %__pdffile__%
FOR /F " tokens=2,3* usebackq delims=:" %%A IN (`findstr /i "number" test.txt`) DO set __lastpagenumber__=%%A
set __lastpagenumber__=%__lastpagenumber__: =%
if exist %__tmpfile__% del %__tmpfile__%


:output
echo The PDF-File: %__pdffilename__% contains %__lastpagenumber__% pages
goto end


:error1
echo no pdf file selected
echo usage: %~n0 PDFFILE
goto end


:error2
echo no pdf file found
echo usage: %~n0 PDFFILE
goto end


:error3
echo.can not find the ghostscript bin file
echo.   %__gs__%
echo.please download it from:
echo.   http://www.ghostscript.com/download/
echo.and install to "C:\prg\ghostscript"
goto end


:end
exit /b

小开

最简单的就是使用 图像魔术

下面是一个示例代码

$image = new Imagick();
$image->pingImage('myPdfFile.pdf');
echo $image->getNumberImages();

否则，您也可以使用 PDF库，如 MPDF或 TCPDF的 PHP

小开

R 软件包 Pdf 工具和函数 pdf_info()提供了 pdf 中页数的信息。

library(pdftools)
pdf_file <- file.path(R.home("doc"), "NEWS.pdf")
info <- pdf_info(pdf_file)
nbpages <- info[2]
nbpages


$pages
[1] 65

小开

这似乎工作得很好，不需要特殊的包或解析命令输出。

<?php


$target_pdf = "multi-page-test.pdf";
$cmd = sprintf("identify %s", $target_pdf);
exec($cmd, $output);
$pages = count($output);

小开

如果您能够访问 shell，最简单的(但不能在100% 的 PDF 中使用)方法是使用 grep。

这应该只返回页数:

grep -m 1 -aoP '(?<=\/N )\d+(?=\/)' file.pdf

例子: https://regex101.com/r/BrUTKn/1

开关描述:

-m 1是必要的，因为一些文件可能有多个匹配的正则表达式模式 (需要志愿者将其替换为仅匹配优先的正则表达式解决方案扩展)
-a是将二进制文件视为文本所必需的
-o只显示匹配
-P使用 Perl 正则表达式

正则表达式解释:

启动“分隔符”: /N的 (?<=\/N )后面(这里没有看到 nb.space 字符)
实际结果: \d+任意数位
结束“分隔符”: (?=\/)向前看

注意: 如果在某些情况下没有找到匹配，可以安全地假设只有1页存在。

小开

由于可以使用命令行实用程序，因此可以使用 Cpdf(Microsoft Windows/Linux/Mac OS X)。获取一个 PDF 中的页数:

cpdf.exe -pages "my file.pdf"

小开

如果一个文件 file _ name.pdf 有100页,

$ qpdf --show-npages file_name.pdf
100

小开

根据 Richard 的回答@，我为 pdfinfo 创建了一个包装类，以防它对任何人都有用

/**
* Wrapper for pdfinfo program, part of xpdf bundle
* http://www.xpdfreader.com/about.html
*
* this will put all pdfinfo output into keyed array, then make them accessible via getValue
*/
class PDFInfoWrapper {


const PDFINFO_CMD = 'pdfinfo';


/**
* keyed array to hold all the info
*/
protected $info = array();


/**
* raw output in case we need it
*/
public $raw = "";


/**
* Constructor
* @param string $filePath - path to file
*/
public function __construct($filePath) {
exec(self::PDFINFO_CMD . ' "' . $filePath . '"', $output);


//loop each line and split into key and value
foreach($output as $line) {
$colon = strpos($line, ':');
if($colon) {
$key = trim(substr($line, 0, $colon));
$val = trim(substr($line, $colon + 1));


//use strtolower to make case insensitive
$this->info[strtolower($key)] = $val;
}
}


//store the raw output
$this->raw = implode("\n", $output);


}


/**
* get a value
* @param string $key - key name, case insensitive
* @returns string value
*/
public function getValue($key) {
return @$this->info[strtolower($key)];
}


/**
* list all the keys
* @returns array of key names
*/
public function getAllKeys() {
return array_keys($this->info);
}


}

小开

下面是一个使用 PHP 获取 PDF 中页面数的简单示例。

<?php


function count_pdf_pages($pdfname) {
$pdftext = file_get_contents($pdfname);
$num = preg_match_all("/\/Page\W/", $pdftext, $dummy);


return $num;
}


$pdfname = 'example.pdf'; // Put your PDF path
$pages = count_pdf_pages($pdfname);


echo $pages;


?>

小开

这个简单的1行似乎很好地完成了工作:

strings $path_to_pdf | grep Kids | grep -o R | wc -l

PDF 文件中有一个区块详细说明了这个时髦字符串的页数:

/Kids [3 0 R 4 0 R 5 0 R 6 0 R 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R 13 0 R 14 0 R 15 0 R 16 0 R 17 0 R 18 0 R 19 0 R 20 0 R 21 0 R 22 0 R 23 0 R 24 0 R 25 0 R 26 0 R 27 0 R 28 0 R 29 0 R 30 0 R 31 0 R 32 0 R 33 0 R 34 0 R 35 0 R 36 0 R 37 0 R 38 0 R 39 0 R 40 0 R 41 0 R]

‘ R’字符的数量就是页数

显示字符串输出的终端屏幕截图

小开

你可以使用 mutool。

mutool show FILE.pdf trailer/Root/Pages/Count

mutool是马来西亚人民解放军软件包的一部分。

小开

你经常阅读正则表达式 /\/Page\W/，但它不会为我的几个 pdf 文件工作。这里还有另一个正则表达式，它适用于我。

$pdf = file_get_contents($path_pdf);
return preg_match_all("/[<|>][\r\n|\r|\n]*\/Type\s*\/Page\W/", $path_pdf, $dummy);

获取 PDF 文档中的页数

这个问题是为了参考和比较。解决方案是接受的答案如下。

使用想象力(PHP 扩展)

使用 FPDI(PHP 库)

打开流并使用正则表达式进行搜索:

那么，什么样的工作才是可靠和准确的呢？

一个名为 PDFinfo的简单命令行可执行文件。

获取 PDF 文档中的页数

这个问题是为了参考和比较。解决方案是 接受的答案如下。

使用 想象力(PHP 扩展)

使用 FPDI(PHP 库)

打开流并使用正则表达式进行搜索:

那么，什么样的工作才是可靠和准确的呢？

一个名为 PDFinfo的简单命令行可执行文件。

这个问题是为了参考和比较。解决方案是接受的答案如下。

使用想象力(PHP 扩展)