I need to parse a PDF file which contains tabular data. I'm using PDFBox to extract the file text to parse the result (String) later. The problem is that the text extraction doesn't work as I expected for tabular data. For example, I have a file which contains a table like this (7 columns: the first two always have data, only one Complexity column has data, only one Financing column has data):
+----------------------------------------------------------------+
| AIH | Value | Complexity | Financing |
| | | Medium | High | Not applicable | MAC/Other | FAE |
+----------------------------------------------------------------+
| xyz | 12.43 | 12.34 | | | 12.34 | |
+----------------------------------------------------------------+
| abc | 1.56 | | 1.56 | | | 1.56|
+----------------------------------------------------------------+
Then I use PDFBox:
PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
Those two lines of data would be extracted like this:
xyz 12.43 12.4312.43
abc 1.56 1.561.56
There are no white spaces between the last two numbers, but this is not the biggest problem. The problem is that I don't know what the last two numbers mean: Medium, High, Not applicable? MAC/Other, FAE? I don't have the relation between the numbers and their columns.
It is not required for me to use the PDFBox library, so a solution that uses another library is fine. What I want is to be able to parse the file and know what each parsed number means.