Reply To: Perfect PDF 9 Editor / Sep 1 2020

Sep 19, 2020 at 11:16 am #16519263 Quote

Gary

Guest

[@Peter Blaise]

>”Line breaks mid sentence, and page numbers mid sentence, are only one part of the PDF-to-text presentation I shared, the most critical challenge to understand was the tab-delimiting between words.”

When a sentence starts on one page and ends on the next, where do you think the line break should be placed in the exported text output? Should it be before the start of the sentence on the starting page? Or, should it be after the end of the sentence on the following page? Either way, you could not go on that rule alone. What if you choose to place the linefeed before the start of the sentence, but the start is a very long part of the page, but the following page has a very short part to finish the sentence. Do you choose instead to place it at the start if the start is shorter but at the end if the end is shorter? What if the sentence carries over multiple pages? Do you accumulate linefeeds, also, same for page numbers? The simple text export places the linefeed or page-numbers in the output in the sequence they are found. It does not make judgments.

What you want is a conversion program in the same format as the source, only that the output is text characters.

The exported text is not in the form of any type of formatting as you seem to expect in spite of me reiterating that the exported text output is nothing more than the text characters in the same sequence that they are encountered.

>”Whether “tab” is a character or is a presentation effect accomplished by any other means does not concern me.”

The tab character is not a common dependable method of formatting text in programs such as MS Word. It is used mostly in text documents to line things up or form indention. Perfect PDF 9 does not try to format the exported text.

>”No, the characters were not separated as if 2 bytes instead of one, the words were separated as if laid out in spreadsheet column.”

Correct, the characters were not separated. They were continuous, only that it took 2 bytes for each character. In Unicode, since characters of the US English language does not need the first eight bits, what you get appears to be spaces. They are not.

>”Note, I shared text output for all to inspect – look at it, it’s just cut-and-paste – and I cited the source, so go ahead and dive in, anyone who wants to explore.”

>”it’s just cut-and-paste – and I cited the source”

To be able to paste, you have to copy something to the clipboard first. You left out one important part of what you presented in your comment. You did not cite the source program displaying the data you copied from. You only cited the original book PDF, which was then exported to Unicode text.
Next, you viewed that Unicode text using some program but you have not said what program it was.
Whatever that program was can affect how the text appears, and if it manipulates the data to allow you to see the text characters in some manner, then you are not copying the original Unicode data. Therefore, pasting that into a comment is meaningless to convince anyone what the original Unicode contained. That is what magicians do to convince the viewer of something. They do not tell you all of the middle steps they took.

What program were you using to display the text that you copied from?

As I pointed out, depending on what mode of encoding Microsoft Notepad was last used as, it can display the contents of the same Unicode file with the characters right next to each other (1 byte) or appear to be separated (2 bytes). You cannot go on what you see with even the simplest tool such as Notepad. You mentioned using a “Word-equivalent” program. That is a poor choice for viewing the exported Unicode text data.

As I mentioned, I exported “The Boy Electrician” to the Unicode text, and received what I expected. You may be wondering why I did not post any of the output. Even though I might have Unicode text that has not been altered, the data would get manipulated as soon as I attempted to post it into a comment. Ever heard of script injection/code injection?

When typing or pasting anything into a web comment form, do you think webpages just accept any data from the forms? For a page to accept anything placed in a form is an excellent way of allowing any user to attack the server. There are many warnings online about these dangers, and how web developers need to handle unknown input. It is normal behavior to strip the entered data down to basic characters (especially bytes of unprintable characters, or the high bits in Unicode 2-byte characters for English), so what the user sees will never be the same as the Unicode it came from. Likewise, what we see from what you pasted into the comment form is not the exported Unicode text it supposedly started from.

All your effort was a moot point from the start. Either you didn’t have a clue about what you were seeing or doing, or you might be trying to scam the reader. You may be convincing to the inexperienced, but don’t expect that to work with everyone.

You cannot determine if the exported text is correct or not by looking at it using any tool. You can only use a tool to tell you what encoding it is. Then, if you have the means (or tool) to interrogate the data byte by byte (e.g., hex editor), only then can you see what the contents are. Unicode files should have a few bytes at the start of the file that identifies what type of encoding the data is in. As soon as you copy any characters or bytes further on, you then only have a string of characters. They alone cannot be used to judge or determine what type of encoding was used without some other interrogation. In some cases, it is impossible to determine what encoding the string of characters came from. If the program you used to copy from was not made to handle Unicode text, you could have started copying in the middle of a 2-byte character. Without knowing what program you used to view the Unicode file, and where you started your copy, we have no idea what that content would be.

>”I note that tab-equivalent-delimited results are unrelated to Unicode, that is, the spaces between words are variable, but exactly what is needed to align the next word with the previous row of word’s alignment, that is, a column, a spreadsheet, tab-delimited, by spaces or tabs or whatever it takes to line up the best word regardless of the length of the current word, but hey …”

Export to text is not for the purpose of making anything line up or be formatted. If the result does line up, then it is a coincidence due to the original PDF having those characters in the same order or the tool you used has manipulated the text to have tabs.

__________

>”I used a PDF of “An Elegant Defense” book by Matt Richtel”
>”I shared a source that was free to me, and not available in other formats.”

The book is available on Amazon as a hardcover, paperback, audio CD, Audiobook, and Kindle so it is available in different formats. It is NOT available from Amazon as a PDF, so I am pretty sure the PDF edition you used was “generated” by someone.

>”you try text-exporting a PDF book with a numbered table of contents yourself, and see if Soft-Xpansion Perfect PDF 9 Editor decides it’s a spreadsheet and separates every word with tab-equivalent spaces, and share the results as I have”

I did do an export and it did not separate any words with a tab (what you are calling tab-delimited spaces). In the entire output, there is not a single tab.

I did not attempt to show in a comment because I know better. Apparently, you don’t even realize that what you did is meaningless.

>”“The Boy Electrician” book by Alfred Powell Morganis is freely available side by side with other freely available file formats.”

That is correct. I did not choose the book based on whether there were other formats available, same for your choice, whether you knew it or not (“An Elegant Defense” book by Matt Richtel” is also available in other formats). Being available in other formats has nothing to do with the issue of whether Soft Xpansion Perfect PDF 9 can export a book with a table of contents, page numbers, etc., and how it handles it. I simply was using a freely available “PDF book with a numbered table of contents,” in order to have a source that had the same things you mentioned that was in your book. Therefore, if Perfect PDF 9 exports the table of contents in columnar form, it would show up. It didn’t. Since you did not use a tool that can view Unicode accurately, it is likely the reason you have tabs. Use a hex editor/viewer and show us the source. If the source has tabs, then they simply were passed on to the output, but with no intention of formatting that output.

>”The “The Boy Electrician” PDF is not original from the publisher, but was generated, …”

What? … what kind of curveball is that? What PDF files are not generated? You act as if the book you used was not generated? I have never heard of a PDF book that was not generated. It would mean the book was composed entirely in a PDF editor instead of something like MS Word. Possible, but the formatting abilities of PDF editors have not quite caught up to the same as MS Word. Even if created entirely in a PDF editor, the final step is to “GENERATE THE PDF” so where are the PDF books that are not generated hiding?

Most likely the author used something like Microsft Word, or a “writers tool” to compose the book. It probably went through several iterations of editing and proofreading before deciding the book was at the publishable state. After it was prepared for typeset, they also made digital versions to send to the printer. They may have even generated a PDF to send to the printer. Either way, at some point, someone made a PDF version. Whether that was done by the original publisher or not, is not known, nor does it matter when it was “generated” or by whom.

If you look at the PDF edition of The Boy Electrician, the pages look like the original book. That is because the original book was OCRed, then the output of the OCR corrected, and formatted to create a source that could be used to typeset/print as near a copy of the book as the original. The PDF is just one of the end results of those steps. The book has the same qualities that your book had and that you specified as being critical parts to see what the Perfect PDF 9 export to text looked like. Again, whether other formats are available or not has nothing to do with the issue being debated.

I used Perfect PDF 9 to export “The Boy Electrician” to text, and it worked as expected. It was all in Unicode, and by the way, there was not a single tab character in the entire output.

>”the PDF is not a product of original publisher, ”

Correct, and again that has nothing to do with it. We are testing a PDF with a table of contents; it does not matter if the book is the latest on the market or from 1913. For both, the PDF editions are recent. There was no need for the PDF to be a product of the original publisher (your PDF may not have been either). That was not in your original stipulation, and if it was, I would have a good laugh. Perfect PDF 9 does not know whether it is a product of the original publisher, nor does it care. That has nothing to do with whether Perfect PDF 9 can export a PDF with a table of contents with or without tabs.

>”and there is no need to export text from it.”

Again, where are these odd curve-balls coming from? What does “need” have to do with it? Did you forget the issue is whether Perfect PDF 9 is doing the right thing when exporting to a text file? “Need” has nothing to do with it.

If you do not like using “The Boy Electrician” pick something else, as long as it is freely available to all, and easy to download. That way, anyone can do the same test.

You mention you get PDFs from different sources and that there are a gazillion places in the Internet that has PDFs available, so use several of those other PDFs and see what you get. Don’t base your entire judgment on one example. You owe it to yourself to see if Perfect PDF 9 handles all of the books the same way. When you can prove that all PDFs with a table of contents come out the same way, then you have something. As it is now, you have nothing.

>”As mentioned, an original publisher probably can purposefully insert non-printable codes that sully the PDF’s auto-conversion to text, I do not know, nor do I care.”

No, an original publisher probably CANNOT purposefully insert non-printable codes that sully the PDF’s auto-conversion to text. It wouldn’t matter if they insert any characters or not. It doesn’t matter what the original contained besides the actual text content because only the text is exported; other characters are ignored. Perfect PDF 9 does not claim to be attempting to produce a formatted version of the original PDF.

It seems like you were expecting text output that you could open with something like MS Word and have a source file that could be used to produce a PDF that looks the same as what the exported text came from. Perfect PDF 9 does not do that. When asked, they answer that it is not for that purpose.

PDF tools are not limited to what they can do, so some PDF editors do export to MS Word format. I am sure in the future, there will be more and more universal PDF tools that do it all.

>”My point, and I do have one, is that Soft-Xpansion Perfect PDF 9 Editor seems to scrutinize the internal coding structures, and miss the human-interpretable presentation.”

It doesn’t matter what it might “seems like” they are doing; that is something you evolved in your head. What is important is what they claim to do and what they actually do. They do not claim to attempt generating formatted text; merely export the text characters found. And every case I have tried that is exactly what they have done.

They are not scrutinizing the internal coding structure to produce a human interpretable presentation but miss. They are merely collecting the text characters found in the original, and dropping the rest. That’s what “export to text” means in its simplest form. Therefore, they are not missing something they are not trying to do in the first place. It is you who has evolved the thought in your mind that they are attempted to create a formatted output, similar to what a conversation to MS Word would look like, only that it would be plain characters. They absolutely have stated that they are not doing that.

>”Who knows, maybe the PDF I started with has a code between words, and Soft-Xpansion honored that … it does not matter, really.”

No, Perfect PDF 9 is simply writing out the characters found in the source.

>”Who knows”

Rather than continuing to guess, look at your source with a hex editor and see what is there. If you want to paste your hex codes found into a comment, that should work. If you use a hex display that shows the actual characters plus the hex values, it isn’t going to work. Paste only the representation of the hex values (simple ASCII text).

I don’t know why you based your entire claim on exporting one PDF file when you have plenty of other PDFs to see if they all do the same. Anyone claiming to be a “computer tech” would have done that instantly instead of blaming a company and their product that has been around for years. It would be bad enough to come up with such an untested theory without saying anything publically. To put out there so everyone can see speaks volumes.