In our previous testing from the trenches post, we presented you with a real testing problem. It was about finding errors in Book Manuscript For Asimov Foundations. Here is our Test Strategy for this testing problem.
Six W Test Strategy
Five W[source] technique is about six questions: Who, What, When, Where, Why, and HoW.
That would be me. But as I am not important to Archive.org as an individual, let’s generalize this.
Archive.org users who downloaded EPUB manuscripts that were automatically generated from the original manuscript.
As the number of mobile devices is great, Archive.org users who would download the EPUB manuscript version are significant.
EPUB Archive.org manuscript format that is auto-generated from the original text manuscript.
“This book was produced in EPUB format by the Internet Archive.
The book pages were scanned and converted to EPUB format automatically. This process relies on optical character recognition and is somewhat susceptible to errors. The book may not offer the correct reading sequence, and there may be weird characters, non-words, and incorrect guesses at the structure. Some page numbers and headers or footers may remain from the scanned page. The process that identifies images might have found stray marks on the page, not actual images from the book. The hidden page numbering which may be available to your ereader corresponds to the numbered pages in the print edition but is not an exact match; page numbers will increment at the same rate as the corresponding print edition, but we may have started numbering before the print book’s visible page numbers. The Internet Archive is working to improve the scanning process and resulting books, but in the meantime, we hope that this book will be useful to you.
Our only goal is to find out is there any difference in txt. While reading the original manuscript, I noticed that some words are missing letters.
- Converting EPUB to TXT is an error-prone
- Compare is error-prone.
Compare manuscript chapter by chapter. Each individual chapter would be compared faster than the whole manuscript. It would be easier to spot fewer differences and identify false positives.