<div dir="auto"><div><div class="gmail_extra"><div class="gmail_quote"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="elided-text">Give ten kids 20 pages each and diff the results...:o)<br></div>

<font color="#888888">    /Bernie\<br></font></blockquote></div></div></div><div dir="auto"><br></div><div dir="auto">We did exactly that 39 years ago with n=2 for survey data not received on op scan media.</div><div dir="auto"><br></div><div dir="auto">My first paid coding job. I was supposedly one of the data entry undergrads but I took over support of Fortran4 and PL1 diff and column swap utilities.</div><div dir="auto"><br></div><div dir="auto"> (Also took op scan sheets to batch window and used offline accounting machines with the opscanned card deck, because my time using the offline sorted was cheaper than mainframe sort job. I hope I'm the youngest person who can say that!)</div><div dir="auto"><br></div><div dir="auto">With NL text, might want a format forgiving diff unless setting rigid guidelines.</div><div dir="auto"><br></div><div dir="auto">Having been researching in Google Books and ArchiveORG/PG PDFs of early and mid 20thC publications for a project unrelated to IH (but WWW is saving me N-1 trips to NARA), I'll concur that scan quality is very important for the OCR, and not just resolution. Contrast, focus, alignment all matter. Especially with elegant light "Modern" typefaces. An adaptive BW scan particularly helpful. </div><div dir="auto"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><font color="#888888"></font></blockquote></div></div></div></div>