|
|
|
|
| Author |
Message |
Richard Evans Guest
|
Posted: Sat Feb 10, 2007 6:42 am Post subject: Scanning to searchable PDF |
|
|
I need to scan 8.5 x 11 loose sheets into searchable PDFs in large
batches: 400 - 600 at a time.
I want a sheet feed scanner with a decent input tray (50+ pages) that
will scan to searchable PDFs quickly, accurately, and with a minimum
of fuss. Resolution is not terribly important. 600 dpi would do. Color
is not important.
I'm looking online at a Xerox 152 that seems to do what I want, but I
can't tell from the picture how big the feed tray is.
Any thoughts on the Xerox? Any other models that might suit me?
The Xerox is $600. I might go as high as $800. Of course, cheaper is
better. |
|
| Back to top |
|
 |
Fix your Windows Problems - FAST.
FREE Safe Scan Registry Check. Locate & Fix Errors in Minutes!
|
|
Dances With Crows Guest
|
Posted: Mon Feb 12, 2007 9:00 am Post subject: Re: Scanning to searchable PDF |
|
|
On Sun, 11 Feb 2007 20:12:44 GMT, Don staggered into the Black Sun and
said:
| Quote: | Dances With Crows wrote:
Don wrote:
Richard Evans wrote
[I'd like to scan paper documents to] searchable PDF
The first few lines of results appear to [show] me that OCR is
[being used]. Does OCR with this new Xerox run on [its] own? Or is
it like every other OCR [which] requires manual corrections?
A perfect OCR engine doesn't exist. OCR is a difficult problem [...]
So if you need 100% accuracy, you've gotta proof the results. If you
don't proof them, you get things like M turning into IVI,
Also, if you have to proof OCRed documents to 100% accuracy, there's
really no reason to keep the image (unless there are graphics on the
page). Text is smaller and easy to convert to other formats. PDFs
are write-once.
I've likely OCR'd more [documents] [than] you'll ever consider.
|
O RLY? From 2000..2005, I was the principal tester and fixer on a very
large document conversion project, and had to do code-monkey things on
that project as well. Hundreds of thousands of pages from the NYT, WSJ,
Boston Globe, Washington Post, and tons of smaller academic journals
were processed through code I was responsible for. And I had to
spot-check far too many of those pages for various operator errors.
And I've reverse-engineered large chunks of the file format of a certain
OCR engine for company purposes. Mostly extracting info that the engine
stores but DDE doesn't make available, but whatever. So I'd say I have
a fair idea of the ways that OCR engines can fail, and a lot more
experience than you credit me with.[0]
| Quote: | It's a much simpler and a less time consuming task to OCR "properly"
into a text editor or word [processing] software[,] as compared to
adding text behind an image.
|
If the first option is easier than the second option, then the software
you're using to do the second option is poorly designed. I worked with
another guy to modify the company's conversion software so that it could
produce PDFs with text-behind-image. It worked reasonably well. Too
bad the clients decided they didn't want that feature.
| Quote: | most beginners [in] scanning are under the impression that OCR and
most scanning is [a] uniform task[,] and nothing could be [further]
from the truth. Far too many [strange] circumstances exist on each
project. Each new and unrelated document may require a new
[configuration] or [a] revision of [your] [configuration].
|
If you can get a decent scan at 300 DPI with good contrast, the scan
isn't skewed, the fonts used are sane, there are no graphics or weird
layouts, and there's no page curl or broken type, OCR just might get 98
or 99% accuracy without much effort on your part. YDocumentsMV.
| Quote: | Simply piling a load of paper onto a sheet feeder and going about
other [tasks], while your scanner [proceeds] both un-monitored and
operating itself, will generally result in a plie of crap
|
True dat. OCR engines have improved a bit, but you still need human
intervention to get really good data. People are still much better at
grokking malformed text than computers are (as shown by the "captcha"
thing some webforums use.)
[0] Does "Proquest" ring a bell?
--
The squirrels now live in fear of the magic exploding compost heap.
--Dan Holdsworth
Matt G|There is no Darkness in Eternity/But only Light too dim for us to see |
|
| Back to top |
|
 |
|
|
| |