The University of Arizona
Please note that this event has ended!

Modern OCR and large-scale text retrieval from images

TRIPODS Seminar

Modern OCR and large-scale text retrieval from images
Series: TRIPODS Seminar
Location: ENR2 S210
Presenter: Marek Rychlik, University of Arizona, Mathematics
Automated Optical Character Recognition (OCR) has a long history, dating back to the 19th century.  Since the invention of the digital scanner in 1957 and its considerable improvement by Ray Kurzweil in the 1970's, OCR technology was perfected for Latin-like alphabets, e.g., English. However, other languages, in particular Traditional Chinese and cursive languages of the Middle East, present significant challenges.
Modern OCR is broadly understood, and it handles the task retrieval of text from images. Cursive scripts and hand-written text do not separate characters, and therefore larger units than characters must be considered, e.g., ligatures, words or entire lines of text.  Special challenges are created by tables, math formulas and newsprint with embedded images, and complicated formatting.  Another challenge is the need for large scale processing, involving millions of pages.
The overarching goal is to bring the accuracy to well above 90% of individual characters to be recognized correctly. It is understood that accuracy below this level, e.g., 60-70% common for the texts we consider, is not useful.
The talk will be illustrated with examples from the ongoing
development of a real OCR system at the Department of mathematics, sponsored by the National Endowment for Humanities (NEH).  One goal of the project is performing OCR on 1.5 million pages of the unique Afghan document collection at the Universty of Arizona containing a mixture of languages Pashto and Persian, often in the same document. Another goal is developing a capability to process documents in Traditional Chinese, featuring 65,000 distinct characters. This calls for methods from data science and machine learning.
(Pizza, coffee & tea will be provided at 11:20am)