Reconocimiento de Escritura

Daniel Keysers

Image Understanding and Pattern RecognitionGerman Research Center for Artificial Intelligence (DFKI)

Kaiserslautern, Germany

Mar-2007

Keysers: RES-07 1 Mar-2007

Outline

OCR - Introduction

OCR fonts

Tesseract

Sources of OCR Errors

Outline

OCR - Introduction

OCR fonts

Tesseract

I OCR = Optical Character Recognition

I steady progress in OCR since the mid-fifties

I 1975: IBM Optical Page Reader cost over 3,000,000 US$ anddisplaced several dozen keypunch operators

I applicationsI home or office useI forms processingI address readingI conversion of large archives of text to computer-readable form

Although researchers have worked on the problem of OCR for atleast thirty years, there has been a renewed interest in OCRtechnology in the recent years. This is partly due to

I the increasing need for efficient information storage andretrieval,

I the increasing need for cross-language information access, and

I the dramatic drop in scanner prices

I [large scale digitization projects]

(Kanungo, 1998)

OCR Paradigms

building blocks of an OCR system:

I noise removal

I skew estimation

I page segmentation & line detection (= layout analysis)

I segmentation

I character classifier

I language modeling

What are the main differences to a handwriting recognition system?

even 99% accuracy= 30 errors on a typical printed page of 3000 characters

“in almost every application, either the OCR results must becorrected by a human operator or a significant fraction of thedocuments are rejected in favor of operator entry” (Nagy+ 2000)

Comercial OCR

Commercial OCR

Systems

commercial:

I Abbyy (FineReader)

I Nuance/Scansoft (OmniPage)

I Oce (RecoStar)

I Iris (ReadIris)

I many smaller vendors

open source:

I ocrad

I gocr

I Tesseract

I OCRopus

Outline

OCR - Introduction

OCR fonts

Tesseract

OCR fontsIf you can, try to make the problem easier. This is usually easierthan improving the classifier.

OCR fonts

two groups of OCR fonts

I Magnetic Ink Character Recognition (MICR)

I Optical Character Recognition (OCR)

I artificial distinction

history:

I financial world

I bar-code not easily readable by humans

I magnetic ink preferred

(source: D. Winter)

OCR fonts - E13-B

I first font used in automated banking

I digits and four special symbols used in banking””dash”, ”amount”, ”on*us” and ”transit”

I The font was specially designed so that magnetic pulses wouldbe read unambiguously. That is the reason for some of theheavy black features of some symbols. For this purpose a gridof 7 by 11 squares was used that either had to be white orblack. The filling was done so that a magnetic scan wouldgive a pulse signal that was very distinctive.

OCR fonts CMC-7

I digits, the letters, and five special symbols

I can also be seen as a barcode:each symbol encoded by seven vertical bars separated by smallor large spaces

OCR fonts - OCR-A

I end of the sixties: full character recognition

I first font: OCR-A

OCR fonts - OCR-B

I accompanies EAN barcodes

I more symbols exist

Output Representation: hOCR Format

I proposed open standard for representing OCR resultsI motivation: existing formats have limitations in

I multi lingual capabilitiesI typographic phenomenaI separate formats for intermediate and final results

I goal: reuse as much existing technology as possibleI main idea: represent various aspects of OCR output

I logical structuringI typesettingI character informationI etc.

I HTML microformat

Assessment of OCR Error Rate

I usually: edit- or Levenshtein distance (cp. SpeechRecognition)

I in the presence of reading order errors: allow blockmovements at certain cost (cp. Machine Translation)

I interesting problem: predict OCR accuracy from a set ofdocument image measurements

OCRopus Open Source OCR System

I Layout Analysis → see previous lectureI Character Recognition Engine:

I based on Tesseract → discussed hereI based on segmentation → see lecture on handwriting

recognition

I flexible architecture

OCRopus Open Source OCR System

I ‘OCRopus’ is the DFKI open source OCR system

I framework for layout analysis and OCR integration

I DFKI layout analysis + DFKI or HP-labs ‘Tesseract’ OCR

I preliminary evaluation on 18 documents, 15K words

I goals: improving adaptivity to new fonts and layouts

OCRopus Open Source OCR System — Release

I first version (technology preview) to be released soonI other application: screen OCR

screen OCR challenges

Screen OCR

I motivation:

I image-based HTML analysisI image-based cut and paste

I character recognizers:

I HMMsI Tesseract

Outline

OCR - Introduction

OCR fonts

Tesseract

History

R. Smith: ‘An overview of the Tesseract OCR Engine’, submittedto ICDAR 2007, personal communication.

I developed at HP between 1984 and 1994

I obtained good results at the 1995 UNLV Annual Test of OCRAccuracy

I then, development was stopped while other commercial OCRengines improved

I In late 2005, HP released Tesseract for open source.

Architecture

I Layout Analysis was separate, therefore not included

I (OCRopus closes this gap by including the possibility to useDFKI layout analysis with the Tesseract recognition engine)

I only supports US-ASCII

I connected component analysis

I stored as outlines (‘blobs’)→ enable recognition of inverse text easily

I blobs → text-lines

Chopping into Words

distinguish fixed pitch and proportional text

fixed-pitch: chopping simple

proportional: Measure gaps in a limited vertical range between thebaseline and mean line. Spaces close to the threshold are madefuzzy, so that a final decision can be made after word recognition.

Adaptation Using Two Passes

first pass:

I attempt to recognize each word

I ‘good’ words are used as adaptation data

second pass:

I recognize words that were not recognized well again

I use adaptation data

Line Finding

I does not need de-skewing

I filter large and small blobs out

I fit remaining blobs to parallel text line model

I re-assign left out blobs

I fit baselines as splines

Segmentation of Touching Characters

While the result from a word is unsatisfactory, Tesseract attemptsto improve the result by chopping the blob with worst confidence.

Candidate chop points are found from concave vertices of apolygonal approximation of the outline, and may have eitheranother concave vertex opposite, or a line.

Combine Broken Characters

A* search of the segmentation graph of possible combinations ofthe maximally chopped blobs into candidate characters withoutactually building the segmentation graph, but instead maintaininga hash table of visited states

character classifier can recognize broken characters directly

Character Classifier

I match outlines of test many-to-one

I test features 3-dimensional, (x, y position, angle),typically 50-100

I prototype features are 4-dimensional (x, y, position, angle,length), typically 10-20 features in a prototype

I hierarchical classification and use of lookup-tables for speed-up

Normalization

baseline and moment normalization used inadaptive and static classifier

Training Data

I no need for broken characters in training

I 20 samples of 94 characters from 8 fonts in a single size, butwith 4 attributes (normal, bold, italic, bold italic), making atotal of 60,160 training samples

I other classifiers: often more than 1,000,000 training samples

Linguistic Analysis

very basic linguistic analysis:look-up in different dictionaries

Outline

OCR - Introduction

OCR fonts

Tesseract

Major Classes of OCR Error Sources

(Nagy+ 2000)

Examples

Potential Sources of Improvement

(Nagy+, 2000)

I improved image processing

I adaptation of the classifier to the current document

I multi-character recognition

I increased use of context

I [combination of multiple OCR results]

Adaptation

I OCR systems generally optimized for average performanceover large sets of characters of different

I fontsI sizesI scan qualities

I improve performance of character recognizer by using styleinformation

Adaptation Approaches

I modeling the sample distribution [Breuel 2001]I each document contains only a single styleI modeling sample distribution as a mixture of Gaussians

I hierarchical Bayesian approach [Mathis et al. 2002]I each document contains a small number of fontsI estimation of prior distributions of a style variable

I style constrained classifiers [Sarkar et al. 2005]I style consistency constraint is hidden variableI combination of class and style represented by Gaussian mixture

modelI each mixture component trained by estimating it directly from

set of samples from a specific class and style

I maximum likelihood linear regression [Senior et al. 1997]I based on work on speaker adaptation [Legetter 1995]I adaptation using linear transformations

Reconocimiento de Escritura

Documents

MATEMÁTICA 4 - sdfab594926e91009.jimcontent.com · Fracciones equivalentes. Reconocimiento y escritura de números mixtos. Resolución de situaciones en las que es necesario el

eBeam Education Suite Version 2.4 Spanish - Españollcf1.s3.amazonaws.com/.../eBeamInteract_2_4_sp.pdf · Herramienta Reconocimiento de escritura manual 35 Herramienta Teclado en

Escritura reconocimiento luis albeiro quintero por albeiro orjuela

Reconocimiento de Escritura Manuscrita. ~ Caso Práctico: Un reconocedor de vocales ~

CREACION ESCRITURA TRADUCCION RE-ESCRITURA · vez". La diferencia con la escritura es que ahí se escribe de nuevo. En el proceso "regular" de la escritura, la re-escritura, la manera

Códices Apropiación formal. Sistemas de escritura Escritura olmeca Escritura zapoteca Escritura epi-olmeca o istmiana Escritura de la cultura de Izapa

Escritura académica y escritura creativa

El sistema informático - rorroo.files.wordpress.com · como agenda electrónica) con un sistema de reconocimiento de escritura. Hoy en día estos dispositivos, pueden realizar muchas

Aprender la escritura, olvidar la escritura · 2021. 5. 4. · Aprender la escritura, olvidar la escritura: nuevas perspectivas sobre la historia de la escritura en el Occidente romano

Escritura de Tradición Mixteca-Puebla. La Escritura Mexica

Reconocimiento de Patrones patrones Reconocimiento de ...Culcyt/ /Reconocimiento de Patrones patrones Reconocimiento de objetos en una plataforma robótica móvil Lorenzo García

revistaloyola.files.wordpress.com€¦ · Web viewEscribe correctamente dictados con las consonantes en estudio para mejorar su escritura y el reconocimiento de las palabras.

Educación - SciELO Colombia · reconocimiento de signos - letras - y su decodificación; y una mirada crítica, que reconoce la lectura y la escritura como aprendizajes anclados

Unidad nº 1 - Las vocales, reconocimiento y escritura

EL GRADO CERO DE LA ESCRITURA · PDF fileel grado cero de la escritura prologo parte i ¿que es la escritura? escrituras politicas la escritura de la novela ¿existe una escritura

Escritura Informal / Formal. Escritura ¿Cuándo usas la escritura fuera del campo académico?

Formulario de inscripción Escritura Terapéutica y … · EL TRAPECIO VOLANTE …Escritura en Movimiento! Talleres de Escritura Creativa, Escritura Terapéutica y Cuerpo Creativo,

ISFD MARIANO MORENO PROFESORADO DE LENGUA Y … · Debido a este no reconocimiento social de la clase baja por la clase alta ... Escritura de textos narrativos, poéticos, dramáticos,

Manual projection 2.1 - gva.escefire.edu.gva.es/pluginfile.php/277769/mod_resource/content/1/Uni… · 1.1.2 Modo ratón 8 1.1.3 Teclado en pantalla y Reconocimiento de escritura

IDENTIFICACIÓN DE CRISIS EPILÉPTICAS, RECONOCIMIENTO Manejo de... · Parciales complejas Parciales complejas. Reconocimiento de desencadenantes. Reconocimiento de desencadenantes