First Author*, Second Author**,Third Author** *Department,Institute Name**Department,Institute Name, if any Abstract– OCR, commonly known asOptical Character Recognition also known as Optical Character Reader. It isused to collect information from handwritten documents or manuscripts, printedpaper data records.
It is a common method of digitizing handwritten manuscriptsso that they can modified or used digitally in modern technology.IndexTerms – Acknowledgement,Conclusion, OCR Procedure, Text Processing.I. INTRODUCTIONTHE recognition and conversion from images of text havealways been a challenging task for automatic data processing and informationretrieval and services. In particular, the task of scanning human handwritingand making them not only digital readable, but also searchable and digitallyeditable, is important to retrieve and collect information.Inthis way, the information of old manuscripts or any handwritten document can bea valuable and interesting source to build a strong and complete informationnetwork.
Different organizations are interested in the mass scale digitizationof historic manuscripts or handwritten documents with a focus on offeringimproved full-text searching.II. OCR PROCEDURESDifferentdigital collections and information systems digitalize handwritten documents,such as old manuscripts of any historic civilization or very old manuscript orany handwritten document. However, optical character recognition of old handwrittenmanuscripts often poses different challenges. In the following phrases wesummarize the main issues.III. OCR problems of HandwrittenDocumentsWorkingwith handwritten documents, we face different problems.Imageproblems: One of the majorproblems is the quality of the original manuscript and the quality of the scan.
This includes issues such as curled pages, blurred fonts, or manually editedpages (e.g. stamps or hand-written notes). Fonttype and layout: Humanhandwritings are not supported by standard OCR software. The fonts orcharacters are different with the change of person. Old manuscripts are evenharder to recognize, because they use font styles that are totally new tomodern human society. Thus, also the spacing between words and characters isoften not consistent. Additionally, historic papers often use different andinconsistent layout structures.
Missingknowledge base: Traditional OCRsoftware uses knowledge bases based on contemporary dictionaries andgrammatical structures to enhance the OCR procedure and does not provide manuscriptsdocuments. Additionally, historic manuscripts often do not follow specificorthographic structures and rules, thus words can be written differently in thesame text.Everymanuscript or handwriting is different from each other. Thus, very specific andunique problems can occur for every project.
In the next section, we take acloser look at the overall OCR process, and possibilities to improve thedifferent steps.IV. OCR ProcessSincetechnology develops by time the necessity of recording data in handwrittenformat decreases by time, now a days all records are being kept in format ofdigitized text document or media, so it is necessary to focus on these specificproblems in the OCR process. In the following section, we describe theprocedure of OCR with a focus on creating a learning / feature base forhandwritten documents, which can be used for improving machine learningalgorithm.
To improve the accuracy of the OCR process, different actions can betaken in every single step of the process.· Scanning:The first phase of Optical Character Recognition is the scanning phase. Thisphase is one of the most important phases. If possible, scans should be made ofwell-preserved and clean originals.
The scanning resolution should be at least300 dpi and the output image a lossless image format (e.g. tiff).· Pre-processing: Thisis the second phase of Optical Character Recognition. In this step, the scanneddocument can be manually optimized for the OCR process. This includes imageediting processes such as increasing the contrast, reducing noise, orsimplifying the colors.· OCR-process:In this phase of Optical Character Recognition, the chosen OCR system reads theimages and applies an algorithm to recognize the characters. It is crucial to choose OCR software that fitsthe current problem and supports a training/learning algorithm.
· Create learning base: To improve the OCR-process it is very important tocreate and improve the learning base for training the OCR system. This baseconsists of a dictionary fitting the document improved character pattern.· Post-processing:In this phase, knowledge can be applied which had not yet been available to theOCR system. In a final step, the output can be corrected manually.V. ConclusionIn this paper, we tried to throw light on the processof OCR of scanned old documents, historic books, manuscript of a very olddocument.
To compare the accuracy of the OCR methods, a normalized version ofthe Levenshtein distance can be used. Since every historic book is differentand poses its own and new challenges, the most important step of an OCR processis building a learning base. The main contribution of this work is a model forOCR processes of historic books with old fonts. With such a model, with respectto preened post-processing, the accuracy of OCR of manuscripts or humanhandwriting can be improved significantly, compared to related approaches.VI.
ACKNOWLEDGMENTThis work would not be possible without the help of Prof.Sukanya Roy, Lecturer at University of Engineering & Management, Kolkata. Wewould also like to thank the other group members for their hard work anddedication to complete the case study on Optical Character Recognition inHandwriting Analysis VII. REFERENCES Holley, R.
, “How good can it get? Analyzingand improving OCR accuracy in large scale historic newspaper digitizationprograms.” D-Lib Magazine 15.3/4 (2009). “The challenges of historical materials and anoverview on the technical solutions in IMPACT”[Online]. Available: https://impactocr.wordpress.com/2010/05/07/anoverview-of-technical-solutions-in-impact/ Mori, S,, Ching Y.
S., and Kazuhiko Y.,”Historical review of OCR research and development.”Proceedings of the IEEE 80.7 (1992): 1029-1058. Feng, S., and Manmatha, R.
, “A hierarchical,HMM based automatic evaluation of OCR accuracy for a digital library ofbooks.” Digital Libraries, 2006. JCDL’06. Proceedings of the 6thACM/IEEE-CS JointConference on.
IEEE, 2006. Gupta, M. R., Jacobson, N. P., and Garcia, E.
K.,”OCR binarization and image pre-processing for searching historicaldocuments.” Pattern Recognition 40.2 (2007): 389-397. Wikipedia, Basic Idea of Optical CharacterRecognition, [Online]