|Full name:||META-CONTENTUM R&D Project|
|Start date:||2006. 06.|
|End date:||2007. 10.|
The goal of the project
Freesoft Plc. has made considerable efforts to make the quality of its products and services better via the developments that are to be realized through R&D project results. Its Contentum content management product, which is a market leader in the field of file and document management systems used in public administration in Hungary, aims to support the better management of search and retrieval functions of these systems and to reduce the high costs of digitizing.
Since the realization of the ’almost paperless office’ can be achieved via post digitization, more precisely, via scanning and OCR, the application of a search engine with high fault tolerance would make the texts more suitable for search and retrieval purposes and would enhance their usability in practice. Also, it may considerably reduce the costs of digitizing as post processing human intervention to make the corrections would not be needed.
In the framework of GVOP 3.3.3 program, the project executes the development of specialized, fault-tolerant, full-text-search methods, as well as a solution for automatic data extraction and document classification.
The Department of Distributed Systems of SZTAKI analyzes the different kinds of errors that emerged during the digitization process of Hungarian documents, and examines how these errors affect the searchability of the digitized items. The project aims to build a metrics for the errors introduced during the OCR process, particularly for those resulting in loss or change of characters or accents, and to build a robust search index for digital repositories containing automatically digitized error-prone documents.
The description of the testbed for the evaluation of digitization error-types and the statistics of the actual findings may be read here.