OCR & TRANSCRIPTION

Get the text out of your digitized documents.

DSpace integration with external Optical Character Recognition software;

Process hOCR format for full-text indexing in SOLR & image text overlay;

Supports out-of-box a large set of languages, allowing more with personalized training files;

work online on manual transcription;

and much more!

Screenshots

Description of the available code

The OCR module enables the integration of DSpace with an external Optical Character Recognition software. Out-of-box, the module supports the open-source Tesseract OCR engine (https://github.com/tesseract-ocr). For each image a curation task allows to extract its text representation in hOCR format for full-text indexing in SOLR. Tesseract supports a very large set of languages including: Italian, French, Spanish, German, Arabic, Simplified and Traditional Chinese and many others (https://github.com/tesseract-ocr/langdata). The OCR engine can also be instructed with personalized training files to recognize fonts and specific languages.

In the presence of the IIIF Image Viewer module, the OCR module also provides support for IIIF Search API through a server component, subject to the same terms of the module license. The IIIF Search API enable the activation of the search functionality inside the IIIF viewer, providing search within images, navigation through the results and highlighting on the image of the OCR text corresponding to the search terms entered. The internal search engine also provides the suggestion of search terms during typing.

 

K

Take a quick look at it

K

Check our services

The new features we could develop with your support

The module will allow the replacement of OCR files automatically obtained via dedicated UI;

In co-presence of the IIIF Image Server module, the system will allow the editing of the OCR image, capturing the positional information. The OCR editing will also be available in the absence of an initial OCR file, allowing the online transcription of texts;

An approval workflow for transcripts will allow decentralized but controlled process management.

To access the code and start using the module: €3,000
You can express your preference on the functionality you would like us to develop first.

Make IT open!

Target budget: €75,000

 

  • 75.000 0% 0%

Other modules

IIIF Image Viewer

Use an international standard to work with image collections.

CKAN Integration

Add Research Data Management features to your DSpace.

Document Viewer

View, access and work on full documents.

Video/Audio Streaming

Simplify access and reuse of audio/video content.

To request a free demo presentation or for any other questions

Other solutions

4Science S.p.A.

VAT no. IT02451840397
Business Register R.E.A. MI n° 2100307
LEI code no. 8156005C8604D23F7649
D.U.N.S. no. 433959163

CONTACTS
HEADQUARTERS:

Viale A.Papa 30, 20149 Milan, Italy

OTHER OFFICES:
Urban Places, Via Tiburtina 652/A, 00159 Rome
Via L.Braille 15, 48124 Ravenna

Phone: +39.02.3971.0430

info@4science.com
www.4science.com

DSpace Certified Partner
ISO 9001:2015 Certified Company
Certified ORCID Service Provider
Registered DataCite Service Provider
CSA Solution Provider