Recognize Text in Scanned Documents

Optical Character Recognition (OCR)

What is OCR?

OCR stands for Optical Character Recognition. This function can be used to recognize texts in rasterized documents (e.g. scanned images or texts).

The OCR process from SEAL Systems works for raster and vector data and can be integrated into automated processes. OCR techniques can make texts that are only available as pixel patterns machine-readable. They are then automatically searchable. Large quantities of files are additionally pre-scanned by search engines, so that finding them across the entire file stock is very fast.

Who needs OCR?

The use cases of OCR text recognition processes are manifold and numerous. For example, it can be helpful for OCR to be used in the digital archiving of old documents.

For archiving, have you relied for many years on the former widespread TIFF format as your archive format? Although you had all the information digitally encoded and available in machine-readable form in your documents, it was lost when converted to the pure raster format TIFF? Then we have good news: at least we can recover the texts for you!

Do your suppliers provide you with scans as production-relevant documents? These are also raster images from which we can determine the texts and store them as searchable texts in the PDF.

FAQs on OCR

We would like to integrate OCR into our document processing. At which points is this suitable?

We recommend including OCR in your processes at the following steps:

  • During document release
  • During a file conversion
  • Before check-in to the DMS
  • – During file conversion of old data stocks to PDF, PDF/A

However, not every file is then additionally processed by OCR. The system itself recognizes whether OCR is useful. Or the OCR process is called specifically only for raster files.

We have PDF files with visible text, but the text cannot be searched. What can be done?

This can have several reasons. PDF files created by scanning are initially built up only by pixels. A person can read the texts, the computer cannot find them at first. Scanners often already have an integrated OCR. However, these may be inefficient. CAD systems often represent the screen display of texts in the output only by line drawings. This happens when the CAD system does not work with standard fonts. The special fonts for the screen maybe are not available in the screen output. Image parts in the PDF can themselves contain text again, which you want to find and recognized..

What advantages do files with searchable text create?

Information can be found more quickly in files if the search is not only carried out via keywording in the DMS, but if it is also possible to search directly in the files for relevant terms. To do this, however, the visible text must be searchable. The exchange of data in supplier chains means that documents cannot always be managed via DMS alone. The usability of files is significantly increased if relevant keywords for classifying the files can be taken directly from the file.

We would like to convert our legacy data from TIFF to PDF/A. Is this possible?

OCR makes sense here, too! PDF/A is increasingly replacing the raster format TIFF as the archive format. Inventory files in TIFF and scanned originals can be converted into PDF format particularly easily. Without additional OCR treatment, however, this conversion brings no added value. The resulting PDF has no useful data other than a raster image. Only the enrichment with text elements offers an additional benefit.

Intrigued?

Request further information without obligation!

 

Conversion of Legacy Data to PDF/A

OCR processes make sense especially when archiving documents and files. However, we not only support our customers in making archived raster formats readable – we are also able to convert them into the correct file format. PDF/A is particularly suitable for secure long-term archiving. Learn more about the advantages of PDF/A: