From 4ae7bcc7d0eca1b7494c12d7cdd5d273c0bb1564 Mon Sep 17 00:00:00 2001 From: Chris Puttick Date: Wed, 1 Apr 2020 13:33:01 +0100 Subject: [PATCH] Add files via upload --- original workflow.svg | 3 +++ 1 file changed, 3 insertions(+) create mode 100644 original workflow.svg diff --git a/original workflow.svg b/original workflow.svg new file mode 100644 index 0000000..effd5f3 --- /dev/null +++ b/original workflow.svg @@ -0,0 +1,3 @@ + + +
Document (trad file)
Document (...
Document (scanned image)
Document (...
Pass to OCR
(via deskew tools/page tidying etc. as needed [various])
Pass to OCR...
Convert/copy to ODT (preserve original document, creating clean copy [LIbreOffice, MAT2, !])
Convert/copy to ODT (...
Extract text with formatting (near as practical to ease read/compare)
Extract text with fo...
Extract images

(direct jar access)
Extract images...
Generate combined md file
[LibreOffice, pandoc]
Generate combined md...
!
!
Pass text to search and displace
[sift, ?!]
Pass text to search...
Marked up text to UI
[ProseMirror]
Marked up text to UI...
Accept/action as required
Accept/action as requi...
Initial search/
displace parameters
Initial searc...
OCR Toolchain
OCR Toolchain
If good
If good
if bad
if bad
optimise scan
[unpaper]
optimise scan...
[tesseract]
[tesseract]
Manual-ish optimisers [openCV, ImageMagick, deskew GUI]
Manual-ish optimisers [openCV,...
preview
[tbc]
preview...
Use of [ ] denotes (probable) tool(s) used within step.
Red ! indicates notable challenge expected, ? indicates we believe there will be a tool that either exists or can be adapted, just not yet identified.
Use of [ ] denotes (probable) tool(s) us...
!
!
Output or
re-search
Output or...
additional search/
displace parameters
(includes "not" from actions)
additional search/...
Output PDF/ODF/hybrid (formatted as near as possible to original, based on first convert/copy/OCR)
[Pandoc, LibreOffice,!]
Output PDF/ODF/hybri...
For proper results for digital originals, this will involve overwriting the text as needed inside the odf jar, sticking within style sections etc.; doesn't resolve changes where line wraps/page breaks or if original font not available, but should otherwise preserve original look/feel of document. PDF (particular type as wanted/needed) then would be secondary output stage, chosen by and invisible to user)
For proper results for digital originals, this wi...
Extract layout info
[!]
Extract layout info...
Option to output as marked up text for other processing
Option to output a...
Direct to action for bulk processing
Direct to...
Viewer does not support full SVG 1.1
\ No newline at end of file