Dealing with metadata #16

Looking at provisional test set I've noticed some docs that are prima facie anonymised are not in fact anonymised because they contain metadata with identifiers. The original tool chain contained a stage to deal with this during initial upload so

(a) is that still the case and,

(b) do we have a way to ingest new metadata as part of the output?

(a) It is still the case bcecause the current tool chain does not read the metadata at input and the file is discarded once the content is sent for processing, so the output file does not include any of the initial metadata.

However there is no search and displace process that would take in a document, strip the metadata and return the document without it. Not sure this is the scope of search and displace.

(b) There isn't a way. How do you see this added to the workflow? Present to user on interface all available metadata fields so they can set them? and add parameteres for the CLI tool for each possible metadata field? Sounds again like it's not the scope of search and displace.

Right now we are not using a metadata remover because we are deleting the original document as soon as we are reading it's text contents and/or images, or if something goes wrong during the S&D process (which includes the ingest). The processed contents are put in a new Markdown file at the moment, which can be converted to a new ODT document via the interface.

If we need to remove the metadata from the original document we can use the MAT2 library, which was mentioned in the forum.

The idea of holding a clean version of the original doc (metadata and content stripped) was related to how to rebuild the doc at the end so that it was broadly similar in look/feel as the unprocessed doc - this only matters for docs intended for human consumption, but it's definitely wanted then (obvs holding the doc framework to later inject the modified content back is only one solution, so if there's a better way...)