Dealing with metadata #16

Open
opened 3 years ago by putt1ck · 3 comments
putt1ck commented 3 years ago

Looking at provisional test set I've noticed some docs that are prima facie anonymised are not in fact anonymised because they contain metadata with identifiers. The original tool chain contained a stage to deal with this during initial upload so

(a) is that still the case and,

(b) do we have a way to ingest new metadata as part of the output?

Looking at provisional test set I've noticed some docs that are prima facie anonymised are not in fact anonymised because they contain metadata with identifiers. The original tool chain contained a stage to deal with this during initial upload so (a) is that still the case and, (b) do we have a way to ingest new metadata as part of the output?
putt1ck added the
enhancement
label 3 years ago
Owner

(a) It is still the case bcecause the current tool chain does not read the metadata at input and the file is discarded once the content is sent for processing, so the output file does not include any of the initial metadata.

However there is no search and displace process that would take in a document, strip the metadata and return the document without it. Not sure this is the scope of search and displace.

(b) There isn't a way. How do you see this added to the workflow? Present to user on interface all available metadata fields so they can set them? and add parameteres for the CLI tool for each possible metadata field? Sounds again like it's not the scope of search and displace.

(a) It is still the case bcecause the current tool chain does not read the metadata at input and the file is discarded once the content is sent for processing, so the output file does not include any of the initial metadata. However there is no search and displace process that would take in a document, strip the metadata and return the document without it. Not sure this is the scope of search and displace. (b) There isn't a way. How do you see this added to the workflow? Present to user on interface all available metadata fields so they can set them? and add parameteres for the CLI tool for each possible metadata field? Sounds again like it's not the scope of search and displace.
Collaborator

Right now we are not using a metadata remover because we are deleting the original document as soon as we are reading it's text contents and/or images, or if something goes wrong during the S&D process (which includes the ingest). The processed contents are put in a new Markdown file at the moment, which can be converted to a new ODT document via the interface.

If we need to remove the metadata from the original document we can use the MAT2 library, which was mentioned in the forum.

Right now we are not using a metadata remover because we are deleting the original document as soon as we are reading it's text contents and/or images, or if something goes wrong during the S&D process (which includes the ingest). The processed contents are put in a new Markdown file at the moment, which can be converted to a new ODT document via the interface. If we need to remove the metadata from the original document we can use the MAT2 library, which was mentioned in the [forum](httphttps://forum.searchanddisplace.com/t/existing-tools-which-could-be-used-or-adapted-for-use-in-the-toolchain/19/6://).
Poster

The idea of holding a clean version of the original doc (metadata and content stripped) was related to how to rebuild the doc at the end so that it was broadly similar in look/feel as the unprocessed doc - this only matters for docs intended for human consumption, but it's definitely wanted then (obvs holding the doc framework to later inject the modified content back is only one solution, so if there's a better way...)

The idea of holding a clean version of the original doc (metadata *and* content stripped) was related to how to rebuild the doc at the end so that it was broadly similar in look/feel as the unprocessed doc - this only matters for docs intended for human consumption, but it's definitely wanted then (obvs holding the doc framework to later inject the modified content back is only one solution, so if there's a better way...)
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.