Repo for the search and displace ingest module that takes odf, docx and pdf and transforms it into .md to be used with search and displace operations
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
Alex Puiu 50ab7a6333 Send document back to core in HTML format 2 years ago
app Send document back to core in HTML format 2 years ago
bootstrap Initial commit. Process files and send response via webhook 3 years ago
config Initial commit. Process files and send response via webhook 3 years ago
database Improve README. Cleanup 2 years ago
public Initial commit. Process files and send response via webhook 3 years ago
resources Improve README. Cleanup 2 years ago
routes Improve README. Cleanup 2 years ago
storage Initial commit. Process files and send response via webhook 3 years ago
tests Improve README. Cleanup 2 years ago
.editorconfig Initial commit. Process files and send response via webhook 3 years ago
.env.example Merge branch 'master' of https://git.law/newroco/searchanddisplace-ingest 2 years ago
.gitattributes Initial commit. Process files and send response via webhook 3 years ago
.gitignore Initial commit. Process files and send response via webhook 3 years ago
.styleci.yml Initial commit. Process files and send response via webhook 3 years ago
README.md Merge branch 'master' of https://git.law/newroco/searchanddisplace-ingest 2 years ago
artisan Initial commit. Process files and send response via webhook 3 years ago
composer-env.nix Added nix composer equivalent nix files 2 years ago
composer.json Skip conversion from odt to docx. 2 years ago
composer.lock Send document content as JSON. 2 years ago
default.nix stdenv 2 years ago
get-pip.py Initial commit. Process files and send response via webhook 3 years ago
php-packages.nix Added nix composer equivalent nix files 2 years ago
phpunit.xml Initial commit. Process files and send response via webhook 3 years ago
server.php Initial commit. Process files and send response via webhook 3 years ago

README.md

Search and Displace Ingest

🌀 Server Requirements:

Build with:

  • Laravel Framework ^6.2

🚀 Installation

Ubuntu Packages

# LibreOffice
apt-add-repository ppa:libreoffice/ppa
apt-get update
apt-get install libreoffice

# Python
apt-get update
apt-get install software-properties-common
add-apt-repository ppa:deadsnakes/ppa
apt-get install supervisor python3.8 python3.8-dev

# Redis
apt-get install redis-server

# PDF Convertor
apt-get install libpoppler-cpp-dev
apt-get install poppler-utils

# Tesseract OCR
add-apt-repository ppa:alex-p/tesseract-ocr-devel
apt-get update
apt-get install tesseract-ocr

# Unpaper
apt-get install unpaper

# DOCX to PDF Convertor
apt-get install unoconv

# Pandoc
apt-get install pandoc

Libraries Packages

# Pip
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py
rm -rf get-pip.py
pip install --upgrade pip

# Pdftotext
pip install pdftotext

# Supervisor
pip install supervisor
systemctl enable supervisor
mkdir /var/log/amqp
mkdir /var/log/queue

# Deskew
cd DESKEW_INSTALLATION_DIRECTORY
cd Bin
./deskew INPUT OUTPUT

# Dewarp
pip3 install opencv-python

cd DEWARP_INSTALLATION_DIRECTORY
pip3 install -r requirements.txt

Queues Supervisor config

Add a new Supervisor config file in the "/etc/supervisor/conf.d" path like in the example below:

Config file path: /etc/supervisor/conf.d/queue-worker-search-and-displace-ingest-production.conf

[program:queue-worker-search-and-displace-ingest-production]
process_name=%(program_name)s_%(process_num)02d
command=php /var/www/html/searchanddisplace-ingest/artisan queue:listen --queue=sd_ingest,default --tries=2 --timeout=180
autostart=true
autorestart=true
user=www-data
numprocs=3
redirect_stderr=true
stdout_logfile=/var/log/queue/queue-worker-search-and-displace-ingest-production.log

The value for the 'command' key should reflect the app path (in the example above the app's path is "/var/www/html/searchanddisplace-ingest").

The 'stdout_logfile' value is the log file. All parent directories must already exist.

Install app

# Generate environment file
cp .env.example .env

# Install backend packages
composer install

# Generate app key
php artisan key:generate

# Change the value for the QUEUE_CONNECTION to redis, if it is not set already

# Deploy supervisor
supervisorctl start all

Search and Displace Core Setup

  • Install the Search and Displace Core app, found here https://git.law/newroco/searchanddisplace-core
  • Get the URL of the Search and Displace Core app and add it to the WEBHOOK_CORE_URL variable in .env
  • Add in .env the WEBHOOK_CORE_SECRET value which needs to be the same value as the WEBHOOK_CLIENT_SECRET in the Search and Displace Core app's .env file

PHP Packages