Repo for the search and displace ingest module that takes odf, docx and pdf and transforms it into .md to be used with search and displace operations
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
|
|
## Search and Displace Ingest
## :cyclone: Server Requirements:
- php7.4 [https://www.php.net] [LICENSE](https://www.php.net/license/index.php) - apache [https://httpd.apache.org] [LICENSE](hhttps://www.apache.org/licenses/LICENSE-2.0) - python 3.8 [https://www.python.org/] [LICENSE](https://docs.python.org/3/license.html) - composer [https://getcomposer.org/] [LICENSE](https://github.com/composer/composer/blob/main/LICENSE)
## :zap: Build with:
- Laravel Framework ^6.2
## :rocket: Installation
### Ubuntu Packages
```bash # LibreOffice
apt-get install python-software-properties apt-add-repository ppa:libreoffice/ppa apt-get update apt-get install libreoffice
# Python
apt-get update apt-get install software-properies-common add-apt-repository ppa:deadsnakes/ppa apt-get install supervisor python3.8 python3.8-dev
# Redis
apt-get install redis-server
# PDF Convertor
apt-get install libpoppler-cpp-dev apt-get install poppler-utils
# Tesseract OCR
add-apt-repository ppa:alex-p/tesseract-ocr-devel apt-get update apt-get install tesseract-ocr
# Unpaper
apt-get install unpaper
# DOCX to PDF Convertor
apt-get install unoconv ```
### Libraries Packages
```bash # Pip
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py python get-pip.py rm -rf get-pip.py pip install --upgrade pip
# Pdftotext
pip install pdftotext
# Supervisor
pip install supervisor systemctl enable supervisor mkdir /var/log/amqp mkdir /var/log/queue
# Deskew
cd DESKEW_INSTALLATION_DIRECTORY cd Bin ./deskew INPUT OUTPUT
# Dewarp
pip3 install opencv-python
cd DEWARP_INSTALLATION_DIRECTORY pip3 install -r requirements.txt ```
### Install app
```bash # Generate environment file
cp .env.example .env
# Install backend packages
composer install
# Generate app key
php artisan key:generate
# Change the value for the QUEUE_CONNECTION to redis, if it is not set already
# Deploy supervisor
php artisan queue:deploy-supervisor
supervisorctl start all ```
### Search and Displace Core Setup
- Install the `Search and Displace Core` app, found here https://git.law/newroco/searchanddisplace-core - Get the URL of the `Search and Displace Core` app and add it to the `WEBHOOK_CORE_URL` variable in `.env` - Add in `.env` the `WEBHOOK_CORE_SECRET` value which needs to be the same value as the `WEBHOOK_CLIENT_SECRET` in the `Search and Displace Core` app's `.env` file
## PHP Packages
- cebe/markdown [LICENSE](https://github.com/cebe/markdown/blob/master/LICENSE) - fideloper/proxy [LICENSE](https://github.com/fideloper/TrustedProxy/blob/master/LICENSE.md) - laravel/framework [LICENSE](https://github.com/laravel/framework/blob/7.x/LICENSE.md) - laravel/tinker [LICENSE](https://github.com/laravel/tinker/blob/2.x/LICENSE.md) - league/html-to-markdown [LICENSE](https://github.com/thephpleague/html-to-markdown/blob/master/LICENSE) - phpoffice/phpword [LICENSE](https://github.com/PHPOffice/PHPWord/blob/0.17.0/LICENSE) - predis/predis [LICENSE](https://github.com/php-enqueue/amqp-bunny/blob/master/LICENSE) - spatie/laravel-webhook-server [LICENSE](https://github.com/spatie/laravel-webhook-server/blob/master/LICENSE.md) - spatie/pdf-to-text [LICENSE](https://github.com/spatie/pdf-to-text/blob/main/LICENSE.md) - thiagoalessio/tesseract_ocr [LICENSE](https://github.com/thiagoalessio/tesseract-ocr-for-php/blob/main/MIT-LICENSE)
|