Repo for the search and displace ingest module that takes odf, docx and pdf and transforms it into .md to be used with search and displace operations
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
|
|
## Search and Displace Ingest
## :cyclone: Server Requirements:
- php7.4 [https://www.php.net] [LICENSE](https://www.php.net/license/index.php) - apache [https://httpd.apache.org] [LICENSE](hhttps://www.apache.org/licenses/LICENSE-2.0) - python 3.8 [https://www.python.org/] [LICENSE](https://docs.python.org/3/license.html) - composer [https://getcomposer.org/] [LICENSE](https://github.com/composer/composer/blob/main/LICENSE)
## :zap: Build with:
- Laravel Framework ^6.2
## :rocket: Installation
**NOTE**
The installation steps below were tested on an Ubuntu 20.04 LTS machine, all commands assume sudo being used unless specified otherwise and should be adapted for each specific environment.
Disk size for this service should be at least 10GB.
---
### Update package repository
``` apt-get update -y ```
### Install Apache2
``` apt-get -y install \ apache2 \ apache2-doc \ apache2-utils \ libapache2-mod-fcgid ```
### Install PHP and the required extensions
``` apt-get -y install software-properties-common && \ add-apt-repository ppa:ondrej/php -y && \ apt-get update -y && \ apt-get -y install \ php7.4 \ php7.4-common \ php7.4-fpm \ php7.4-mbstring \ php7.4-sqlite3 \ php7.4-xml \ php7.4-zip ```
### Configure Apache2 and PHP
``` a2enmod \ rewrite \ actions \ fcgid \ alias \ proxy_fcgi \ remoteip && \ sed -i "s/DocumentRoot \/var\/www\/html/DocumentRoot \/var\/www\/html\/searchanddisplace-ingest\/public/g" /etc/apache2/sites-available/000-default.conf && \ sed -i "/^[[:blank:]]ErrorLog/i\ <FilesMatch \.php\$>" /etc/apache2/sites-available/000-default.conf && \ sed -i "/^[[:blank:]]ErrorLog/i\ SetHandler \"proxy:unix:\/var\/run\/php\/php7.4-fpm.sock|fcgi:\/\/localhost\"" /etc/apache2/sites-available/000-default.conf && \ sed -i "/^[[[:blank:]]ErrorLog/i\ </\FilesMatch>" /etc/apache2/sites-available/000-default.conf && \ bash -c 'echo "RemoteIPHeader X-Forwarded-For" >> /etc/apache2/apache2.conf' && \ sed -i "s/LogFormat \"%v:%p %h/LogFormat \"%v:%p %a/g" /etc/apache2/apache2.conf && \ sed -i "s/LogFormat \"%h/LogFormat \"%a/g" /etc/apache2/apache2.conf && \ chown -R www-data /var/www/html && \ chmod -R 755 /var/www/html && \ sed -i "s/AllowOverride None/AllowOverride All/g" /etc/apache2/apache2.conf && \ systemctl restart apache2 ```
### Install Composer
`apt-get -y install composer`
### Ubuntu Packages
``` # LibreOffice
apt-get update -y && \ apt-add-repository -y ppa:libreoffice/ppa && \ apt-get update -y && \ apt-get install -y libreoffice ``` ``` # Python
apt-get update -y && \ apt-get install -y software-properties-common && \ add-apt-repository -y ppa:deadsnakes/ppa && \ apt-get install -y \ build-essential \ libpoppler-cpp-dev \ pkg-config \ supervisor \ python3 \ python3-dev ``` ``` # Redis
apt-get install -y redis-server ``` ``` # PDF Convertor
apt-get install -y \ libpoppler-cpp-dev \ poppler-utils ``` ``` # Tesseract OCR
add-apt-repository -y ppa:alex-p/tesseract-ocr-devel && \ apt-get update -y && \ apt-get install -y tesseract-ocr ``` ``` # Unpaper
apt-get install -y unpaper ``` ``` # DOCX to PDF Convertor
apt-get install -y unoconv ``` ``` # Pandoc
apt-get install -y pandoc ```
### Libraries Packages
``` # Pip
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \ python3 get-pip.py && \ rm -rf get-pip.py && \ pip3 install --upgrade pip ``` ``` # Pdftotext
pip3 install pdftotext ``` ``` # Supervisor
pip3 install supervisor && \ systemctl enable supervisor && \ mkdir /var/log/amqp && \ mkdir /var/log/queue ```
### Queues Supervisor config
Config file path: **/etc/supervisor/conf.d/queue-worker-search-and-displace-ingest-production.conf**
```bash [program:queue-worker-search-and-displace-ingest-production] process_name=%(program_name)s_%(process_num)02d command=php /var/www/html/searchanddisplace-ingest/artisan queue:listen --queue=sd_ingest,default --tries=2 --timeout=180 autostart=true autorestart=true user=www-data numprocs=3 redirect_stderr=true stdout_logfile=/var/log/queue/queue-worker-search-and-displace-ingest-production.log ```
The value for the **command** key should reflect the app path (in the example above the app's path is **/var/www/html/searchanddisplace-ingest**).
The **stdout_logfile** value is the log file. All parent directories must already exist.
### Install app
- Download app
``` cd /var/www/html && \ git clone https://git.law/newroco/searchanddisplace-ingest.git && \ chown -R www-data:www-data searchanddisplace-ingest && \ cd searchanddisplace-ingest ```
- Install Dewarp ``` # Dewarp
cd /var/www/html/searchanddisplace-ingest/resources/python/dewarp && \ pip3 install opencv-python ```
- Install and configure app
```bash # Generate environment file
cp .env.example .env
# Install backend packages
composer install
# Generate app key
php artisan key:generate
# Change in .env the value for the QUEUE_CONNECTION to redis, if it is not set already
# Deploy supervisor
supervisorctl reread supervisorctl update supervisorctl start all ```
- Check the queue worker is running
``` supervisorctl status ```
### Search and Displace Core Setup
- Install the `Search and Displace Core` app, found here https://git.law/newroco/searchanddisplace-core - Get the URL of the `Search and Displace Core` app and add it to the `WEBHOOK_CORE_URL` variable in `.env` - Add in `.env` the `WEBHOOK_CORE_SECRET` value which needs to be the same value as the `WEBHOOK_CLIENT_SECRET` in the `Search and Displace Core` app's `.env` file
## PHP Packages
- cebe/markdown [LICENSE](https://github.com/cebe/markdown/blob/master/LICENSE) - fideloper/proxy [LICENSE](https://github.com/fideloper/TrustedProxy/blob/master/LICENSE.md) - laravel/framework [LICENSE](https://github.com/laravel/framework/blob/7.x/LICENSE.md) - laravel/tinker [LICENSE](https://github.com/laravel/tinker/blob/2.x/LICENSE.md) - league/html-to-markdown [LICENSE](https://github.com/thephpleague/html-to-markdown/blob/master/LICENSE) - phpoffice/phpword [LICENSE](https://github.com/PHPOffice/PHPWord/blob/0.17.0/LICENSE) - predis/predis [LICENSE](https://github.com/php-enqueue/amqp-bunny/blob/master/LICENSE) - spatie/laravel-webhook-server [LICENSE](https://github.com/spatie/laravel-webhook-server/blob/master/LICENSE.md) - spatie/pdf-to-text [LICENSE](https://github.com/spatie/pdf-to-text/blob/main/LICENSE.md) - thiagoalessio/tesseract_ocr [LICENSE](https://github.com/thiagoalessio/tesseract-ocr-for-php/blob/main/MIT-LICENSE)
|