Repo for the search and displace ingest module that takes odf, docx and pdf and transforms it into .md to be used with search and displace operations
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 

6.3 KiB

Search and Displace Ingest

🌀 Server Requirements:

Build with:

  • Laravel Framework ^6.2

🚀 Installation

NOTE

The installation steps below were tested on an Ubuntu 20.04 LTS machine, all commands assume sudo being used unless specified otherwise and should be adapted for each specific environment.

Disk size for this service should be at least 10GB.


Update package repository

apt-get update -y

Install Apache2

apt-get -y install \
	apache2 \
	apache2-doc \
	apache2-utils \
	libapache2-mod-fcgid

Install PHP and the required extensions

apt-get -y install software-properties-common && \
add-apt-repository ppa:ondrej/php -y && \
apt-get update -y && \
apt-get -y install \
	php7.4 \
	php7.4-common \
	php7.4-fpm \
	php7.4-mbstring \
	php7.4-sqlite3 \
	php7.4-xml \
    php7.4-zip

Configure Apache2 and PHP

a2enmod \
	rewrite \
	actions \
	fcgid \
	alias \
	proxy_fcgi \
	remoteip && \
sed -i "s/DocumentRoot \/var\/www\/html/DocumentRoot \/var\/www\/html\/searchanddisplace-ingest\/public/g" /etc/apache2/sites-available/000-default.conf && \
sed -i "/^[[:blank:]]ErrorLog/i\    <FilesMatch \.php\$>" /etc/apache2/sites-available/000-default.conf && \
sed -i "/^[[:blank:]]ErrorLog/i\      SetHandler \"proxy:unix:\/var\/run\/php\/php7.4-fpm.sock|fcgi:\/\/localhost\"" /etc/apache2/sites-available/000-default.conf && \
sed -i "/^[[[:blank:]]ErrorLog/i\    </\FilesMatch>" /etc/apache2/sites-available/000-default.conf && \
bash -c 'echo "RemoteIPHeader X-Forwarded-For" >> /etc/apache2/apache2.conf' && \
sed -i "s/LogFormat \"%v:%p %h/LogFormat \"%v:%p %a/g" /etc/apache2/apache2.conf && \
sed -i "s/LogFormat \"%h/LogFormat \"%a/g" /etc/apache2/apache2.conf && \
chown -R www-data /var/www/html && \
chmod -R 755 /var/www/html && \
sed -i "s/AllowOverride None/AllowOverride All/g" /etc/apache2/apache2.conf && \
systemctl restart apache2

Install Composer

apt-get -y install composer

Ubuntu Packages

# LibreOffice
apt-get update -y && \
apt-add-repository -y ppa:libreoffice/ppa && \
apt-get update -y && \
apt-get install -y libreoffice
# Python
apt-get update -y && \
apt-get install -y software-properties-common && \
add-apt-repository -y ppa:deadsnakes/ppa && \
apt-get install -y \
    build-essential \
	libpoppler-cpp-dev \
	pkg-config \
    supervisor \
    python3 \
    python3-dev
# Redis
apt-get install -y redis-server
# PDF Convertor
apt-get install -y \
    libpoppler-cpp-dev \
    poppler-utils
# Tesseract OCR
add-apt-repository -y ppa:alex-p/tesseract-ocr-devel && \
apt-get update -y && \
apt-get install -y tesseract-ocr
# Unpaper
apt-get install -y unpaper
# DOCX to PDF Convertor
apt-get install -y unoconv
# Pandoc
apt-get install -y pandoc

Libraries Packages

# Pip
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
python3 get-pip.py && \
rm -rf get-pip.py && \
pip3 install --upgrade pip
# Pdftotext
pip3 install pdftotext
# Supervisor
pip3 install supervisor && \
systemctl enable supervisor && \
mkdir /var/log/amqp && \
mkdir /var/log/queue

Queues Supervisor config

Config file path: /etc/supervisor/conf.d/queue-worker-search-and-displace-ingest-production.conf

[program:queue-worker-search-and-displace-ingest-production]
process_name=%(program_name)s_%(process_num)02d
command=php /var/www/html/searchanddisplace-ingest/artisan queue:listen --queue=sd_ingest,default --tries=2 --timeout=180
autostart=true
autorestart=true
user=www-data
numprocs=3
redirect_stderr=true
stdout_logfile=/var/log/queue/queue-worker-search-and-displace-ingest-production.log

The value for the command key should reflect the app path (in the example above the app's path is /var/www/html/searchanddisplace-ingest).

The stdout_logfile value is the log file. All parent directories must already exist.

Install app

  • Download app
cd /var/www/html && \
git clone https://git.law/newroco/searchanddisplace-ingest.git && \
chown -R www-data:www-data searchanddisplace-ingest && \
cd searchanddisplace-ingest
  • Install Dewarp
# Dewarp
cd /var/www/html/searchanddisplace-ingest/resources/python/dewarp && \
pip3 install opencv-python
  • Install and configure app
# Generate environment file
cp .env.example .env

# Install backend packages
composer install

# Generate app key
php artisan key:generate

# Change in .env the value for the QUEUE_CONNECTION to redis, if it is not set already

# Deploy supervisor
supervisorctl reread
supervisorctl update
supervisorctl start all
  • Check the queue worker is running
supervisorctl status

Search and Displace Core Setup

  • Install the Search and Displace Core app, found here https://git.law/newroco/searchanddisplace-core
  • Get the URL of the Search and Displace Core app and add it to the WEBHOOK_CORE_URL variable in .env
  • Add in .env the WEBHOOK_CORE_SECRET value which needs to be the same value as the WEBHOOK_CLIENT_SECRET in the Search and Displace Core app's .env file

PHP Packages