Repo for the search and displace ingest module that takes odf, docx and pdf and transforms it into .md to be used with search and displace operations
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

226 lines
6.3 KiB

3 years ago
3 years ago
3 years ago
3 years ago
  1. ## Search and Displace Ingest
  2. ## :cyclone: Server Requirements:
  3. - php7.4 [https://www.php.net] [LICENSE](https://www.php.net/license/index.php)
  4. - apache [https://httpd.apache.org] [LICENSE](hhttps://www.apache.org/licenses/LICENSE-2.0)
  5. - python 3.8 [https://www.python.org/] [LICENSE](https://docs.python.org/3/license.html)
  6. - composer [https://getcomposer.org/] [LICENSE](https://github.com/composer/composer/blob/main/LICENSE)
  7. ## :zap: Build with:
  8. - Laravel Framework ^6.2
  9. ## :rocket: Installation
  10. **NOTE**
  11. The installation steps below were tested on an Ubuntu 20.04 LTS machine, all commands assume sudo being used unless specified otherwise and should be adapted for each specific environment.
  12. Disk size for this service should be at least 10GB.
  13. ---
  14. ### Update package repository
  15. ```
  16. apt-get update -y
  17. ```
  18. ### Install Apache2
  19. ```
  20. apt-get -y install \
  21. apache2 \
  22. apache2-doc \
  23. apache2-utils \
  24. libapache2-mod-fcgid
  25. ```
  26. ### Install PHP and the required extensions
  27. ```
  28. apt-get -y install software-properties-common && \
  29. add-apt-repository ppa:ondrej/php -y && \
  30. apt-get update -y && \
  31. apt-get -y install \
  32. php7.4 \
  33. php7.4-common \
  34. php7.4-fpm \
  35. php7.4-mbstring \
  36. php7.4-sqlite3 \
  37. php7.4-xml \
  38. php7.4-zip
  39. ```
  40. ### Configure Apache2 and PHP
  41. ```
  42. a2enmod \
  43. rewrite \
  44. actions \
  45. fcgid \
  46. alias \
  47. proxy_fcgi \
  48. remoteip && \
  49. sed -i "s/DocumentRoot \/var\/www\/html/DocumentRoot \/var\/www\/html\/searchanddisplace-ingest\/public/g" /etc/apache2/sites-available/000-default.conf && \
  50. sed -i "/^[[:blank:]]ErrorLog/i\ <FilesMatch \.php\$>" /etc/apache2/sites-available/000-default.conf && \
  51. sed -i "/^[[:blank:]]ErrorLog/i\ SetHandler \"proxy:unix:\/var\/run\/php\/php7.4-fpm.sock|fcgi:\/\/localhost\"" /etc/apache2/sites-available/000-default.conf && \
  52. sed -i "/^[[[:blank:]]ErrorLog/i\ </\FilesMatch>" /etc/apache2/sites-available/000-default.conf && \
  53. bash -c 'echo "RemoteIPHeader X-Forwarded-For" >> /etc/apache2/apache2.conf' && \
  54. sed -i "s/LogFormat \"%v:%p %h/LogFormat \"%v:%p %a/g" /etc/apache2/apache2.conf && \
  55. sed -i "s/LogFormat \"%h/LogFormat \"%a/g" /etc/apache2/apache2.conf && \
  56. chown -R www-data /var/www/html && \
  57. chmod -R 755 /var/www/html && \
  58. sed -i "s/AllowOverride None/AllowOverride All/g" /etc/apache2/apache2.conf && \
  59. systemctl restart apache2
  60. ```
  61. ### Install Composer
  62. `apt-get -y install composer`
  63. ### Ubuntu Packages
  64. ```
  65. # LibreOffice
  66. apt-get update -y && \
  67. apt-add-repository -y ppa:libreoffice/ppa && \
  68. apt-get update -y && \
  69. apt-get install -y libreoffice
  70. ```
  71. ```
  72. # Python
  73. apt-get update -y && \
  74. apt-get install -y software-properties-common && \
  75. add-apt-repository -y ppa:deadsnakes/ppa && \
  76. apt-get install -y \
  77. build-essential \
  78. libpoppler-cpp-dev \
  79. pkg-config \
  80. supervisor \
  81. python3 \
  82. python3-dev
  83. ```
  84. ```
  85. # Redis
  86. apt-get install -y redis-server
  87. ```
  88. ```
  89. # PDF Convertor
  90. apt-get install -y \
  91. libpoppler-cpp-dev \
  92. poppler-utils
  93. ```
  94. ```
  95. # Tesseract OCR
  96. add-apt-repository -y ppa:alex-p/tesseract-ocr-devel && \
  97. apt-get update -y && \
  98. apt-get install -y tesseract-ocr
  99. ```
  100. ```
  101. # Unpaper
  102. apt-get install -y unpaper
  103. ```
  104. ```
  105. # DOCX to PDF Convertor
  106. apt-get install -y unoconv
  107. ```
  108. ```
  109. # Pandoc
  110. apt-get install -y pandoc
  111. ```
  112. ### Libraries Packages
  113. ```
  114. # Pip
  115. curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
  116. python3 get-pip.py && \
  117. rm -rf get-pip.py && \
  118. pip3 install --upgrade pip
  119. ```
  120. ```
  121. # Pdftotext
  122. pip3 install pdftotext
  123. ```
  124. ```
  125. # Supervisor
  126. pip3 install supervisor && \
  127. systemctl enable supervisor && \
  128. mkdir /var/log/amqp && \
  129. mkdir /var/log/queue
  130. ```
  131. ### Queues Supervisor config
  132. Config file path: **/etc/supervisor/conf.d/queue-worker-search-and-displace-ingest-production.conf**
  133. ```bash
  134. [program:queue-worker-search-and-displace-ingest-production]
  135. process_name=%(program_name)s_%(process_num)02d
  136. command=php /var/www/html/searchanddisplace-ingest/artisan queue:listen --queue=sd_ingest,default --tries=2 --timeout=180
  137. autostart=true
  138. autorestart=true
  139. user=www-data
  140. numprocs=3
  141. redirect_stderr=true
  142. stdout_logfile=/var/log/queue/queue-worker-search-and-displace-ingest-production.log
  143. ```
  144. The value for the **command** key should reflect the app path (in the example above the app's path is **/var/www/html/searchanddisplace-ingest**).
  145. The **stdout_logfile** value is the log file. All parent directories must already exist.
  146. ### Install app
  147. - Download app
  148. ```
  149. cd /var/www/html && \
  150. git clone https://git.law/newroco/searchanddisplace-ingest.git && \
  151. chown -R www-data:www-data searchanddisplace-ingest && \
  152. cd searchanddisplace-ingest
  153. ```
  154. - Install Dewarp
  155. ```
  156. # Dewarp
  157. cd /var/www/html/searchanddisplace-ingest/resources/python/dewarp && \
  158. pip3 install opencv-python
  159. ```
  160. - Install and configure app
  161. ```bash
  162. # Generate environment file
  163. cp .env.example .env
  164. # Install backend packages
  165. composer install
  166. # Generate app key
  167. php artisan key:generate
  168. # Change in .env the value for the QUEUE_CONNECTION to redis, if it is not set already
  169. # Deploy supervisor
  170. supervisorctl reread
  171. supervisorctl update
  172. supervisorctl start all
  173. ```
  174. - Check the queue worker is running
  175. ```
  176. supervisorctl status
  177. ```
  178. ### Search and Displace Core Setup
  179. - Install the `Search and Displace Core` app, found here https://git.law/newroco/searchanddisplace-core
  180. - Get the URL of the `Search and Displace Core` app and add it to the `WEBHOOK_CORE_URL` variable in `.env`
  181. - Add in `.env` the `WEBHOOK_CORE_SECRET` value which needs to be the same value as the `WEBHOOK_CLIENT_SECRET` in
  182. the `Search and Displace Core` app's `.env` file
  183. ## PHP Packages
  184. - cebe/markdown [LICENSE](https://github.com/cebe/markdown/blob/master/LICENSE)
  185. - fideloper/proxy [LICENSE](https://github.com/fideloper/TrustedProxy/blob/master/LICENSE.md)
  186. - laravel/framework [LICENSE](https://github.com/laravel/framework/blob/7.x/LICENSE.md)
  187. - laravel/tinker [LICENSE](https://github.com/laravel/tinker/blob/2.x/LICENSE.md)
  188. - league/html-to-markdown [LICENSE](https://github.com/thephpleague/html-to-markdown/blob/master/LICENSE)
  189. - phpoffice/phpword [LICENSE](https://github.com/PHPOffice/PHPWord/blob/0.17.0/LICENSE)
  190. - predis/predis [LICENSE](https://github.com/php-enqueue/amqp-bunny/blob/master/LICENSE)
  191. - spatie/laravel-webhook-server [LICENSE](https://github.com/spatie/laravel-webhook-server/blob/master/LICENSE.md)
  192. - spatie/pdf-to-text [LICENSE](https://github.com/spatie/pdf-to-text/blob/main/LICENSE.md)
  193. - thiagoalessio/tesseract_ocr [LICENSE](https://github.com/thiagoalessio/tesseract-ocr-for-php/blob/main/MIT-LICENSE)