Repo for the search and displace ingest module that takes odf, docx and pdf and transforms it into .md to be used with search and displace operations
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

131 lines
4.1 KiB

3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
  1. ## Search and Displace Ingest
  2. ## :cyclone: Server Requirements:
  3. - php7.4 [https://www.php.net] [LICENSE](https://www.php.net/license/index.php)
  4. - apache [https://httpd.apache.org] [LICENSE](hhttps://www.apache.org/licenses/LICENSE-2.0)
  5. - python 3.8 [https://www.python.org/] [LICENSE](https://docs.python.org/3/license.html)
  6. - composer [https://getcomposer.org/] [LICENSE](https://github.com/composer/composer/blob/main/LICENSE)
  7. ## :zap: Build with:
  8. - Laravel Framework ^6.2
  9. ## :rocket: Installation
  10. ### Ubuntu Packages
  11. ```bash
  12. # LibreOffice
  13. apt-get install python-software-properties
  14. apt-add-repository ppa:libreoffice/ppa
  15. apt-get update
  16. apt-get install libreoffice
  17. # Python
  18. apt-get update
  19. apt-get install software-properties-common
  20. add-apt-repository ppa:deadsnakes/ppa
  21. apt-get install supervisor python3.8 python3.8-dev
  22. # Redis
  23. apt-get install redis-server
  24. # PDF Convertor
  25. apt-get install libpoppler-cpp-dev
  26. apt-get install poppler-utils
  27. # Tesseract OCR
  28. add-apt-repository ppa:alex-p/tesseract-ocr-devel
  29. apt-get update
  30. apt-get install tesseract-ocr
  31. # Unpaper
  32. apt-get install unpaper
  33. # DOCX to PDF Convertor
  34. apt-get install unoconv
  35. ```
  36. ### Libraries Packages
  37. ```bash
  38. # Pip
  39. curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
  40. python get-pip.py
  41. rm -rf get-pip.py
  42. pip install --upgrade pip
  43. # Pdftotext
  44. pip install pdftotext
  45. # Supervisor
  46. pip install supervisor
  47. systemctl enable supervisor
  48. mkdir /var/log/amqp
  49. mkdir /var/log/queue
  50. # Deskew
  51. cd DESKEW_INSTALLATION_DIRECTORY
  52. cd Bin
  53. ./deskew INPUT OUTPUT
  54. # Dewarp
  55. pip3 install opencv-python
  56. cd DEWARP_INSTALLATION_DIRECTORY
  57. pip3 install -r requirements.txt
  58. ```
  59. ### Queues Supervisor config
  60. Add a new Supervisor config file in the "/etc/supervisor/conf.d" path like in the example below:
  61. Config file path: /etc/supervisor/conf.d/queue-worker-search-and-displace-ingest-production.conf
  62. ```bash
  63. [program:queue-worker-search-and-displace-ingest-production]
  64. process_name=%(program_name)s_%(process_num)02d
  65. command=php /var/www/html/searchanddisplace-ingest/artisan queue:listen --queue=sd_ingest,default --tries=2 --timeout=180
  66. autostart=true
  67. autorestart=true
  68. user=www-data
  69. numprocs=3
  70. redirect_stderr=true
  71. stdout_logfile=/var/log/queue/queue-worker-search-and-displace-ingest-production.log
  72. ```
  73. The value for the 'command' key should reflect the app path (in the example above the app's path is "/var/www/html/searchanddisplace-ingest").
  74. The 'stdout_logfile' value is the log file. All parent directories must already exist.
  75. ### Install app
  76. ```bash
  77. # Generate environment file
  78. cp .env.example .env
  79. # Install backend packages
  80. composer install
  81. # Generate app key
  82. php artisan key:generate
  83. # Change the value for the QUEUE_CONNECTION to redis, if it is not set already
  84. # Deploy supervisor
  85. supervisorctl start all
  86. ```
  87. ### Search and Displace Core Setup
  88. - Install the `Search and Displace Core` app, found here https://git.law/newroco/searchanddisplace-core
  89. - Get the URL of the `Search and Displace Core` app and add it to the `WEBHOOK_CORE_URL` variable in `.env`
  90. - Add in `.env` the `WEBHOOK_CORE_SECRET` value which needs to be the same value as the `WEBHOOK_CLIENT_SECRET` in
  91. the `Search and Displace Core` app's `.env` file
  92. ## PHP Packages
  93. - cebe/markdown [LICENSE](https://github.com/cebe/markdown/blob/master/LICENSE)
  94. - fideloper/proxy [LICENSE](https://github.com/fideloper/TrustedProxy/blob/master/LICENSE.md)
  95. - laravel/framework [LICENSE](https://github.com/laravel/framework/blob/7.x/LICENSE.md)
  96. - laravel/tinker [LICENSE](https://github.com/laravel/tinker/blob/2.x/LICENSE.md)
  97. - league/html-to-markdown [LICENSE](https://github.com/thephpleague/html-to-markdown/blob/master/LICENSE)
  98. - phpoffice/phpword [LICENSE](https://github.com/PHPOffice/PHPWord/blob/0.17.0/LICENSE)
  99. - predis/predis [LICENSE](https://github.com/php-enqueue/amqp-bunny/blob/master/LICENSE)
  100. - spatie/laravel-webhook-server [LICENSE](https://github.com/spatie/laravel-webhook-server/blob/master/LICENSE.md)
  101. - spatie/pdf-to-text [LICENSE](https://github.com/spatie/pdf-to-text/blob/main/LICENSE.md)
  102. - thiagoalessio/tesseract_ocr [LICENSE](https://github.com/thiagoalessio/tesseract-ocr-for-php/blob/main/MIT-LICENSE)