Repo for the search and displace ingest module that takes odf, docx and pdf and transforms it into .md to be used with search and displace operations
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

130 lines
4.0 KiB

3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
  1. ## Search and Displace Ingest
  2. ## :cyclone: Server Requirements:
  3. - php7.4 [https://www.php.net] [LICENSE](https://www.php.net/license/index.php)
  4. - apache [https://httpd.apache.org] [LICENSE](hhttps://www.apache.org/licenses/LICENSE-2.0)
  5. - python 3.8 [https://www.python.org/] [LICENSE](https://docs.python.org/3/license.html)
  6. - composer [https://getcomposer.org/] [LICENSE](https://github.com/composer/composer/blob/main/LICENSE)
  7. ## :zap: Build with:
  8. - Laravel Framework ^6.2
  9. ## :rocket: Installation
  10. ### Ubuntu Packages
  11. ```bash
  12. # LibreOffice
  13. apt-add-repository ppa:libreoffice/ppa
  14. apt-get update
  15. apt-get install libreoffice
  16. # Python
  17. apt-get update
  18. apt-get install software-properties-common
  19. add-apt-repository ppa:deadsnakes/ppa
  20. apt-get install supervisor python3.8 python3.8-dev
  21. # Redis
  22. apt-get install redis-server
  23. # PDF Convertor
  24. apt-get install libpoppler-cpp-dev
  25. apt-get install poppler-utils
  26. # Tesseract OCR
  27. add-apt-repository ppa:alex-p/tesseract-ocr-devel
  28. apt-get update
  29. apt-get install tesseract-ocr
  30. # Unpaper
  31. apt-get install unpaper
  32. # DOCX to PDF Convertor
  33. apt-get install unoconv
  34. ```
  35. ### Libraries Packages
  36. ```bash
  37. # Pip
  38. curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
  39. python get-pip.py
  40. rm -rf get-pip.py
  41. pip install --upgrade pip
  42. # Pdftotext
  43. pip install pdftotext
  44. # Supervisor
  45. pip install supervisor
  46. systemctl enable supervisor
  47. mkdir /var/log/amqp
  48. mkdir /var/log/queue
  49. # Deskew
  50. cd DESKEW_INSTALLATION_DIRECTORY
  51. cd Bin
  52. ./deskew INPUT OUTPUT
  53. # Dewarp
  54. pip3 install opencv-python
  55. cd DEWARP_INSTALLATION_DIRECTORY
  56. pip3 install -r requirements.txt
  57. ```
  58. ### Queues Supervisor config
  59. Add a new Supervisor config file in the "/etc/supervisor/conf.d" path like in the example below:
  60. Config file path: /etc/supervisor/conf.d/queue-worker-search-and-displace-ingest-production.conf
  61. ```bash
  62. [program:queue-worker-search-and-displace-ingest-production]
  63. process_name=%(program_name)s_%(process_num)02d
  64. command=php /var/www/html/searchanddisplace-ingest/artisan queue:listen --queue=sd_ingest,default --tries=2 --timeout=180
  65. autostart=true
  66. autorestart=true
  67. user=www-data
  68. numprocs=3
  69. redirect_stderr=true
  70. stdout_logfile=/var/log/queue/queue-worker-search-and-displace-ingest-production.log
  71. ```
  72. The value for the 'command' key should reflect the app path (in the example above the app's path is "/var/www/html/searchanddisplace-ingest").
  73. The 'stdout_logfile' value is the log file. All parent directories must already exist.
  74. ### Install app
  75. ```bash
  76. # Generate environment file
  77. cp .env.example .env
  78. # Install backend packages
  79. composer install
  80. # Generate app key
  81. php artisan key:generate
  82. # Change the value for the QUEUE_CONNECTION to redis, if it is not set already
  83. # Deploy supervisor
  84. supervisorctl start all
  85. ```
  86. ### Search and Displace Core Setup
  87. - Install the `Search and Displace Core` app, found here https://git.law/newroco/searchanddisplace-core
  88. - Get the URL of the `Search and Displace Core` app and add it to the `WEBHOOK_CORE_URL` variable in `.env`
  89. - Add in `.env` the `WEBHOOK_CORE_SECRET` value which needs to be the same value as the `WEBHOOK_CLIENT_SECRET` in
  90. the `Search and Displace Core` app's `.env` file
  91. ## PHP Packages
  92. - cebe/markdown [LICENSE](https://github.com/cebe/markdown/blob/master/LICENSE)
  93. - fideloper/proxy [LICENSE](https://github.com/fideloper/TrustedProxy/blob/master/LICENSE.md)
  94. - laravel/framework [LICENSE](https://github.com/laravel/framework/blob/7.x/LICENSE.md)
  95. - laravel/tinker [LICENSE](https://github.com/laravel/tinker/blob/2.x/LICENSE.md)
  96. - league/html-to-markdown [LICENSE](https://github.com/thephpleague/html-to-markdown/blob/master/LICENSE)
  97. - phpoffice/phpword [LICENSE](https://github.com/PHPOffice/PHPWord/blob/0.17.0/LICENSE)
  98. - predis/predis [LICENSE](https://github.com/php-enqueue/amqp-bunny/blob/master/LICENSE)
  99. - spatie/laravel-webhook-server [LICENSE](https://github.com/spatie/laravel-webhook-server/blob/master/LICENSE.md)
  100. - spatie/pdf-to-text [LICENSE](https://github.com/spatie/pdf-to-text/blob/main/LICENSE.md)
  101. - thiagoalessio/tesseract_ocr [LICENSE](https://github.com/thiagoalessio/tesseract-ocr-for-php/blob/main/MIT-LICENSE)