Repo for the search and displace ingest module that takes odf, docx and pdf and transforms it into .md to be used with search and displace operations
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

112 lines
3.3 KiB

3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
  1. ## Search and Displace Ingest
  2. ## :cyclone: Server Requirements:
  3. - php7.4 [https://www.php.net] [LICENSE](https://www.php.net/license/index.php)
  4. - apache [https://httpd.apache.org] [LICENSE](hhttps://www.apache.org/licenses/LICENSE-2.0)
  5. - python 3.8 [https://www.python.org/] [LICENSE](https://docs.python.org/3/license.html)
  6. - composer [https://getcomposer.org/] [LICENSE](https://github.com/composer/composer/blob/main/LICENSE)
  7. ## :zap: Build with:
  8. - Laravel Framework ^6.2
  9. ## :rocket: Installation
  10. ### Ubuntu Packages
  11. ```bash
  12. # LibreOffice
  13. apt-get install python-software-properties
  14. apt-add-repository ppa:libreoffice/ppa
  15. apt-get update
  16. apt-get install libreoffice
  17. # Python
  18. apt-get update
  19. apt-get install software-properies-common
  20. add-apt-repository ppa:deadsnakes/ppa
  21. apt-get install supervisor python3.8 python3.8-dev
  22. # Redis
  23. apt-get install redis-server
  24. # PDF Convertor
  25. apt-get install libpoppler-cpp-dev
  26. apt-get install poppler-utils
  27. # Tesseract OCR
  28. add-apt-repository ppa:alex-p/tesseract-ocr-devel
  29. apt-get update
  30. apt-get install tesseract-ocr
  31. # Unpaper
  32. apt-get install unpaper
  33. # DOCX to PDF Convertor
  34. apt-get install unoconv
  35. ```
  36. ### Libraries Packages
  37. ```bash
  38. # Pip
  39. curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
  40. python get-pip.py
  41. rm -rf get-pip.py
  42. pip install --upgrade pip
  43. # Pdftotext
  44. pip install pdftotext
  45. # Supervisor
  46. pip install supervisor
  47. systemctl enable supervisor
  48. mkdir /var/log/amqp
  49. mkdir /var/log/queue
  50. # Deskew
  51. cd DESKEW_INSTALLATION_DIRECTORY
  52. cd Bin
  53. ./deskew INPUT OUTPUT
  54. # Dewarp
  55. pip3 install opencv-python
  56. cd DEWARP_INSTALLATION_DIRECTORY
  57. pip3 install -r requirements.txt
  58. ```
  59. ### Install app
  60. ```bash
  61. # Generate environment file
  62. cp .env.example .env
  63. # Install backend packages
  64. composer install
  65. # Generate app key
  66. php artisan key:generate
  67. # Change the value for the QUEUE_CONNECTION to redis, if it is not set already
  68. # Deploy supervisor
  69. php artisan queue:deploy-supervisor
  70. supervisorctl start all
  71. ```
  72. ### Search and Displace Core Setup
  73. - Install the `Search and Displace Core` app, found here https://git.law/newroco/searchanddisplace-core
  74. - Get the URL of the `Search and Displace Core` app and add it to the `WEBHOOK_CORE_URL` variable in `.env`
  75. - Add in `.env` the `WEBHOOK_CORE_SECRET` value which needs to be the same value as the `WEBHOOK_CLIENT_SECRET` in
  76. the `Search and Displace Core` app's `.env` file
  77. ## PHP Packages
  78. - cebe/markdown [LICENSE](https://github.com/cebe/markdown/blob/master/LICENSE)
  79. - fideloper/proxy [LICENSE](https://github.com/fideloper/TrustedProxy/blob/master/LICENSE.md)
  80. - laravel/framework [LICENSE](https://github.com/laravel/framework/blob/7.x/LICENSE.md)
  81. - laravel/tinker [LICENSE](https://github.com/laravel/tinker/blob/2.x/LICENSE.md)
  82. - league/html-to-markdown [LICENSE](https://github.com/thephpleague/html-to-markdown/blob/master/LICENSE)
  83. - phpoffice/phpword [LICENSE](https://github.com/PHPOffice/PHPWord/blob/0.17.0/LICENSE)
  84. - predis/predis [LICENSE](https://github.com/php-enqueue/amqp-bunny/blob/master/LICENSE)
  85. - spatie/laravel-webhook-server [LICENSE](https://github.com/spatie/laravel-webhook-server/blob/master/LICENSE.md)
  86. - spatie/pdf-to-text [LICENSE](https://github.com/spatie/pdf-to-text/blob/main/LICENSE.md)
  87. - thiagoalessio/tesseract_ocr [LICENSE](https://github.com/thiagoalessio/tesseract-ocr-for-php/blob/main/MIT-LICENSE)