Repo for the search and displace ingest module that takes odf, docx and pdf and transforms it into .md to be used with search and displace operations
  1. ## Search and Displace Ingest
  2. ## :cyclone: Server Requirements:
  - php7.4
  - apache
  - python 3.8
  - composer
  7. ## :zap: Build with:
  8. - Laravel Framework ^6.2
  9. ## :rocket: Installation
  10. ### Ubuntu Packages
  11. ```bash
  12. # LibreOffice
  13. apt-add-repository ppa:libreoffice/ppa
  14. apt-get update
  15. apt-get install libreoffice
  16. # Python
  17. apt-get update
  18. apt-get install software-properties-common
  19. add-apt-repository ppa:deadsnakes/ppa
  20. apt-get install supervisor python3.8 python3.8-dev
  21. # Redis
  22. apt-get install redis-server
  23. # PDF Convertor
  24. apt-get install libpoppler-cpp-dev
  25. apt-get install poppler-utils
  26. # Tesseract OCR
  27. add-apt-repository ppa:alex-p/tesseract-ocr-devel
  28. apt-get update
  29. apt-get install tesseract-ocr
  30. # Unpaper
  31. apt-get install unpaper
  32. # DOCX to PDF Convertor
  33. apt-get install unoconv
  34. ```
  35. ### Libraries Packages
  36. ```bash
  37. # Pip
  38. curl -o
  39. python
  40. rm -rf
  41. pip install --upgrade pip
  42. # Pdftotext
  43. pip install pdftotext
  44. # Supervisor
  45. pip install supervisor
  46. systemctl enable supervisor
  47. mkdir /var/log/amqp
  48. mkdir /var/log/queue
  49. # Deskew
  51. cd Bin
  52. ./deskew INPUT OUTPUT
  53. # Dewarp
  54. pip3 install opencv-python
  56. pip3 install -r requirements.txt
  57. ```
  58. ### Queues Supervisor config
  59. Add a new Supervisor config file in the "/etc/supervisor/conf.d" path like in the example below:
  60. Config file path: /etc/supervisor/conf.d/queue-worker-search-and-displace-ingest-production.conf
  61. ```bash
  62. [program:queue-worker-search-and-displace-ingest-production]
  63. process_name=%(program_name)s_%(process_num)02d
  64. command=php /var/www/html/searchanddisplace-ingest/artisan queue:listen --queue=sd_ingest,default --tries=2 --timeout=180
  65. autostart=true
  66. autorestart=true
  67. user=www-data
  68. numprocs=3
  69. redirect_stderr=true
  70. stdout_logfile=/var/log/queue/queue-worker-search-and-displace-ingest-production.log
  71. ```
  72. The value for the 'command' key should reflect the app path (in the example above the app's path is "/var/www/html/searchanddisplace-ingest").
  73. The 'stdout_logfile' value is the log file. All parent directories must already exist.
  74. ### Install app
  75. ```bash
  76. # Generate environment file
  77. cp .env.example .env
  78. # Install backend packages
  79. composer install
  80. # Generate app key
  81. php artisan key:generate
  82. # Change the value for the QUEUE_CONNECTION to redis, if it is not set already
  83. # Deploy supervisor
  84. supervisorctl start all
  85. ```
  86. ### Search and Displace Core Setup
  87. - Install the `Search and Displace Core` app, found here
  88. - Get the URL of the `Search and Displace Core` app and add it to the `WEBHOOK_CORE_URL` variable in `.env`
  89. - Add in `.env` the `WEBHOOK_CORE_SECRET` value which needs to be the same value as the `WEBHOOK_CLIENT_SECRET` in
  90. the `Search and Displace Core` app's `.env` file
  91. ## PHP Packages
  - cebe/markdown
  - fideloper/proxy
  - laravel/framework
  - laravel/tinker
  - league/html-to-markdown
  - phpoffice/phpword
  - predis/predis
  - spatie/laravel-webhook-server
  - spatie/pdf-to-text
  - thiagoalessio/tesseract_ocr