Repo for the search and displace core module including the interface to select files and search and displace operations to run on them. https://searchanddisplace.com
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

292 lines
9.1 KiB

3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
  1. # Search and Displace Core
  2. ---
  3. **NOTE**
  4. The installation steps below were tested on an Ubuntu 20.04 LTS machine, all commands assume sudo being used unless specified otherwise and should be adapted for each specific environment.
  5. Disk size for this service should be at least 15GB.
  6. ---
  7. ## Install
  8. ### Update package repository
  9. ```
  10. apt-get update -y
  11. ```
  12. ### Install Apache2
  13. ```
  14. apt-get -y install \
  15. apache2 \
  16. apache2-doc \
  17. apache2-utils \
  18. libapache2-mod-fcgid
  19. ```
  20. ### Install PHP and the required extensions
  21. ```
  22. apt-get -y install software-properties-common && \
  23. add-apt-repository ppa:ondrej/php -y && \
  24. apt-get update -y && \
  25. apt-get -y install \
  26. php7.4 \
  27. php7.4-calendar \
  28. php7.4-common \
  29. php7.4-fileinfo \
  30. php7.4-ftp \
  31. php7.4-fpm \
  32. php7.4-gettext \
  33. php7.4-iconv \
  34. php7.4-json \
  35. php7.4-mbstring \
  36. php7.4-opcache \
  37. php7.4-pdo \
  38. php7.4-phar \
  39. php7.4-posix \
  40. php7.4-readline \
  41. php7.4-sockets \
  42. php7.4-sqlite3 \
  43. php7.4-tokenizer \
  44. php7.4-xml
  45. ```
  46. ### Configure Apache2 and PHP
  47. ```
  48. a2enmod \
  49. rewrite \
  50. actions \
  51. fcgid \
  52. alias \
  53. proxy_fcgi \
  54. remoteip && \
  55. sed -i "s/DocumentRoot \/var\/www\/html/DocumentRoot \/var\/www\/html\/searchanddisplace-core\/public/g" /etc/apache2/sites-available/000-default.conf && \
  56. sed -i "/^[[:blank:]]ErrorLog/i\ <FilesMatch \.php\$>" /etc/apache2/sites-available/000-default.conf && \
  57. sed -i "/^[[:blank:]]ErrorLog/i\ SetHandler \"proxy:unix:\/var\/run\/php\/php7.4-fpm.sock|fcgi:\/\/localhost\"" /etc/apache2/sites-available/000-default.conf && \
  58. sed -i "/^[[[:blank:]]ErrorLog/i\ </\FilesMatch>" /etc/apache2/sites-available/000-default.conf && \
  59. bash -c 'echo "RemoteIPHeader X-Forwarded-For" >> /etc/apache2/apache2.conf' && \
  60. sed -i "s/LogFormat \"%v:%p %h/LogFormat \"%v:%p %a/g" /etc/apache2/apache2.conf && \
  61. sed -i "s/LogFormat \"%h/LogFormat \"%a/g" /etc/apache2/apache2.conf && \
  62. chown -R www-data /var/www/html && \
  63. chmod -R 755 /var/www/html && \
  64. sed -i "s/AllowOverride None/AllowOverride All/g" /etc/apache2/apache2.conf && \
  65. systemctl restart apache2
  66. ```
  67. ### Install Composer
  68. ```
  69. apt-get -y install composer
  70. ```
  71. ### Install NodeJS 16 LTS , npm
  72. ```
  73. curl -s https://deb.nodesource.com/setup_16.x | sudo bash
  74. ```
  75. ```
  76. apt-get -y install \
  77. nodejs \
  78. yarn
  79. ```
  80. ### Install and Configure the app
  81. - Download the app
  82. ```
  83. cd /var/www/html && \
  84. git clone https://git.law/newroco/searchanddisplace-core.git && \
  85. chown -R www-data:www-data searchanddisplace-core && \
  86. cd searchanddisplace-core
  87. ```
  88. - Create the `.env` file by copying the contents from the `.env.example` file.
  89. `cp .env.example .env`
  90. - For the 'QUEUE_CONNECTION' variable in `.env` you can use either `sync` or `redis` (recommended). If you choose to use `redis`
  91. then you need to make sure that it is installed on your machine.
  92. ```
  93. apt-get -y install redis-server
  94. ```
  95. - Install the `Search and Displace Ingest` app, found here https://git.law/newroco/searchanddisplace-ingest
  96. - Get the URL of the `Search and Displace Ingest` app and add it to the `SD_INGEST_URL` variable in `.env`
  97. - Add in `.env` the `WEBHOOK_CLIENT_SECRET` value which needs to be the same value as the `WEBHOOK_CORE_SECRET` in
  98. the `Search and Displace Ingest` app `.env` file
  99. - Add in `.env` the `SD_DUCKLING_URL` value which by default is `http://0.0.0.0:8000/parse`. You can find
  100. details about installing Facebook Duckling in a section below.
  101. - Install composer dependencies
  102. `composer install`
  103. - Install npm dependencies
  104. ```
  105. rm -rf node_modules && \
  106. npm install
  107. ```
  108. - Compile frontend assets
  109. `npm run production`
  110. - Generate the app key by running the following command:
  111. `php artisan key:generate`
  112. - Migrate DB tables
  113. ```
  114. touch ./database/database.sqlite
  115. chown www-data:www-data ./database/database.sqlite
  116. php artisan migrate
  117. ```
  118. ### Queues Supervisor config
  119. - Install supervisor
  120. `apt-get install supervisor -y`
  121. - Config file path: **/etc/supervisor/conf.d/queue-worker-search-and-displace-core-production.conf**
  122. ```bash
  123. [program:queue-worker-search-and-displace-core-production]
  124. process_name=%(program_name)s_%(process_num)02d
  125. command=php /var/www/html/searchanddisplace-core/artisan queue:listen --queue=sd_core,default --tries=2 --timeout=180
  126. autostart=true
  127. autorestart=true
  128. user=www-data
  129. numprocs=3
  130. redirect_stderr=true
  131. stdout_logfile=/var/log/queue/queue-worker-search-and-displace-core-production.log
  132. ```
  133. The value for the **command** key should reflect the app path (in the example above the app's path is **/var/www/html/searchanddisplace-core**).
  134. The **stdout_logfile** value is the log file. All parent directories must already exist.
  135. ` mkdir /var/log/queue`
  136. - Start Supervisor (after adding the Supervisor configs detailed below)
  137. `supervisorctl start all`
  138. - (Optional) Restart Supervisor after a config file update
  139. ```
  140. supervisorctl reread
  141. supervisorctl update
  142. supervisorctl restart <name>
  143. ```
  144. ### Facebook Duckling
  145. ```
  146. apt-get -y install \
  147. libpcre3 \
  148. libpcre3-dev \
  149. pkg-config && \
  150. cd /var/www/html && \
  151. git clone https://github.com/facebook/duckling.git fb-duckling && \
  152. cd fb-duckling && \
  153. curl -sSL https://get.haskellstack.org/ | sh && \
  154. stack build && \
  155. stack exec duckling-example-exe && \
  156. stack test
  157. ```
  158. ### Facebook Duckling Supervisor config
  159. Config file path: **/etc/supervisor/conf.d/duckling-worker-search-and-displace-core-production.conf**
  160. ```bash
  161. [program:duckling-worker-search-and-displacecore-production]
  162. process_name=%(program_name)s_%(process_num)02d
  163. directory=/var/www/html/fb-duckling
  164. command=sudo -S stack exec duckling-example-exe
  165. autostart=true
  166. autorestart=true
  167. user=root
  168. numprocs=1
  169. redirect_stderr=true
  170. stdout_logfile=/var/log/queue/duckling-worker-search-and-displace-core-production.log
  171. ```
  172. The value for the **directory** key should reflect the Facebook Duckling app path (in the example above the path is **/var/www/html/fb-duckling**).
  173. The **stdout_logfile** value is the log file. All parent directories must already exist.
  174. ### Start the queue worker and Facebook Duckling with Supervisor
  175. ```
  176. supervisorctl reread
  177. supervisorctl update
  178. supervisorctl start all
  179. ```
  180. - Check they are running
  181. ```
  182. supervisorctl status
  183. ```
  184. ### Converting documents
  185. ```
  186. # LibreOffice
  187. apt-get install -y software-properties-common && \
  188. apt-add-repository ppa:libreoffice/ppa && \
  189. apt-get update && \
  190. apt-get install -y libreoffice libreoffice-writer2xhtml
  191. ```
  192. # Searchers
  193. There are 2 types of searchers: basic and compounded
  194. ## Basic searcher
  195. There are 2 types of basic searchers: native and custom
  196. ### Native basic searcher
  197. This type of searchers are added by default in the app and cannot be edited or deleted.
  198. - Amount of Money
  199. - Credit Card Number
  200. - Distance
  201. - Duration
  202. - Email
  203. - Numeral
  204. - Ordinal
  205. - Phone Numbers
  206. - Quantity
  207. - Temperature
  208. - Time
  209. - Url
  210. - Volume
  211. ### Custom basic searcher
  212. You can add a custom basic searcher by clicking the 'Add regex' button found in the navbar.
  213. This searcher is a regular expression.
  214. Example: `[d\]{4}-[d\]{3}-[d\]{3}` searches, in the document, all text strings that
  215. have 4 digits, a dash, 3 digits, a dash, and finally 3 digits; 1234-123-123 is a valid text.
  216. ## Compounded searcher
  217. A compounded searcher contains one or more searchers, which can be either basic or comopounded.
  218. The searchers can be listed in two ways: in rows and in columns. Each column in a row
  219. extends the searching criteria and each row filters the results of the previous row.
  220. Let's take as an example the following searcher: the first row has 2 searchers, in the first column
  221. we have the 'Email' native basic searcher and in the second column we have a custom basic searcher
  222. which searches for text strings that have a leading '#' character. The second rows has only one column
  223. and that column has a custom basic searcher which searches for text strings which contain the '@' character.
  224. After we execute the Search&Displace the first row of the searcher will be applied on the initial document content
  225. and will find all email addresses and all text strings which have a leading '#' character, so the operation applies
  226. the searchers in the first row independently, each column extending the searching criteria.
  227. Then the second row will be applied on the results of the first row, so on the email addresses and the text strings
  228. which have a leading '#' character, basically each row filters the results of the previous row.
  229. # Demo Version
  230. Is available here https://demo.searchanddisplace.com/
  231. No authentication is required.
  232. # Demo Steps
  233. - Select and upload a document file (supported files: .docx, .pdf, .odt, .txt)
  234. - After the file is uploaded and processed you will see it's contents on the page
  235. - Select searchers by clicking the 'List' button on the right, for each searcher you can input a replace value, so for example if you select the 'Email' searcher and input the replace value as 'EMAIL' then all email addresses which are found in the document will be replaced with the text EMAIL
  236. - After you are done with the searchers selection you can hide the panel by clicking again on the 'List' button
  237. - You can execute the Search&Displace by clicking on the 'Run filters' button
  238. - After the processing is done you will see the resulting document in the right panel, side by side with the initial document
  239. - You can highlight the found and replaced items by toggling the 'Highlight differences' button