...
Go to Build > Data sources > External integrations.
Click the Add external integrations button;
Next, click the Web crawler (Scrapy) tile Website card and provide the following information:
Name your data source;
Select the credentials to use. By default, this data source does not need any credentials so you can select No connection.
Add a description for the data source;
Add the URL of the website you want to retrieve the content from, e.g.
https://www.konverso.ai/
;Specify the Language of the website content as a 2-character code, e.g.
en
for English;Add a URL filter: Type the URLs that you want to specifically retrieve the content from. Indeed, if you do not specify the links, the content from all the pages from the embedded external links will be retrieved as well. For example, if you only want to retrieve the content from the
konverso.ai
website and not that of the external links, you need to add it to this filter;(Optional) XPath of the site title: specify the XPath used to extract the titles of the pages containing articles. If not defined, the default CSS title value is used;
Set the Maximum number of stored pages. If set to 0, all pages will be crawled until there are no more URLs to crawl;
Specify the list of url regexes to include in crawl. They will be used to filter URLs that will be stored;
Specify the List of url regexes to exclude from crawl. They will be used to filter URLs that will be not stored. Note that these regexes are checked if there is a regex that matches a page that should be stored. Note that the regex should match the whole url.
(Optional) Switch on the Download file URLs button to download file URLs.
Specify the Downloadable file extensions. This allows you to crawl attached files, such as PDF files. In that case, you can specify the file extensions to download, e.g.
.pdf
,pdf
, etc.
...