Website

An agent can leverage the content of specific websites by retrieving it using a web crawler to create a Website data source.

What is web crawling?

Web crawling is an automated process to browse websites and index their pages for search purposes. To crawl a website, you need a web crawler, which is a program that automatically indexes content from various websites.

In the context of the Agentic platform, we offer the Website data source, which allows you to retrieve the content from a given website using the Scrapy web crawler and leverage the retrieved content via your agents.

Adding a Website data source

To add a new website data source,

  • Go to Build > Data sources > External integrations.

  • Click the Add external integration button;

  • Next, click the Website card and provide the following information:

    • Name your data source;

    • Select the credentials to use. By default, this data source does not need any credentials so you can select No connection.

    • Add a description for the data source;

    • Add the URL of the website you want to retrieve the content from, e.g. https://www.konverso.ai/;

    • Specify the Language of the website content as a 2-character code, e.g. en for English;

    • Add a URL filter: Type the URLs that you want to specifically retrieve the content from. Indeed, if you do not specify the links, the content from all the pages from the embedded external links will be retrieved as well. For example, if you only want to retrieve the content from the konverso.ai website and not that of the external links, you need to add it to this filter;

    • (Optional) XPath of the site title: specify the XPath used to extract the titles of the pages containing articles. If not defined, the default CSS title value is used;

    • Set the Maximum number of stored pages. If set to 0, all pages will be crawled until there are no more URLs to crawl;

    • Specify the list of url regexes to include in crawl. They will be used to filter URLs that will be stored;

    • Specify the List of url regexes to exclude from crawl. They will be used to filter URLs that will be not stored. Note that these regexes are checked if there is a regex that matches a page that should be stored. Note that the regex should match the whole url.

    • (Optional) Switch on the Download file URLs button to download file URLs.

    • Specify the Downloadable file extensions. This allows you to crawl attached files, such as PDF files. In that case, you can specify the file extensions to download, e.g. .pdf, pdf, etc.

Finally, click the Add external integration button to add this website to your data source repository.

What’s next?

Now that the data source has been created, you can select it when creating an agent.

Find more information about how to create an agent by reading this page: Build your own agent.