Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents
minLevel1
maxLevel6
outlinefalse
styledefault
typelist
printabletrue

What is web crawling?

Web crawling is an automated process to browse websites and index their pages for search purposes. To crawl a website, you need a web crawler, which is a program that automatically indexes content from various websites.

In the context of the Agentic platform, we offer the Website data source, which allows you to retrieve the content from a given website using the Scrapy web crawler and leverage the retrieved content via your agents.

Adding a Website data source

...

  • Go to Build > Data sources > External integrations.

  • Click the Add integrationsexternal integration button;

  • Next, click the Website tile card and provide the following information:

    • Name your data source;

    • Select the credentials to use. By default, this data source does not need any credentials so you can select No connection.

    • Add a description for the data source;

    • Add the URL of the website to crawl you want to retrieve the content from, e.g. https://www.konverso.ai/;

    • Specify the Language of the website content as a 2-character code, e.g. en for English;

    • Add a URL filter: Type the URLs that you want to crawlspecifically retrieve the content from. Indeed, if you do not specify the links, the content from all the pages from the embedded external links will be crawledretrieved as well. For example, if you only want to crawl retrieve the content from the konverso.ai website and not that of the external links included, you need to add it to this filter;

    • (Optional) XPath of the site title: specify the XPath used to extract the titles of the pages containing articles. If not defined, the default CSS title value is used;

    • Set the Maximum number of stored pages. If set to 0 (default), all pages will be crawled until there are no more URLs to crawl;

    • Specify the list of url regexes to include in crawl. They will be used to filter URLs that will be stored;

    • Specify the List of url regexes to exclude from crawl. They will be used to filter URLs that will be not stored. Note that these regexes are checked if there is a regex that matches a page that should be stored. Note that the regex should match the whole url.

    • (Optional) Switch on the Download file URLs: Check this box button to download file URLs.

    • Specify the Downloadable file extensions. This allows you to crawl attached files, such as PDF files. In that case, you can specify the file extensions to download, e.g. .pdf, pdf, etc.

    • Select who to share the credentials with. This will determine who can use your credentials in the data sources. If you want to keep these credentials private, select Only me. If you want to share them with other builders, select Builders.

    • Choose whether to limit pages to Only child pages or the Entire website.

Finally, click the Add external integration button to add this website to your data source repository.

What’s next?

Now that the data source has been created, you can select it when creating an agent.

...