Crawl settings SortSite Desktop Manual
These options affect how the crawler visits pages, and changing them can have a dramatic affect on the number of pages scanned. If crawls have stopped working you can restore factory settings using the Reset button in Scan Options.
Blocking links
You can block specific links using the Blocks tab in Scan Options and adding URLs to the Blocked links box. These can either be full URLs, or wildcard patterns:
http://www.example.com/copyright.htm
blocks a single pagehttp://www.example.com/legal/*
blocks all pages in the subdirectory called “legal”http:*
blocks all HTTP links*.pdf
blocks links to all pages with .PDF extension*print_friendly.htm
blocks links to URLs ending in print_friendly.htm*action=edit*
blocks links containing action=edit
Robots.txt
The robots.txt
file is a digital “Keep Out” sign that limits access to specific pages, or blocks certain crawlers. We strongly advise keeping this option checked.
Including other domains in the scan
By default only pages from a single host name are scanned. This can be changed using the Links tab in Scan Options.
-
Follow links to related domains if unchecked only pages on a single host name are visited. If checked then peer host names and subdomains are also visited. For example, checking this box visits pages on
www2.example.com
andsupport.example.com
if the start page is onwww.example.com
. -
Follow links to additional domains add any additional domains you want visited during a scan (one per line).
Controlling link depth
Link depth can be changed using the Links tab in Scan Options.
-
Link depth controls how many clicks the scanner will follow from the start page. The default setting is your entire site, but you can restrict this to the top level pages of your site (e.g. visit up to 3 clicks from the start page).
-
External links controls how many links to each external sites are checked. You may want to restrict this if you have many links to single site on every page (e.g. social media sharing links).
Timeouts
You can change how quickly the crawler requests pages, and how long it waits for each page before timing out, using the Crawler tab in Scan Options.
- Server Load lets you set a delay between loading pages to avoid placing undue load on a server
- Page Timeout controls the maximum time spent loading a page - the default setting is 180 seconds
User agent
The User-Agent HTTP header can be set using the Crawler tab in Scan Options. This only needs changed if you’re scanning a site that does User-Agent detection.
HTTP authentication
If you need to scan sites using HTTP authentication you can enable this using the Crawler tab in Scan Options.