Normally, the default WebSite Auditor setup works perfectly fine to crawl a website in full and gather all the pages. In certain cases though, you may need a bit of fine-tuning to gather all site's resources.
If you struggling to collect all the pages of your website with WebSite Auditor, follow the steps below:
* even if your actual homepage is, say, domain.com/index.php, make sure you create the project for domain.com. Otherwise, WebSite Auditor will only consider the pages that contain the Project URL in full (domain.com/index.php/...) as the website's own pages.
By default, WebSite Auditor follows robots.txt rules (the default SEO PowerSuite bot is a generic one).
If some of the pages/recourses are restricted from indexing (with robots.txt, noindex meta tag, or X-Robots rule in HTTP Header), their URLs won't be further crawled and will land under All Resources with the corresponding label in Robots.txt Instructions column.
To collect all the pages, disregarding their indexation rules, disable the option to Follow Robots.txt Instructions in Expert options upon creating/rebuilding the project:
WebSite Auditor crawls websites by starting from the homepage (project domain URL) and by following each and every internal link. If some of the pages are not properly linked to the website, they won't be found upon crawling the website directly.
To find such 'orphans' and ensure proper internal linking, enable the Search for Orphan pages option under Expert options (search across Google's Index, your Sitemap, or both):
Orphans will have the corresponding Orphan and Source labels in the Tags column.
Some of the pages may not be collected if you've set a Scan Depth limit. The app will only crawl and collect all the pages x clicks far from the homepage (Project URL).
You can check/reset the Scan Depth limit under Expert Options:
If your website is built with Ajax or features any JavaScript-generated content, make sure to enable the app to Execute JavaScript to access and crawl the content in full.
By default, WebSite Auditor will gather the pages with certain (most commonly used) URL Parameters, but there's also the option to Ignore All. In this case, the app will treat all pages with different URL parameters as the same page, and will only collect the 'naked' URL.
Navigate to URL Parameters under Expert options, and set the app to Ignore Custom URL Parameters (optionally, upload your site's custom parameters list directly from Google Search Console):