Hello everybody,
I installed asqatasun from tar file over Ubuntu 18.04 in our company network end of May,
In June 05 we performed a Site audit over https://amadeus.com/en (our corporate site) and the result was 900 pages tested over 1000 browsed. But now if we run the same audit we can test only 1 page.
I wonder what the cause may be. I have the same result on other sites but not on all sites ( ex. https://www.travelclick.com/ works fine).
I already activated the DEBUG mode in the crawler and services logs. And also the bypassUrlCheck=true.
You may increase the logs for the crawler to troubleshoot it. To do so:
Edit /var/lib/tomcat8/webapps/asqatasun/WEB-INF/classes/log4j.properties (provided you’re on Ubuntu 18.04 ; or <TOMCAT_ASQATASUN_WEBAPP>/WEB-INF/classes/log4j.properties)
Set log4j.logger.org.asqatasun.crawler=DEBUG (default value is INFO).
Restart Tomcat systemctl restart tomcat8.service. Restarting Tomcat may actually take time, maybe minutes. systemctl status tomcat8.service should who some kind of progress.
As the crawler doesn’t understand Javascript, it can’t go further.
Moreover, it appears your site is running Incapsula CDN, which prevents crawlers from doing their jobs (that’s a choice you, your company or Incapsula made).
As a workaround, you may ask (if possible) Incapsula to allow the user-agent “asqatasun” (name used by our crawler). You may also audit the site from “inside your company”, I mean without the CDN. You should have a way to access it internally from within your company, on a pre-prod environment or something like that.
To be sure, you could verify that between your last audit of 1000 pages and now, the CDN has been setup.
As a side note, you launched the audit against AccessiWeb 2.2 (aw22) referential, which has been deprecated for 5 years (left for historical purpose, and will be removed in next major version). You should use RGAA instead
Hello @mfaure,
sorry to open the topic again but i was able to have access to a non prod environment for the site so no more web firewall.
But i am still able to have only one page analyzed.
I tried with HTTPIE as you suggested and i can see the source of the home of the site but i don’t know what can stop the crawler to proceed.
Analyzing the html i can see that the url in the meta tag are “fake” could it be the reason?
What other information the crawler use?
As indicated by @mfaure, the crawler does not understand javascript.
Perhaps the best is to use page auditing, which is more efficient at analyzing HTML code and understands javascript. You can audit up to 10 pages at the same time.
Hi @fabrice,
the file was only the extraction of the of the entire home page that was huge.
Do you think you need the entire sourcecode? Has you can see in the head there are url in the meta like https://stage.amadeusrail.net/de but this urls don’t exists.
Thanks
Cesare