Site Audit on a specific site is not running anymore

csoprana · July 17, 2020, 3:55pm

Hello everybody,
I installed asqatasun from tar file over Ubuntu 18.04 in our company network end of May,
In June 05 we performed a Site audit over https://amadeus.com/en (our corporate site) and the result was 900 pages tested over 1000 browsed. But now if we run the same audit we can test only 1 page.

I wonder what the cause may be. I have the same result on other sites but not on all sites ( ex. https://www.travelclick.com/ works fine).
I already activated the DEBUG mode in the crawler and services logs. And also the bypassUrlCheck=true.

I hope you can help me solve this issue.

Thanks in advance
regards,
Cesare

fabrice · July 22, 2020, 8:17am

Hi @csoprana,

could you provide us with some logs?

mfaure · July 22, 2020, 9:35am

Hi @csoprana. I confirm I reproduce your behaviour. This seems related to https://amadeus.com/en website.

You may increase the logs for the crawler to troubleshoot it. To do so:

Edit /var/lib/tomcat8/webapps/asqatasun/WEB-INF/classes/log4j.properties (provided you’re on Ubuntu 18.04 ; or <TOMCAT_ASQATASUN_WEBAPP>/WEB-INF/classes/log4j.properties)
Set log4j.logger.org.asqatasun.crawler=DEBUG (default value is INFO).
Restart Tomcat systemctl restart tomcat8.service. Restarting Tomcat may actually take time, maybe minutes. systemctl status tomcat8.service should who some kind of progress.

For your information, here is our dev doc: debugging Asqatasun.

Please share you logs with us

csoprana · July 22, 2020, 1:18pm

Hello @fabrice and @mfaure,
Here the log extraction of an audit with DEBUG level.
asqatasun.log.2020-07-17.txt (19.5 KB)

Can i provide you other logs?

Thank you very much

Cesare

mfaure · July 22, 2020, 2:18pm

OK I got it. That was obvious, I didn’t see it at first sight.

Here is the content of the only page you grab:

<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">
</script>
<body>
</body></html>

You can verify it on the command line with HTTPIE (apt-get install httpie):

http https://amadeus.com/
http https://amadeus.com/en/

As the crawler doesn’t understand Javascript, it can’t go further.

Moreover, it appears your site is running Incapsula CDN, which prevents crawlers from doing their jobs (that’s a choice you, your company or Incapsula made).

As a workaround, you may ask (if possible) Incapsula to allow the user-agent “asqatasun” (name used by our crawler). You may also audit the site from “inside your company”, I mean without the CDN. You should have a way to access it internally from within your company, on a pre-prod environment or something like that.

To be sure, you could verify that between your last audit of 1000 pages and now, the CDN has been setup.

As a side note, you launched the audit against AccessiWeb 2.2 (aw22) referential, which has been deprecated for 5 years (left for historical purpose, and will be removed in next major version). You should use RGAA instead

Hope this helps!

csoprana · July 22, 2020, 2:30pm

Hi,
Thank you very much @mfaure, i will investigate internally asking for what you suggest in order to complete the audit.
Thanks again for you help

Cesare

csoprana · July 29, 2020, 3:23pm

Hello @mfaure,
sorry to open the topic again but i was able to have access to a non prod environment for the site so no more web firewall.
But i am still able to have only one page analyzed.
I tried with HTTPIE as you suggested and i can see the source of the home of the site but i don’t know what can stop the crawler to proceed.
Analyzing the html i can see that the url in the meta tag are “fake” could it be the reason?
What other information the crawler use?

Thanks
Cesare

fabrice · July 29, 2020, 9:41pm

As indicated by @mfaure, the crawler does not understand javascript.

Perhaps the best is to use page auditing, which is more efficient at analyzing HTML code and understands javascript. You can audit up to 10 pages at the same time.

csoprana · July 30, 2020, 6:40am

Hi @fabrice,
in that case the html is not the same i am not blocked by the web firewall.

This is the head of the homepage html: homepageSTAGE.txt (10.8 KB)

Thanks
Cesare

fabrice · August 2, 2020, 9:58am

homepageSTAGE.txt the file ends with the following HTML code and this is not normal:

</head>
  <body>

is it possible to test the page audit, instead of the audit via the crawler?

csoprana · August 3, 2020, 7:45am

Hi @fabrice,
the file was only the extraction of the of the entire home page that was huge.
Do you think you need the entire sourcecode? Has you can see in the head there are url in the meta like https://stage.amadeusrail.net/de but this urls don’t exists.
Thanks
Cesare