Error audit de site

Bonjour,

Mon audit de page marche très bien tout comme l’audit de scenario et de fichier mais l’audit de site non il ne marche pas et je ne comprend pas l’erreur .

J’ai regarder dans la docs y a pas vraiment de précisions quelqu’un peut m’aider svp ? :slight_smile:

J’ai essayé ici 2 sites different wikipedia qui servait de modele et le site wikihow sans aucune connexion.

18-05-2016 13:07:49:474 8833149 INFO  org.asqatasun.webapp.orchestrator.AsqatasunOrchestratorImpl  - Launching audit site on http://en.wikipedia.org/
18-05-2016 13:07:51:899 8835574 INFO  org.asqatasun.service.command.SiteAuditCommandImpl  - Launching crawler for page http://en.wikipedia.org/
18-05-2016 13:07:52:990 8836665 INFO  org.asqatasun.crawler.CrawlerImpl  - Rel canonical pages are kept for ref Rgaa30
18-05-2016 13:07:58:050 8841725 INFO  org.asqatasun.crawler.framework.AsqatasunCrawlJob  - crawljob is running
18-05-2016 13:08:04:017 8847692 WARN  org.asqatasun.service.AuditServiceImpl  - Audit has no content
18-05-2016 13:08:04:136 8847811 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whileCONTENT_ADAPTING was required
18-05-2016 13:08:04:157 8847832 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whilePROCESSING was required
18-05-2016 13:08:04:206 8847881 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whileCONSOLIDATION was required
18-05-2016 13:08:04:208 8847883 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whileANALYSIS was required
18-05-2016 13:08:04:230 8847905 INFO  org.asqatasun.webapp.orchestrator.AsqatasunOrchestratorImpl  - failure email sent to [dgfip_asqatasun@email.com] on audit n° 37
18-05-2016 13:08:04:234 8847909 INFO  org.asqatasun.webapp.orchestrator.AsqatasunOrchestratorImpl  - Audit site terminated on http://en.wikipedia.org/

18-05-2016 13:15:19:739 9283414 INFO  org.asqatasun.webapp.orchestrator.AsqatasunOrchestratorImpl  - Launching audit site on http://fr.wikihow.com/Accueil
18-05-2016 13:15:20:153 9283828 INFO org.asqatasun.service.command.SiteAuditCommandImpl  - Launching crawler for page http://fr.wikihow.com/Accueil
18-05-2016 13:15:20:247 9283922 INFO  org.asqatasun.crawler.CrawlerImpl  - Rel canonical pages are kept for ref Rgaa22
18-05-2016 13:15:21:688 9285363 INFO  org.asqatasun.crawler.framework.AsqatasunCrawlJob  - crawljob is running
18-05-2016 13:15:27:351 9291026 WARN  org.asqatasun.service.AuditServiceImpl  - Audit has no content
18-05-2016 13:15:27:371 9291046 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whileCONTENT_ADAPTING was required
18-05-2016 13:15:27:383 9291058 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whilePROCESSING was required
18-05-2016 13:15:27:393 9291068 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whileCONSOLIDATION was required
18-05-2016 13:15:27:395 9291070 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whileANALYSIS was required
18-05-2016 13:15:27:828 9291503 INFO  org.asqatasun.webapp.orchestrator.AsqatasunOrchestratorImpl  - failure email sent to [dgfip_asqatasun@email.com] on audit n° 39
18-05-2016 13:15:27:837 9291512 INFO  org.asqatasun.webapp.orchestrator.AsqatasunOrchestratorImpl  - Audit site terminated on http://fr.wikihow.com/Accueil

Bonjour,

@koj une idée du problème
ou de la méthode à suivre pour trouver la cause ?

Idéalement, @vivileds, il faudrait identifier si le problème est dû à ton installation ou à l’environnement réseau. Pour cela, tu peux utiliser l’image Docker sur la même machine où tu as installé Asqatasun et voir si l’audit de site fonctionne. Si tu as besoin de modifier le fichier de configuration d’Asqatasun dans le Docker (pour le proxy par exemple), je peux te fournir les lignes de commandes à utiliser.

@vivileds, en parallèle de la résolution de ton problème, je t’invite à lire les échanges suivants qui évoquent la différence d’analyse et de résultat entre un audit de site et les audits de page / scénario :

Pour trouver la cause, il faudrait passer le les logs asqatasun en debug.

log4j.logger.org.asqatasun.crawler=DEBUG
log4j.logger.org.asqatasun.service=DEBUG

dans /var/lib/tomcat7/webapps/asqatasun/WEB-INF/classes/log4j.properties

En attendant, je mettrai bien une petite piece, sur le robots.txt qui semble exclure bcp de Bots. (http://fr.wikihow.com/robots.txt)

Peux-tu réessayer avec un le site asqatasun.org par exemple?

En esperant que ca fasse avancer

Koj

1 Like

merci @koj pour ces précisions

en complément @vivileds, tu peux jeter un oeil ici :

Hello

@koj merci pour ta réponse ci dessous le log d’asqatasun suite à l’audit de site sur asqatasun.org

23-05-2016 09:42:44:818 1533737 INFO  org.asqatasun.webapp.orchestrator.AsqatasunOrchestratorImpl  - Launching audit site on http://asqatasun.org/
23-05-2016 09:42:45:431 1534350 INFO  org.asqatasun.service.command.SiteAuditCommandImpl  - Launching crawler for page http://asqatasun.org/
23-05-2016 09:42:46:869 1535788 INFO  org.asqatasun.crawler.CrawlerImpl  - Rel canonical pages are kept for ref Rgaa30
23-05-2016 09:42:52:603 1541522 INFO  org.asqatasun.crawler.framework.AsqatasunCrawlJob  - crawljob is running
23-05-2016 09:42:58:243 1547162 WARN  org.asqatasun.service.AuditServiceImpl  - Audit has no content
23-05-2016 09:42:58:304 1547223 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whileCONTENT_ADAPTING was required
23-05-2016 09:42:58:331 1547250 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whilePROCESSING was required
23-05-2016 09:42:58:372 1547291 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whileCONSOLIDATION was required
23-05-2016 09:42:58:374 1547293 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whileANALYSIS was required
23-05-2016 09:42:58:550 1547469 INFO  org.asqatasun.webapp.orchestrator.AsqatasunOrchestratorImpl  - failure email sent to [dgfip_asqatasun@email.com] on audit n° 39
23-05-2016 09:42:58:662 1547581 INFO  org.asqatasun.webapp.orchestrator.AsqatasunOrchestratorImpl  - Audit site terminated on http://asqatasun.org/

Merci

Je viens de lancer un audit de site sur asqatasun.org en mode debug avancé mais j’arrive pas très bien a cibler le probleme.

23-05-2016 09:50:45:935 11903 DEBUG org.asqatasun.service.command.factory.AuditCommandFactoryImpl  - AuditPageWithCrawler false
23-05-2016 09:50:45:936 11904 DEBUG org.asqatasun.service.command.factory.AuditCommandFactoryImpl  - CleanUpRelatedContent true
23-05-2016 09:50:49:118 15086 INFO  org.springframework.web.servlet.DispatcherServlet  - FrameworkServlet 'tgol-web-app': initialization started
23-05-2016 09:51:00:317 26285 INFO  org.springframework.web.servlet.DispatcherServlet  - FrameworkServlet 'tgol-web-app': initialization completed in 11199 ms
23-05-2016 09:57:24:701 410669 INFO  org.asqatasun.webapp.orchestrator.AsqatasunOrchestratorImpl  - Launching audit site on http://asqatasun.org/
23-05-2016 09:57:24:703 410671 DEBUG org.asqatasun.service.AuditServiceImpl  - auditSite
23-05-2016 09:57:24:736 410704 DEBUG org.asqatasun.service.AuditServiceThreadQueueImpl  - auditCommand polled
23-05-2016 09:57:24:740 410708 DEBUG org.asqatasun.service.AuditServiceThreadQueueImpl  - AuditServiceThread created from auditCommand
23-05-2016 09:57:24:746 410714 DEBUG org.asqatasun.service.AuditServiceThreadQueueImpl  - AuditServiceThread started
23-05-2016 09:57:27:361 413329 INFO  org.asqatasun.service.command.SiteAuditCommandImpl  - Launching crawler for page http://asqatasun.org/
23-05-2016 09:57:27:471 413439 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - Directory: /var/tmp/asqatasun/crawl-1463990247456 created
23-05-2016 09:57:27:472 413440 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - crawlConfigFilePath: /var/lib/tomcat7/webapps//asqatasun/WEB-INF/conf/crawler/ for copy
23-05-2016 09:57:27:518 413486 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - filepath : /var/lib/tomcat7/webapps//asqatasun/WEB-INF/conf/crawler//asqatasun-crawler-beans-site.xml
23-05-2016 09:57:27:518 413486 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - DEPTH 20
23-05-2016 09:57:27:518 413486 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - SCREEN_WIDTH 1920
23-05-2016 09:57:27:519 413487 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - INFORMATIVE_IMAGE_MARKER 
23-05-2016 09:57:27:519 413487 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - COMPLEX_TABLE_MARKER 
23-05-2016 09:57:27:519 413487 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - LEVEL Rgaa30;LEVEL_2
23-05-2016 09:57:27:519 413487 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - CONSIDER_COOKIES true
23-05-2016 09:57:27:519 413487 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - ALTERNATIVE_CONTRAST_MECHANISM false
23-05-2016 09:57:27:520 413488 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - MAX_DOCUMENTS 1000
23-05-2016 09:57:27:520 413488 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - SCREEN_HEIGHT 1080
23-05-2016 09:57:27:520 413488 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - MAX_DURATION 86400
23-05-2016 09:57:27:520 413488 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - PRESENTATION_TABLE_MARKER 
23-05-2016 09:57:27:520 413488 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - DATA_TABLE_MARKER 
23-05-2016 09:57:27:520 413488 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - EXCLUSION_REGEXP 
23-05-2016 09:57:27:521 413489 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - INCLUSION_REGEXP 
23-05-2016 09:57:27:521 413489 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - DECORATIVE_IMAGE_MARKER 
23-05-2016 09:57:27:658 413626 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  - Modifier found for value http://asqatasun.org/
23-05-2016 09:57:27:807 413775 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  - 20 DEPTH
23-05-2016 09:57:27:808 413776 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  - Modifier found for value 20
23-05-2016 09:57:27:830 413798 DEBUG org.asqatasun.crawler.util.HeritrixAttributeValueModifier  - Update maxHops attribute of bean tooManyHopsDecideRule with value 20
23-05-2016 09:57:27:831 413799 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  - 1920 SCREEN_WIDTH
23-05-2016 09:57:27:831 413799 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  -  INFORMATIVE_IMAGE_MARKER
23-05-2016 09:57:27:832 413800 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  -  COMPLEX_TABLE_MARKER
23-05-2016 09:57:27:832 413800 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  - Rgaa30;LEVEL_2 LEVEL
23-05-2016 09:57:27:832 413800 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  - true CONSIDER_COOKIES
23-05-2016 09:57:27:833 413801 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  - Modifier found for value true
23-05-2016 09:57:27:851 413819 DEBUG org.asqatasun.crawler.util.HeritrixInverseBooleanAttributeValueModifier  - Update ignoreCookies attribute of bean fetchHttp with value false
23-05-2016 09:57:27:852 413820 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  - false ALTERNATIVE_CONTRAST_MECHANISM
23-05-2016 09:57:27:852 413820 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  - 1000 MAX_DOCUMENTS
23-05-2016 09:57:27:852 413820 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  - Modifier found for value 1000
23-05-2016 09:57:27:866 413834 DEBUG org.asqatasun.crawler.util.HeritrixAttributeValueModifier  - Update maxDocumentsDownload attribute of bean crawlLimiter with value 1000
23-05-2016 09:57:27:867 413835 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  - 1080 SCREEN_HEIGHT
23-05-2016 09:57:27:867 413835 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  - 86400 MAX_DURATION
23-05-2016 09:57:27:867 413835 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  - Modifier found for value 86400
23-05-2016 09:57:27:883 413851 DEBUG org.asqatasun.crawler.util.HeritrixAttributeValueModifier  - Update maxTimeSeconds attribute of bean crawlLimiter with value 86400
23-05-2016 09:57:27:883 413851 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  -  PRESENTATION_TABLE_MARKER
23-05-2016 09:57:27:883 413851 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  -  DATA_TABLE_MARKER
23-05-2016 09:57:27:884 413852 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  -  EXCLUSION_REGEXP
23-05-2016 09:57:27:884 413852 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  - Modifier found for value 
23-05-2016 09:57:27:897 413865 DEBUG org.asqatasun.crawler.util.HeritrixParameterValueModifier  - [list: null] value 
23-05-2016 09:57:27:906 413874 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  -  INCLUSION_REGEXP
23-05-2016 09:57:27:907 413875 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  - Modifier found for value 
23-05-2016 09:57:27:919 413887 DEBUG org.asqatasun.crawler.util.HeritrixParameterValueModifier  - [list: null] value 
23-05-2016 09:57:27:925 413893 DEBUG org.asqatasun.crawler.util.CrawlConfigurationUtils  -  DECORATIVE_IMAGE_MARKER
23-05-2016 09:57:27:946 413914 DEBUG org.asqatasun.crawler.util.HeritrixAttributeValueModifierAndEraser  - Update httpProxyHost attribute of bean fetchHttp with value proxy.infra.dgfip
23-05-2016 09:57:27:953 413921 DEBUG org.asqatasun.crawler.util.HeritrixAttributeValueModifierAndEraser  - Update httpProxyPort attribute of bean fetchHttp with value 3128
23-05-2016 09:57:27:971 413939 DEBUG org.asqatasun.crawler.util.HeritrixAttributeValueModifierAndEraser  - Delete httpProxyUser attribute of bean fetchHttp because of null or empty value 
23-05-2016 09:57:27:996 413964 DEBUG org.asqatasun.crawler.util.HeritrixAttributeValueModifierAndEraser  - Delete httpProxyPassword attribute of bean fetchHttp because of null or empty value 
23-05-2016 09:57:28:296 414264 INFO  org.asqatasun.crawler.CrawlerImpl  - Rel canonical pages are kept for ref Rgaa30
23-05-2016 09:57:28:297 414265 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - crawljob is launchable
23-05-2016 09:57:29:685 415653 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - Job validated
23-05-2016 09:57:29:686 415654 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - Starting context
23-05-2016 09:57:31:354 417322 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - Context started
23-05-2016 09:57:31:355 417323 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - Request crawl start
23-05-2016 09:57:31:362 417330 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - CrawlJob changes state to PREPARING
23-05-2016 09:57:31:411 417379 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - PREPARING
23-05-2016 09:57:31:411 417379 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - crawl start requested
23-05-2016 09:57:32:364 418332 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - CrawlJob changes state to RUNNING
23-05-2016 09:57:32:364 418332 INFO  org.asqatasun.crawler.framework.AsqatasunCrawlJob  - crawljob is running
23-05-2016 09:57:32:368 418336 DEBUG org.asqatasun.crawler.processor.AsqatasunWriterProcessor  - should process? http://asqatasun.org/ with mime type unknown false
23-05-2016 09:57:32:502 418470 DEBUG org.asqatasun.crawler.processor.AsqatasunWriterProcessor  - should process? dns:asqatasun.org with mime type text/dns false
23-05-2016 09:57:33:506 419474 DEBUG org.asqatasun.crawler.processor.AsqatasunWriterProcessor  - should process? dns:asqatasun.org with mime type text/dns false
23-05-2016 09:57:34:511 420479 DEBUG org.asqatasun.crawler.processor.AsqatasunWriterProcessor  - should process? dns:asqatasun.org with mime type text/dns false
23-05-2016 09:57:34:516 420484 DEBUG org.asqatasun.crawler.processor.AsqatasunWriterProcessor  - should process? http://asqatasun.org/ with mime type unknown false
23-05-2016 09:57:35:372 421340 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - CrawlJob changes state to STOPPING
23-05-2016 09:57:35:374 421342 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - CrawlJob changes state to EMPTY
23-05-2016 09:57:37:532 423500 DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - CrawlJob changes state to FINISHED
23-05-2016 09:57:38:080 424048 DEBUG org.asqatasun.crawler.CrawlerImpl  - remove Orphan related contents  0 elements
23-05-2016 09:57:38:091 424059 DEBUG org.asqatasun.crawler.CrawlerImpl  - remove Orphan SSPs  0 elements
23-05-2016 09:57:38:302 424270 DEBUG org.asqatasun.service.CrawlerServiceImpl  - Number Of SSP From WebResource http://asqatasun.org/ : 0
23-05-2016 09:57:38:336 424304 DEBUG org.asqatasun.service.CrawlerServiceImpl  - Number Of Related Content From WebResource?http://asqatasun.org/ : 0
23-05-2016 09:57:38:385 424353 WARN  org.asqatasun.service.AuditServiceImpl  - Audit has no content
23-05-2016 09:57:38:561 424529 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whileCONTENT_ADAPTING was required
23-05-2016 09:57:38:662 424630 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whilePROCESSING was required
23-05-2016 09:57:38:757 424725 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whileCONSOLIDATION was required
23-05-2016 09:57:38:760 424728 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whileANALYSIS was required
23-05-2016 09:57:39:240 425208 INFO  org.asqatasun.webapp.orchestrator.AsqatasunOrchestratorImpl  - failure email sent to [dgfip_asqatasun@email.com] on audit n° 40
23-05-2016 09:57:39:328 425296 INFO  org.asqatasun.webapp.orchestrator.AsqatasunOrchestratorImpl  - Audit site terminated on http://asqatasun.org/

D’après ce que je vois il y a une erreur au niveau du mime, j’ai beau chercher je voit pas ce qui va pas.

crawler.processor.AsqatasunWriterProcessor  - should process? http://asqatasun.org/ with mime type unknown false
crawler.processor.AsqatasunWriterProcessor  - should process? dns:asqatasun.org with mime type text/dns false
crawler.processor.AsqatasunWriterProcessor  - should process? dns:asqatasun.org with mime type text/dns false
crawler.processor.AsqatasunWriterProcessor  - should process? dns:asqatasun.org with mime type text/dns false
crawler.processor.AsqatasunWriterProcessor  - should process? http://asqatasun.org/ with mime type unknown false

J’ai consulté le fichier /etc/mime.types mais ça m’avance a rien.

Quelqu’un à une idée ?

@vivileds, comme tu as configuré un proxy dans asqatasun.conf
peux-tu essayer avec un des sites listés dans la variable
de configuration proxyExclusionUrl ?

Un test complémentaire serait celui déjà évoqué plus haut :

Bonjour,

J’ai testé le docker image l’audit de page fonctionne très bien, j’ai opté pour la version v4.0.1

Ceci est le logs produit par l’image docker lorsque je lance l’audit de site

Log de tomcat

May 24, 2016 7:08:08 AM org.archive.crawler.framework.CrawlJob instantiateContainer
INFO: Job instantiated
May 24, 2016 7:08:08 AM org.archive.spring.PathSharingContext initLaunchId
INFO: launch id 20160524070808
May 24, 2016 7:08:08 AM org.archive.crawler.framework.CrawlJob onApplicationEvent
INFO: PREPARING 20160524070808
==/var/log/asqatasun/asqatasun.log <==
24-05-2016 07:08:08:178 423077 INFO  org.asqatasun.service.command.SiteAuditCommandImpl  - Launching crawler for page http://fr.wikihow.com/Accueil
24-05-2016 07:08:08:273 423172 INFO  org.asqatasun.crawler.CrawlerImpl  - Rel canonical pages are kept for ref Rgaa30
==/var/log/tomcat7/catalina.out <==
May 24, 2016 7:08:09 AM org.archive.crawler.framework.CrawlController noteFrontierState
INFO: Crawl running.
May 24, 2016 7:08:09 AM org.archive.crawler.framework.CrawlJob onApplicationEvent
INFO: RUNNING 20160524070808
==/var/log/asqatasun/asqatasun.log <==
24-05-2016 07:08:09:500 424399 INFO  org.asqatasun.crawler.framework.AsqatasunCrawlJob  - crawljob is running
==/var/log/tomcat7/catalina.out <==
May 24, 2016 7:08:12 AM org.archive.crawler.framework.CrawlController noteFrontierState
INFO: Crawl empty.
May 24, 2016 7:08:12 AM org.archive.crawler.framework.CrawlJob onApplicationEvent
INFO: STOPPING 20160524070808
May 24, 2016 7:08:12 AM org.archive.crawler.framework.CrawlJob onApplicationEvent
INFO: EMPTY 20160524070808
May 24, 2016 7:08:14 AM org.archive.crawler.framework.CheckpointService stop
INFO: Cleaned up Checkpoint TimerThread.
May 24, 2016 7:08:14 AM org.archive.crawler.framework.CrawlJob onApplicationEvent
INFO: FINISHED 20160524070808

Log asqatasun :

24-05-2016 07:08:08:178 423077 INFO  org.asqatasun.service.command.SiteAuditCommandImpl  - Launching crawler for page http://fr.wikihow.com/Accueil
24-05-2016 07:08:08:273 423172 INFO  org.asqatasun.crawler.CrawlerImpl  - Rel canonical pages are kept for ref Rgaa30
24-05-2016 07:08:09:500 424399 INFO  org.asqatasun.crawler.framework.AsqatasunCrawlJob  - crawljob is running
24-05-2016 07:08:15:315 430214 WARN  org.asqatasun.service.AuditServiceImpl  - Audit has no content
24-05-2016 07:08:15:379 430278 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whileCONTENT_ADAPTING was required
24-05-2016 07:08:15:402 430301 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whilePROCESSING was required
24-05-2016 07:08:15:435 430334 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whileCONSOLIDATION was required
24-05-2016 07:08:15:437 430336 WARN  org.asqatasun.service.command.AuditCommandImpl  - Audit status isERROR whileANALYSIS was required
24-05-2016 07:08:15:869 430768 INFO  org.asqatasun.webapp.orchestrator.AsqatasunOrchestratorImpl  - failure email sent to [me@my-email.org] on audit n? 6
24-05-2016 07:08:15:978 430877 INFO  org.asqatasun.webapp.orchestrator.AsqatasunOrchestratorImpl  - Audit site terminated on http://fr.wikihow.com/Accueil

Effectivement je suis sous proxy mais l’audit de page marche docn le probleme n’est pas issue du proxy, le problème c’est que Asqatasun ne me laisse pas entrer une adresse qui ne contient pas .com ici le site que je veux testé est : http://venea.apli.dgfip/

Merci

@vivileds, l’ajout d’une URL sans TLD valide (.com, .fr, .paris, …),
pour les audits de site n’est effectivement pas possible (sauf pour localhost)
—> testé sur v4.0.1 et branche develop

Si tu veux, tu peux créer une issue sur Github pour ce problème.

1 Like

Pour résumer, tu as avec un Asqatasun installé (sur une machine Ubuntu)
et un Asqatasun via l’image docker qui ont la même problématique :

  • Audit de page → OK
  • Audite de site → FAIL

Or l’image Docker fonctionne très bien pour les audits de site.
Nous pouvons en déduire qu’il y a un truc au niveau réseau
que l’audit de site fait (mais pas l’audit de page)
mais qui ne passe pas sur le réseau où tu es.

Pas de piste dans l’immédiat pour résoudre ton problème… :frowning:

1 Like

@koj @mfaure une idée ?

Réponse ultra-rapide: audit de page et audit de site n’utilisent pas les mêmes briques pour charger la page (respectivement Firefox ESR et Heritrix). Il faut vérifier que les paramètres de proxy soient bien passés à Héritrix. Là tout de suite, je ne sais plus comment checker ça, il faudrait vérifier dans les options du .conf.

Bonjour

J’ai véririfé mon fichier de conf je ne vois aucune anomalie est il possible de préciser un peu plus ?Il y a peut être des choses que je dois modifier.

Merci

Bonjour,

En parcourant la doc je suis tombe sur comment configurer heritrix :
http://doc.asqatasun.org/en/30_Contributor_doc/Engine/Heritrix_configuration.html

Suite à cela je trouve pas le fichier de configuration.
Quelqu’un peut m’aider ?

Merci

Salut @vivileds

La doc de config Heritrix ne t’aidera pas pour le coup (c’est de la doc pour développeur).

Par contre, pour avancer il faut diagnostiquer ton problème. Je t’invite vivement à faire les modifs suggérées par @koj, à savoir:

  1. modifier le fichier /var/lib/tomcat7/webapps/asqatasun/WEB-INF/classes/log4j.properties
  2. et y ajouter

Ensuite fais un audit de site sur https://fr.wikipedia.org/ et partage ici les logs

(lâche rien! :slight_smile: )

@mfaure, à priori @vivileds a déjà activé
le log niveau DEBUG et l’a posté plus haut :

voici l’extrait du log qui semble poser problème :

INFO  org.asqatasun.crawler.framework.AsqatasunCrawlJob  - crawljob is running
DEBUG org.asqatasun.crawler.processor.AsqatasunWriterProcessor  - should process? http://asqatasun.org/ with mime type unknown false
DEBUG org.asqatasun.crawler.processor.AsqatasunWriterProcessor  - should process? dns:asqatasun.org with mime type text/dns false
DEBUG org.asqatasun.crawler.processor.AsqatasunWriterProcessor  - should process? dns:asqatasun.org with mime type text/dns false
DEBUG org.asqatasun.crawler.processor.AsqatasunWriterProcessor  - should process? dns:asqatasun.org with mime type text/dns false
DEBUG org.asqatasun.crawler.processor.AsqatasunWriterProcessor  - should process? http://asqatasun.org/ with mime type unknown false
DEBUG org.asqatasun.crawler.framework.AsqatasunCrawlJob  - CrawlJob changes state to STOPPING

@mfaure, il y a une option bypassUrlCheck dans asqatasun.conf
qui n’est pas listée dans la documentation.

Cette option ne serait pas intéressante
pour la configuration réseau de @vivileds ?

# bypass initial check of URL before effective launch of audit.   
# CAUTION : bypassing this control may lead to test error pages. 
# Only use for debug purpose when setting network properties.     
# The value MUST be equals to false or true    
bypassUrlCheck=false

Merci @fabrice, sorry @vivileds j’ai lu trop vite hier.

OK je comprends le problème: DNS dans un contexte de proxy. Ce que je ne comprends pas c’est qu’il me semble qu’on avait déjà réglé ça. J’investigue et vous tiens au courant.

Matthieu

1 Like