青鬼院蜻蛉的血统:对isURLVisited和politeness的分析

来源：百度文库编辑：九乡新闻网时间：2024/07/14 02:08:04

对isURLVisited和politeness的分析

分析isURLVisited

这主要是在Frontier里实现的，当一个链接要进入等待队列时需要先判断是否已经被抓取过，如果已经抓取过则不进入，否则进入。

分析isURLVisited就要从分析存储已抓取url的结构说起。Heritrix在内部使用了BerkeleyDB(Database)。BerkeleyDB就是一个HashTable，它能够按“key/value”方式来保存数据。它是一套开放源代码的嵌入式数据库，为应用程序提供可伸缩的、高性能的、有事务保护功能的数据管理服务。BerkeleyDB就是一个Hash Table，它能够按“key/value”方式来保存数据。使用BerkeleyDB时，数据库和应用程序在相同的地址空间中运行，所以数据库操作不需要进程间的通讯。另外，BerkeleyDB中的所有操作都使用一组API接口。因此，不需要对某种查询语言（比如SQL）进行解析，也不用生成执行计划，这就大大提高了运行效率。解决了多线程访问、超大容量的问题。

       Heritrix中涉及存储url的主要的类有UriUniqFilter、FPMergeUriUniqFilte、SetBasedUriUniqFilter、BdbUriUniqFilter、BloomUriUniqFilter、emFPMergeUriUniqFilter和DiskFPMergeUriUniqFilter。
   用户可以在创建一个爬取任务时选择BdbUriUniqFilter, BloomUriUniqFilter,emFPMergeUriUniqFilter和DiskFPMergeUriUniqFilter中的一种。默认是BdbUriUniqFilter。而且这也是在Heritrix抓取过程中使用的唯一一种方式。
     Heritrix中涉及存储url的主要的类分布在org.archive.crawler.util包下，之间的继承关系如下图：

这里存储已经处理过的url的数据结构是Berkeley Database，叫做alreadySeen。

Java代码：

protected transientDatabase alreadySeen = null;

为了节省存储空间，alreadySeenUrl中存储的并不是url，而是url的fingerprint(64位)。为了不破坏url的局部性，分别对url的主机名和整个url计算fingerprint，然后把24位的主机名fingerprint和40位的url的fingerprint连接起来得到最后的64位的fingerprint。计算fingerprint是在createKey函数中实现。代码如下：

    publicstatic long createKey(CharSequence uri) {
       String url = uri.toString();
       int index = url.indexOf(COLON_SLASH_SLASH);
       if (index > 0) {
           index = url.indexOf('/', index + COLON_SLASH_SLASH.length());
       }
       CharSequence hostPlusScheme = (index == -1)? url:url.subSequence(0, index);
       long tmp = FPGenerator.std24.fp(hostPlusScheme);

return tmp | (FPGenerator.std40.fp(url)>>> 24);
}

setAdd函数把uri加入到数据库中，如果已经存在，则返回false，否则返回true。关键代码如下：（根据自己的理解加入了注释）
Java代码

protected boolean setAdd(CharSequence uri) {
       DatabaseEntry key = new DatabaseEntry();
       LongBinding.longToEntry(createKey(uri),key);//将uri的fingerprint从long类型转换成DatabaseEntry类型，以便于Database进行存储。
       long started = 0;

       OperationStatus status = null;
       try {
           if (logger.isLoggable(Level.INFO)) {
               started = System.currentTimeMillis();
           }
           status = alreadySeen.putNoOverwrite(null, key,ZERO_LENGTH_ENTRY);//检查是否已经被抓取过，并返回状态给status
           if (logger.isLoggable(Level.INFO)) {
               aggregatedLookupTime +=
                   (System.currentTimeMillis() - started);
           }
       } catch (DatabaseException e) {
           logger.severe(e.getMessage());
       }
       if (status == OperationStatus.SUCCESS) {
           count++;
           if (logger.isLoggable(Level.INFO)) {
               final int logAt = 10000;
               if (count > 0 &&((count % logAt) == 0)) {
                   logger.info("Average lookup " +
                       (aggregatedLookupTime / logAt) + "ms.");
                   aggregatedLookupTime = 0;
               }
           }
       }
       if(status == OperationStatus.KEYEXIST) {//是否已经探测过
           return false;
       } else {
           return true;
       }
    }

分析politeness

   每个时间只有一个面向服务器的连接(one connection at a time)
   Heritrix的礼貌性主要在Frontier中实现：一次对一个服务器只开一个链接，并且保证uri按一定速率处理，从而不会给被爬取的服务器造成负担。
   爬虫采用广度优先遍历，使用FIFO的队列来存储待爬取的URL。因为网页的局部性，队列中相邻的URL很可能是相同主机名的，这样爬取会给服务器造成很大负担。如果用很多队列来存放URL，每个队列中URL的主机名相同，同一时间里，只允许队列中一个URL被爬取，就能避免上述问题了。
    heritrix中主机名相同的URL队列是用WorkQueue来实现的，一个WorkQueue就是一个具有相同主机名的队列。在Heritrix中，还有其他的队列，代码如下：(在org.archive.crawler.frontier.WorkQueueFrontier.java中)
Java代码

protectedtransientObjectIdentityCacheallQueues = null;
// ofclassKey -> ClassKeyQueue
Java代码

    protectedvoid initQueuesOfQueues() {
       // small risk of OutOfMemoryError: if 'hold-queues' is false,
       // readyClassQueues may grow in size without bound
       readyClassQueues = newLinkedBlockingQueue();
       // risk of OutOfMemoryError: in large crawls,
       // inactiveQueues may grow in size without bound
       inactiveQueues = newLinkedBlockingQueue();
       // risk of OutOfMemoryError: in large crawls with queuemax-budgets,
       // inactiveQueues may grow in size without bound
       retiredQueues = newLinkedBlockingQueue();
       // small risk of OutOfMemoryError: in large crawls with many
       // unresponsive queues, an unbounded number of snoozed queues
       // may exist
       snoozedClassQueues = Collections.synchronizedSortedSet(newTreeSet());
    }

在子类BdbFrontier中的初始化过程如下：

public void initialize(CrawlController c)
    throwsFatalConfigurationException, IOException {
       this.controller = c;
       // fill in anything from a checkpoint recovery first (because
       // usual initialization will skip initQueueOfQueues incheckpoint)
       if (c.isCheckpointRecover()) {
           // If a checkpoint recover, copy old values from serialized
           // instance into this Frontier instance. Do it this waybecause
           // though its possible to serialize BdbFrontier, its currentlynot
           // possible to set/remove frontier attribute plugging the
           // deserialized object back into the settings system.
           // The below copying over is error-prone because its easy
           // to miss a value. Perhaps there's a betterway? Introspection?
           BdbFrontier f = null;
           try {
               f = (BdbFrontier)CheckpointUtils.
                   readObjectFromFile(this.getClass(),
                       c.getCheckpointRecover().getDirectory());
           } catch (FileNotFoundException e) {
               throw new FatalConfigurationException("Failed checkpoint " +
                   "recover: " + e.getMessage());
           } catch (IOException e) {
               throw new FatalConfigurationException("Failed checkpoint " +
                   "recover: " + e.getMessage());
           } catch (ClassNotFoundException e) {
               throw new FatalConfigurationException("Failed checkpoint " +
                   "recover: " + e.getMessage());
           }

           this.nextOrdinal = f.nextOrdinal;
           this.totalProcessedBytes = f.totalProcessedBytes;
           this.liveDisregardedUriCount = f.liveDisregardedUriCount;
           this.liveFailedFetchCount = f.liveFailedFetchCount;
           this.processedBytesAfterLastEmittedURI =
               f.processedBytesAfterLastEmittedURI;
           this.liveQueuedUriCount = f.liveQueuedUriCount;
           this.liveSucceededFetchCount = f.liveSucceededFetchCount;
           this.lastMaxBandwidthKB = f.lastMaxBandwidthKB;
           this.readyClassQueues = f.readyClassQueues;
           this.inactiveQueues =reinit(f.inactiveQueues,"inactiveQueues");//inactiveQueues的初始化
           this.retiredQueues =reinit(f.retiredQueues,"retiredQueues");//retiredQueues的初始化
           this.snoozedClassQueues =f.snoozedClassQueues;//snoozedClassQueues的初始化
           this.inProcessQueues = f.inProcessQueues;
           super.initialize(c);
           wakeQueues();
       } else {
           // perform usual initialization
           super.initialize(c);
       }
    }

readyClassQueues存储着已经准备好被爬取的队列的key；
inactiveQueues存储着所有非活动状态的url队列的key；
retiredQueues存储着不再激活的url队列的key。
snoozedClassQueues：存储着所有休眠的url队列的key，它们都按唤醒时间排序；

线程返回readyClassQueues和snoozedClassQueues中已经到唤醒时间的队列中第一个url，下载相应的文档，完成之后从队列中移除该url。每爬取到一个url都需要判断应该加入哪个队列中。首先根据url的主机名判断是否存在该主机名的队列，如果不存在就新建一个队列。然后判断该队列是否在生命周期内，如果不在就设置为在生命周期内。如果队列需要保持不激活状态或者活动队列的数量超过设定的阈值，就把该队列放入inactiveQueues中，否则放在readyClassQueues中。
另外，heritrix还设定了很多参数来限制对服务器的访问频率。如最长等待时间max-delay-ms，默认30秒；重连同一服务器至少等待时间min-delay-ms，默认是3秒，重连同一服务器要等待上次连接间隔的几倍delay-factor，默认是5。

(2) robots.txt
robots.txt称为机器人协议，放在网站的根目录下。在这个文件中声明该网站中不想被robot访问的部分，或者指定搜索引擎只收录指定的内容。这是一个君子协定，爬虫可以不遵守，但是出于礼貌最好遵守。
heritrix在预处理阶段处理robots.txt。它把针对每个user-agent的allow和disallow封装为一个RobotsDirectives类，整个robots.txt用一个Robotstxt对象来存储。
heritrix处理robots.txt有五种方法，都封装在RobotsHonoringPolicy中。这五种方法分别是：
Classic：遵守robots.txt对当前user-agent的第一部分指令。
Ignore：忽略robots.txt。
Custom：遵守robots.txt中特定操作的指令。
Most-favored：遵守最宽松的指令。
Most-favored-set：给定一些user-agent格式的集合，遵守最宽松的限制。

当策略是Most-favored或Most-favored-set时，可以选择是否伪装成另一个user agent。
RobotsExlusionPolicy类中包含heritrix最终处理robots.txt的方法，disallows用来判断userAgent能否访问某个url。它完全依据用户在新建一个爬虫任务时设置的处理robots.txt的策略来实现。

在源代码中的反应如下：
RobotsDirectives.java
Robotstxt.java
RobotsHonoringPolicy.java
RobotsExclusionPolicy.java包都存放在org.archive.crawler.datamodel包下。而且通过查看源文件即可看到类的注释。分别如下：
Java代码：

public class RobotsDirectives implements Serializable{
...
}

Java代码：

public class Robotstxt implements Serializable{
...
}

Java代码：

public class RobotsHonoringPolicy extendsModuleType{
...
}

Java代码：

public class RobotsExclusionPolicy implements Serializable{
...
}

对isURLVisited和politeness的分析对钱云会案证据的分析和初步判断银监会2010年对经济的展望和分析对传统社会和现代社会的四种分析角度对中国房地产的分析对现状的分析对当今社会的分析（体制的腐败和社会的堕落）有良心的学者对高铁最理性客观的分析和预警对财务分析的理解对男人的极品分析对现金流量表的分析对男人的精辟分析专家对芽苗菜的分析佛教对心理现象的分析对“金三角血案”的分析对我国当前物价上涨原因和相关矛盾的分析蜗居中关于宋思明对海藻和家庭的感情分析1 对我国当前物价上涨原因和相关矛盾的分析对当今社会的分析（体制的腐败和社会的堕落） (世外桃源大中国的分类目录 [我的图书馆]) 转型期中国的社会冲突——对当代工农维权抗争活动的观察和分析对现在中国股市的精辟分析！芒格:对实践思维的实践分析？一位MM对男人的极品分析对中国股市的精辟分析