钗头玉茗的意思:Lineland: HBase Architecture 101

来源:百度文库 编辑:九乡新闻网 时间:2024/07/04 21:25:03

HBase Architecture 101 - Write-ahead-Log

What is the Write-ahead-Log you ask? In my previous postwe had a look at the general storage architecture of HBase. One thingthat was mentioned is the Write-ahead-Log, or WAL. This post explainshow the log works in detail, but bear in mind that it describes thecurrent version, which is 0.20.3. I will address the various plans toimprove the log for 0.21 at the end of this article. For the term itselfplease read here.

Big Picture
TheWAL is the lifeline that is needed when disaster strikes. Similar to aBIN log in MySQL it records all changes to the data. This is importantin case something happens to the primary storage. So if the servercrashes it can effectively replay that log to get everything up to wherethe server should have been just before the crash. It also means thatif writing the record to the WAL fails the whole operation must beconsidered a failure.

Let"s look at the high level view of how this is done in HBase. Firstthe client initiates an action that modifies data. This is currently acall to put(Put), delete(Delete) and incrementColumnValue() (abbreviated as "incr" here at times). Each of these modifications is wrapped into a KeyValue object instance and sent over the wire using RPC calls. The calls are (ideally batched) to the HRegionServer that serves the affected regions. Once it arrives the payload, the said KeyValue, is routed to the HRegion that is responsible for the affected row. The data is written to the WAL and then put into the MemStore of the actual Store that holds the record. And that also pretty much describes the write-path of HBase.

Eventually when the MemStore gets to a certain size or after a specifictime the data is asynchronously persisted to the file system. In betweenthat timeframe data is stored volatile in memory. And if theHRegionServer  hosting that memory crashes the data is lost... but forthe existence of what is the topic of this post, the WAL!

We have a look now at the various classes or "wheels" working the magicof the WAL. First up is one of the main classes of this contraption.

HLog

The class which implements the WAL is called HLog.What you may have read in my previous post and is also illustratedabove is that there is only one instance of the HLog class, which is oneper HRegionServer. When a HRegion is instantiated the single HLog ispassed on as a parameter to the constructor of HRegion.

Central part to HLog's functionality is the append() method, which internally eventually calls doWrite().It is what is called when the above mentioned modification methods areinvoked... or is it? One thing to note here is that for performancereasons there is an option for put(), delete(), and incrementColumnValue() to be called with an extra parameter set: setWriteToWAL(boolean). If you invoke this method while setting up for example a Put()instance then the writing to WAL is forfeited! That is also why thedownward arrow in the big picture above is done with a dashed line toindicate the optional step. By default you certainly want the WAL, nodoubt about that. But say you run a large bulk import MapReduce job thatyou can rerun at any time. You gain extra performance but need to takeextra care that no data was lost during the import. The choice is yours.

Another important feature of the HLog is keeping track of the changes. This is done by using a "sequence number". It uses an AtomicLonginternally to be thread-safe and is either starting out at zero - or atthat last known number persisted to the file system. So as the regionis opening its storage file, it reads the highest sequence number whichis stored as a meta field in each HFile andsets the HLog sequence number to that value if it is higher than whathas been recorded before. So at the end of opening all storage files theHLog is initialized to reflect where persisting has ended and where tocontinue. You will see in a minute where this is used.

Theimage to the right shows three different regions. Each of them coveringa different row key range. As mentioned above each of these regionsshares the the same single instance of HLog. What that means in thiscontext is that the data as it arrives at each region it is written tothe WAL in an unpredictable order. We will address this further below.

Finally the HLog has the facilities to recover and split a log left by a crashed HRegionServer. These are invoked by the HMaster before regions are deployed again.

HLogKey

Currently the WAL is using a Hadoop SequenceFile,which stores record as sets of key/values. For the WAL the value issimply the KeyValue sent from the client. The key is represented by an HLogKey instance. If you may recall from my first post inthis series the KeyValue does only represent the row, column family,qualifier, timestamp, and value as well as the "Key Type". Last time Idid not address that field since there was no context. Now we have onebecause the Key Type is what identifies what the KeyValue represents, a"put" or a "delete" (where there are a few more variations of the latterto express what is to be deleted, value, column family or a specificcolumn).

What we are missing though is where the KeyValue belongs to, i.e. theregion and the table name. That is stored in the HLogKey. What is alsostored is the above sequence number. With each record that number isincremented to be able to keep a sequential order of edits. Finally itrecords the "Write Time", a time stamp to record when the edit waswritten to the log.

LogFlusher

As mentioned above as data arrives at a HRegionServer in form ofKeyValue instances it is written (optionally) to the WAL. And asmentioned as well it is then written to a SequenceFile. While this seemstrivial, it is not. One of the base classes in Java IO is the Stream.Especially streams writing to a file system are often buffered toimprove performance as the OS is much faster writing data in batches, orblocks. If you write records separately IO throughput would be reallybad. But in the context of the WAL this is causing a gap where data issupposedly written to disk but in reality it is in limbo. To mitigatethe issue the underlaying stream needs to be flushed on a regularbasis. This functionality is provided by the LogFlusher class and thread. It simply calls HLog.optionalSync(), which checks if the  hbase.regionserver.optionallogflushinterval, set to 10 seconds by default, has been exceeded and if that is the case invokes HLog.sync(). The other place invoking the sync method is HLog.doWrite(). Once it has written the current edit to the stream it checks if the hbase.regionserver.flushlogentries parameter, set to 100 by default, has been exceeded and calls sync as well.

Sync itself invokes HLog.Writer.sync() and is implemented in SequenceFileLogWriter.For now we assume it flushes the stream to disk and all is well. Thatin reality this is all a bit more complicated is discussed below.

LogRoller

Obviously it makes sense to have some size restrictions related to thelogs written. Also we want to make sure a log is persisted on a regularbasis. This is done by the LogRoller class and thread. It is controlledby the hbase.regionserver.logroll.period parameter in the $HBASE_HOME/conf/hbase-site.xmlfile. By default this is set to 1 hour. So every 60 minutes the log isclosed and a new one started. Over time we are gathering that way abunch of log files that need to be maintained as well. The HLog.rollWriter()method, which is called by the LogRoller to do the above rolling of thecurrent log file, is taking care of that as well by calling HLog.cleanOldLogs()subsequently. It checks what the highest sequence number written to astorage file is, because up to that number all edits are persisted. Itthen checks if there is a log left that has edits all less than thatnumber. If that is the case it deletes said logs and leaves just thosethat are still needed.

This is a good place to talk about the following obscure message you may see in your logs:

2009-12-15 01:45:48,427 INFO org.apache.hadoop.hbase.regionserver.HLog: Too
many hlogs: logs=130, maxlogs=96; forcing flush of region with oldest edits:
foobar,1b2dc5f3b5d4,1260083783909


It is printed because the configured maximum number of log files to keepexceeds the number of log files that are required to be kept becausethey still contain outstanding edits that have not yet been persisted.The main reason I saw this being the case is when you stress out thefile system so much that it cannot keep up persisting the data at therate new data is added. Otherwise log flushes should take care of this.Note though that when this message is printed the server goes into aspecial mode trying to force flushing out edits to reduce the number oflogs required to be kept.
The other parameters controlling the log rolling are hbase.regionserver.hlog.blocksize and hbase.regionserver.logroll.multiplier,which are set by default to rotate logs when they are at 95% of theblocksize of the SequenceFile, typically 64M. So either the logs areconsidered full or when a certain amount of time has passed causes thelogs to be switched out, whatever comes first.

Replay

Once a HRegionServer starts and is opening the regions it hosts itchecks if there are some left over log files and applies those all theway down in Store.doReconstructionLog(). Replaying a log issimply done by reading the log and adding the contained edits to thecurrent MemStore. At the end an explicit flush of the MemStore (note,this is not the flush of the log!) helps writing those changes out todisk.

The old logs usually come from a previous region server crash. When theHMaster is started or detects that region server has crashed it splitsthe log files belonging to that server into separate files and storesthose in the region directories on the file system they belong to. Afterthat the above mechanism takes care of replaying the logs. One thing tonote is that regions from a crashed server can only be redeployed ifthe logs have been split and copied. Splitting itself is done in HLog.splitLog().The old log is read into memory in the main thread (means singlethreaded) and then using a pool of threads written to all regiondirectories, one thread for each region.

Issues

As mentioned above all edits are written to one HLog per HRegionServer.You would ask why that is the case? Why not write all edits for aspecific region into its own log file? Let's quote the BigTable paper once more:

Ifwe kept the commit log for each tablet in a separate log file, a verylarge number of files would be written concurrently in GFS. Depending onthe underlying file system implementation on each GFS server, thesewrites could cause a large number of disk seeks to write to thedifferent physical log files.

HBase followed that principle for pretty much the same reasons. Asexplained above you end up with many files since logs are rolled andkept until they are safe to be deleted. If you do this for every regionseparately this would not scale well - or at least be an itch thatsooner or later is causing pain.

So far that seems to be no issue. But again, it causes problems whenthings go wrong. As long as you have applied all edits in time andpersisted the data safely, all is well. But if you have to split the logbecause of a server crash then you need to divide into suitable pieces,as described above in the "replay" paragraph. But as you have seenabove as well all edits are intermingled in the log and there is noindex of what is stored at all. For that reason the HMaster cannotredeploy any region from a crashed server until it has split the logsfor that very server. And that can be quite a number if the server wasbehind applying the edits.

Another problem is data safety. You want to be able to rely on thesystem to save all your data, no matter what newfangled algorithms areemployed behind the scenes. As far as HBase and the log is concerned youcan turn down the log flush times to as low as you want - you are stilldependent on the underlaying file system as mentioned above; the streamused to store the data is flushed but is it written to disk yet? We aretalking about fsync style issues. Now for HBase we are most likely talking Hadoop's HDFS as being the file system that is persisted to.

Up to this point it should be abundantly clear that the log is whatkeeps data safe. For that reason a log could be kept open for up to anhour (or more if configured so). As data arrives a new key/value pair iswritten to the SequenceFile and occasionally flushed to disk. But thatis not how Hadoop was set out to work. It was meant to provide an APIthat allows to open a file, write data into it (preferably a lot) andclosed right away, leaving an immutable file for everyone else to readmany times. Only after a file is closed it is visible and readable toothers. If a process dies while writing the data the file is pretty muchconsidered lost. What is required is a feature that allows to read thelog up to the point where the crashed server has written it (or as closeas possible).

Interlude: HDFS append, hflush, hsync, sync... wth?

It all started with HADOOP-1700reported by HBase lead Michael Stack. It was committed in Hadoop 0.19.0and meant to solve the problem. But that was not the case. So the issuewas tackled again in HADOOP-4379 aka HDFS-200 and implemented syncFs() that was meant to help syncing changes to a file to be more reliable. For a while we had custom code (see HBASE-1470) that detected a patched Hadoop that exposed that API. But again this did not solve the issue entirely.

Then came HDFS-265, which revisits the append idea in general. It also introduces a Syncable interface that exposes hsync() and hflush().

Lastly SequenceFile.Writer.sync() is not the same asthe above, it simply writes a synchronization marker into the file thathelps reading it later - or recover data if broken.
While append for HDFS in general is useful it is not used in HBase, but the hflush()is. What it does is writing out everything to disk as the log iswritten. In case of a server crash we can safely read that "dirty" fileup to the last edits. The append in Hadoop 0.19.0 was so badly suitedthat a hadoop fsck / would report the DFS being corrupt because of the open log files HBase kept.

Bottom line is, without Hadoop 0.21.0 you can very well face data loss. With Hadoop 0.21.0 you have a state-of-the-art system.

Planned Improvements

For HBase 0.21.0 there are quite a few things lined up that affect the WAL architecture. Here are some of the noteworthy ones.

SequenceFile Replacement

One of the central building blocks around the WAL is the actual storagefile format. The used SequenceFile has quite a few shortcomings thatneed to be addressed. One for example is the suboptimal performance asall writing in SequenceFile is synchronized, as documented in HBASE-2105.

As with HFile replacing MapFile in HBase 0.20.0 it makes sense to thinkabout a complete replacement. A first step was done to make the HBaseclasses independent of the underlaying file format. HBASE-2059 made the class implementing the log configurable.

Another idea is to change to a different serialization altogether. HBASE-2055 proposes such a format using Hadoop's Avroas the low level system. Avro is also slated to be the new RPC formatfor Hadoop, which does help as more people are familiar with it.

Append/Sync

Even with hflush() we have a problem that calling it too often may cause the system to slow down. Previous tests using the older syncFs()call did show that calling it for every record slows down the systemconsiderably. One step to help is to implement a "Group Commit", done inHBASE-1939. It flushes out records in batches. In addition HBASE-1944 adds the notion of a "deferred log flush" as a parameter of a Column Family. If set to true it leaves the syncing of changes to the log to the newly added LogSyncer class and thread. Finally HBASE-2041 sets the flushlogentries to 1 and optionallogflushinterval to 1000 msecs. The .META. is always synced for every change, user tables can be configured as needed.

Distributed Log Splitting

As remarked splitting the log is an issue when regions need to beredeployed. One idea is to keep a list of regions with edits inZookeeper. That way at least all "clean" regions can be deployedinstantly. Only those with edits need to wait then until the logs aresplit.

What is left is to improve how the logs are split to make the process faster. Here is how is the BigTable addresses the issue:
Oneapproach would be for each new tablet server to read this full commitlog file and apply just the entries needed for the tablets it needs torecover. However, under such a scheme, if 100 machines were eachassigned a single tablet from a failed tablet server, then the log filewould be read 100 times (once by each server).
and further
Weavoid duplicating log reads by first sorting the commit log entries inorder of the keys (table, row name, log sequence number). In the sortedoutput, all mutations for a particular tablet are contiguous and cantherefore be read efficiently with one disk seek followed by asequential read. To parallelize the sorting, we partition the log fileinto 64 MB segments, and sort each segment in parallel on differenttablet servers. This sorting process is coordinated by the master and isinitiated when a tablet server indicates that it needs to recovermutations from some commit log file.
This is where its at. As part of the HMaster rewrite (see HBASE-1816) the log splitting will be addressed as well. HBASE-1364 wraps the splitting of logs into one issue. But I am sure that will evolve in more sub tasks as the details get discussed.Posted byLars Georgeat3:45 PM