Teragen performance with hServer

Alix · Apr 18, 2013

Hello,

I am currently evaluating hServer with my Hadoop installation. The first thing I tried was to modify Hadoop standard TeraGen step of TeraSort benchmark to use GridOutputFormat to skip writing data to disk and put it directly to hServer. So, according to tutorial I just changed the task output format to GridOutputFormat and started TeraGen to generate 10GB input.

Results were pretty unexpected to me: in soss server statistics I see ~10k creates/sec throughput which is way slower then HDFS. This number does not change significantly with number of reduce tasks (i.e. number of threads inserting records per server)

I suspect there is something wrong with my configuration, so is there any configuration parameter I should set/verify they are set to speed up insertion rate?

I am running 3x8 core boxes with 96GB ram, 10Gbit network, linux.

Thanks.

admin · Apr 18, 2013

Directly storing individual, very small key/value pairs (such as the ones generated by TeraGen) is outside the scope of this release of ScaleOut hServer. The purpose of the GridOutputFormat is to store/update objects whose characteristics are suitable for use by an in-memory data grid (IMDG) and simultaneously accessed by another application updating these objects in a live scenario. An IMDG incorporates different tradeoffs from a distributed file system, such as HDFS, and has a different usage model. BTW, this is not the case for ScaleOut hServer’s DatasetInputIFormat used for HDFS caching, as explained below.

Unlike HDFS, ScaleOut hServer’s IMDG supports highly efficient random reads and updates of objects identified by unique keys, and it implements object-level replication on an update by update basis. Objects have rich semantics, such as distributed locking, queryable properties, backing store access, and timeouts. All of this comes with a price of additional memory and performance overheads, and so IMDG objects are typically much larger (in the range of 10-300 KB) than TeraGen key/value pairs.

This overhead has an impact on throughput when sequentially streaming objects to the IMDG, becoming more evident with very small objects. When objects are committed to the IMDG sequentially, this creates a delay due to replication of each object, which requires a network round trip to complete the update. This round-trip time is pretty small in absolute terms, but it limits streaming throughput if the data set is divided into a large number of very small objects. (In typical IMDG use cases, this overhead is not significant.)

For these reasons, we discourage directly storing very small objects in the IMDG and instead advise the use of a chunking strategy (several hundred kilobytes is a good starting point for a chunk size). A chunking optimization is used in ScaleOut hServer’s DatasetInputFormat for HDFS caching. The key/value pairs emitted by the record reader are accumulated in a queue of buffers which are written out to the server in an overlapped manner. This approach yields much better performance than streaming individual objects to the IMDG.

All that said, we plan to add specific optimizations to the GridOutputFormat for supporting large numbers of small objects in the IMDG with automatic chunking and ordering. This feature was not ready for this first release of ScaleOut hServer.

Teragen performance with hServer

Alix

New member

admin

Administrator