Saturday, 13 April 2013

Using Hadoop Archives | Hadoop Tutorial pdf

Using Hadoop Archives

A Hadoop Archive is created from a collection of files using the archive tool. The tool runs a MapReduce job to process the input files in parallel, so to run it, you need a MapReduce cluster running to use it.
Limitations
There are a few limitations to be aware of with HAR files. Creating an archive creates a copy of the original files, so you need as much disk space as the files you are archiving to create the archive (although you can delete the originals once you have created the archive). There is currently no support for archive compression, although the files that go into the archive can be compressed (HAR files are like tar files in this respect). Archives are immutable once they have been created. To add or remove files, you must recreate
the archive. In practice, this is not a problem for files that don’t change after being written, since they can be archived in batches on a regular basis, such as daily or weekly.
As noted earlier, HAR files can be used as input to MapReduce. However, there is no archive-aware InputFormat that can pack multiple files into a single MapReduce split, so processing lots of small files, even in a HAR file, can still be inefficient.

No comments: