Tuesday, July 14, 2015

Hadoop I/O

Data Integrity
As hadoop operates on the large sets of data so there is possibility of data curruption. For example if you want to write a file to HDFS and while performing this operation over the network some of data got modified so it leads you to the corrupt datanode. So avoid such problem Hadoop offers us a facility called checksum to validate the data. 

It is used for client-side checksum which means if you are writing a file HelloWord.txt to the system than it will create HelloWord.txt.crc in the same directory and when you try to read the same file it will verify the HelloWord.txt.crc checksum for data validation.
We can disable the checksum and it is achived by using the
Just like normal zip or rar files, which we use for saving the space and to increase the data transfer speed. Hadoop also have its own sets of compression extension which can be used for compression of data files. Please find the compression types below : -
  • DEFLATE (.deflat)
  • gzip  (.gz)
  • bzip2 (.bz2  *you can split it)
  • LZO  (.lzo)
  • Snappy  (.snappy)
How to compress a file ?
Ex: -  gzip -1 HelloWord.txt
* In the above command  -1 stands for optimizing speed and -9 for optimizing space 

Codecs are the algorithms used for performing the compression and decompression. To use the codecs you need to implement the CompressionCodec interface and its methods
createOutputStream(OutputStream out) : – It is used to compress the data and write on the disk
createInputStream(InputStream in) : – It is used to decompress the data and read operation on uncompress data
How to identify the codecs ?
To identify the codecs you simply need to look the extension of the file for example if file extension is .gz the GzipCodec will be used. 

If you are using the compression and decompression so frequently than it advisible to use the CodecPool. It is similar to Database Connection pool. It saves the cost of object creation.

No comments:

Post a Comment