Open Source Programming: Hadoop

Hadoop I/O

Data Integrity

As hadoop operates on the large sets of data so there is possibility of data curruption. For example if you want to write a file to HDFS and while performing this operation over the network some of data got modified so it leads you to the corrupt datanode. So avoid such problem Hadoop offers us a facility called checksum to validate the data.

LocalFileSystem

It is used for client-side checksum which means if you are writing a file HelloWord.txt to the system than it will create HelloWord.txt.crc in the same directory and when you try to read the same file it will verify the HelloWord.txt.crc checksum for data validation.
We can disable the checksum and it is achived by using the
RawLocalFileSystem

Compression
Just like normal zip or rar files, which we use for saving the space and to increase the data transfer speed. Hadoop also have its own sets of compression extension which can be used for compression of data files. Please find the compression types below : -

DEFLATE (.deflat)
gzip (.gz)
bzip2 (.bz2 *you can split it)
LZO (.lzo)
Snappy (.snappy)

How to compress a file ?
Ex: - gzip -1 HelloWord.txt
* In the above command -1 stands for optimizing speed and -9 for optimizing space

Codecs
Codecs are the algorithms used for performing the compression and decompression. To use the codecs you need to implement the CompressionCodec interface and its methods
createOutputStream(OutputStream out) : – It is used to compress the data and write on the disk
createInputStream(InputStream in) : – It is used to decompress the data and read operation on uncompress data
How to identify the codecs ?
To identify the codecs you simply need to look the extension of the file for example if file extension is .gz the GzipCodec will be used.

CodecPool
If you are using the compression and decompression so frequently than it advisible to use the CodecPool. It is similar to Database Connection pool. It saves the cost of object creation.

Hadoop : Shell Commands

Note : In Hadoop File system each command starts with the : – hadoop fs where fs stands for file system

1. hadoop fs -ls /
This command is similar to the Linux 'ls' command to list the files in the hadoop system.
The parameter '/' it means list the files from the root of the hadoop

2. -mkdir
This command is used to created the dirctories in the HDFS(Hadoop File system)
Example : -
hadoop fs -mkdir /user/hadoop/test (This command will create the 'test' folder inside the /user/hadoop)

3. -count
This command used to cound the number of dirctories
Example : -
hadoop fs -count /user

4.   -touchz
This command is used to create a file of 0 length. This is similar to the Unix ‘touch’ command
Example : -
hadoop fs -touchz /user/hadoop/test/test.txt (This command will create a file 'test.txt' inside the directory '/user/hadoop/test')

5. -cp and -mv
These commands operate like regular Unix commands to copy and rename a file.

6. -put and -copyFromLocal
As the name suggest these command will be used to copy the files from Local Hard Disk(Storage Device) to HDFS (Hadoop File System)
The only difference between -put and -copyFromLocal is 'In put we have the option of stdin but not in the copyFromLocal'

Example : -
$ hadoop fs -copyFromLocal /home/rahul/Hadoop-Script/rahul.txt /user/hadoop/test
$ hadoop fs -ls /user/hadoop/test
Found 2 items
-rw-r–r–   1 rahul supergroup          0 2014-07-27 18:36 /user/hadoop/test/rahul.txt
-rw-r–r–   1 rahul supergroup          0 2014-07-27 18:27 /user/hadoop/test/test.txt

7.   -rm
It is used to delete or remove file
Example : -
$ hadoop fs -rm /user/hadoop/test/test.txt
Deleted hdfs://localhost:9000/user/hadoop/test/test.txt

8.   -get and -copyToLocal –
This commands are used to copy the file from HDFS to Local System.
Example : -
$ hadoop fs -copyToLocal /user/hadoop/test /home/rahul/Hadoop-Script/rahul.txt

9.hadoop fs -rmr /user
To remove the directory from HDFS

Hadoop : Configuring the Pseudo-distributed mode

Before we go for configuration just goto the conf direcotry of your Hadoop distribution and have look at the files. In the conf directory you will find the following files : -

core-site.xml
hdfs-site.xml
mapred-site.xml

So we need to edit all the fiels one-by-one to make it work in pseudo-distributed mode
core-site.xml : - Just modify the content of core-site.xml similar to below code

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration><property><name>fs.default.name</name><value>hdfs://localhost:9000</value></property>
<property><name>hadoop.tmp.dir</name><value>/var/lib/hadoop</value></property>
</configuration>

Attributes which are being used in the core-site.xml
fs.default.name : - It contains the location of Namenode
dfs.name.dir : – It contains the information about the metadata (will see it later)
hadoop.data.dir : - It is used to create the HDFS data directory (will see it later)

hdfs-site.xml : - Just modify the content of hdfs-site.xml similar to below code

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

dfs.replicaton : - How many time hdfs block should be replicated
mapred-site.xml : - It holds information about the job trackers

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

After all the configuration which you have done in the previous steps, there one last step to format the HDFS filesystem before we start it. Please use the following command to format the format the HDFS filesystem : -

$ hadoop namenode -format

Open Source Programming

Tuesday, July 14, 2015

Hadoop I/O

Hadoop : Shell Commands

Hadoop : Configuring the Pseudo-distributed mode

Categories