Tuesday, July 14, 2015

HDFS – Hadoop Distributed File System

HDFS
 
HDFS is desinged for storing very large files with streaming data access around the networks. The term large files means megabytes, gigabytes or terabytes.
HDFS has a block concept to stor the data, each block is having by default size of 64 MB. The size of the block is big as compared to disk block(512kb) beccause to minimize the cost of seek. Ex – If you have soo many small blocks of 512 kb than it is quite difficult to seek the start of each block for reading. So the time of seek will increase. 

Namenodes(master) and Datanodes(slave)
Namenodes is used to manage the filesystem namespace. It maintain the file system tree and metadata. If namenode fails all the datanodes will fail.
Datanode works as a slave and it stores the blocks and it reports to the Namenodes with the list of blocks which is being asked. 

SPOF(Single Point of Failure)
Namenode are generally the single point of failure. It means if namenode is down than all the clients which are having mapreduce program will fail to read the data. So it is quit costly for the production environment because it will stop the system and system will be non-functional.
To avoid such situaltion Hadoop 0.23 came up with HA (High Availability) in the there is a node which is always is a stand by mode if in case namenode goes down, 

Failover and Fencing
Failover is like moving from activenode to standby node which is maitain by the entity called as failover control.
Fencing is like preventing the active node from getting failed. For example if there is slow network issue and failover is triggered on nodes one by one so to prevent active node from getting failed is called fencing. 

STONITH (Shoot The Other Node Into The Head)
Manually shutdown the nod. It is a special power which is being used by the administrator to forcefully shutdown the node. 

Data Flow for File Read
  1. First it calls the open() method on the HDFS [ open() -> HDFS ]
  2. Second get the block location from the Namenode via RPC [block location -> Namenode]
  3. Datanode address is being return by the Namenode
  4. Once the address is return it will search for the closest datanode
  5. Now it will perform the continuos read operation until the block end is reached
  6. Than at last the final close() method will be called
Data Flow for File Write
  1. Clients create file by calling the create()  on HDFS
  2. Than HDFS makes an RPC call the Namenode for file creation
  3. Than Namenode perform various check on request if the request already exists or not, owner had permission to write. If all checks pass then it starts writing the data.
  4. Writing of the data is done in terms of packets which in turn decided by the Namenode to allocate to datanodes.
  5. After the packets are written the ackowledgment package is sent to HDFS.
  6. And than close() is called to end the write process
distcp
It used to copy large amount of data from hadoop files system in parallel.

No comments:

Post a Comment