hadoop Blocks and Splits HDFS


  1. Block Size and Blocks in HDFS : HDFS has the concept of storing data in blocks whenever a file is loaded. Blocks are the physical partitions of data in HDFS ( or in any other filesystem, for that matter ).

    Whenever a file is loaded onto the HDFS, it is splitted physically (yes, the file is divided) into different parts known as blocks. The number of blocks depend upon the value of dfs.block.size in hdfs-site.xml

    Ideally, the block size is set to a large value such as 64/128/256 MBs (as compared to 4KBs in normal FS). The default block size value on most distributions of Hadoop 2.x is 128 MB. The reason for a higher block size is because Hadoop is made to deal with PetaBytes of data with each file ranging from a few hundred MegaBytes to the order of TeraBytes.

    Say for example you have a file of size 1024 MBs. if your block size is 128 MB, you will get 8 blocks of 128MB each. This means that your namenode will need to store metadata of 8 x 3 = 24 files (3 being the replication factor).

    Consider the same scenario with a block size of 4 KBs. It will result in 1GB / 4KB = 250000 blocks and that will require the namenode to save the metadata for 750000 blocks for just a 1GB file. Since all these metadata related information is stored in-memory, larger block size is preferred to save that bit of extra load on the NameNode.

    Now again, the block size is not set to an extremely high value like 1GB etc because, ideally, 1 mapper is launched for each block of data. So if you set the block size to 1GB, you might lose parallelism which might result in a slower throughput overall.

2.) Split Size in HDFS : Splits in Hadoop Processing are the logical chunks of data. When files are divided into blocks, hadoop doesn't respect any file bopundaries. It just splits the data depending on the block size. Say if you have a file of 400MB, with 4 lines, and each line having 100MB of data, you will get 3 blocks of 128 MB x 3 and 16 MB x 1. But when input splits are calculated while the prceossing of data, file/record boundaries are kept in mind and in this case we will have 4 input splits of 100 MB each, if you are using, say, NLineInputFormat.

Split Size can also be set per job using the property mapreduce.input.fileinputformat.split.maxsize

A very good explanation of Blocks vs Splits can be found in this SO Answer/