Solving "Small file problem"

Solving "Small file problem"

book

Article ID: KB0082647

calendar_today

Updated On:

Products Versions
Spotfire Data Science 6.x

Description

Question: How do I specify the number of mappers?

I have hundreds or even thousands of small files generated by MapReduce jobs. I'd like to use these files for further analysis, but there are too many files!
Is there any way to consume these many small files as a few larger files?
By default MapReduce will spawn as many mappers as input splits. We'd like to use fewer mappers. Please help!


 

Issue/Introduction

Hadoop has a universal problem of dealing with a large number of small files. This shows how Tibco Spotfire Data Science would deal with it.

Resolution

Solution for Pig Operators
  1. Choose a dataset that contains many "part-" files. 
    My dataset has 112 part-
     files.
  2. Calculate the total size of your dataset using this command.
    hadoop fs -du -h /path/to/folder
    In the following example, my folder uses 1.8Mb, or approx 1887436.8 bytes
  3. $ hadoop fs -du -h /path/to/folder
    1.8 M    5.4 M    /path/to/folder
    Calculate the value for this parameter: 
    pig.maxCombinedSplitSize = total file size in bytes / (desired number of mappers)
    For example, 
    pig.maxCombinedSplitSize = 1887436.8 / 37 mappers = approx 49152
  4. In the datasource connection, specify this parameter
    pig.maxCombinedSplitSize = 49152 
  5. Drag the "dataset" to the canvas.
  6. Connect with the Column Filter operator, which is a pig operator.
  7. Check RM of the Pig Job (Column Filter) to determine the number of mappers being used.  In this case: 39.

Solution for MapReduce operators
  1. For MapReduce operators, such as Alpine Forest, I can pass the dataset through a "Column Filter" as above, selecting "all" columns. 
  2. Notice the MapReduce operator (Alpine Forest) uses 39 mappers. 38 part-* files and one metadata file will be generated. 
Reference: Which operators use Pig?

Explore Operators 

Bar Chart
Box Plot
Frequency
Histogram
Scatter Plot Matrix

Transform Operators

Aggregation
Column Filter
Join*
Null Value Replacement
Row Filter
Variable

Tools Operators

Pig Execute

(*) Also available with MR implementation.