I have hundreds or even thousands of small files generated by MapReduce jobs. I'd like to use these files for further analysis, but there are too many files! Is there any way to consume these many small files as a few larger files? By default MapReduce will spawn as many mappers as input splits. We'd like to use fewer mappers. Please help!
Issue/Introduction
Hadoop has a universal problem of dealing with a large number of small files. This shows how Tibco Spotfire Data Science would deal with it.
Resolution
Solution for Pig Operators
Choose a dataset that contains many "part-" files. My dataset has 112 part- files.
Calculate the total size of your dataset using this command. hadoop fs -du -h /path/to/folder In the following example, my folder uses 1.8Mb, or approx 1887436.8 bytes
$ hadoop fs -du -h /path/to/folder
1.8 M 5.4 M /path/to/folder
Calculate the value for this parameter: pig.maxCombinedSplitSize = total file size in bytes / (desired number of mappers) For example, pig.maxCombinedSplitSize = 1887436.8 / 37 mappers = approx 49152
In the datasource connection, specify this parameter pig.maxCombinedSplitSize = 49152
Drag the "dataset" to the canvas.
Connect with the Column Filter operator, which is a pig operator.
Check RM of the Pig Job (Column Filter) to determine the number of mappers being used. In this case: 39.
Solution for MapReduce operators
For MapReduce operators, such as Alpine Forest, I can pass the dataset through a "Column Filter" as above, selecting "all" columns.
Notice the MapReduce operator (Alpine Forest) uses 39 mappers. 38 part-* files and one metadata file will be generated.
Reference: Which operators use Pig?
Explore Operators
Bar Chart Box Plot Frequency Histogram Scatter Plot Matrix
Transform Operators
Aggregation Column Filter Join* Null Value Replacement Row Filter Variable