Solving "Small file problem"

book

Article ID: KB0082647

calendar_today

Updated On:

Products	Versions
Spotfire Data Science	6.x

Description

Question: How do I specify the number of mappers?

I have hundreds or even thousands of small files generated by MapReduce jobs. I'd like to use these files for further analysis, but there are too many files!
Is there any way to consume these many small files as a few larger files?
By default MapReduce will spawn as many mappers as input splits. We'd like to use fewer mappers. Please help!

Resolution

Solution for Pig Operators

Choose a dataset that contains many "part-" files.
My dataset has 112 part- files.
Calculate the total size of your dataset using this command.
hadoop fs -du -h /path/to/folder
In the following example, my folder uses 1.8Mb, or approx 1887436.8 bytes
```
$ hadoop fs -du -h /path/to/folder
1.8 M    5.4 M    /path/to/folder
```
Calculate the value for this parameter:
pig.maxCombinedSplitSize = total file size in bytes / (desired number of mappers)
For example,
pig.maxCombinedSplitSize = 1887436.8 / 37 mappers = approx 49152
In the datasource connection, specify this parameter
pig.maxCombinedSplitSize = 49152
Drag the "dataset" to the canvas.
Connect with the Column Filter operator, which is a pig operator.
Check RM of the Pig Job (Column Filter) to determine the number of mappers being used. In this case: 39.

Solution for MapReduce operators

For MapReduce operators, such as Alpine Forest, I can pass the dataset through a "Column Filter" as above, selecting "all" columns.
Notice the MapReduce operator (Alpine Forest) uses 39 mappers. 38 part-* files and one metadata file will be generated.

Reference: Which operators use Pig?

Explore Operators

Bar Chart
Box Plot
Frequency
Histogram
Scatter Plot Matrix

Transform Operators

Aggregation
Column Filter
Join*
Null Value Replacement
Row Filter
Variable

Tools Operators

Pig Execute

(*) Also available with MR implementation.

Issue/Introduction

Hadoop has a universal problem of dealing with a large number of small files. This shows how Tibco Spotfire Data Science would deal with it.

Feedback

thumb_up Yes

thumb_down No

Welcome to "KB Articles"