Thinking over this one a bit, it may make sense to build a little bit of background before jumping to map join directly.
Imagine if you have to write a program (assume we are in 1990 when no big data existed) to take 2 FILES and ready them and join and then write to another file. There may be better ways, but the most common approach that comes to mind is
Key point is unless the files are sorted the join is not possible. This also extends to big data world, unless the files are sorted the join simply can not happen.
Remember: Hive underneath is map reduce.
Lets think of the above scenario in map-reduce world. A schematic representation would look like below
If you notice above Mappers as dealing with only individual files and sort and shuffle phase is where they come together, they get sorted and keys matched. This is where it gets expensive.
A map join is basically a hint to do things little differently which is situational. i.e. one file is small and can be saved to distributed cache so that the join can happen in map phase. Confusing?... the following digram shows how it would look like
The distributed cache is a hash map that supports faster look up. The mapper 1 and 2 used the hash map in distributed cache to look up the values and join associated attributes to their own dataset before going through the sort and shuffle. This reduced the load on the sort and shuffle.
Thats a 2000ft view of map joins.
Important Note:
Recent versions of hive know to do automatic joins when the table statistics are known to the optimizer. However, if the statistics are not available for tables the optimizer won't know which table can be used as hash map. Also for subquery/intermediate datasets, the optimizer won't know during optimization phase how to map join (as the intermediate data sets are created in run time). This is where a humanly touch of giving map join hints possibly would save a lot of time.
Following is an example of map join
Map join is a hint keyword.... similar to hints in RDBMS world. If you are interested in understanding the concepts more in detail, it may be worthwhile to write a map reduce yourself that used distributed cache with a hash map.
Imagine if you have to write a program (assume we are in 1990 when no big data existed) to take 2 FILES and ready them and join and then write to another file. There may be better ways, but the most common approach that comes to mind is
- Sort file A
- Sort file B
- Read A and B simultaneously and compare the keys and stall on the key which is lagging(both the files may not have same record count, like customer and order) and keep writing to an output file as keys match.
Key point is unless the files are sorted the join is not possible. This also extends to big data world, unless the files are sorted the join simply can not happen.
Remember: Hive underneath is map reduce.
Lets think of the above scenario in map-reduce world. A schematic representation would look like below
If you notice above Mappers as dealing with only individual files and sort and shuffle phase is where they come together, they get sorted and keys matched. This is where it gets expensive.
A map join is basically a hint to do things little differently which is situational. i.e. one file is small and can be saved to distributed cache so that the join can happen in map phase. Confusing?... the following digram shows how it would look like
The distributed cache is a hash map that supports faster look up. The mapper 1 and 2 used the hash map in distributed cache to look up the values and join associated attributes to their own dataset before going through the sort and shuffle. This reduced the load on the sort and shuffle.
Thats a 2000ft view of map joins.
Important Note:
Recent versions of hive know to do automatic joins when the table statistics are known to the optimizer. However, if the statistics are not available for tables the optimizer won't know which table can be used as hash map. Also for subquery/intermediate datasets, the optimizer won't know during optimization phase how to map join (as the intermediate data sets are created in run time). This is where a humanly touch of giving map join hints possibly would save a lot of time.
Following is an example of map join
SELECT /*+ MAPJOIN(STORETABLE) */ COUNT(*) FROM
SALESTABLE JOIN STORETABLE ON (SALESTOREKEY = STOREKEY)
Map join is a hint keyword.... similar to hints in RDBMS world. If you are interested in understanding the concepts more in detail, it may be worthwhile to write a map reduce yourself that used distributed cache with a hash map.