The hadoop distribution with MapR is a bit different and takes a bit of tweaking to get used to. The distributed filesystem is replaced with MapR’s own C++ version. It is proprietary but have some nice features. HDFS with CDH with all the meta information in the name node is pretty scary. We had multiple incidents where the name node did not checkpoint for some reason and had to change the code to overcome the problem, with the risk of loosing data in the cluster.
Currently there are multiple hadoop clusters in our infrastructure the biggest one is about 200 nodes, and about 1 petabyte. The nicest thing with MapR is the ability to have multiple redundant name nodes and job trackers.
Oh wow, that’s cool. My job found the nam0001 tracker and resumed right
where it left off.
On 5/24/12 3:25 PM, “John Doe” <email@example.com> wrote:
>Jobtracker seems down.
If you design it correctly you also will have them in different racks to maximize redundancy.
Having the ability to use nfs also helps in terms of easy access to the data. The idea here is for all engineers to to be curious about the data in multiple unstructured repository using hadoop.
Using M3 versus M5:
- M3 is really not high availability but from a distribution standpoint very stable, and it is free.
- M5 is subject to licensing with MapR and it is not free. But you probably do not have to have to have a team that is hand holding the system to be up.
Volumes is a bit of a pain since you have to use them intelligently to maximize speed in the filesystem.
The GUI is really nice and has API’s to use it, also installing Hive and other things is easy.
MapR is helpful with assisting with configurations and you should use them to avoid known issues.