VOOZH about

URL: https://www.javacodegeeks.com/2013/10/amazon-elastic-map-reduce-to-compute-recommendations-with-apache-mahout.html

⇱ Amazon Elastic Map Reduce to compute recommendations with Apache Mahout


Apache Mahout is a “scalable machine learning library” which, among others, contains implementations of various single-node and distributed recommendation algorithms. In my last blog post I described how to implement an on-line recommender system processing data on a single node. What if the data is too large to fit into memory (>100M preference data points)? Then we have no choice, but to take a look at Mahout’s distributed recommenders implementation!

The distributed recommender is based on Apache Hadoop; it’s a job which takes as input a list of user preferences, computes an item co-occurence matrix, and outputs top-K recommendations for each user. For an introductory blog on how this works and how to run it locally, see for example this blog post.

We can of course run this job on a custom Hadoop cluster, but it’s much faster (and less painful) to just use a pre-configured one, like EMR. However, there’s a slight problem. The latest Hadoop version that is available on EMR is 1.0.3, and it contains jars for Apache Lucene 2.9.4. However, the recommender job depends on Lucene 4.3.0, which results in the following beautiful stack trace:

2013-10-04 11:05:03,921 FATAL org.apache.hadoop.mapred.Child (main): Error running child : java.lang.NoSuchMethodError: org.apache.lucene.util.PriorityQueue.<init>(I)V
 at org.apache.mahout.math.hadoop.similarity.cooccurrence.TopElementsQueue.<init>(TopElementsQueue.java:33)
	at org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$UnsymmetrifyMapper.
map(RowSimilarityJob.java:405)
	at org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$UnsymmetrifyMapper.
map(RowSimilarityJob.java:389)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)

How to solve this? Well, we “just” need to update Lucene in the EMR Hadoop installation. We can use a bootstrap action for that. Here are the exact steps:

  1. Download lucene-4.3.0.tgz (e.g. from here) and upload it into a S3 bucket; make the file public.
  2. Upload this script to the bucket as well; call it e.g. update-lucene.sh:
    #!/bin/bash
    cd /home/hadoop
    wget https://s3.amazonaws.com/bucket_name/bucket_path/lucene-4.3.0.tgz
    tar -xzf lucene-4.3.0.tgz
    cd lib
    rm lucene-*.jar
    cd ..
    cd lucene-4.3.0
    find . | grep lucene- | grep jar$ | xargs -I {} cp {} ../lib

    This script will be run on the Hadoop nodes and will update the Lucene version. Make sure to change the script and enter the correct bucket name and bucket path, so that it points to the public Lucene archive.

And that’s it! You can now create and run the job flow, and after a couple of minutes/hours/days you’ll have the results waiting on S3.
 

Do you want to know how to develop your skillset to become a Java Rockstar?
Subscribe to our newsletter to start Rocking right now!
To get you started we give you our best selling eBooks for FREE!
1. JPA Mini Book
2. JVM Troubleshooting Guide
3. JUnit Tutorial for Unit Testing
4. Java Annotations Tutorial
5. Java Interview Questions
6. Spring Interview Questions
7. Android UI Design
and many more ....
I agree to the Terms and Privacy Policy

Thank you!

We will contact you soon.

👁 Photo of Adam Warski
Adam Warski
October 17th, 2013Last Updated: October 17th, 2013
0 115 1 minute read

Adam Warski

Adam is one of the co-founders of SoftwareMill, a company specialising in delivering customised software solutions. He is also involved in open-source projects, as a founder, lead developer or contributor to: Hibernate Envers, a Hibernate core module, which provides entity versioning/auditing capabilities; ElasticMQ, an SQS-compatible messaging server written in Scala; Veripacks, a tool to specify and verify inter-package dependencies, and others.
Subscribe

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Back to top button
Close
wpDiscuz