Voozh

URL: https://www.javacodegeeks.com/2012/02/apache-mahout-getting-started.html

Recently I have got an interesting problem to solve: how to classify text from different sources using automation? Some time ago I read about a project which does this as well as many other text analysis stuff – Apache Mahout. Though it’s not a very mature one (current version is 0.4), it’s very powerful and scalable. Build on top of another excellent project, Apache Hadoop, it’s capable to analyze huge data sets.

So I did a small project in order to understand how Apache Mahout works. I decided to use Apache Maven 2 in order to manage all dependencies so I will start with POM file first.

<!--?xml version="1.0" encoding="UTF-8"?-->
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
 <modelversion>4.0.0</modelversion>
 <groupid>org.acme</groupid>
 <artifactid>mahout</artifactid>
 <version>0.94</version>
 <name>Mahout Examples</name>
 <description>Scalable machine learning library examples</description>
 <packaging>jar</packaging>

 <properties>
 <project.build.sourceencoding>UTF-8</project.build.sourceencoding>
 <apache.mahout.version>0.4</apache.mahout.version>
 </properties>
 
 <build>
 <plugins>
 <plugin>
 <groupid>org.apache.maven.plugins</groupid>
 <artifactid>maven-compiler-plugin</artifactid>
 <configuration>
 <encoding>UTF-8</encoding>
 <source>1.6
 <target>1.6</target>
 <optimize>true</optimize>
 </configuration>
 </plugin>
 </plugins>
 </build>

 <dependencies>
 <dependency>
 <groupid>org.apache.mahout</groupid>
 <artifactid>mahout-core</artifactid>
 <version>${apache.mahout.version}</version>
 </dependency>

 <dependency>
 <groupid>org.apache.mahout</groupid>
 <artifactid>mahout-math</artifactid>
 <version>${apache.mahout.version}</version>
 </dependency>

 <dependency>
 <groupid>org.apache.mahout</groupid>
 <artifactid>mahout-utils</artifactid>
 <version>${apache.mahout.version}</version>
 </dependency>


 <dependency>
 <groupid>org.slf4j</groupid>
 <artifactid>slf4j-api</artifactid>
 <version>1.6.0</version>
 </dependency>

 <dependency>
 <groupid>org.slf4j</groupid>
 <artifactid>slf4j-jcl</artifactid>
 <version>1.6.0</version>
 </dependency>
 </dependencies>
</project>

Then I looked into Apache Mahout examples and algorithms available for text classification problem. The most simple and accurate one is Naive Bayes classifier. Here is a code snippet:

package org.acme;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.FileReader;
import java.util.List;

import org.apache.hadoop.fs.Path;
import org.apache.mahout.classifier.ClassifierResult;
import org.apache.mahout.classifier.bayes.TrainClassifier;
import org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm;
import org.apache.mahout.classifier.bayes.common.BayesParameters;
import org.apache.mahout.classifier.bayes.datastore.InMemoryBayesDatastore;
import org.apache.mahout.classifier.bayes.exceptions.InvalidDatastoreException;
import org.apache.mahout.classifier.bayes.interfaces.Algorithm;
import org.apache.mahout.classifier.bayes.interfaces.Datastore;
import org.apache.mahout.classifier.bayes.model.ClassifierContext;
import org.apache.mahout.common.nlp.NGrams;

public class Starter {
 public static void main( final String[] args ) {
 final BayesParameters params = new BayesParameters();
 params.setGramSize( 1 );
 params.set( "verbose", "true" );
 params.set( "classifierType", "bayes" );
 params.set( "defaultCat", "OTHER" );
 params.set( "encoding", "UTF-8" );
 params.set( "alpha_i", "1.0" );
 params.set( "dataSource", "hdfs" );
 params.set( "basePath", "/tmp/output" );
 
 try {
 Path input = new Path( "/tmp/input" );
 TrainClassifier.trainNaiveBayes( input, "/tmp/output", params );
 
 Algorithm algorithm = new BayesAlgorithm();
 Datastore datastore = new InMemoryBayesDatastore( params );
 ClassifierContext classifier = new ClassifierContext( algorithm, datastore );
 classifier.initialize();
 
 final BufferedReader reader = new BufferedReader( new FileReader( args[ 0 ] ) );
 String entry = reader.readLine();
 
 while( entry != null ) {
 List< String > document = new NGrams( entry, 
 Integer.parseInt( params.get( "gramSize" ) ) )
 .generateNGramsWithoutLabel();

 ClassifierResult result = classifier.classifyDocument( 
 document.toArray( new String[ document.size() ] ), 
 params.get( "defaultCat" ) ); 

 entry = reader.readLine();
 }
 } catch( final IOException ex ) {
 ex.printStackTrace();
 } catch( final InvalidDatastoreException ex ) {
 ex.printStackTrace();
 }
 }
}

There is one important note here: system must be taught before starting classification. In order to do so, it’s necessary to provide examples (more – better) of different text classification. It should be simple files where each line starts with category separated by tab from text itself. F.e.:

SUGGESTION That's a great suggestion
QUESTION Do you sell Microsoft Office?
...

More files you can provide, more precise classification you will get. All files must be put to the ‘/tmp/input’ folder, they will be processed by Apache Hadoop first. :)

Reference: Getting started with Apache Mahout from our JCG partner Andrey Redko at the Andriy Redko {devmind}.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

6. Spring Interview Questions

7. Android UI Design

and many more ....

I agree to the Terms and Privacy Policy

👁 Image

Thank you!

We will contact you soon.

Andrey Redko

Andriy is a well-grounded software developer with more then 12 years of practical experience using Java/EE, C#/.NET, C++, Groovy, Ruby, functional programming (Scala), databases (MySQL, PostgreSQL, Oracle) and NoSQL solutions (MongoDB, Redis).

👁 java-interview-questions-answers

Simple REST client in Java

September 11th, 2012

👁 spring-interview-questions-answers

Spring Boot Error – Error creating a bean with name ‘dataSource’ defined in class path resource DataSourceAutoConfiguration

May 1st, 2019

👁 Image

How to fix Exception in thread “main” java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory in Java

February 22nd, 2018

👁 Image

Mockito: Cannot instantiate @InjectMocks field: the type is an interface

July 7th, 2020

👁 spring-interview-questions-answers

100 Java Spring Interview Questions & Answers – The ULTIMATE List (PDF Download)

May 2nd, 2014

👁 spring-interview-questions-answers

Spring Boot Remove Embedded Tomcat Server, Enable Jetty Server

April 28th, 2020

👁 spring-interview-questions-answers

What is SecurityContext and SecurityContextHolder in Spring Security?

February 21st, 2018

👁 java-interview-questions-answers

How to install Apache Web Server on EC2 Instance using User data script

May 7th, 2020

This site uses Akismet to reduce spam. Learn how your comment data is processed.

10 Comments

Oldest

Newest Most Voted

👁 Ali

Ali

12 years ago

Hi ,Nice tutorial
I am able to run the code.I tested with sample file which contains
QUESTION
SUGGESTION

series.and i gave a test file consisting of sentences of question and suggestion without any lable.
In the ouput directory i get three folders of “trainer-tfIdf”,”trainer-thetaNormalizer”,”trainer-weights”

how to see the output.

can you please help

👁 Andriy Redko

Andriy Redko

12 years ago

Reply to Ali

Hi Ali,

Thank you for your comment. The variable ‘result’ of ‘ClassifierResult’ contains the classification (including scores) for particular text or message. You can print it out on a console or output to another file. Please note that at the time, the post targeted version 0.4 of Apache Mahout. Current version is 0.7 and unfortunately those are not compatible at all.

Please let me know if it’s helpful.
Thank you.

Best Regards,
Andriy Redko

👁 aparnesh gaurav

aparnesh gaurav

12 years ago

Thanks for sharing this .

👁 aparnesh gaurav

aparnesh gaurav

12 years ago

Hi,

Q1 .Does the above algorithm work on a distributed framework ? ( Assuming that we are keeping the input file in hdfs )
Q2. Is the output folder referred here in hdfs ?
Q3. I don’t see any map-reduce code here , so shall i assume it’s only hdfs applied here but no parallel processing because on map reduce codes are written here.

Regards,
Aparnesh

👁 Andriy Redko

Andriy Redko

12 years ago

Reply to aparnesh gaurav

Hi Aparnesh, Before answering your questions I would like to mention one important detail: at the time post had been published, Mahout was at version 0.4. Current stable version of Mahout is 0.7 and it’s very, very different from the old one. So with respect to Mahout 0.4: Q1 -> No, it won’t Q2 -> Yes, ‘basePath’ could point to HDFS location Q3 -> Right, M/R is not a part of this example Mahout 0.7 would give you full M/R processing pipeline but the code from the post won’t work / compile anymore. This blog post can give more insights:… Read more »

👁 aparnesh gaurav

aparnesh gaurav

12 years ago

Reply to Andriy Redko

Thanks for your prompt reply. And yes , Chimpler has rich information regarding this.

http://chimpler.wordpress.com/2013/02/20/playing-with-the-mahout-recommendation-engine-on-a-hadoop-cluster/

Regards,
Aparnesh

👁 Amitesh

Amitesh

9 years ago

Thank you for sharing the example. I am new to Apache Mahout. I tried to use your code in my environment, but I am facing issues. I know that the post is old,and you may not reply to my query. But I am writing as I am stuck. I tried to configure Maven in my environment, but due to company policy, I am not able to configure it successfully, so I decided to resolve the dependencies, and I installed all the required dependencies one by one. But I am not able to find one last library, I tried looking on… Read more »

👁 Andriy Redko

Andriy Redko

9 years ago

Reply to Amitesh

Hi Amitesh,

The exceptions like this are an indication of Hadoop version mismatch, unfortunately. I would suggest you to look at the recent Apache Mahout documentation (https://mahout.apache.org/), a LOT of things changed since the blog post was published … The good news are that you may get the desired results much, much faster :)

Thank you.

Best Regards,
Andriy Redko

👁 Amitesh

Amitesh

9 years ago

Reply to Andriy Redko

Thank you for your reply,and I will be looking for the version part for sure. However, I would like to bring a point to you notice that my Eclipse is on windows,and my Mahout is installed on Linux, I didn’t run the code yet on my mahout box, I ran executed the code inside my eclipse on windows machine. Since I was facing issues, so I did not touch linux untill now.

👁 Andriy Redko

Andriy Redko

9 years ago

Reply to Amitesh

Hi Amitesh,

Yes, I understand that you run everything from your Eclipse. The issue though is still caused by Java libraries. I see at least mahout-core0.7.jar and mahout-core0.8.jar in the list, which are conflicting versions. For the example of the article you need 0.7 only. Thank you.

Best Regards,
Andriy Redko

URL: https://www.javacodegeeks.com/2012/02/apache-mahout-getting-started.html

⇱ Apache Mahout: Getting started - Java Code Geeks

Thank you!

Andrey Redko

Related Articles

Simple REST client in Java

Spring Boot Error – Error creating a bean with name ‘dataSource’ defined in class path resource DataSourceAutoConfiguration

How to fix Exception in thread “main” java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory in Java

Mockito: Cannot instantiate @InjectMocks field: the type is an interface

100 Java Spring Interview Questions & Answers – The ULTIMATE List (PDF Download)

Spring Boot Remove Embedded Tomcat Server, Enable Jetty Server

What is SecurityContext and SecurityContextHolder in Spring Security?

How to install Apache Web Server on EC2 Instance using User data script