This is the second of a three-part series on the current state of play for machine learning in Hadoop.  Part One is here.  In this post, we cover open source options.

As we noted in Part One, machine learning is one of several technologies for analytics; the broader category also includes fast queries, streaming analytics and graph engines.   This post will focus on machine learning, but it’s worth nothing that open source options for fast queries include Impala and Shark; for streaming analytics Storm, S4 and Spark Streaming; for graph engines Giraph, GraphLab and Spark GraphX.

Tools for machine learning in Hadoop can be classified into two main categories:

  • Software that enables integration between legacy machine learning tools and Hadoop in a “run-beside” architecture
  • Fully distributed machine learning software that integrates with Hadoop

There are two major open source projects in the first…

View original post 739 more words