This is the second of a three-part series on the current state of play for machine learning in Hadoop. Part One is here. In this post, we cover open source options.
As we noted in Part One, machine learning is one of several technologies for analytics; the broader category also includes fast queries, streaming analytics and graph engines. This post will focus on machine learning, but it’s worth nothing that open source options for fast queries include Impala and Shark; for streaming analytics Storm, S4 and Spark Streaming; for graph engines Giraph, GraphLab and Spark GraphX.
Tools for machine learning in Hadoop can be classified into two main categories:
- Software that enables integration between legacy machine learning tools and Hadoop in a “run-beside” architecture
- Fully distributed machine learning software that integrates with Hadoop
There are two major open source projects in the first…
View original post 739 more words
In this post we discuss what HBase users should know about one of the internal parts of HBase: the Memstore. Understanding underlying processes related to Memstore will help to configure HBase cluster towards better performance.
Let’s take a look at the write and read paths in HBase to understand what Memstore is, where how and why it is used.
(picture was taken from Intro to HBase Internals and Schema Design presentation)
When RegionServer (RS) receives write request, it directs the request to specific Region. Each Region stores set of rows. Rows data can be separated in multiple column families (CFs). Data of particular CF is stored in HStore which consists of Memstore and a set of HFiles. Memstore is kept in RS main memory, while HFiles are written to HDFS. When write request is processed, data is first written into the…
View original post 1,535 more words
“Real or near-real time information delivery is one of the defining characteristics of big data analytics.”