The Shape of Data

I’m going to start this post with a confession: Up until a few days ago, the only thing I knew about p-values was that Randall Munroe didn’t seem to like them. My background is in geometry, not statistics, even though I occasionally try to fake it. But it turns out that a lot of other people don’t like p-values either, such as the journal Basic and Applied Social Psychology which recently banned them. So I decided to do some reading (primarily Wikipedia) and it turns out, like most things in the world of data, there’s some very interesting geometry involved, at least if you know where to look.

View original post 1,682 more words

Machine Learning in Hadoop: Part Two


This is the second of a three-part series on the current state of play for machine learning in Hadoop.  Part One is here.  In this post, we cover open source options.

As we noted in Part One, machine learning is one of several technologies for analytics; the broader category also includes fast queries, streaming analytics and graph engines.   This post will focus on machine learning, but it’s worth nothing that open source options for fast queries include Impala and Shark; for streaming analytics Storm, S4 and Spark Streaming; for graph engines Giraph, GraphLab and Spark GraphX.

Tools for machine learning in Hadoop can be classified into two main categories:

  • Software that enables integration between legacy machine learning tools and Hadoop in a “run-beside” architecture
  • Fully distributed machine learning software that integrates with Hadoop

There are two major open source projects in the first…

View original post 739 more words

Configuring HBase Memstore: What You Should Know

In this post we discuss what HBase users should know about one of the internal parts of HBase: the Memstore. Understanding underlying processes related to Memstore will help to configure HBase cluster towards better performance.

HBase Memstore

Let’s take a look at the write and read paths in HBase to understand what Memstore is, where how and why it is used.

Memstore Usage in HBase Read/Write Paths Memstore Usage in HBase Read/Write Paths

(picture was taken from Intro to HBase Internals and Schema Design presentation)

When RegionServer (RS) receives write request, it directs the request to specific Region. Each Region stores set of rows. Rows data can be separated in multiple column families (CFs). Data of particular CF is stored in HStore which consists of Memstore and a set of HFiles. Memstore is kept in RS main memory, while HFiles are written to HDFS. When write request is processed, data is first written into the…

View original post 1,535 more words

Basic XML processing with Scala


Topics: XML, Scala XML API, XML literals, marshalling


Pretty much everybody knows what XML is: it is a structured, machine-readable text format for representing information that can be easily checked for the “grammaticality” of the tags, attributes, and their relationship to each other (e.g. using DTD’s). This contrasts with HTML, which can have elements that don’t close (e.g. <p>foo<p>bar rather than <p>foo</p><p>bar</p>) and still be processed. XML was only ever meant to be a format for machines, but it morphed into a data representation that many people ended up (unfortunately, for them) editing by hand. However, even as a machine readable format it has problems, such as being far more verbose than is really required, which matters quite a bit when you need to transfer lots of data from machine to machine — in the next post, I’ll discuss JSON and Avro, which can be viewed as…

View original post 1,574 more words