If it won't be simple, it simply won't be. [Hire me, source code] by Miki Tebeka, CEO, 353Solutions

Friday, September 14, 2012

Using Hadoop Streaming With Avro

One of the way to use Python with Hadoop is via Hadoop Streaming. However it's geared mostly toward text based format and at work we use mostly Avro.

Took me a while to figure the magic, but here it is. Note that the input to the mapper is one JSON object per line.

Note it's a bit old (Avro is now at 1.7.4), originally from here.

2 comments:

Alex said...

I've been trying to do something like this. But the JSON that comes out is 10% of the time mangled. Have you had this experience?

Miki Tebeka said...

No I haven't. Probably in the near future I'll have more experience with this method and we'll see.

Did you try a newer avro release? (it's currently at 1.7.4)

Blog Archive