If it won't be simple, it simply won't be. [source code] by Miki Tebeka, CEO, 353Solutions

Wednesday, December 19, 2012

Timing Your Code

It's a good idea to time portions of your code and have some metric you monitor. This way you can see trends and solve bottlenecks before someone notices (hopefully). Timing functions is easy with decorators, but sometimes you want to time a portion of a function. For this you can use a context manager.

Tuesday, December 11, 2012

Tuesday, November 20, 2012

Last Letter Frequency

I was playing a game with my child where you say a word, then the other person need to say a word which starts with the last letter of the word you said, then you need to say a word with their last letter ...

We noticed that many words end with S and E, which made me curious about the frequency of the last letter in English words. matplotlib makes it super easy to visualize the results.

Friday, November 16, 2012

Python For Data Analysis

Just finished reading Python For Data Analysis, it's a great book with lots of practical examples. Highly recommended.

Thursday, October 25, 2012

Mocking HTTP Servers

Sometimes, httpbin is not enough, and you need your own custom HTTP server for testing.
Here's a small example on how to do that using the built in SimpleHTTPServer (thanks @noahsussman for reminding me).

Monday, October 15, 2012

http://httpbin.org

Sometimes you need to write an HTTP server to debug the client you are writing.

One quick way to avoid this is to use http://httpbin.org/. It supports most of the common HTTP verbs and mostly return the variables you send in.

For example (note the args field in the reply):

$ curl -i 'http://httpbin.org/get?x=1&y=2'
HTTP/1.1 200 OK
Content-Type: application/json
Date: Mon, 15 Oct 2012 21:50:27 GMT
Server: gunicorn/0.13.4
Content-Length: 386
Connection: keep-alive

{
  "url": "http://httpbin.org/get?x=1&y=2",
  "headers": {
    "Content-Length": "",
    "Connection": "keep-alive",
    "Accept": "*/*",
    "User-Agent": "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3",
    "Host": "httpbin.org",
    "Content-Type": ""
  },
  "args": {
    "y": "2",
    "x": "1"
  },
  "origin": "75.82.8.111"
}

Friday, October 05, 2012

Cleanup After Your Tests - But Be Lazy

It's a nice practice to clean after your tests. It's good for various reasons like disk space, "pure" execution environment and others.

However if you clean up to eagerly it'll make your debugging much harder. The data just won't be there to see what went wrong.

The solution we found is pretty simple:
  • Try to place all your test output in one location
  • Nuke this location when starting the tests
This way all the information is available after an error, and you don't accumulate too much junk (just one test run junk at a time).

Thursday, September 20, 2012

Data Wrangling With Python

I just gave a talk at work called "Data Wrangling With Python" which gives an overview on the scientific Python ecosystem. You can view it here.

Friday, September 14, 2012

Using Hadoop Streaming With Avro

One of the way to use Python with Hadoop is via Hadoop Streaming. However it's geared mostly toward text based format and at work we use mostly Avro.

Took me a while to figure the magic, but here it is. Note that the input to the mapper is one JSON object per line.

Note it's a bit old (Avro is now at 1.7.4), originally from here.

Friday, September 07, 2012

Setting Maching Learning on OSX

Setting up machine learning tools (numpy, scipy, matplotlib, scikit-learn, ...) can be a pain (why can they just use a decent OS? :).

We are lucky to have Ben Kim now with us at Adconion, and he posted the following:


Mac OS X Lion Software Installs
  1. Install compilers
    1. Install XCode 4.x from the App Store
      1. Install Command Line Tools in Preferences/Download
    2. Install gcc, g++, and gfortran compilers
      1. Download tar file
      2. Extract to /
        1. tar -xvf abc.tar -C /
    3. Reference http://sites.google.com/site/dwhipp/tutorials/mac_compilers
  2. Install Homebrew
    1. Run the install command using ruby
      1. ruby <(curl -fsSkL raw.github.com/mxcl/homebrew/go)
    2. brew doctor
      1. chown /usr/local folders listed
      2. Place /usr/local/bin before /usr/bin in path
    3. Reference https://github.com/mxcl/homebrew/wiki/installation
  3. Install python using brew
    1. brew install readline sqlite gdbm pkg-config
    2. brew install python
    3. Note: Mac OS X Lion comes with old version 2.7.1 of python (python --version) 
  4. Set PATH in .bash_profile
    1. vim ~/.bash_profile
      1. export PATH=/usr/local/share/python:/usr/local/bin:$PATH
  5. Create symlinks
    1. Within /(System/)?Library/Frameworks/Python.framework/Versions, sudo rm Current
    2. Within the above directories, ln -s /usr/local/Cellar/python/2.7.3 Current
  6. Install pip, if necessary, using easy_install
    1. sudo easy_install pip
  7. Using pip (sudo pip install [--upgrade] abc)
    1. Install nose
    2. Install numpy
    3. Install scipy with environmental variables settings
      1. sudo CC=clang CXX=clang FFLAGS=-ff2c pip install [--upgrade] scipy
    4. Install scikit-learn
    5. Install pandas
  8. Install matplotlib
    1. Download source from repo: https://github.com/matplotlib/matplotlib
    2. cd ~Downloads/matplotlib-*
    3. python setup.py build
    4. python setup.py install
  9. Install VW (Vowpal Wabbit)
    1. Install boost
      1. Download tar file
      2. mv boost extracted folder to /usr/local
      3. export BOOST_ROOT environmental variable
      4. cd to boost directory
      5. make and install
        1. sudo ./bootstrap
        2. sudo ./bjam install
      6. Download bjam
      7. mv to directory in PATH
        1. mv bjam /usr/local/bin
      8. Set bjam toolset to darwin
        1. bjam toolset=darwin stage
      9. Reference http://www.boost.org/doc/libs/1_41_0/more/getting_started/unix-variants.html#expected-build-output
    2. cd to VW directory
      1. make and test
        1. make
        2. make test

Saturday, July 07, 2012

Show Dependecies of Azkaban Jobs

We're using Azkaban at work to schedule Hadoop jobs. It's hard to view the job dependencies without deploying, so here's a little script that will show you job dependencies as an image. It uses dot (from Graphviz) to produce the image.

Tuesday, June 26, 2012

Python Based Assembler

Got reminded of a project I did while back. It's a Python based assembler. The main idea is that the assembly file is actually a Python file with pre-set functions (assembly instruction). In this manner, I managed to skip lexing, parsing and other things and deliver a working assembler in two days. You can view the presentation I gave on this here.

The Assembler


Example Input

Friday, May 04, 2012

Using travis-ci with bitbucket

travis-ci is a great service. My problem is that it works only with github while I mainly use bitbucket (and please, let's not get into hg/git debate - hg is way better :).

The way I found to make this work is to mirror my bitbucket projects on github using hg-git. Below is an example from fastavro.

First, you need to install hg-git. It's available from PyPI, "pip install hg-git" will do the trick, (or "easy_install hg-git" if you don't have pip).

Then create a repository on github to mirror the one on bitbucket. After that tell travis-ci to watch this repository.

Next step is to enable hg-git in your repository, edit .hg/hgrc and add the following:
[extensions]
hgext.bookmarks =
hggit =

Then "bootstrap" it with the following command:
hg bookmark -r default master

Next step is to create .travis.yml, For fastavro I have both Python 2.7 and 3.2.

Last step, is to make sure every time we push to bitbucket, changes are pushed to github as well. This is done with an outgoing hook in .hg/hgrc
[hooks]
outgoing = hg push git+ssh://git@github.com/tebeka/fastavro.git || true

(The || true is there since hg push will exit with non-zero value sometimes)

That's all. Now fastavro has continuous integration that runs both on Python 2.7 and 3.2.

Monday, April 23, 2012

Twitter Post Frequency

Sometime I see interesting new people on Twitter. However before adding them I'd like to know what is their post frequency so I won't get spammed. Below is a simple script to do that:

Tuesday, March 27, 2012

A lambda Gotcha

Quick, what is the output of the following?

In [1]: callbacks = [lambda: i for i in range(10)]
In [2]: [c() for c in callbacks]

The right answer is:
Out[2]: [9, 9, 9, 9, 9, 9, 9, 9, 9, 9]

This is due to the fact that i is bound to the same variable in all the lambdas, and has the final value of 9.

There are two ways to overcome this. The first is to use the fact the default arguments are evaluated at function creation time (which is another known gotcha).

In [3]: callbacks = [lambda i=i: i for i in range(10)]
In [4]: [c() for c in callbacks]
Out[4]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

The second is to create a function generator function:
In [5]: def make_callback(i):
   ...:     return lambda: i
   ...:
In [6]: callbacks = [make_callback(i) for i in range(10)]
In [7]: [c() for c in callbacks]
Out[7]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Wednesday, March 14, 2012

Reading Avro Files Faster than Java

At work, we use a lot of Avro. One of the problems we faced was that the Python Avro package is very slow comparing to the Java one. The goal then was to write fastavro which is a subset of the avro package and should be at least as fast as Java. In this post I'll show how fastavro became faster than Java and also Python 3 compatible.

Going Fast
The Python avro package uses classes and properties heavily. This might allow for nice design but since function calls in Python are expensive it has a cost. The approach was to strip down most of the code in the avro package to one simple module, eliminating as many function calls as possible along the way and using only built in types. After some tweaking, fastavro was churning through the 10K records benchmark in about 2.6seconds (comparing to 13.9 seconds of the avro package). It was a nice speedup but the goal was to be as fast as Java (which was doing about 1.8sec).


Going Faster
Enter Cython. fastavro compiles the Python code without any specific Cython code. This way on machines that do not have a compiler users can still use fastavro. This complicated the build process a bit since now the C extension is generated using Cython in external Makefile. The code in fastavro first tries to import the C extension and if it fails imports the pure Python one.

This approach gave a 2x speedup (benchmark of 10K records done in 1.5seconds). Again, this is without any Cython specific code.

Python 3 Support
The initial Python 3 support was written on the first day of PyCon. However after hearing Robert Brewer's excellent talk I decided to take his advice and write a small compatibility layer (six was not used for various reasons).

As Robert said, this approach made fastavro better with strings, unicode and other things which were glossed over the 2.X only code. The build system was simplified a lot comparing to the one with the initial Python 3 support.

End Result
The end result is a package that reads Avro faster than Java and supports both Python 2 and Python 3. Using Cython and a little bit of work the was achieved without too much effort.

As usual, the code can be found on bitbucket.

Thursday, February 23, 2012

Super Simple Mocking

There are many mocking libraries for Python out there. Due to the dynamic nature of Python I find them an overkill. Below is a super simple mocking library that works for me.


Note that for some types (such as C extensions, objects with __slots__ ...) this will not work since they do not have a __dict__.

EDIT: Following HackerNews comments , I've changed the interface to mock(obj, **kw).

Sunday, February 05, 2012

Adding Key Navigation to Your Web Site

I don't like using the mouse and thankful for every web site that adds keyboard navigation (like Google Reader). Be nice to your users and add some yourself.

jQuery makes it super easy to add keyboard navigation to your site, below is a simple example adding keyboard navigation. "k/j" for up/down, "o" for opening an item and "?" for toggling help. JavaScript code starts at line 90.


Wednesday, January 11, 2012

fastavro with Cython

Added an optional step of compiling fastavro with Cython. Just doing that, with no Cython specific code reduced the time of processing 10K records from 2.9sec to 1.7sec. Not bad for that little work.

Also added a __main__.py so you can use fastavro to process Avro files:

  • python -m fastavro weather.avro # Dump records in JSON format
  • python -m fastavro --schema weather.avro # Dump schema

Friday, January 06, 2012

fastavro

Just released fastavro to PyPI. It has way less features than the official avro package, but according to my tests it's about 5 times faster.

Blog Archive