If it won't be simple, it simply won't be. [Hire me, source code] by Miki Tebeka, CEO, 353Solutions

Friday, January 18, 2008

Simple Text Summarizer

Comments:
  • About 50 lines of code
  • Gives reasonable results (try it out)
  • tokenize need to be improved much more (better detection, stop words ...)
  • split_to_sentences need to be improved much more (handle 3.2, Mr. Smith ...)
  • In real life you'll need to "clean" the text (Ads, credits, ...)

Tuesday, January 15, 2008

attrgetter is fast

#!/usr/bin/env python

from operator import attrgetter
from random import shuffle

class Point:
    def __init__(self, x, y):
        self.x, self.y = x, y

def sort1(points):
    points.sort(key = lambda p: p.x)

def sort2(points):
    points.sort(key = attrgetter("x"))

if __name__ == "__main__":
    from timeit import Timer

    points1 = [Point(x, 2 * x) for x in range(100)]
    points2 = points1[:]

    num_times = 10000

    t1 = Timer("sort1(points1)", "from __main__ import sort1, points1")
    print t1.timeit(num_times)

    t2 = Timer("sort2(points2)", "from __main__ import sort2, points2")
    print t2.timeit(num_times)


$ ./attr.py
0.492087125778
0.29891705513
$

Friday, January 04, 2008

Faster and Shorter "dot" using itertools

Let's calculate the dot product of two vectors:

from itertools import starmap, izip
from operator import mul

def dot1(v1, v2):
result = 0
for i, value in enumerate(v1):
result += value * v2[i]
return result

def dot2(v1, v2):
return sum(starmap(mul, izip(v1, v2)))

if __name__ == "__main__":
from timeit import Timer

num_times = 1000
v1 = range(100)
v2 = range(100)

t1 = Timer("dot1(%s, %s)" % (v1, v2), "from __main__ import dot1")
print t1.timeit(num_times) # 0.038722038269

t2 = Timer("dot2(%s, %s)" % (v1, v2), "from __main__ import dot2")
print t2.timeit(num_times) # 0.0260770320892
dot2 is faster and shorter, however dot1 is more readable - my vote goes to dot2.

Blog Archive