Tuesday, December 27, 2011

Leysin Winter Sprint

PyPy Leysin Winter Sprint: 15-22nd January 2012

The next PyPy sprint will be in Leysin, Switzerland, for the eighth time. This is a fully public sprint: newcomers and topics other than those proposed below are welcome.

Goals and topics of the sprint

  • Py3k: work towards supporting Python 3 in PyPy
  • NumPyPy: work towards supporting the numpy module in PyPy
  • JIT backends: integrate tests for ARM; look at the PowerPC 64; maybe try again to write an LLVM- or GCC-based one
  • STM and STM-related topics; or the Concurrent Mark-n-Sweep GC
  • And as usual, the main side goal is to have fun in winter sports :-) We can take a day off for ski.

Exact times

The work days should be 15-21 January 2011 (Sunday-Saturday). The official plans are for people to arrive on the 14th or the 15th, and to leave on the 22nd.

Interested? Read more...

Thursday, December 22, 2011

Come see us at PyCon 2012

PyCon 2012 is coming up in just a few short months, and PyPy will be well
represented there. We'll be delivering a tutorial, two talks, plus we'll be
around for the sprints.

Here are the abstracts for the tutorials and talks:

  • How to get the most out of your PyPy, by Maciej Fijalkowski, Alex Gaynor
    and Armin Rigo: For many applications PyPy can provide performance benefits
    right out of the box. However, little details can push your application to
    perform much better. In this tutorial we'll give you insights on how to push
    PyPy to its limits. We'll focus on understanding the performance
    characteristics of PyPy, and learning the analysis tools in order to maximize
    your applications' performance. This is the tutorial.
  • Why PyPy by example, by Maciej Fijalkowski, Alex Gaynor and Armin Rigo:
    One of the goals of PyPy is to make existing Python code faster; however an
    even broader goal was to make it possible to write things in Python that
    previously would needed to be written in C or other low-level language. This
    talk will show examples of this, and describe how they represent the
    tremendous progress PyPy has made, and what it means for people looking at
    using PyPy.
  • How the PyPy JIT works, by Benjamin Peterson: The Python community is
    abuzz about the major speed gains PyPy can offer for pure Python code. But how
    does the PyPy JIT actually work? This talk will discuss how the PyPy JIT is
    implemented. It will include descriptions of the tracing, optimization, and
    assembly generation phases. I will demonstrate each step with an example loop.

If you have any questions let us know! We look forward to seeing people at
PyCon and chatting about PyPy and the entire Python ecosystem.

See you there,
Maciej Fijalkowski, Alex Gaynor, Benjamin Peterson, Armin Rigo, and the entire PyPy team

Thursday, December 8, 2011

Plotting using matplotlib from PyPy

Big fat warning This is just a proof of concept. It barely works. There are missing pieces left and right, which were replaced with hacks so I can get this to run and prove it's possible. Don't try this at home, especially your home. You have been warned.

There has been a lot of talking about PyPy not integrating well with the current scientific Python ecosystem, and numpypy (a NumPy reimplementation on top of pypy) was dubbed "a fancy array library". I'm going to show that integration with this ecosystem is possible with our design.

First, the demo:

#!/usr/bin/env pypy

# numpy, pypy version
import numpypy as numpy
# DRAGONS LIVE THERE (fortunately hidden)
from embed.emb import import_mod

pylab = import_mod('matplotlib.pylab')

if __name__ == '__main__':
    a = numpy.arange(100, dtype=int)
    b = numpy.sin(a)
    pylab.plot(a, b)
    pylab.show()

And you get:

Now, how to reproduce it:

  • You need a PyPy without cpyext, I did not find a linker that would support overriding symbols. Right now there are no nightlies like this, so you have to compile it yourself, like:

    ./translate.py -Ojit targetpypystandalone.py --withoutmod-cpyext
    

    That would give you a PyPy that's unable to load some libraries like PIL, but perfectly working otherwise.

  • Speaking of which, you need a reasonably recent PyPy.

  • The approach is generally portable, however the implementation has been tested only on 64bit linux. Few tweaks might be required.

  • You need to install python2.6, the python2.6 development headers, and have numpy and matplotlib installed on that python.

  • You need a checkout of my hacks directory and put embedded on your PYTHONPATH, your pypy checkout also has to be on the PYTHONPATH.

Er wait, what happened?

What didn't happen is we did not reimplement matplotlib on top of PyPy. What did happen is we embed CPython inside of PyPy using ctypes. We instantiate it. and follow the embedding tutorial for CPython. Since numpy arrays are not movable, we're able to pass around an integer that's represents the memory address of the array data and reconstruct it in the embedded interpreter. Hence with a relatively little effort we managed to reuse the same array data on both sides to plot at array. Easy, no?

This approach can be extended to support anything that's not too tied with python objects. SciPy and matplotlib both fall into the same category but probably the same strategy can be applied to anything, like GTK or QT. It's just a matter of extending a hack into a working library.

To summarize, while we're busy making numpypy better and faster, it seems that all external libraries on the C side can be done using an embedded Python interpreter with relatively little effort. To get to that point, I spent a day and a half to learn how to embed CPython, with very little prior experience in the CPython APIs. Of course you should still keep as much as possible in PyPy to make it nice and fast :)

Cheers, fijal

Tuesday, November 29, 2011

PyPy 1.7 on Win32

Hi all,

We have fixed _continuation on Win32 (thanks Stakkars), and so we have now a Win32 version of PyPy 1.7.

Monday, November 21, 2011

PyPy 1.7 - widening the sweet spot

We're pleased to announce the 1.7 release of PyPy. As became a habit, this release brings a lot of bugfixes and performance improvements over the 1.6 release. However, unlike the previous releases, the focus has been on widening the "sweet spot" of PyPy. That is, classes of Python code that PyPy can greatly speed up should be vastly improved with this release. You can download the 1.7 release here:

http://pypy.org/download.html

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 1.7 and cpython 2.7.1 performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 32/64 or Windows 32. Windows 64 work is ongoing, but not yet natively supported.

The main topic of this release is widening the range of code which PyPy can greatly speed up. On average on our benchmark suite, PyPy 1.7 is around 30% faster than PyPy 1.6 and up to 20 times faster on some benchmarks.

Highlights

  • Numerous performance improvements. There are too many examples which python constructs now should behave faster to list them.

  • Bugfixes and compatibility fixes with CPython.

  • Windows fixes.

  • PyPy now comes with stackless features enabled by default. However, any loop using stackless features will interrupt the JIT for now, so no real performance improvement for stackless-based programs. Contact pypy-dev for info how to help on removing this restriction.

  • NumPy effort in PyPy was renamed numpypy. In order to try using it, simply write:

    import numpypy as numpy
    

    at the beginning of your program. There is a huge progress on numpy in PyPy since 1.6, the main feature being implementation of dtypes.

  • JSON encoder (but not decoder) has been replaced with a new one. This one is written in pure Python, but is known to outperform CPython's C extension up to 2 times in some cases. It's about 20 times faster than the one that we had in 1.6.

  • The memory footprint of some of our RPython modules has been drastically improved. This should impact any applications using for example cryptography, like tornado.

  • There was some progress in exposing even more CPython C API via cpyext.

Things that didn't make it, expect in 1.8 soon

There is an ongoing work, which while didn't make it to the release, is probably worth mentioning here. This is what you should probably expect in 1.8 some time soon:

  • Specialized list implementation. There is a branch that implements lists of integers/floats/strings as compactly as array.array. This should drastically improve performance/memory impact of some applications
  • NumPy effort is progressing forward, with multi-dimensional arrays coming soon.
  • There are two brand new JIT assembler backends, notably for the PowerPC and ARM processors.

Fundraising

It's maybe worth mentioning that we're running fundraising campaigns for NumPy effort in PyPy and for Python 3 in PyPy. In case you want to see any of those happen faster, we urge you to donate to numpy proposal or py3k proposal. In case you want PyPy to progress, but you trust us with the general direction, you can always donate to the general pot.

Cheers,
Maciej Fijałkowki, Armin Rigo and the entire PyPy team

Monday, November 14, 2011

Gothenburg sprint report

In the past week, we have been busy hacking on PyPy at the Gothenburg sprint, the second of this 2011. The sprint was hold at Laura's and Jacob's place, and here is a brief report of what happened.


In the first day we welcomed Mark Pearse, who was new to PyPy and at his first sprint. Mark worked the whole sprint in the new SpecialisedTuple branch, whose aim is to have a special implementation for small 2-items and 3-items tuples of primitive types (e.g., ints or floats) to save memory. Mark paired with Antonio for a couple of days, then he continued alone and did an amazing job. He even learned how to properly do Test Driven Development :-).

Antonio spent a couple of days investigating whether it is possible to use application checkpoint libraries such as BLCR and DMTCP to save the state of the PyPy interpreter between subsequent runs, thus saving also the JIT-compiled code to reduce the warmup time. The conclusion is that these are interesting technologies, but more work would be needed (either on the PyPy side or on the checkpoint library side) before it can have a practical usage for PyPy users.

Then, Antonio spent most of the rest of the sprint working on his ffistruct branch, whose aim is to provide a very JIT-friendly way to interact with C structures, and eventually implement ctypes.Structure on top of that. The "cool part" of the branch is already done, and the JIT already can compile set/get of fields into a single fast assembly instruction, about 400 times faster than the corresponding ctypes code. What is still left to do is to add a nicer syntax (which is easy) and to implement all the ctypes peculiarities (which is tedious, at best :-)).

As usual, Armin did tons of different stuff, including fixing a JIT bug, improving the performance of file.readlines() and working on the STM branch (for Software Transactional Memory), which is now able to run RPython multithreaded programs using software transaction (as long as they don't fill up all the memory, because support for the GC is still missing :-)). Finally, he worked on improving the Windows version of PyPy. While doing so he discovered together with Anto a terrible bug which lead to a continuous leak of stack space because the JIT called some functions using the wrong calling convention.

Håkan, with some help from Armin, worked on the jit-targets branch, whose goal is to heavily refactor the way the traces are internally represented by the JIT, so that in the end we can produce (even :-)) better code than what we do nowadays. More details in this mail.

Andrew Dalke worked on a way to integrate PyPy with FORTRAN libraries, and in particular the ones which are wrapped by Numpy and Scipy: in doing so, he wrote f2pypy, which is similar to the existing f2py but instead of producing a CPython extension module it produces a pure python modules based on ctypes. More work is needed before it can be considered complete, but f2pypy is already able to produce a wrapper for BLAS which passes most of the tests under CPython, although there's still work left to get it working for PyPy.

Armin and Håkan with Laura's "5x faster" cake
Christian Tismer worked the whole sprint on the branch to make PyPy compatible with Windows 64 bit. This needs a lot of work because a lot of PyPy is written under the assumption that the long type in C has the same bit size than void*, which is not true on Win64. Christian says that in the past Genova-Pegli sprint he completed 90% of the work, and in this sprint he did the other 90% of the work. Obviously, what is left to complete the task is the third 90% :-). More seriously, he estimated a total of 2-4 person-weeks of work to finish it.

But, all in all, the best part of the sprint has been the cake that Laura baked to celebrate the "5x faster than CPython" achievement. Well, actually our speed page reports "only" 4.7x, but that's because in the meantime we switched from comparing against CPython 2.6 to comparing against CPython 2.7, which is slightly faster. We are confident that we will reach the 5x goal again, and that will be the perfect excuse to eat another cake :-)

Thursday, October 27, 2011

Speeding up JSON encoding in PyPy

Hi

Recently I spent a bit of effort into speeding up JSON in PyPy. I started with writing a benchmark, which is admittedly not a very good one, but it's better than nothing (suggestions on how to improve it are welcome!).

For this particular benchmark, the numbers are as follow. Note that CPython by default uses the optimized C extension, while PyPy uses the pure Python one. PyPy trunk contains another pure Python version which has been optimized specifically for the PyPy JIT. Detailed optimizations are described later in this post.

The number reported is the time taken for the third run, when things are warmed up. Full session here.

CPython 2.6 22s
CPython 2.7 3.7s
CPython 2.7 no C extension 44s
PyPy 1.5 34s
PyPy 1.6 22s
PyPy trunk 3.3s

Lessons learned:

Expectations are high

A lot of performance critical stuff in Python world is already written in a hand optimized C. Writing C (especially when you interface with CPython C API) is ugly and takes significant effort. This approach does not scale well when there is a lot of code to be written or when there is a very tight coupling between the part to be rewritten and the rest of the code. Still, people would expect PyPy to be better at "tasks" and not precisely at running equivalent code, hence a comparison between the C extension and the pure python version is sound. Fortunately it's possible to outperform the C extension, but requires a bit of effort on the programmer side as well.

Often interface between the C and Python part is ugly

This is very clear if you look at json module as implemented in CPython's standard library. Not everything is in C (it would probably be just too much effort) and the interface to what is in C is guided via profiling not by what kind of interface makes sense. This especially is evident comparing CPython 2.6 to 2.7. Just adapting the code to an interface with C made the Python version slower. Removing this clutter improves the readability a lot and improves PyPy's version a bit, although I don't have hard numbers.

JitViewer is crucial

In case you're fighting with PyPy's performance, jitviewer is worth a shot. While it's not completely trivial to understand what's going on, it'll definitely show you what kind of loops got compiled and how.

No nice and fast way to build strings in Python

PyPy has a custom thing called __pypy__.builders.StringBuilder. It has a few a features that make it much easier to optimize than other ways like str.join() or cStringIO.

  • You can specify the start size, which helps a lot if you can even provide a rough estimate on the size of the string (less copying)
  • Only append and build are allowed. While the string is being built you can't seek or do anything else. After it's built you can never append any more.
  • Unicode version available as well as __pypy__.builders.UnicodeBuilder.

Method calls are ok, immutable globals are ok

PyPy's JIT seems to be good enough for at least the simple cases. Calling methods for common infrastructure or loading globals (instead of rebinding as locals) is fast enough and improves code readability.

String copying is expensive

Edit: see the comment at the end

If you use re.sub, the current implementation will always create a copy of the string even if there was no match to replace. If you know your regexp is simple, first try to check if there is anything to replace. This is a pretty hard optimization to do automatically -- simply matching the regular expression can be too costly for it to make sense. In our particular example however, the regexp is really simple, checking ranges of characters. It also seems that this is by far the fastest way to escape characters as of now.

Generators are slower than they should be

I changed the entire thing to simply call builder.append instead of yielding to the main loop where it would be gathered. This is kind of a PyPy bug that using generators extensively is slower, but a bit hard to fix. Especially in cases where there is relatively little data being passed around (few bytes), it makes sense to gather it first. If I were to implement an efficient version of iterencode, I would probably handle chunks of predetermined size, about 1000 bytes instead of yielding data every few bytes.

I must admit I worked around PyPy's performance bug

For obscure (although eventually fixable) reasons, this:

for c in s: # s is string
  del c

is faster than:

for c in s:
  pass

This is a PyPy performance bug and should be fixed, but on a different branch ;-)

PyPy's JIT is good

I was pretty surprised, but the JIT actually did make stuff work nicely. The changes that were done were relatively minor and straightforward, once the module was cleaned to the normal "pythonic" state. It is worth noting that it's possible to write code in Python and make it run really fast, but you have to be a bit careful. Again, jitviewer is your friend when determining why things are slow. I hope we can write more tools in the future that would more automatically guide people through potential performance pitfals.

Cheers, fijal

Edit: I was wrong about re.sub. It just seems to be that the JIT is figuring match better than sub, will be fixed soon

Monday, October 17, 2011

PyPy Göteborg Post-Hallowe'en Sprint Nov 2nd - Nov 9th

The next PyPy sprint will be in Gothenburg, Sweden. It is a public sprint, suitable for newcomers. We'll focus on making a public kickoff for both the numpy/pypy integration project and the Py3k support project, as well as whatever interests the Sprint attendees. Since both of these projects are very new, there will be plenty of work suitable for newcomers to PyPy.

Other topics might include:

  • Helping people get their code running with PyPy
  • work on a FSCons talk?
  • state of the STM Vinnova project (We most likely, but not for certain will know whether or not we are approved by this date.)

Other Useful dates

GothPyCon - Saturday Oct 29.

FSCONS Friday Nov 11 - Sunday Nov 12.

Location

The sprint will be held in the apartment of Laura Creighton and Jacob Hallén which is at Götabergsgatan 22 in Gothenburg, Sweden. Here is a map. This is in central Gothenburg. It is between the tram stops of Vasaplatsen and Valand, (a distance of 4 blocks) where many lines call -- the 2, 3, 4, 5, 7, 10 and 13.

Probably cheapest and not too far away is to book accomodation at SGS Veckobostader. The Elite Park Avenyn Hotel is a luxury hotel just a few blocks away. There are scores of hotels a short walk away from the sprint location, suitable for every budget, desire for luxury, and desire for the unusual. You could, for instance, stay on a boat. Options are too numerous to go into here. Just ask in the mailing list or on the blog.

Hours will be from 10:00 until people have had enough. It's a good idea to arrive a day before the sprint starts and leave a day later. In the middle of the sprint there usually is a break day and it's usually ok to take half-days off if you feel like it. Of course, many of you may be interested in sticking around for FSCons, held the weekend after the sprint.

Good to Know

Sweden is not part of the Euro zone. One SEK (krona in singular, kronor in plural) is roughly 1/10th of a Euro (9.36 SEK to 1 Euro).

The venue is central in Gothenburg. There is a large selection of places to get food nearby, from edible-and-cheap to outstanding. We often cook meals together, so let us know if you have any food allergies, dislikes, or special requirements.

Sweden uses the same kind of plugs as Germany. 230V AC.

Getting Here

If are coming train, you will arrive at the Central Station. It is about 12 blocks to the site from there, or you can take a tram.

There are two airports which are local to Göteborg, Landvetter (the main one) and Gothenburg City Airport (where some budget airlines fly). If you arrive at Landvetter the airport bus stops right downtown at Elite Park Avenyn Hotel which is the second stop, 4 blocks from the Sprint site, as well as the end of the line, which is the Central Station. If you arrive at Gothenburg City Airport take the bus to the end of the line. You will be at the Central Station.

You can also arrive by ferry, from either Kiel in Germany or Frederikshavn in Denmark.

Who's Coming?

If you'd like to come, please let us know when you will be arriving and leaving, as well as letting us know your interests We'll keep a list of people which we'll update (which you can do so yourself if you have bitbucket pypy commit rights).

Wednesday, October 12, 2011

Numpy funding and status update

Hi everyone,

It's been a little while since we wrote about NumPy on PyPy, so we wanted to give everyone an update on what we've been up to, and what's up next for us.

We would also like to note that we're launching a funding campaign for NumPy support in PyPy. Details can be found on the donation page.

Some of the things that have happened since last we wrote are:

  • We added dtype support, meaning you can now create arrays of a bunch of different types, including bools, ints of a various sizes, and floats.
  • More array methods and ufuncs, including things like comparison methods (==, >, etc.)
  • Support for more and more argument types, for example you can index by a tuple now (only works with tuples of length one, since we only have single-dimension arrays thus far).

Some of the things we're working on at the moment:

  • More dtypes, including complex values and user-defined dtypes.
  • Subscripting arrays by other array as indices, and by bool arrays as masks.
  • Starting to reuse Python code from the original numpy.

Some of the things on the near horizon are:

  • Better support for scalar data, for example did you know that numpy.array([True], dtype=bool)[0] doesn't return a bool object? Instead it returns a numpy.bool_.
  • Multi-dimensional array support.

If you're interested in helping out, we always love more contributors, Alex, Maciej, Justin, and the whole PyPy team

Tuesday, October 11, 2011

More Compact Lists with List Strategies

Since we come closer to merging the list-strategy branch I want to try to explain this memory optimization today.

Datatypes in PyPy are stored as W_<type>Objects (e.g. W_StringObject to represent strings, W_IntObject to represent ints). This is necessary due to the dynamic nature of Python. So the actual value (e.g. string, integer) is stored inside that box, resulting in an indirection. When having a large amount of such boxed objects, for example in a list, the wasted memory can become quite large.

If you have a closer look at such lists, you will see that in many of them only one type of data is stored and only few (and smaller) lists store mixed types. Another thing to observe is that those lists often won't change the types of the objects they contain at runtime very often. For instance a list of a million integers is very unlikely to suddenly get a string appended to it.

List Strategies

The goal of this work is to write an optimization that exploits this behaviour. Instead of wrapping all items in a list, we implement lists in a way that they are optimized for storing certain (primitive) datatypes. These implementations store the content of the list in unwrapped form, getting rid of the extra indirection and wrapper objects.

One approach would be to add a level of indirection, making each W_ListObject instance point to another object that stores the actual content. For this other object, several implementations would exist, for every datatype we want to store without wrapping it (as well as a general one that deals with arbitrary content). The data layout would look something like this:

This approach has the problem that we need two indirections to get to the data and that the implementation instances need memory themselves.

What we would like to do is to make the W_ListObject point to an RPython list directly, that contains either wrapped or unwrapped data. This plan has the problem that storing different unwrapped data is not directly possible in RPython.

To solve the problem, we use the rerased RPython library module. It allows us to erase the type of an object, in this case lists, and returns something similar to void-star in C, or Object in Java. This object is then stored on the W_ListObject in the field storage. If we want to work with the list, for example to append or delete items, we need to unerase the storage again.

Example for rerase:

storage = erase([1 ,2 ,3 ,4])
# storage is an opaque object that you can do nothing with
....
l = unerase(storage)
l.clear()

Now that we know how to make the W_ListObject point directly to wrapped or unwrapped data, we need to find out how to actually do any operations on this data. This can be accomplished by adding another field to our W_ListObject. This field points to a ListStrategy object. The actual implementation of W_ListObject is now deferred to those ListStrategy classes. For instance, a W_ListObject which holds only integers will use the IntegerListStrategy.

When the type of content is being changed, we need to change the used strategy as well as the storage in compatible ways. For example when we add a string to the list of integers we need to switch to the ObjectListStrategy and change the storage to be a list of wrapped objects. Thus the currently used strategy always knows what to do with what is currently in the storage.

As you can see, we now save one level of indirections by storing some of the data unwrapped. Of course each operation on a list needs to go via the strategy, but since we save one indirection for each element stored in that list and the Strategy classes are singletons, the benefits outweigh the costs.

Currently there are only strategies for integers and strings since many lists seem to have these datatypes. Other strategies i.e for floats and unicode strings are planned. We also implemented two special strategies for empty lists and range-lists. The EmptyListStrategy's storage is None. If objects are added to the list we just switch to the appropriate strategy (determined by the item's type). RangeListsStrategies do not store any items at all. Instead they only store values describing the range of the list, i.e. start, step and length. On any operations that changes the data of the list we switch to the IntegerStrategy.

A nice side-effect of storing unwrapped datatypes is that we can implement optimized methods for certain cases. For instance, since comparison of unwrapped integers is now much faster than comparison between arbitrary objects, we can rewrite the sorting methods for lists containing integers.

Microbenchmarks

Finally here is an early overview of the memory consumption of different Python implementations: CPython, PyPy and PyPy-list which uses list-strategies. To demonstrate how powerful list-strategies can be in the best case, we wrote benchmarks that create a list of integers, a list of strings and a range-list each with one million elements each and then reads out the heap size of the process as reported by the OS.

The results are as follows:

The savings on integers and strings in this ideal case are quite big.

The benchmark for range-lists is a little unfair, since in CPython one could accomplish the same memory behaviour using xrange. However, in PyPy users won't notice that internally the list does not store all items, making it still possible to use all list methods, such as append or delete.

Conclusion

We hope that list strategies bring memory savings for applications that use homogeneous lists of primitive types. Furthermore, operations on such lists tend to be somewhat faster as well. This also integrates well with the JIT. The list strategies optimizations will be merged to the PyPy's default branch at some point in the next months. An equivalent optimization for dictionaries has already been merged (and is part of PyPy 1.6), one for sets is coming in the future.

Lukas Diekmann and Carl Friedrich Bolz

Wednesday, September 21, 2011

Py3k for PyPy fundraiser

Hi,

We would like to announce a donation campaign for implementing Python 3 in PyPy.
Please read our detailed plan for all the details and donate using the
button on that page!

Thanks,
The PyPy Team

Tuesday, August 30, 2011

Wrapping C++ Libraries with Reflection — Status Report One Year Later

Well over a year ago, work was started on the cppyy module which lives in the reflex-support branch. Since then, work has progressed at a varying pace and has included a recent sprint in Düsseldorf, last July.

Let's first take a step back and recap why we're interested in doing this, given that it is perfectly possible to use C++ through generated bindings and cpyext. cppyy makes use of reflection information generated for the C++ classes of interest, and has that reflection information available at run time. Therefore, it is able to open up complex C++ types to the JIT in a conceptually similar manner as simple types are open to it. This means that it is possible to get rid of a lot of the marshalling layers when making cross-language calls, resulting in much lower call overhead than is possible when going through the CPython API, or other methods of wrapping.

There are two problems that need to be solved: C++ language constructs need to be presented on the Python side in a natural way; and cross-language impedance mismatches need to be minimized, with some hints of the user if need be. For the former, the list of mapped features has grown to a set that is sufficient to do real work. There is now support for:

  • builtin, pointer, and array types
  • namespaces, classes, and inner classes
  • global functions, global data
  • static/instance data members and methods
  • default variables, object return by value
  • single and multiple (virtual) inheritance
  • templated classes
  • basic STL support and pythonizations
  • basic (non-global) operator mapping

The second problem is harder and will always be an on-going process. But one of the more important issues has been solved at the recent Düsseldorf sprint, namely, that of reclaiming C++ objects instantiated from the Python side by the garbage collector.

Performance has also improved, especially that of the nicer "pythonized" interface that the user actually sees, although it still misses out on about a factor of 2.5 in comparison to the lower-level interface (which has gotten uglier, so you really don't want to use that). Most of this improvement is due to restructuring so that it plays nicer with the JIT and libffi, both of which themselves have seen improvements.

Work is currently concentrated on the back-ends: a CINT back-end is underway and a LLVM/CLang pre-compiled headers (PCH) back-end is planned. The latter is needed for this code to be released in the wild, rather than just used in high energy physics (HEP), as that would be easier to support. Also, within HEP, CLang's PCH are foreseen to be the future format of reflection information.

At the end of the Düsseldorf sprint, we tried a little code that did something actually "useful," namely the filling of a histogram with some random values. We did get it to work, but trying cppyy on a large class library showed that a good warning system for such things like missing classes was sorely needed. That has been added since, and revisiting the histogram example later, here is an interesting note: the pypy-c run takes 1.5x the amount of time of that of the compiled, optimized, C++ code. The run was timed start to finish, including the reflection library loading and JIT warm-up that is needed in the case of Python, but not for the compiled C++ code. However, in HEP, scientists run many short jobs while developing their analysis codes, before submitting larger jobs on the GRID to run during lunch time or overnight. Thus, a more realistic comparison is to include the compilation time needed for the C++ code and with that, the Python code needs only 55% of the time required by C++.

The choice of a programming language is often a personal one, and such arguments like the idea that C++ is hard to use typically do not carry much weight with the in-crowd that studies quantum field dynamics for fun. However, getting the prompt with your analysis results back faster is a sure winner. We hope that cppyy will soon have progressed far enough to make it useful first to particle physicists and then other uses for wrapping C++ libraries.

Wim Lavrijsen, Carl Friedrich Bolz, Armin Rigo

Tuesday, August 23, 2011

We need Software Transactional Memory

Hi all. Here is (an extract of) a short summary paper about my current position on Software Transactional Memory as a general tool in the implementation of Python or Python-like languages. Thanks to people on IRC for discussion on making this blog post better (lucian, Alex Gaynor, rguillebert, timonator, Da_Blitz). For the purpose of the present discussion, we are comparing Java with Python when it comes to multi-threading.

The problem in complex high-level languages

Like Java, the Python language gives guarantees: it is not acceptable for the Python virtual machine to crash due to incorrect usage of threads. A primitive operation in Java is something like reading or writing a field of an object; the corresponding guarantees are along the lines of: if the program reads a field of an object, and another thread writes to the same field of the same object, then the program will see either the old value, or the new value, but not something else entirely, and the virtual machine will not crash.

Higher-level languages like Python differ from Java by the fact that a "primitive operation" is far more complex. It may for example involve looking in several hash maps, perhaps doing updates. In general, it is completely impossible to map every operation that must be atomic to a single processor instruction.

Jython: fine-grained locking

This problem has been solved "explicitly" in the Jython interpreter that runs on top of Java. The solution is explicit in the following sense: throughout the Jython interpreter, every single operation makes careful use of Java-level locking mechanisms. This is an application of "fine-grained locking". For example, operations like attribute lookup, which need to perform look-ups in a number of hash maps, are protected by acquiring and releasing locks (in __getattribute__).

A draw-back of this solution is the attention to detail required. If even one place misses a lock, then there is either a bug --- and such bugs occur in cases that are increasingly rare and hard to debug as the previous bugs are fixed --- or we just file it under "differences from CPython". There is however the risk of deadlock, if two threads attempt to lock the same objects in different order.

In practice, the situation is actually not as bad as I may paint it: the number of locks in Jython is reasonable, and allows for all the "common cases" to work as expected. (For the uncommon cases, see below.)

Performance-wise, the Java virtual machine itself comes with locks that have been heavily optimized over a long period of time, so the performance is acceptable. However if this solution were coded in C, it would need a lot of extra work to optimize the locks manually (possibly introducing more of the subtle bugs).

CPython: coarse-grained locking

CPython, the standard implementation of Python in C, took a different and simpler approach: it has a single global lock, called the Global Interpreter Lock (GIL). It uses "coarse-grained locking": the lock is acquired and released around the whole execution of one bytecode (or actually a small number of bytecodes, like 100). This solution is enough to ensure that no two operations can conflict with each other, because the two bytecodes that invoke them are themselves serialized by the GIL. It is a solution which avoids --- unlike Jython --- writing careful lock-acquiring code all over the interpreter. It also offers even stronger guarantees: every bytecode runs entirely atomically.

Nowadays, the draw-back of the GIL approach is obvious on multi-core machines: by serializing the execution of bytecodes, starting multiple threads does not actually let the interpreter use of more than one core.

PyPy, the Python implementation in Python, takes the same approach so far.

Existing usage

As we have seen, we have the following situation: the existing Python language, as CPython implements it, offers very strong guarantees about multi-threaded usage. It is important to emphasize that most existing multi-threaded Python programs actually rely on such strong guarantees. This can be seen for example in a problem that takes a populated list and does in several threads:

next_item = global_list.pop()

This implicitly relies on the fact that pop() will perform atomic removal from the list. If two threads try to pop() from the same list at the same time, then the two operations will occur in one order or the other; but they will not e.g. return the same object to both threads or mess up the internal state of the list object.

With such an example in mind, it should be clear that we do not want a solution to the multi-core issue that involves dropping these strong guarantees. It is ok however to lower the barrier, as Jython does; but any Python implementation must offer some guarantees, or not offer multi-threading at all. This includes the fact that a lot of methods on built-in types are supposed to be atomic.

(It should be noted that not offering multi-threading at all is actually also a (partial) solution to the problem. Recently, several "hacks" have appeared that give a programmer more-or-less transparent access to multiple independent processes (e.g. multiprocessing). While these provide appropriate solutions in some context, they are not as widely applicable as multi-threading. As a typical example, they fail to apply when the mutiple cores need to process information that cannot be serialized at all --- a requirement for any data exchange between several processes.)

Here is an example of how Jython's consistency is weaker than CPython's GIL. It takes uncommon examples to show it, and the fact that it does not work like a CPython programmer expect them to is generally considered as an implementation detail. Consider:

Thread 1:  set1.update(set2)
Thread 2:  set2.update(set3)
Thread 3:  set3.update(set1)

Each operation is atomic in the case of CPython, but decomposed in two steps (which can each be considered atomic) in the case of Jython: reading from the argument, and then updating the target set. Suppose that initially set1 = {1}, set2 = {2}, set3 = {3}. On CPython, independently on the order in which the threads run, we will end up with at least one of the sets being {1, 2, 3}. On Jython, it is possible that all three sets end up as containing two items only. The example is a bit far-fetched but should show that CPython's consistency is strictly stronger than Jython's.

PyPy

PyPy is a Python interpreter much like CPython or Jython, but the way it is produced is particular. It is an interpreter written in RPython, a subset of Python, which gets turned into a complete virtual machine (as generated C code) automatically by a step called the "translation". In this context, the trade-offs are different from the ones in CPython and in Jython: it is possible in PyPy, and even easy, to apply arbitrary whole-program transformations to the interpreter at "translation-time".

With this in mind, it is possible to imagine a whole-program transformation that would add locking on every object manipulated in RPython by the interpreter. This would end up in a situation similar to Jython. However, it would not automatically solve the issue of deadlocks, which is avoided in the case of Jython by careful manual placement of the locks. (In fact, being deadlock-free is a global program property that cannot be automatically ensured or verified; any change to Jython can in theory break this property, and thus introduce subtle deadlocks. The same applies to non-atomicity.)

In fact, we can easily check that if the interpreter accesses (for both reading and writing) objects A and B in a bytecode of thread 1, and objects B and A (in the opposite order) in a bytecode of thread 2 --- and moreover if you need to have accessed the first object before you can decide that you will need to access the second object --- then there is no way (apart from the GIL) to avoid a deadlock while keeping the strong guarantee of atomicity. Indeed, if both threads have progressed to the middle of the execution of their bytecode, then A has already been mutated by thread 1 and similarly B has already been mutated by thread 2. It is not possible to successfully continue running the threads in that case.

Using Software Transactional Memory

Software Transactional Memory (STM) is an approach that gives a solution to precisely the above problem. If a thread ended up in a situation where continuing to run it would be wrong, then we can abort and rollback. This is similar to the notion of transaction on databases. In the above example, one or both threads would notice that they are about to run into troubles and abort. This means more concretely that they need to have a way to restart execution at the start of the bytecode, with all the side-effects of what they did so far being either cancelled or just not committed yet.

We think that this capacity to abort and rollback is the missing piece of the puzzle of multi-threaded implementations of Python. Actually, according to the presentation of the problem given above, it is unavoidable that any solution that wants to offer the same level of consistency and atomicity as CPython would involve the capacity of aborting and rolling back --- which means precisely that STM cannot be avoided.

Ok, but why not settle down with Jython's approach and put careful locks left and right throughout the interpreter? Because (1) we would have to consider every operation's atomicity and make decisions (or steal Jython's) and document them here; (2) it would also be really a lot of work, to optimize these locks e.g. with the JIT as well as the JVM does; and (3) it is not the PyPy way to require manually tweaking your code everywhere for a feature that should be orthogonal. Point (3) is probably the most important here: you need to redo the work for every language you implement in PyPy. It also implies my own point (4): it is not fun :-)

In more details, the process would work as follows. (This gives an overview of one possible model; it is possible that a different model will end up being better.) In every thread:

  • At the start of a bytecode, we start a "transaction". This means setting up a thread-local data structure to record a log of what occurs in the transaction.
  • We record in the log all objects that are read, as well as the modifications that we would like to make.
  • During this time, we detect "read" inconsistencies, shown by the object's "last-modified" timestamp being later than the start time of the current transaction, and abort. This prevents the rest of the code from running with inconsistent values.
  • If we reach the end of the bytecode without a "read" inconsistency, then we atomically check for "write" inconsistencies. These are inconsistencies which arise from concurrent updates to objects in the other threads --- either our "write" objects, or our "read" objects.
  • If no inconsistency is found, we "commit" the transaction by copying the delayed writes from the log into main memory.

The points at which a transaction starts or ends are exactly the points at which, in CPython, the Global Interpreter Lock is respectively acquired and released. If we ignore the fact that (purely for performance) CPython acquires and releases the GIL only every N bytecodes, then this means:

  1. Before any bytecode we acquire the GIL (start a transaction), and after the bytecode we release it (ends the transaction); and
  2. Before doing an external call to the C library or the OS we release the GIL (ends the transaction) and afterwards re-acquire it (start the next transaction).
So in particular this model is well suited to the STM condition that we cannot do anything in a transaction that cannot be rolled back, like --- precisely --- system calls. Indeed, by construction, these system calls occur outside a transaction, because in CPython they occur with the GIL released.

Performance

A large number of implementation details are still open for now. From a user's point of view (i.e. the programmer using Python), the most relevant one is the overall performance impact. We cannot give precise numbers so far, and we expect the initial performance to be abysmally bad (maybe 10x slower); however, with successive improvements to the locking mechanism, to the global program transformation inserting the locks, to the garbage collector (GC), and to the Just-in-Time (JIT) compiler, we believe that it should be possible to get a roughly reasonable performance (up to maybe 2x slower). For example, the GC can maintain flags on the objects to know that they did not escape their creation thread, and do not need any logging; and the JIT compiler can aggregate several reads or writes to an object into one. We believe that these are the kind of optimizations that can give back a lot of the performance lost.

The state of STM

Transactional Memory is itself a relatively old idea, originating from a 1986 paper by Tom Knight. At first based on hardware support, the idea of software-only transactional memory (STM) was popularized in 1995 and has recently been the focus of intense research.

The approach outlined above --- using STM to form the core of the implementation of a language --- is new, as far as we know. So far, most implementations provide STM as a library feature. It requires explicit usage, often in the form of explicitly declaring which objects must be protected by STM (object-based STMs). It is only recently that native STM support has started to appear, notably in the Clojure language.

STM is described on Wikipedia as an approach that "greatly simplifies conceptual understanding of multithreaded programs and helps make programs more maintainable by working in harmony with existing high-level abstractions such as objects and modules." We actually think that these benefits are important enough to warrant being exposed to the Python programmer as well, instead of being used only internally. This would give the Python programmer a very simple interface:

with atomic:
    <these operations are executed atomically>

(This is an old idea. Funny how back in 2003 people, including me, thought that this was a hack. Now I'm writing a blog post to say "it was not a hack; it's explicitly using locks that is a hack." I'm buying the idea of composability.)

From a practical point of view, I started looking seriously at the University of Rochester STM (RSTM), a C++ library that has been a focus of --- and a collection of results from --- recent research. One particularly representative paper is A Comprehensive Strategy for Contention Management in Software Transactional Memory by Michael F. Spear, Luke Dalessandro, Virendra J. Marathe and Michael L. Scott.

Conclusion

Taking these ideas and applying them in the context of an implementation of a complex high-level language like Python comes with its own challanges. In this context, using PyPy makes sense as both an experimentation platform and as a platform that is recently gaining attention for its performance. The alternatives are unattractive: doing it in CPython for example would mean globally rewriting the interpreter. In PyPy instead, we write it as a transformation that is applied systematically at translation-time. Also, PyPy is a general platform for generating fast interpreters for dynamic languages; the STM implementation in PyPy would work out of the box for other language implementations as well, instead of just for Python.


Update:

  • This is mostly me (Armin Rigo) ranting aloud and trying experiments; this post should not be confused as meaning that the whole PyPy team will now spend the next years working on it full-time. As I said it is orthogonal to the actual Python interpreter, and it is in any case a feature that can be turned on or off during translation; I know that in many or most use cases, people are more interested in getting a fast PyPy rather than one which is twice as slow but scales well.
  • Nothing I said is really new. For proof, see Riley and Zilles (2006) as well as Tabba (2010) who both experimented with Hardware Transactional Memory, turning CPython or PyPy interpreter's GIL into start/end transactions, as I describe here.

Thursday, August 18, 2011

PyPy 1.6 - kickass panda

We're pleased to announce the 1.6 release of PyPy. This release brings a lot of bugfixes and performance improvements over 1.5, and improves support for Windows 32bit and OS X 64bit. This version fully implements Python 2.7.1 and has beta level support for loading CPython C extensions. You can download it here:

http://pypy.org/download.html

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7.1. It's fast (pypy 1.6 and cpython 2.6.2 performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64 or Mac OS X. Windows 32 is beta (it roughly works but a lot of small issues have not been fixed so far). Windows 64 is not yet supported.

The main topics of this release are speed and stability: on average on our benchmark suite, PyPy 1.6 is between 20% and 30% faster than PyPy 1.5, which was already much faster than CPython on our set of benchmarks.

The speed improvements have been made possible by optimizing many of the layers which compose PyPy. In particular, we improved: the Garbage Collector, the JIT warmup time, the optimizations performed by the JIT, the quality of the generated machine code and the implementation of our Python interpreter.

Highlights

  • Numerous performance improvements, overall giving considerable speedups:
    • better GC behavior when dealing with very large objects and arrays
    • fast ctypes: now calls to ctypes functions are seen and optimized by the JIT, and they are up to 60 times faster than PyPy 1.5 and 10 times faster than CPython
    • improved generators(1): simple generators now are inlined into the caller loop, making performance up to 3.5 times faster than PyPy 1.5.
    • improved generators(2): thanks to other optimizations, even generators that are not inlined are between 10% and 20% faster than PyPy 1.5.
    • faster warmup time for the JIT
    • JIT support for single floats (e.g., for array('f'))
    • optimized dictionaries: the internal representation of dictionaries is now dynamically selected depending on the type of stored objects, resulting in faster code and smaller memory footprint. For example, dictionaries whose keys are all strings, or all integers. Other dictionaries are also smaller due to bugfixes.
  • JitViewer: this is the first official release which includes the JitViewer, a web-based tool which helps you to see which parts of your Python code have been compiled by the JIT, down until the assembler. The jitviewer 0.1 has already been release and works well with PyPy 1.6.
  • The CPython extension module API has been improved and now supports many more extensions. For information on which one are supported, please refer to our compatibility wiki.
  • Multibyte encoding support: this was of of the last areas in which we were still behind CPython, but now we fully support them.
  • Preliminary support for NumPy: this release includes a preview of a very fast NumPy module integrated with the PyPy JIT. Unfortunately, this does not mean that you can expect to take an existing NumPy program and run it on PyPy, because the module is still unfinished and supports only some of the numpy API. However, barring some details, what works should be blazingly fast :-)
  • Bugfixes: since the 1.5 release we fixed 53 bugs in our bug tracker, not counting the numerous bugs that were found and reported through other channels than the bug tracker.

Cheers,

Hakan Ardo, Carl Friedrich Bolz, Laura Creighton, Antonio Cuni, Maciej Fijalkowski, Amaury Forgeot d'Arc, Alex Gaynor, Armin Rigo and the PyPy team

Friday, August 12, 2011

Visualization of JITted code

Hello.

We're proud to announce the first public release of the jitviewer. As of now, jitviewer is a slightly internal tool that helps understanding how your Python source code is compiled by the PyPy's JIT all the way down to machine code.

To install it, you need a very recent version of PyPy (newer than 9th of August), for example one of the nightly builds:

  • install pip and distribute either by creating a PyPy virtualenv or by following the installation instructions.
  • make sure to have a source code checkout of PyPy and put it in your PYTHONPATH.
  • pip install jitviewer. Note that you need to run the pip executable which belongs to PyPy, not the globally installed one.

Have a look at the README for how to start it, or try the online demo if you just want to play with it.

The jitviewer is a web application written with flask and jinja2. If you have experience with web development and you want to help PyPy, don't hesitate to contact us, there are plenty of things to improve in it :-).

What does the jitviewer really do?

At the top of the page, you will see the list of pieces of code which has been compiled by the JIT. You will see entries for both normal loops and for "entry bridges". This is not the right place to discuss the difference between those, but you most probably want to look at loops, because usually it's where most of the time is spent.

Note that for each loop, you will see the name of the function which contains the first instruction of the loop. However, thanks to the inlining done by the JIT, it will contain also the code for other functions.

Once you select a loop, the jitviewer shows how the JIT has compiled the Python source code into assembler in a hierarchical way. It displays four levels:

  • Python source code: only the lines shown in azure have been compiled for this particular loop, the ones in gray have not.

  • Python bytecode, the one you would get by doing:

    def f(a, b):
       return a + b
    
    import dis
    dis.dis(f)
    

    The opcodes are e.g. LOAD_FAST, LOAD_GLOBAL etc. The opcodes which are not in bold have been completely optimized aways by the JIT.

  • Intermediate representation of jit code (IR). This is a combination of operations (like integer addition, reading fields out of structures) and guards (which check that the assumptions we made are actually true). Guards are in red. These operations are "at the same level as C": so, for example, + takes two unboxed integers which can be stored into the register of the CPU.

  • Assembler: you can see it by clicking on "Show assembler" in the menu on the right.

Sometimes you'll find that a guard fails often enough that a new piece of assembler is required to be compiled. This is an alternative path through the code and it's called a bridge. You can see bridges in the jitviewer when there is a link next to a guard. For more information about purpose look up the jit documentation.

I'm still confused

Jitviewer is not perfect when it comes to explaining what's going on. Feel free to pop up on IRC or send us a mail to the mailing list, we'll try to explain and/or improve the situation. Consult the contact page for details.

Cheers,
fijal & antocuni

Tuesday, August 2, 2011

PyPy is faster than C, again: string formatting

String formatting is probably something you do just about every day in Python, and never think about. It's so easy, just "%d %d" % (i, i) and you're done. No thinking about how to size your result buffer, whether your output has an appropriate NULL byte at the end, or any other details. A C equivalent might be:

char x[44];
sprintf(x, "%d %d", i, i);

Note that we had to stop for a second and consider how big numbers might get and overestimate the size (44 = length of the biggest number on 64bit (20) + 1 for the sign * 2 + 1 (for the space) + 1 (NUL byte)), it took the authors of this post, fijal and alex, 3 tries to get the math right on this :-)

This is fine, except you can't even return x from this function, a more fair comparison might be:

char *x = malloc(44 * sizeof(char));
sprintf(x, "%d %d", i, i);

x is slightly overallocated in some situations, but that's fine.

But we're not here to just discuss the implementation of string formatting, we're here to discuss how blazing fast PyPy is at it, with the new unroll-if-alt branch. Given the Python code:

def main():
    for i in xrange(10000000):
        "%d %d" % (i, i)

main()

and the C code:

#include <stdio.h>
#include <stdlib.h>


int main() {
    int i = 0;
    char x[44];
    for (i = 0; i < 10000000; i++) {
        sprintf(x, "%d %d", i, i);
    }
}

Run under PyPy, at the head of the unroll-if-alt branch, and compiled with GCC 4.5.2 at -O4 (other optimization levels were tested, this produced the best performance). It took 0.85 seconds to execute under PyPy, and 1.63 seconds with the compiled binary. We think this demonstrates the incredible potential of dynamic compilation, GCC is unable to inline or unroll the sprintf call, because it sits inside of libc.

Benchmarking the C code:

#include <stdio.h>
#include <stdlib.h>


int main() {
    int i = 0;
    for (i = 0; i < 10000000; i++) {
        char *x = malloc(44 * sizeof(char));
        sprintf(x, "%d %d", i, i);
        free(x);
    }
}

Which as discussed above, is more comperable to the Python, gives a result of 1.96 seconds.

Summary of performance:

Platform GCC (stack) GCC (malloc) CPython PyPy (unroll-if-alt)
Time 1.63s 1.96s 10.2s 0.85s
relative to C 1x 0.83x 0.16x 1.9x

Overall PyPy is almost 2x faster. This is clearly win for dynamic compilation over static - the sprintf function lives in libc and so cannot be specializing over the constant string, which has to be parsed every time it's executed. In the case of PyPy, we specialize the assembler if we detect the left hand string of the modulo operator to be constant.

Cheers,
alex & fijal

Thursday, July 7, 2011

Realtime image processing in Python

Image processing is notoriously a CPU intensive task. To do it in realtime, you need to implement your algorithm in a fast language, hence trying to do it in Python is foolish: Python is clearly not fast enough for this task. Is it? :-)
Actually, it turns out that the PyPy JIT compiler produces code which is fast enough to do realtime video processing using two simple algorithms implemented by Håkan Ardö.
sobel.py implements a classical way of locating edges in images, the Sobel operator. It is an approximation of the magnitude of the image gradient. The processing time is spend on two convolutions between the image and 3x3-kernels.
magnify.py implements a pixel coordinate transformation that rearranges the pixels in the image to form a magnifying effect in the center. It consists of a single loop over the pixels in the output image copying pixels from the input image.
You can try by yourself by downloading the appropriate demo:
To run the demo, you need to have mplayer installed on your system. The demo has been tested only on linux, it might (or not) work also on other systems:
$ pypy pypy-image-demo/sobel.py

$ pypy pypy-image-demo/magnify.py
By default, the two demos uses an example AVI file. To have more fun, you can use your webcam by passing the appropriate mplayer parameters to the scripts, e.g:
$ pypy demo/sobel.py tv://
By default magnify.py uses nearest-neighbor interpolation. By adding the option -b, bilinear interpolation will be used instead, which gives smoother result:
$ pypy demo/magnify.py -b
There is only a single implementation of the algorithm in magnify.py. The two different interpolation methods are implemented by subclassing the class used to represent images and embed the interpolation within the pixel access method. PyPy is able to achieve good performance with this kind of abstractions because it can inline the pixel access method and specialize the implementation of the algorithm. In C++ that kind of pixel access method would be virtual and you'll need to use templates to get the same effect without incurring in runtime overhead.
The video above shows PyPy and CPython running sobel.py side by side (PyPy taking input from the webcam, CPython from the test file). Alternatively, to have a feeling on how much PyPy is faster than CPython, try to run the demo with the latter. These are the the average fps (frames per second) that I get on my machine (Ubuntu 64 bit, Intel i7 920, 4GB RAM) when processing the default test.avi video and using the prebuilt PyPy binary found in the full tarball alinked above. For sobel.py:
  • PyPy: ~47.23 fps
  • CPython: ~0.08 fps
For magnify.py:
  • PyPy: ~26.92 fps
  • CPython: ~1.78 fps
This means that on sobel.py, PyPy is 590 times faster. On magnify.py the difference is much less evident and the speedup is "only" 15x.
It must be noted that this is an extreme example of what PyPy can do. In particular, you cannot expect (yet :-)) PyPy to be fast enough to run an arbitrary video processing algorithm in real time, but the demo still proves that PyPy has the potential to get there.

Wednesday, June 29, 2011

Global Interpreter Lock, or how to kill it

People that listened to my (Armin Rigo) lightning talk at EuroPython know that suddenly, we have a plan to remove the Global Interpreter Lock --- the infamous GIL, the thing in CPython that prevents multiple threads from actually running in your Python code in parallel.

That's not actually new, because Jython has been doing it all along. Jython works by very carefully adding locks to all the mutable built-in types, and by relying on the underlying Java platform to be efficient about them (so that the result is faster than, say, very carefully adding similar locks in CPython). By "very carefully", I mean really really carefully; for example, 'dict1.update(dict2)' needs to lock both dict1 and dict2, but if you do it naively, then a parallel 'dict2.update(dict1)' might cause a deadlock.

All of PyPy, CPython and IronPython have a GIL. But for PyPy we are considering a quite different approach than Jython's, based on Software Transactional Memory. This is a recent development in computer science, and it gives a nicer solution than locking. Here is a short introduction to it.

Say you want to atomically pop an item from 'list1' and append it to 'list2':

def f(list1, list2):
    x = list1.pop()
    list2.append(x)

This is not safe in multithreaded cases (even with the GIL). Say that you call f(l1, l2) in thread 1 and f(l2, l1) in thread 2. What you want is that it has no effect at all (x is moved from one list to the other, then back). But what can occur is that instead the top of the two lists are swapped, depending on timing issues.

One way to fix it is with a global lock:

def f(list1, list2):
    global_lock.acquire()
    x = list1.pop()
    list2.append(x)
    global_lock.release()

A finer way to fix it is with locks that come with the lists:

def f(list1, list2):
    acquire_all_locks(list1.lock, list2.lock)
    x = list1.pop()
    list2.append(x)
    release_all_locks(list1.lock, list2.lock)

The second solution is a model for Jython's, while the first is a model for CPython's. Indeed, in CPython's interpreter, we acquire the GIL, then we do one bytecode (or actually a number of them, like 100), then we release the GIL; and then we proceed to the next bunch of 100.

Software Transactional Memory (STM) gives a third solution:

def f(list1, list2):
    while True:
        t = transaction()
        x = list1.pop(t)
        list2.append(t, x)
        if t.commit():
            break

In this solution, we make a transaction object and use it in all reads and writes we do to the lists. There are actually several different models, but let's focus on one of them. During a transaction, we don't actually change the global memory at all. Instead, we use the thread-local transaction object. We store in it which objects we read from, which objects we write to, and what values we write. It is only when the transaction reaches its end that we attempt to "commit" it. Committing might fail if other commits have occurred in between, creating inconsistencies; in that case, the transaction aborts and must restart from the beginning.

In the same way as the previous two solutions are models for CPython and Jython, the STM solution looks like it could be a model for PyPy in the future. In such a PyPy, the interpreter would start a transaction, do one or several bytecodes, and then end the transaction; and repeat. This is very similar to what is going on in CPython with the GIL. In particular, it means that it gives programmers all the same guarantees as the GIL does. The only difference is that it can actually run multiple threads in parallel, as long as their code does not interfere with each other. (In particular, if you need not just the GIL but actual locks in your existing multi-threaded program, then this will not magically remove the need for them. You might get an additional built-in module that exposes STM to your Python programs, if you prefer it over locks, but that's another question.)

Why not apply that idea to CPython? Because we would need to change everything everywhere. In the example above, you may have noted that I no longer call 'list1.pop()', but 'list1.pop(t)'; this is a way to tell that the implementation of all the methods needs to be changed, in order to do their work "transactionally". This means that instead of really changing the global memory in which the list is stored, it must instead record the change in the transation object. If our interpreter is written in C, as CPython is, then we need to write it explicitly everywhere. If it is written instead in a higher-level language, as PyPy is, then we can add this behavior as as set of translation rules, and apply them automatically wherever it is necessary. Moreover, it can be a translation-time option: you can either get the current "pypy" with a GIL, or a version with STM, which would be slower due to the extra bookkeeping. (How much slower? I have no clue, but as a wild guess, maybe between 2 and 5 times slower. That is fine if you have enough cores, as long as it scales nicely :-)

A final note: as STM research is very recent (it started around 2003), there are a number of variants around, and it's not clear yet which one is better in which cases. As far as I can tell, the approach described in "A Comprehensive Strategy for Contention Management in Software Transactional Memory" seems to be one possible state-of-the-art; it also seems to be "good enough for all cases".

So, when will it be done? I cannot say yet. It is still at the idea stage, but I think that it can work. How long would it take us to write it? Again no clue, but we are looking at many months rather than many days. This is the sort of thing that I would like to be able to work on full time after the Eurostars funding runs out on September 1. We are currently looking at ways to use crowdfunding to raise money so that I can do exactly that. Expect a blog post about that very soon. But this looks like a perfect candidate for crowdfunding -- there are at least thousands of you who would be willing to pay 10s of Euros to Kill the GIL. Now we only have to make this happen.

Wednesday, June 8, 2011

Report back from our survey

Hi all,

I'm here to report back the results of our survey. First, we're very pleased to report that a number of you guys are happilly running PyPy in production! Most (97%) of the respondants using PyPy are using it because it's faster, but a further 26% (respondants could choose multiple answers) are using it because of lower memory usage. Of users who aren't using PyPy, the most common reason was C extensions, followed by "Other".

From reading the extra comments section there are a few things we've learned:

  1. Google docs needs a better UI for this stuff
  2. A huge number of people want NumPy and SciPy, it was easily the most requested C extension (25% of respondants said somthing about NumPy). We've already blogged on the topic of our plans for NumPy.
  3. Having packages in the various OS's repositories would be a big help in getting users up and running.

A huge thanks to everyone who responded! Finally, if you're using PyPy in production we'd love to get a testimonial from you, if you're willing to spare a few minutes to give us a quote or two please get in contact with us via our mailing list.

Thanks, Alex

Monday, May 23, 2011

PyPy Genova-Pegli Post-EuroPython Sprint June 27 - July 2 2011

The next PyPy sprint will be in Genova-Pegli, Italy, the week after EuroPython (which is in Florence, about 3h away by train). This is a fully public sprint: newcomers and topics other than those proposed below are welcome.

Goals and topics of the sprint

  • Now that we have released 1.5, the sprint itself is going to be mainly working on fixing issues reported by various users. Possible topics include, but are not limited to:
    • fixing issues in the bug tracker
    • improve cpyext, the C-API compatibility layer, to support more extension modules
    • finish/improve/merge jitypes2, the branch which makes ctypes JIT friendly
    • general JIT improvements
    • improve our tools, like the jitviewer or the buildbot infrastructure
    • make your favorite module/application working on PyPy, if it doesn't yet
  • Of course this does not prevent people from showing up with a more precise interest in mind If there are newcomers, we will gladly give introduction talks.
  • Since we are almost on the beach, we can take one day off for summer relaxation and/or tourist visits nearby :-).

Exact times

The work days should be 27 June - 2 July 2011. People may arrive on the 26th already and/or leave on the 3rd.

Location & Accomodation

Both the sprint venue and the lodging will be at Albergo Puppo in Genova-Pegli, Italy. Pegli is a nice and peaceful little quarter of Genova, and the hotel is directly on the beach, making it a perfect place for those who want to enjoy the sea in the middle of the Italian summer, as a quick search on Google Images shows :-)

The place has a good ADSL Internet connexion with wireless installed. You can of course arrange your own lodging anywhere but I definitely recommend lodging there too.
Please confirm that you are coming so that we can adjust the reservations as appropriate. The prices are as follows, and they include breakfast and a parking place for the car, in case you need it:
  • single room: 70 €
  • double room: 95 €
  • triple room: 105 €
Please register by hg:
http://bitbucket.org/pypy/extradoc/src/extradoc/sprintinfo/genova-pegli-2011/people.txt
or on the pypy-dev mailing list if you do not yet have check-in rights:
http://mail.python.org/mailman/listinfo/pypy-dev
In case you want to share a room with someone else but you don't know who, please let us know (either by writing it directly in people.txt or by writing on the mailing list) and we will try to arrange it.

Monday, May 16, 2011

PyPy Usage Survey

We've been working on PyPy for a long time. But readers of this blog will know that in the past year something has changed: we think PyPy is production ready. And it's not just us, this week LWN.net wrote an article about how PyPy sped up one of their scripts by a factor of three, noting that, "plans are to run gitdm under PyPy from here on out". All in all we think PyPy is pretty great, but not everyone is using it yet, and we want to know why. We want your feedback on why PyPy isn't ready to be your only Python yet, and how we can improve it to make that happen.

Therefore, we've put together a quick survey, whether you're using PyPy or not if you could take a few minutes to fill it out and let us know how we're doing we'd really appreciate it. You can find the form here.

Thanks, The PyPy team

Sunday, May 15, 2011

Server migration in progress

Hi all,

We are in the process of migrating the hosting machine for PyPy, moving away from codespeak.net and towards a mixture of custom servers (e.g. for buildbot.pypy.org) and wide-scale services (e.g. for the docs, now at readthedocs.org).

When this is done, a proper announce will be posted here. In the meantime, we have already moved the mailing lists, now hosted on python.org. The subscribers' list have been copied, so if you didn't notice anything special for the past week, then everything works fine :-) This concerns pypy-dev, pypy-issue and pypy-commit. Two notes:

  • Some settings have not been copied, notably if you used to disable mail delivery. Sorry about that; you have to re-enter such settings.
  • Following the move, about 50 addresses have been dropped for being invalid. I'm unsure why they were not dropped earlier, but in case sending mail to you from python.org instead of codespeak.net fails, then you have been dropped from the mailing lists, and you need to subscribe again.

Wednesday, May 11, 2011

Playing with Linear Programming on PyPy

Fancy hi-level interfaces often come with a high runtime overhead making them slow. Here is an experiment with building such an interface using constructions that PyPy should be good at optimizing. The idea is to allow the JIT in PyPy to remove the overhead introduced by using a fancy high-level python interface on top of a low-level C interface. The application considered is Linear programming. It is a tool used to solve linear optimization problems. It can for example be used to find the nonnegative values x, y and z that gives the maximum value of
without violating the constraints
There exists general purpose solvers for these kind of problems that are very fast and can literally handle millions of variables. To use them however the problem has to be transformed into some specific matrix form, and the coefficients of all the matrices has to be passed to the solver using some API. This transformation is a tedious and error prone step that forces you to work with matrix indexes instead of readable variable names. Also it makes maintaining an implementation hard since any modification has to be transformed too.

The example above comes from the manual of the glpk library. That manual continues by describing how to convert this problem into the standard form of glpk (which involves introducing three new variables) and then gives the c-code needed to call the library. Relating that c-code to the problem above without the intermediate explanation of the manual is not easy. A common solution here is to build a hi-level interface that allows a more natural way of defining the matrices and/or allow the equations to be entered symbolically. Unfortunately, such interfaces often become slow. For the benchmark below for example, cvxopt requires 20 minutes to setup a problem that takes 9.43 seconds to solve (this seems a bit extreme, am I doing something wrong?).

The high-level interface I constructed on top of the glpk library is pplp and it allows the equations to be entered symbolically. The above problem can be solved using

    lp = LinearProgram()
    x, y, z = lp.IntVar(), lp.IntVar(), lp.IntVar()
    lp.objective = 10*x + 6*y + 4*z
    lp.add_constraint( x + y + z <= 100 )
    lp.add_constraint( 10*x + 4*y + 5*z <= 600 )
    lp.add_constraint( 2*x + 2*y + 6*z <= 300 )
    lp.add_constraint( x >= 0 )
    lp.add_constraint( y >= 0 )
    lp.add_constraint( z >= 0 )

    maxval = lp.maximize()
    print maxval
    print x.value, y.value, z.value

To benchmark the API I used it to solve a minimum-cost flow problem with 154072 nodes and 390334 arcs. The C library needs 9.43 s to solve this and the pplp interface adds another 5.89 s under PyPy and 28.17 s under CPython. A large amount of time is still spend setting up the problem, but it's a significant improvement over the 20 minutes required on CPython by cvxopt. It is probably not designed to be fast on this kind of benchmark. I have not been able to get cvxopt to work under PyPy. The benchmark used is available here