Enzo 2.0 has just been released to its new Google Code website. This release features preliminary support for inline Python analysis, using yt.
In the Enzo documentation there's a brief section on how to use yt for inline analysis. As it stands, many features are not fully functional, but things like phase plots, profiles, derived quantities and slices all work. This functionality is currently untested at large (> 128) processors, but for small runs -- particularly debugging runs! -- it works nicely. (Projections do not work yet, but they will.)
Email the mailing list or leave a comment here if you run into any trouble, or if you want to share success stories!
Just sharing a video here that showcases some improvements I've made to the kD-tree rendering that will be making its way to yt for the 2.0 release. You can download it here
Just to be clear this is showing the rendering of a cosmology simulation with a 64^3 root grid + 6 AMR levels in real time on 8 processors. The script is run in parallel, with the root processor displaying the results once each frame is finished. The viewpoint is being randomized, showing the power of a kD- tree homogenization that allows a fast back-to-front sorting algorithm for each brick. The big key here is that each processor keeps the data associated with its volume in memory so that a new viewpoint doesn't require additional file I/O.
To help get an idea of what the load balancing is doing, I figured out a way to plot the outline of the bricks with a color corresponding to each processor. This is using the breadth-first load balancing where the top N subtrees are distributed across the N processors. Some of the colors overlap in an odd way because of the order in which they are shown but you can get the general idea.
There are a few more improvements on the way such as parallel kD-tree construction which should lower the overhead for this method by quite a lot, so keep an eye out!
The last few days I've spent some time looking at how parallelism in yt performs. I conducted two different tests, both of which operated on the 512^3, 7 level 'Santa Fe Light Cone' dataset RD0036. This dataset has 5.5e8 total cells and in the neighborhood of 380,000 grids. I ran four different tests: a 1D profile of the entire dataset, a 2D profile of the entire dataset, and projections of both 'Density' (requires IO) and 'Ones' (doesn't require IO). For the purposes of these tests I have excised any time spent in creating the hierarchy and affiliated grid objects. The results are shown just here:
One thing I found rather astonishing about this is how poorly the 1D profile operates. It's not incomprehensibly slow, but it operates much slower than the 2D profile. I investigated why, and I believe that it comes down to a fundamental difference in the implementation. The 2D profile's binning algorithm is implemented in C, whereas the binning algorithm in the 1D profile is actually implemented in Python, such that it actually scales quite poorly with the number of bins. Since I was taking 128 bin profile, this enhanced the problem substantially. Regardless, while a new 1D profiling implementation probably won't be written for yt-2.0 (as it's not so slow as to be prohibitively costly) it will be replaced in the future.
I was, however, quite pleased to see the parallelism scale so nicely for the profiles. Projections had a bit of a different story, in that their strong scaling saturated at around 64 processors; I believe this is probably due simply to the projection algorithm's cost/benefit ratio. I would point out that in my tests there was some noise in the high-end -- in one of the runs, projecting on 128 processors took a mere 6 seconds. Then again, since you only need to project once anyway (and then spend quite a while writing a paper!) the time difference between 30 seconds and 6 seconds is pretty much a wash.
Now, before I get too enthusiastic about this, I did find a couple very disturbing problems with the parallelism in yt. Both were within the derived quantities -- these are quantities that hang off an object, like sphere.quantities['Extrema']('Density') and so on. (In fact, that's the one that brought the problem to light.) I was performing a call to 'Extrema' before conducting the profiling step and I saw that it took orders of magnitude more time than the profiling itself! But why should this be, when they both touch the disk roughly the same amount? So I dug in a bit further. Derived quantities are calculated in two steps. The first one is to calculate reduced products on every grid object: this would be the min/max for a given grid, for the above example of 'Extrema'. All of the intermediate results then get fed into a single 'combine' operation, which calculates the final result. This enables better parallelism as well as grid-by-grid calculation -- much easier on the memory!
Now, in parallel we usually preload all the data that is going to be accessed by a given task. For instance, if we're calculating the extrema of density, each MPI task will read from disk all the densities in the grids it's going to encounter. (In serial this is much less wise, and so we avoid it -- although batch-preloading could be a useful enhancement in the future.) However, what I discovered was that in fact the entire process of preloading was disabled by default. So instead of conducting maybe 10-30 IO operations, each processor was conducting more like 10,000. This was why it was so slow! I've since changed the default for parallel runs to be to indeed preload data.
The other issue was with the communication. Derived quantities were still using a very old, primitive version of parallelism that I implemented more than two years ago. In this primitive version, the root processor would turn-by-turn communicate with each of the other processors, and then finally broadcast a result back to all of them. I have since changed this to be an Allgather operation, which enables collective broadcasting. This sped things up by a factor of 4, although I am not sure that in the long run this will be wise at high processor counts. (As a side note, in converting the derived quantities to Allgather, I also converted projections to use Allgatherv instead of Alltoallv. This should dramatically improve communication in projections, but communication was never the weak spot in projections anyway.)
There are still many places parallelism could be improved, but with just these two changes to the derived quantity parallelism, I see speeds that are much more in line with the amount of IO and processing that occurs.
The current method for projections in yt is based on a home-grown algorithm for calculating grid overlap and joining points. I've always been pretty proud of it -- it gave good results, and it succeeded at the project-once-make-many- images philosophy that went into its design. Rather than storing a 2D array of pixels, projections and slices in yt store flat arrays of image plane coordinates and cell widths. This means that there's an additional step of pixelization to create an image, but it also means that arbitrary images can be made from a single projection or slice operation.
Over the years, I've spent a lot of time shaving time off of the projection mechanism. But, it was still slow on some types of data -- and the parallelism was never all that great, either. Because of the algorithm, the parallelism was exclusively based on image-plane decomposition, which means that it was difficult to get good load balancing on nested grid simulations and that it was simply impossible to run inline unless massive amounts of data were to be passed over the wire between processes. This ultimately meant that embedded analysis could not generate adaptive projections, only fixed resolution projections.
A couple months ago, I tried my hand at implementing a new mechanism for projecting based on quad trees. The advantage with Quad Trees would be that they could be decomposed independent of the image plane; this would enable the projections to be made inline, without shipping data around between nodes. The initial results were extremly promising -- it was able to project the density field for a 1024^3, 7-level dataset (1.6 billion cells) in just around 2300 seconds in serial, with peak memory usage of 2.7 gigabytes, easily fitting within a single node. I don't know how long it would take to project this in serial with the old method, because I killed the task after a few hours.
For various reasons, I did not have time to implement this in the main yt code until very recently. It now lives in the main 'yt' branch as the AMRQuadProj class, hanging off the hierarchy as .quad_proj. Once it has been parallelized, it will replace the old .proj method, and will be used natively. One of the unfortunate side effects of inserting it into the full yt machinery is that projections have to satisfy a number of API requirements -- they have to handle field cuts, arbitrary data objects, and several other reductions that slow things down. However, even with this added baggage, in most cases the new quad tree projection will be about twice as fast as the old method for projections. If you want to give it a go before it becomes the default, you can access it directly by calling .quad_proj on the hierarchy.
Hopefully I'll be able to parallelize it soon, so that this can become the default method -- and so that variable resolution projections can make their way into embedded analysis!
The Parallel Merger Tree in yt, like most of the code in yt, has a rich history of changes and upgrades. One of the most significant upgrades was a change in the way the SQLite database file is updated during the course of building the merger tree. Briefly, the database contains all the information about the merger tree of the halos, as well as the specifics of each halo, such as the mass, position or bulk velocity. Originally, the database was built by first filling in all the initially available information about each halo: the data that comes from the halo finder. Then, as the merger tree ran and identified relationships between halos, the entries for each halo would be modified to reflect the new information using the SQL "UPDATE" command. Below is a simplified example of how this worked, using random numbers instead of halo data:
import sqlite3 as sql import numpy as np import time, random # How many random numbers we'll be working with. number = int(5e5) # Create and connect to the database. conn = sql.connect("test.db") cursor = conn.cursor() # Make the table we'll be using. The first column will automatically # generate the integer index for each row starting at 0. line = "CREATE TABLE randoms (count INTEGER PRIMARY KEY, value FLOAT);" cursor.execute(line) # Make the random numbers. rnum = np.random.random(number) t1 = time.time() # Insert the values the first time. for r in rnum: line = "INSERT INTO randoms VALUES (null, %1.10e);" % r cursor.execute(line) t2 = time.time() print "First insertion took %1.2e seconds" % (t2-t1) print "Now let's update the information." # Make some more random numbers. rnum2 = np.random.random(number) # This make an array of integers [0:number-1], with the positions randomly # placed. order = np.array(random.sample(xrange(number), number)) t3 = time.time() # Update the entries. for pair in zip(order, rnum2): line = "UPDATE randoms SET value = %1.10e WHERE count = %d;" % \ (pair, pair) cursor.execute(line) t4 = time.time() print "Update took %1.2e seconds" % (t4-t3) # Close up properly. cursor.close() conn.close()
Running this on my two year old Macbook Pro gives me:
First insertion took 1.79e+01 seconds Now let's update the information. Update took 4.58e+01 seconds
Remember that the second operation is putting no more information in the database, it is just changing the data. However, it's much slower because it must search the database for each entry with the current "count" value. In effect, with each "UPDATE", the whole database is being read and every entry is being compared to "count" to see if it needs to be changed. There is a "LIMIT" keyword that can be part of an "UPDATE" statement that tells SQLite to only affect a certain number of rows. However, this is not a good solution for two reasons. First, this capability needs to be enabled when the SQLite library is compiled, and it often isn't. Second, this only reduces the average number of rows searched from N to N/2, which is not a very good improvement.
In the Merger Tree, this "UPDATE" step became the largest single factor in the runtime. This is very bad because in the Parallel Merger Tree, all SQLite writes are done with the root processor, which means all the other tasks sat idle waiting on the root task. A better way was needed!
As it turns out, there is a much faster way. Instead of using "UPDATE", a second, temporary database is created. As we can even see from the example above, "INSERT" statements are relatively fast, so we would like to use that! As the first database is read in, its values are modified, and then the data is added to the second. After all is said and done, the second database will then replace the first, and the overall effect will be the same.
import sqlite3 as sql import numpy as np import time, random # How many random numbers we'll be working with. number = int(5e5) # Create and connect to the database. conn = sql.connect("test.db") cursor = conn.cursor() # Make the table we'll be using. line = "CREATE TABLE randoms (count INTEGER PRIMARY KEY, value FLOAT);" cursor.execute(line) # Make the random numbers. rnum = np.random.random(number) t1 = time.time() # Insert the values the first time. for r in rnum: line = "INSERT INTO randoms VALUES (null, %1.10e);" % r cursor.execute(line) t2 = time.time() print "First insertion took %1.2e seconds" % (t2-t1) print "Now let's update the information in a faster way." # Make some more random numbers. rnum2 = np.random.random(number) t3 = time.time() # Make a second database. conn2 = sql.connect("test2.db") cursor2 = conn2.cursor() # We need to make the table, again. line = "CREATE TABLE randoms (count INTEGER PRIMARY KEY, value FLOAT);" cursor2.execute(line) # Update the entries by reading in from the first, writing to the second. line = "SELECT * FROM randoms ORDER BY count;" cursor.execute(line) results = cursor.fetchone() mark = 0 while results: line = "INSERT INTO randoms VALUES (%d, %1.10e);" % \ (results, rnum[mark]) cursor2.execute(line) mark += 1 results = cursor.fetchone() t4 = time.time() print "Update took %1.2e seconds" % (t4-t3) # Close up properly. cursor.close() conn.close() cursor2.close() conn2.close()
And the timings:
First insertion took 1.55e+01 seconds Now let's update the information in a faster way. Update took 1.82e+01 seconds
Of course there will always be some variability in the timings, but the relative differences between the two methods is very clear. In this second method, the update time is just a bit longer than the initial timing. However, the first method was several times longer, for the same overall result. Note that in the second example, the "count" data column is being specified, and even if a new entry is made with a "null" value, SQLite will correctly fill in with the next larger integer with no problem.
The only caveat is that this method will temporarily require roughly double the amount of disk space. This is a minor concern because modern computers have far more disk space than a SQLite database (or two) could ever fill up in a reasonable application.
For about a year and a half now, most of the unstable development of yt has occurred inside a mercurial repo. Mercurial is a distributed version control system, not unlike git or bzr, where each checkout brings with it the entire history of the repository and enables full development. Each individual can commit changes to their own repository, while still accepting changes from others. It also makes it much easier to submit patches upstream.
All in all, mercurial is a much better development environment than subversion, and I think that's been born out by the changes in the development style of both yt and Enzo, which has also recently moved to mercurial.
For a while we'd been keeping a mixed model: subversion was regarded as more stable, slower moving, and the mercurial development was less stable and faster; things were worked on in hg and then backported to svn. However, because of the recent reorganization efforts to clean up the code in preparation for yt-2.0, it became clear that backporting these changes to svn would be a nightmare. So we decided we would move exclusively to hg for development.
However, this brought with it some unfortunate problems: when I initially created the hg repository in February of 2009, I only imported from SVN revisions 1100 onwards. And because of the way hg works, I was unable to get any newly reimported history to connect with the old history. But, if we were going to move to hg exclusively, I wanted all that old history.
But, thanks to the convert extension to mercurial, I was able to import all the old history and splice the two together. But, unfortunately, this meant that all the changeset hashes were different! So with the conversion to mercurial we had to force everyone who already had an hg repo to re-clone and re-initialize their own repository. This turned out not to be so bad, but because of some decisions I made during the splicing the graph now looks a bit crazy in the middle! I'm still thinking about ways to change that without changing any changeset hashes, but we'll just have to see; maybe it's not so bad that we have so many crazy paths and branches.
We're gently migrating to the hg repositories exclusively; the installation scripts have mostly been updated, but not completely rolled out. Hopefully the end result will be worth it -- I've written up some developer documentation on how to use hg and how to use it to develop yt, and I hope this will be the start of a more open development process!
Last week I was approached by a friend and collaborator to prepare some large volume renderings using the software volume renderer in yt. In the past we've successfully made very, very large image renderings using yt -- Sam's even made one at 8192^2, although at extremely high resolution like that sometimes the lack of fidelity in the underlying volume renderer shows up; sometimes even artifacts in the AMR grid boundaries, but that's less common. Making the very large volume renderings isn't too bad -- it scales roughly with the number of pixels, but we can dispatch many frames to be rendered at once on a cluster. There are a couple other, more important things to consider when making the big volume renderings. For starters, the entire structure of volume rendering in yt was not really created to generate a series of images -- only a single image. The idea was that you would prepare a specific image, make it, and move on. However, for this project, I want to do a zoomin, or possibly a more complicated camera path.
Additionally, one of the first things that we did with the volume rendering was silly: we applied no normalization to the output images. That was a mistake, I see now. Part of the reason for this was uncertainty in the correct normalization -- the bias that the user wanted to apply may not be the natural bias from the image. But more than that, because the rendering algorithm itself was some what holistically settled upon (the original implementation, which we used for shell-style renderings, was not a "correct" implementation of alpha blending) a natural mechanism for scaling did not immediately present itself. One likely exists, possibly dependent on the field of view, I simply do not yet know it. This will have to be rectified, because the mechanism used for scaling a set of images will have to be different than the mechanism for scaling an image in isolation, or else frames will jump in brightness during the movie's course.
The final thing that I wanted to change was to add support for stereo rendering. Rather than repeat any of the amazing discussion from Paul Bourke's website, I'll simply direct you there. Everything you ever wanted to know about stereo rendering. (When I was a first year grad student, we actually bought a copy of his site to use locally -- it was our way of showing support for him putting it online, and it also came with a bunch of source code for example applications.) I first attempted to apply the correct method for stereo, where the view direction is parallel and the total view frustum is shifted.
This did not work. In fact, it made me realize that all this time, the yt method for volume rendering is in fact ... not really a volume rendering method inasmuch as it is a planar ray-casting method. Typically when doing volume rendering, there's a perspective applied to the image: the rays all emanate from a single place, creating a frustum. But for yt, we actually set up a single plane of vectors at the back of the volume and advance that forward across the image. This is good and bad; it's good in the sense that it's more clear precisely what is going on. But it's bad in the sense that correct stereo is more difficult. (Of course, on Bourke's page he has a workaround that may work for this, but I have not yet attempted it.) Here's a rough depiction of the different between the two methods.
The upshot is that stereo doesn't seem to work unless you go with the "toe-in" method that can cause eyestrain after a long time and shows visible parallax at the edges. I'm not sure if this is going to be a problem, but because I am not right now eager to rewrite the rendering backend, this is the way it is for the moment.
To set up the stereo rendering, I separated out the rendering mechanism from the objects to be rendered. Previously, there was a single VolumeRendering object that you could create, raycast through, then discard. I created a new camera object that accepted a homogenized volume and would call "traverse" on that volume, feeding a back and a front point. The Volume is then responsible for passing off fixed-resolution grids to the camera, which accumulates an image buffer by calling the ray traversal functions. The front and back points are essentially the only thing needed to know this order, but the camera also stores its three orientation vectors and its position that describe it in 3D space. By separating out these two conceptual objects, we undo some of the "single, carefully constructed image" bias that was in the original volume renderer. (And, we open ourselves up to being able to use the Camera with a hardware volume renderer, should that day ever come.)
So now we have a camera, and it makes images like this:
It's a little dim, but that's a task for another day. The next step is taking that perspective and turning it into a set of stereo images. To do that, I added a new class called StereoPairCamera. It accepts a Camera object and turns it into two camera objects, where the final interocular distance is calculated relative to the image plane width. As I mentioned above, this only operates via toe-in stereo, so it does this in the simplest manner possible: it moves each of the Left and Right cameras by half of the interocular distance away from the original location, and then recalculates a normal vector to point back at the original center. Now we can generate left and right images: Unfortunately, on my laptop (which is my primary/exclusive work computer) I don't have the ability to view these pairs. To get around that, I wrote a simple stereo pair image viewer in OpenGL and imposed upon my friends that do have a stereo viz wall to test it out -- and after some fiddling with the interocular distance, we got what appeared to be workable stereo pairs.
The full code for generating camera paths as well as stereo pairs is already in yt, but the documentation is still being written; I might also clean up the interface a bit. Additionally, at some point in the future, the issue of toe-in stereo versus correct parallel-frustum stereo will need to be dealt with; the last thing I really want to do is force people to only use a bad method for generating stereo pairs. Hopefully that is something that can be dealt with at a later time.
Thanks to the wealth of resources out there for making this a relatively easy task: the aforementioned Paul Bourke website on stereo pairs, the PyOpenGL and PIL teams for making the image pair viewer easy, and everyone else whose work I've built on to make things like this.
Matt just set up this blog to document some of the developments that are going on in yt, and he asked me to share some of the work we've recently done on a kD-Tree decomposition primarily used for the volume rendering, so here goes.
This semester I took a High Performance Scientific Computing course here at Boulder where a large fraction of our work revolved around a final project. Luckily for me, I was able to work on a new parallel decomposition framework for the volume renderer inside yt. What resulted was primarily a kD-tree decomposition of our AMR datasets. While kD-tree are usually used to organize point-like objects, we modified it to separate regions of space.
Our kD-tree has 2 primary objects: dividing nodes and leaf nodes. Each dividing node has a position, a splitting axis, and pointers to left and right children. The children are then either additional dividing nodes or leaf nodes. The dividing nodes therefore recursively split up the domain until the 'bottom' is composed of all leaf nodes. Each leaf node represents a brick of data that has the same resolution and is covered by a single AMR grid. A leaf doesn't have to (and rarely does) encompass an entire grid. A leaf contains various information such as what AMR grid it resides in, what the start and end indices are in that grid, the estimated cost to render the leaf, and the left and right corners of the space it covers.
So what does this do for us? Well, the kD-Tree decomposition allows a unique back-to-front ordering of all the data. This ordering is determined on the fly by comparing the viewpoint with the dividing node position and splitting axis. By comparing these two, we are able to navigate all the way to the 'back' of the data and recursively step through the data. This decomposition also provides at least 2 different ways of load-balancing the rendering. By estimating the cost of each leaf, we can subdivide the back-to-front ordering of grids into Nprocs chunks and have each processor only work on their part. This subdivision is viewpoint-dependent such that the first processor always gets the furthest 1/Nth of the data. This subdivision is done on each processor and no communication between them is needed. The second method to implement a 'breadth-first' traversal of the tree where we first create Nprocs sub-trees that are not as well load balanced as the first method but are viewpoint-independent. Each processor is then handed one of the sub-trees to completely render. This time, when the viewpoint changes, the regions that each processor 'owns' stays the same.
Now one important part in both of these methods is how we composite the individual images back into a single image. This has been studied a bunch in image processing and is what I think is known as an 'over-operation'. For us, all this means is that we combine each pair of images using their alpha channels correctly to blend the images together. This is essentially the same operation that is done during the actual ray-casting. This composition is simple for the first decomposition method because we know exactly which order each processor's image goes in, and construct a binary tree reduction onto the root processor. In the second case where there is a more arbitrary ordering of each of the images, we use the same knowledge from the kD-Tree to combine each pair of images in the correct order. For now, this implementation is living in the kd-render branch of the yt mercurial repository, but I imagine that after some more testing and trimming it will make its way back into the main trunk. If you are interested in using it, you mainly have to change two calls in the rendering:
When instantiating the volume rendering, you need to add the kd=True and pf=pf keywords.
vp = pf.h.volume_rendering(L, W, c, (SIZE,SIZE), planck, fields=['Temperature', 'RelativeDensity'], pf=pf, kd=True, l_max=None, memory_factor=0.5)
Then, instead of vp.ray_cast() we use either the back-to-front partitioning:
vp.kd_ray_cast(memory_per_process=2**30) # 1GB per processor
or the breadth first:
vp.kd_breadth_ray_cast(memory_per_process=2**30) # 1GB per processor
There are two additional keywords shown above that somewhat useful for large datasets. Because there can be several leafs that use the same vertex-centered-data of a single grid, it is important to save as much of that information in memory so that disk I/O doesn't hinder the performance. To do this, we've implemented a rudimentary cacheing system that stores as much of the vertex centered data as possible. The memory_factor keyword lets us know what fraction of the memory per processor to use, and memory_per_process tells us how much memory each processor has access to. This is fairly experimental and if you run out of memory, I'd suggest cutting one of these two options by a factor of 5-10.
Anyways, that's about it for the discussion of the new kD-Tree decomposition for volume rendering. As we make improvements some of this may change, so keep an eye out for those. Let me just mention that this isn't the only thing this kD-Tree can be used for, so if you have any thoughts or questions, feel free to email me (firstname.lastname@example.org) or perhaps comment on this post.