The Joys Of Programming

Hypation,

The only software tools I have used for Code Reviews have been Code Collaborator and Jupiter. CC is a pretty simple web application that you upload files or diffs of files to. Then you give out a an ID# to everyone you want to look at the stuff. They login to the site and leave comments on the files you've uploaded, similar to the track change notes in a word document. Then you either get everyone in a room to go over the comments and decide on responses, or just assume that the developer will see the comments and integrate them accordingly (bad idea). It works pretty well, has a command line interface and Eclipse plug-ins. The only real downside is that you're looking at the files one at a time in a web page with code highlighting, instead of in your IDE.

The other is the Jupiter plug-in for Eclipse. I only used it once, but you create a Jupiter review record of some kind and send it around to your developers. They look at the code in Eclipse and leave comments on it there. Again, you can get together and review all of the comments made. The nice thing about Jupiter is that it works off of your Eclipse project, so you can navigate around files and code just like you normally would. In theory you could run the code or tests, too and make sure they behave as expected. The downside is that it's a little bit more complicated to setup and use than Code Collaborator.

Two jobs ago I just did pair code reviews over VNC and Skype. One person just looking over the changes/diffs on-screen and chatting with the dev about what they did and why.

Have you guys settled on a continuous integration build server, yet? I have had a lot of success with TeamCity and Jenkins.

Jenkins has been janky for us, but YMMV.

Yeah, we're an almost entirely Atlassian company where I work and it's not a great experience. I'm sure there's a lot of power under the hood in JIRA, but it's hidden behind such an unhelpful UI.

Jenkins has been pretty good overall, probably the most reliable bit of infrastructure.

Yeah, I think we have a different sort of idea of code review going here. Integration with our VCS seems like a good place to try it, because it provides a reasonable unit of review. (Individual commit or push.) Integration with our IDE would be a non-starter, because we don't all use one IDE.

For automated build and testing stuff, well... the tests we'd really like to automate (because they're a pain to run but are the most likely to tell us when something important breaks) involve firing up a bunch of VMs to have them talk to each other. Really automating that will take some doing. In short: the build and unit test parts are easy enough in comparison to what we actually have to do that we'd almost be happy to roll our own using shell scripts.

Theoretically, builds and testing are another team's sandbox Yet somehow we still end up doing everything ourselves. Whee.

I've been using Jenkins for years, I'm a fan.

I <3 TeamCity myself.

Here's an opinion piece about the field I work in

http://madhadron.com/?p=263

Can't say I'm completely unsympathetic to his POV.

http://support.microsoft.com/kb/183116

Suffice to say, what this article says to specifically not do, our code does, in many places, to the most active UI thread we have. The chances of it happening are very slim, but it's happened twice this month on two different computers. This renders the kiosk-style application unusable until someone (myself or my coworker) remotes in and kills the application. The best part is that this generally wouldn't be a problem but the recently installed anti-virus on those machines have been nearly maxing out the memory on the computer, causing those window operations to take just a bit longer to increase the chances that the program eats a bullet. That's right, the code is basically playing russian roulette. Yay for finding skeletons in the closet!

DanB wrote:

Here's an opinion piece about the field I work in

http://madhadron.com/?p=263

Can't say I'm completely unsympathetic to his POV.

Garbage in, garbage out. No big surprise there. I am fairly surprised that scientists have been able to use social engineering skills to keep their funding. That's unexpected.

tboon wrote:

I <3 TeamCity myself.

Eagerly seconded!

Mixolyde wrote:
DanB wrote:

Here's an opinion piece about the field I work in

http://madhadron.com/?p=263

Can't say I'm completely unsympathetic to his POV.

Garbage in, garbage out. No big surprise there.

And how.

Mixolyde wrote:

I am fairly surprised that scientists have been able to use social engineering skills to keep their funding. That's unexpected.

Science is all politics at the strategic level; there's a level of seniority where you stop doing science and it's all about positioning to get that grant money. Much as you should take that rant from a disaffected phD student with a massive pinch of salt there is a kernel of truth in there; you get grant money to do novel things. There is almost no money floating around to give you the time to do things properly or maintain things (such as software); and that inevitably leads to some very brittle infrastructure. And it means if you want to do things properly it's on your own head (and personal time) to go the extra mile and sort it out.

I sent that bioinformatics link to one of the smartest people I know, who is currently getting a PhD in the field. This is a funny chunk of her rant that came back:

I do see that there are a lot of bioinformatics tools and most are crap. Or at least are terribly over complicated and buggy. This isn't a conspiracy to obfuscate the tools, it's just what you get when PhDs think they can code and that software development is the same thing as hacking together a script. All that tells me is that the field is still new and they need some real software people to show them how to write software that people can actually use. I have no problem explaining at length to people that they are doing things the wrong way.
DanB wrote:

She's totally right that this is what you get when you ask academics to make software because as a general rule academics of all stripes seldom have any formal training in software engineering. But the solution isn't going to be some maturation of the field because almost all the grant money available is for novel research projects and not earmarked for development/engineering. And of course there is strong career and funding body pressure to publish work not to spend 6 months polishing some piece of software. You can just about count on the fingers of one hand the number of bioinformatics institutes around the world which have dedicated dev/engineering teams and it's no coincidence that's where actually good tools tend to come from or eventually migrate to.

I work as an academic in software engineering, and I can tell you even the individuals responsible for providing formal training in software engineering generally produce tools which are sh*t. There's simply almost no incentive to do otherwise, and thus outside of stuff produced by Microsoft Research most of the tools are at best marginally useful. Those individuals that do care about producing good tools typically move on to make them in industry where that drive is rewarded... even if they want to stay in academia, they can't outpublish those folks dedicated to cobbling together stuff as fast as possible. But that's modern science for you.

Mixolyde wrote:

I sent that bioinformatics link to one of the smartest people I know, who is currently getting a PhD in the field. This is a funny chunk of her rant that came back:

I do see that there are a lot of bioinformatics tools and most are crap. Or at least are terribly over complicated and buggy. This isn't a conspiracy to obfuscate the tools, it's just what you get when PhDs think they can code and that software development is the same thing as hacking together a script. All that tells me is that the field is still new and they need some real software people to show them how to write software that people can actually use. I have no problem explaining at length to people that they are doing things the wrong way.

I'd be a little more sympathetic to this view if the field really were new in any sense.

She's totally right that this is what you get when you ask academics to make software because as a general rule academics of all stripes seldom have any formal training in software engineering. But the solution isn't going to be some maturation of the field because almost all the grant money available is for novel research projects and not earmarked for development/engineering. And of course there is strong career and funding body pressure to publish work, not to spend 6 months polishing some piece of software. You can just about count on the fingers of one hand the number of bioinformatics institutes around the world which have dedicated dev/engineering teams and it's no coincidence that's where actually good tools tend to come from or eventually migrate to.

Staats wrote:

even if they want to stay in academia, they can't outpublish those folks dedicated to cobbling together stuff as fast as possible. But that's modern science for you.

It's like some kind of object lesson in the problems of publication focus as the only yardstick for academic assessment!

Fun with language implementation... I have my inference engine mostly working. (It doesn't support recursive types yet--have to write the code to find the minimal cycle and break it and unify infinite type expressions.)

INPUT: val fact = fun n ⇒ case n of 0 ⇒ 1 | _ ⇒ * n (fact (- n 1)) type t = #num val add = fun a b ⇒ + a b val id = fun (type α) x ⇒ (x : α) val try_id = id (type #num) val try_id' = id (type #num) 1 val i = 3 val i' = (i : t) INFERRED TYPES: val fact : fun #num → #num type t = #num val add : fun #num #num → #num val id : fun (type α) α → α val try_id : fun #num → #num val try_id' : #num val i : #num val i' : t

I... wow I don't get any of that. What is that and where can I read a primer on the subject?

It's just function definitions in a language I'm implementing (syntax is still up in the air, this is just the current display format). It's in prefix notation for everything at the moment, no infix operators. So "+ x y" means "add x and y together". In the style of ML and Haskell, function application is by putting expressions adjacent to each other, so "f x" means "apply function f to argument x". "+" is treated like a function like any other.

The first part, the input, is a set of definitions in my language:

  • Define "fact" as a function that takes an argument n. If n is 0, it returns 1. If n is anything else, multiply n by fact called on n - 1. (Pretty standard factorial function.)
  • Define a type alias "t" that is the same as "#int". (My base types are "#blah" right now.)
  • Define "add" as a function that takes two arguments a and b, and returns a + b.
  • Define "id" as a polymorphic function ("(type α)" is a type parameter, which currently has to be given explicitly. That means this function works for any type at all), a value x , and return x. The type of x is constrained to be equal to the type argument α. This is the polymorphic identity function: for any type "α", it takes an argument of that type and returns it unchanged.
  • Define "try_id" as the result of applying id to the type #num. So "try_id" is the identity function on numbers. This tests specialization of a polymorphic function to a specific type.
  • Define "try_id'" as the result of applying id to yhe type #num and the value 1. So this tests application of a polymorphic function to a type and a value of that type.
  • Define "i" as the number 3.
  • Define "i'" as the number 3, having type t (which was defined above as being the same as #num.)

The second part is the output of running type inference over those definitions--and the meanings of those types should be pretty clear from this point. Since the types are left out of the definitions (except in cases of polymorphism, mostly), type inference is used to analyze the definitions to determine that they're well-typed, and what the types of the values are.

So for "fact", for example, it's trying to figure out what type of argument n is, and what the return type of the function is. In order to do so, it says "okay, n is used in a case expression, and is matched against a number. So it must be a number. Also, it's used in "(- 1 n)", and - takes two numbers, so that's another reason it must be a number. "*" takes two numbers and returns a number, and since it's applied to the result of fact, the result of fact must be a number. Since the result is returned from fact, the result of fact must be a number. (Which matches up.) Also, the result of the 0 case is a number, and that matches up, too. Therefore, the type of fact must be "fun #num → #num", a function from numbers to numbers.

The other stuff works the same way. Essentially, it builds up a set of constraints for the various parts of the program. Like the type of the case statement is unknown, so we'll call it ?5 for the fifth unknown, and then we add equations "?5 = #num" as we discover that value is used in specific ways. At the end of the day, we take all of the constraints together and determine if there's a proper solution to all of those equations.

This kind of type system is used in languages like ML, Haskell, F#, and the like. What I'm aiming for eventually is to add Scala-style implicit arguments (and use those for type arguments, so that you can just say "id 1" and have it figure out that this expands to "id (type #num) 1"), and end up with a language that provides the power of system F. (Which probably doesn't mean anything to you, but I linked a paper a while back.) I also want to include polymorphic (extensible) sum and product types, and see if a language with the above features is nice to use--and if I can use it to implement something like Haskell type classes.

Anyway, the spot I'm in now, you don't have to give the explicit types of variables, because it can infer those for you, except in the case of polymorphic values. That means that with the language as it currently exists, I have a fully-featured strongly typed language with type inference. It's lacking some amenities, but it works. Once I add a few more amenities, I intend to make it self-hosting (i.e. I can write it in itself), and add a compiler to Javascript code. (Which is one reason I have a generic "number" type, to match Javascript's core number type.)

On the side of experiencing a language with features like the one I'm playing with, check out "Try OCaml" for an interactive intro to the OCaml language, or "Learn You a Haskell for Great Good" for a nice intro to Haskell. Both of those languages have a lot of features in common with what I'm working on. My current implementation is in OCaml.

The Try OCaml site works using a tool called "js_of_ocaml", which converts OCaml bytecode to Javascript, so the whole language compiler and top-level runtime is running on your browser in Javascript. It's pretty fast, too. Very nifty. (OCaml also compiles to native machine code, with quite good performance.)

Oh, and if you want to learn about the programming language theory behind things like this, consider looking at [em]Using, Understanding, and Unraveling The OCaml Language: From Practice to Theory and vice versa[/em]. (PDF, 174 page book.)

Also, f*cking monads man. They are awesome.
(Category theory is insanity in the best way)

boogle wrote:

Also, f*cking monads man. They are awesome.
(Category theory is insanity in the best way)

They are. Playing with Haskell?

Hypatian wrote:

Okay, so that's the setup. We've been evaluating Atlassian stuff, since some of that is in use in other places, including the JIRA that we never use, and the JIRA that that other team does use. We looked at FishEye and Crucible, and... they seem really really sh*tty. Like, FishEye doesn't really provide anything more than a simple hgweb setup (aside from minor integration with JIRA--although it tends to pick up random comments in our commits as JIRA issues, which is rather unhelpful. Thing-3.5.1 isn't issue [Thing-3].5.1, but thanks for the link. Very helpful.) Crucible doesn't seem to provide a code review mechanism that actually matches what we want to do. You can put comments in a commit to say "Hey, review this please", but the act of actually reviewing something doesn't do anything useful. Neither FishEye nor Crucible provides any repo-management features over the basic stuff, and in fact they: a) add an extra step of "register the repo with Atlassian stuff after putting it on the system", and b) don't support the idea that a single "project" could be associated with multiple repositories, either.

In short, the Atlassian stuff just doesn't seem like a good fit for us. We have a history of hate with the Atlassian tools we deal with so far, and go out of our way to avoid them because they get in the way so much. There's some potential that Confluence and JIRA could be made to suck less with more direct control, but these additional tools don't seem to provide anything new that we actually want. In short, Atlassian stuff seems cheap, but is overall crappy.

--

Hmm, we have been using Jira for about 4 years now, and it's helped our department out immensely. Got Crucible/Fisheye about 2 years ago and it's working fine. We just use Confluence just for department documentation and it's ok for that.

I'm actually kind of down on monads right now. The way they encapsulate computation is powerful, but the inability to propagate choices made in bound/mapped functions back to the outside context frustrates me. I've been thinking about applicative functors, and whether it would be possible to add a bit of syntax sugar to make it very very convenient to work with them. It might require using only non-strict constructions in that context, but I think it could be pretty cool if I can make it work. (Capsule summary: What if "a b" style application were actually sugar for a "$" application operator, so "a b c d" => "a $ b $ c $ d", and what if you had the ability to re-define the $ operator? So you could do something like "myAppFunctor.(a b c d)", and the application inside that context is lifted into the functor instead of being the normal function application. This could make use of applicative functors more light-weight. Combine that with an "if-then-else" operation that's a function, as you can do in a non-strict context, and you can lift conditionals into the functor as well. That would allow the lifted value to have global knowledge of the computation. As an example, this could allow the functor to re-order operations to be more efficient. In a monadic context, you can't do that, because as soon as you hit that first lambda abstraction or branch, you can no longer get value information about what's happening inside, only static type information.)

Regarding Jira: I think the key problem is that it feels really really heavy-weight. The UI is not well designed for "get in, do one thing, get out"--it seems to be more built around the idea that people are using Jira constantly. And that means that in a group like ours, where we have many projects each of which has one or two people working on it, and none of which gets more than one or two bug reports or feature requests a week, the pain of getting into Jira and entering the right things feels very very high. It looks like the latest versions have RESTful APIs, so we might be able to roll up some command-line tools to simplify the stuff we do most often.

Another sample big annoyance with Atlassian stuff is that their web UIs tend to do some pretty heinous things. Most of us use macs for development, and almost all Mac text entry UIs allow emacs-style keybindings. (So things like control-a to go to the start of the line, control-e to go to the end, etc.) At some point, Confluence added keybindings that conflict with this. So C-e would suddenly go to a different page and lose everything. Again, it's the kind of thing that if we were using it all the time we'd probably get used to it. But when we just jump in once every week or so, we run into annoyances like that every time we use things.

It looks like we're definitely going to be mandated to use Jira for bug tracking, despite our history of "try using it, bail and go back to using email" the last several times we've been asked to use it (over many years). So we'll see how that goes, I guess. Having admin rights so that we don't have to put in IT ticket requests every time we want to add a new version to a project will probably help.

Can anyone tell me why a mark & sweep GC has to stop-the-world to get the roots? Cause if it doesn't have to..
Anybody know how to write a paper? First I have to get Mono compiled so I can implement it, but I have an idea for a wait-free, multi-threaded, concurrent(with the mutator, at no point does it stop the world) garbage collector.

Also, we use Bugzilla for actual bugs, Jira for development.

RolandofGilead wrote:

Can anyone tell me why a mark & sweep GC has to stop-the-world to get the roots? Cause if it doesn't have to..

I don't know if it has too. There are some incremental algorithms. But IIRC their throughput is not good, slower then stop the world. There were also some gnarly concurrency issues involving the allocation of new new objects. But it's been a while ago since I looked at this. Might have been solved by now.

I love you, Python. In the span of 2 hours today, I went from zero to an executable one of my product managers can use to generate DDL (or just plain build the DB) from an Excel file structured in a specific way, complete with primary keys, foreign keys, and indices. Handles dependency ordering and py2exe makes it self-contained, so he doesn't even have to install Python, though I'm not sure why he wouldn't want to.

Now he can stop Female Doggoing about the lack of tools and get me my schema.

tboon wrote:

I <3 TeamCity myself.

New job is using TeamCity. Hoping to find out why you love it.

On a related note (if I remember right), I'm switching to IntelliJ from Eclipse and couldn't be happier.

I haven't really read up on things, but the most obvious difficulties with doing GC in a multi-threaded environment are:

1) GC roots in CPU registers. If other code is allowed to keep running, you can't assume that everything is in memory--and if you require the call stack/continuation to be kept up to date, that kind of defeats the point of having registers.

2) Uninitialized memory. If you do GC in the calling thread at allocation time, and lock to prevent more than one thread from allocating at a time, and make sure all memory is initialized before returning ot to the caller, then you can be sure during GC that no memory that you expect to have a pointer in it contains garbage. Otherwise, all bets are off. (You could also achieve a similar effect by having the allocator initialize, and having the in-progress block be counted as an uninitialized block to the GC.)

3) Of course you have to do some sync around memory management no matter what, if you allow threads to share memory at all.

As an added note, any good GC is going to have to do copying at some point. Mark/sweep has a primary problem that it's O(m+n) where n is the size of live memory and m is the size of garbage. A copying collector is O(n). A typical generational collector will use copying for the youngest generations where there's a lot of infanticide, and then maybe switch to m/s for stuff that survives longer.

My suspicion is that if you haven't read all of the literature on GC and you're not doing research in the field (and I expect you'd have read most of the literature in that case), then you're unlikely to have come up with an easy solution nobody has thought of before.

But it's been known to happen. Go do some reading and then let us know if it nets you a Ph.D.

Hypatian wrote:

2) Uninitialized memory. If you do GC in the calling thread at allocation time, and lock to prevent more than one thread from allocating at a time, and make sure all memory is initialized before returning ot to the caller, then you can be sure during GC that no memory that you expect to have a pointer in it contains garbage. Otherwise, all bets are off. (You could also achieve a similar effect by having the allocator initialize, and having the in-progress block be counted as an uninitialized block to the GC.)

What do you mean by "memory that you expect to have a pointer in it"?

As an added note, any good GC is going to have to do copying at some point. Mark/sweep has a primary problem that it's O(m+n) where n is the size of live memory and m is the size of garbage. A copying collector is O(2n).

FTFY

A typical generational collector will use copying for the youngest generations where there's a lot of infanticide, and then maybe switch to m/s for stuff that survives longer.

Yep, modern gc seems to be done that way. It makes sense. I still don't know how to prevent stop-the-world when using a copying collector, but I saw one paper that made that claim.

My suspicion is that if you haven't read all of the literature on GC and you're not doing research in the field (and I expect you'd have read most of the literature in that case), then you're unlikely to have come up with an easy solution nobody has thought of before.

But it's been known to happen. Go do some reading and then let us know if it nets you a Ph.D. :)

Indeed. Of course with mine there is a small space penalty per object and per thread and a time penalty. Performance depends on algorithms and hardware living in harmony.

RolandofGilead wrote:

What do you mean by "memory that you expect to have a pointer in it"?

If you have a 16-byte chunk of memory that's occupied by two 8-byte pointers, then you have to follow each pointer. If you have a 16-byte chunk of memory that has one 8-byte pointer and two 4-byte unsigned integers, you only have to look at the pointer. If you have a register that contains an integer, you don't need to follow it, but if it contains a pointer you do. If you have a freshly allocated 8-byte space that [em]should[/em] contain a pointer and is currently reachable, then you need to follow it. But if it hasn't been initialized yet, you can't follow it because it could be pointing at anything. So you need to make sure it's zeroed out (assuming the GC doesn't follow null pointers) or set to a valid pointer before the GC ever sees it. That means there's a period in which it has been allocated from the heap (so its memory is not available) and is known not to be garbage (so it doesn't get collected prematurely), but it's still not quite a root because you shouldn't follow the pointers in it. One very simple way to accomplish that is to not allow new allocation while the GC is running. (i.e. a global lock on allocations while the GC is in progress.) A more fine-grained approach is possible, but involves more complexity and more synchronization between the GC thread and other threads which are allocating memory (i.e. all of them).

RolandofGilead wrote:
As an added note, any good GC is going to have to do copying at some point. Mark/sweep has a primary problem that it's O(m+n) where n is the size of live memory and m is the size of garbage. A copying collector is O(2n).

FTFY

That's a constant factor. O(n) = O(2n). Also, I can't think of any context where a "2" would come into play here. You can do it while only walking the live pointers once (unlike mark/sweep, where live blocks get visited twice.) I imagine you must mean "read" + "write", which really doesn't matter because read and write operations don't cost the same.

The constant factors in copying are much higher than that, of course, because you have to visit every byte of allocated memory, not just those containing pointers. However, if you produce a lot of garbage (which you should expect in any GC setting--if allocating memory is expensive enough that you have to think about it (e.g. Java object creation overhead), you have a problem), the expense of having to iterate over the entire allocation space each time through GC will come to dominate.