The Joys Of Programming

Well, if he's got gigabit networking, and it's on good hardware, pushing 5 gigs to a client should only take about a minute. (assuming about 80 megs/second transfer; theoretical max is about 125.) However, it's a minute per client, so if you have a lot of clients, that could turn into a problem. You can't run the job in parallel, because it will be sucking up all the server's bandwidth.

If you went to 10GigE on the server, you could provision ten clients at once, assuming that you can fit the entire dataset in the RAM cache. If you have to actually pull from the drives, it will take a fast RAID setup to push 10 gigabits for more than a quick burst.

After thinking about it a minute, I realized I'm being dumb; I'm imposing serial thinking on a parallel architecture. Once the data file is on a client, it can provision other clients just as well as the master server can. You could write a little replication job that would double in speed every time a copy was completed.

In other words, you'd provision the first client, it would provision the second, they'd both provision the third and fourth, all four would provision 5 through 9, and so on. Assuming you're at any kind of reasonable cluster size, you'd be able to get your data onto every node in less than ten minutes.

You'd need switches that could handle a hurricane of data, but anything that's of reasonable quality should be fine in that circumstance. Just about all business-class switches can drive all their ports at full wire speed, simultaneously.

Note that some Dell switches, at least from a few years ago, are known to choke up when used that heavily. As far as I know, the bug was never fixed. I don't know if newer Dells still do that or not.

SixteenBlue wrote:

Are you sure the time to copy the files locally isn't going to be longer than the time to read them from HDFS? Will you be running the jobs multiples times so you're trying to limit the data transfer to just the first time? That's what I assumed but wanted to make sure.

Essentially yes I don't really want to be copy the data every time a map task starts because I don't really want to add a 1-10mins to start of each map.

With regards the hardware this will run on. The answer is who knows. We're building this package with a view to releasing it in to the wild for other researchers to run on the cluster of their choice. So we have no idea how many compute nodes or what network hardware it might end up running on. Our build target is "make sure it runs on EC2" but beyond that who knows.

Keep going and you'll reinvent bittorrent soon enough :P. See Twitter's deployment tool for example.

(Edit: comment aimed at Malor)

Anyone have any experience with RavenDB? That's what we're using, so I've been spending the past week flushing 15 years of relational thinking out of my mind. I've been doing Map/Reduce for years, it's just hard to give up stuff like joins, when those seem so natural to me.

Bonus_Eruptus wrote:

Anyone have any experience with RavenDB? That's what we're using, so I've been spending the past week flushing 15 years of relational thinking out of my mind. I've been doing Map/Reduce for years, it's just hard to give up stuff like joins, when those seem so natural to me.

Too bad, from now on everything is a list! All hail the list.

Just got to use Hypatian's ascending 2D array search in O(N) time on a candidate. Took him awhile, but with some gentle prodding, he figured it out. The rest of my team thought I was mean for asking it, but it was for a senior-enough role, and he had enough public experience that I thought Fibonacci might be insulting.

Huh, I've never really seen that. We write/overwrite files to HDFS all the time.

SixteenBlue wrote:

Are you sure the time to copy the files locally isn't going to be longer than the time to read them from HDFS? Will you be running the jobs multiples times so you're trying to limit the data transfer to just the first time? That's what I assumed but wanted to make sure.

Ok next Hadoop question that has been puzzling me greatly.

I have 2 hadoop packages that perform 2 different data analysis jobs. Both essentially go Map1->Map2->Reduce. The first maps in both packages are similar but produce somewhat differing outputs. I'm having trouble testing these. In both packages Map1 reads in a series of files (called test1.fa, the digit incrementing per file) analyses it and outputs a matrix of data in a file (test1.mtx). Having produced the matrix I copy the file from the compute node back in to hdfs, to a directory called matrices/.

I tested package1 and package2 running hadoop in embedded mode, while writing the code and everything was fine and worked as expected. When I came to test in pseudo-distributed mode package1 ran fine. But when I came to run package2, it failed trying to write it's version of test1.mtx to matrices/, throwing a checksum error.

So I googled a bit and found some suggestions that this happens some times and I should try again. So I tried multiple times, same error over and over. A bit more googling and I found some information that suggests that the first time you write a file to hdfs the namenode creates a checksum for the file, then if you try and do any future copy from local with a file with that same name to the same location, if the local file is different then it will fail the checksum. I've tested this with different matrix files

hadoop dfs -copyFromLocal test1.mtx matrices/

And this seems to check out.

So my question is how to I flush the cache of checksums? I have tried namenode -format. I have tried deleting the contents of /tmp/hadoop-USER/, I've tried deleting the logs. I've even tried completely deleting all the files, the hadoop directory and downloading everything and starting from scratch. I am at a loss.

Any ideas?

SixteenBlue wrote:

Huh, I've never really seen that. We write/overwrite files to HDFS all the time.

If I change the name of the file locally test1.mtx to testing1.mtx it will copy over fine, obviously

DanB wrote:
SixteenBlue wrote:

Huh, I've never really seen that. We write/overwrite files to HDFS all the time.

If I change the name of the file locally test1.mtx to testing1.mtx it will copy over fine, obviously

Ok I figured this out. When writing files on the LocalFileSystem by default the compute daemon/client writes a crc checksum as a hidden file in the directory you're writing to. That's a one time write at the first time a file of a given name is created. As far as I can tell if you overwrite that file the system won't overwrite the previous checksum, then when you try to copy the file to HDFS it verifies that everything matches on the local fs before it copies to HDFS. Deleteing those hidden files appears to have solved the problem. You can disable the checksumming behaviour by using RawLocalFileSystem() instead of a FileSytem().

RavenDB question, if anyone's used it, or any other document-based map/reduce NoSQL DB:

We've got the following models:

public class Server { public string ServerId public string ServerName public string CabinetId } public class Cabinet { public string CabinetId public string CabinetName public string DatacenterId } public class Datacenter { public string DatacenterId public string DatacenterName } public class ServerModelView { public string ServerId public string ServerName public string CabinetId public string DatacenterId public string CabinetName public string DatacenterName }

and the following multi-map index:

AddMap(datacenters => from dc in datacenters select new ServerModelView() { DatacenterId = dc.Id, DatacenterName = dc.Name, CabinetId = (string)null, CabinetName = (string)null, ServerId = (string)null, ServerName = (string)null }); AddMap(cabinet => from cab in cabinet select new ServerModelView() { DatacenterId = cab.Datacenter.Id, DatacenterName = (string)null, CabinetId = cab.Id, CabinetName = cab.Name, ServerId = (string)null, ServerName = (string)null }); AddMap(servers => from s in servers select new ServerModelView() { DatacenterId = (string)null, DatacenterName = (string)null, CabinetId = s.Cabinet.Id, CabinetName = (string)null, ServerId = s.Id, ServerName = s.Name });

Any idea how to write a Reduce or TransformResults function that can give me essentially a list of Datacenters, with each one having a list of Cabinets, and each of those having a list of Servers?

Essentially, in SQL, it would be a three-way join, but I'm still new to the no-joins paradigm in NoSQL.

Has anyone here had a chance to look at the (brand new) book, "Python For Kids". It appears to be out for Kindle only at this point, releasing in a dead-tree edition next week. I'm considering buying it for my kid, but really want to avoid a dud. Someone on GWJ had recommended Head First Java as a good starter book previously, and we got it out of the library (and renewed it a couple of times), but he hasn't found a programming project that has really caught his attention yet.

This one seems game-oriented, which may help. (And you really can't have too many reference books around, can you?)

Katy wrote:

Has anyone here had a chance to look at the (brand new) book, "Python For Kids". It appears to be out for Kindle only at this point, releasing in a dead-tree edition next week. I'm considering buying it for my kid, but really want to avoid a dud. Someone on GWJ had recommended Head First Java as a good starter book previously, and we got it out of the library (and renewed it a couple of times), but he hasn't found a programming project that has really caught his attention yet.

This one seems game-oriented, which may help. (And you really can't have too many reference books around, can you?)

The book that inspired this thread is actually a game-oriented PyThon book, free to download, but also available in dead tree format. It's aimed for 10-12 year olds, might be worth a look.

Light Table's awesome, and I think they hit their stretch goal on Kickstarter, which means after Clojure, they'll have Python support.

It actually was inspired by Bret Victor's Inventing on Principle, which has an awesome demo based on Braid about 11 minutes in. The whole thing's worth watching, though.

Bonus_Eruptus wrote:

Light Table's awesome, and I think they hit their stretch goal on Kickstarter, which means after Clojure, they'll have Python support.

It actually was inspired by Bret Victor's Inventing on Principle, which has an awesome demo based on Braid about 11 minutes in. The whole thing's worth watching, though.

Just looking through Bret's web site and found this awesome blog post on learning programming and related stuff.

Just an FYI that JetBrains is having a 75% off End of the World sale, today only and for personal licenses. ReSharper, RubyMine, etc.

http://bit.ly/qotSNA

Damn! Just pad for WebStorm a couple of weeks ago.

But $50 for a personal license of IntelliJ Idea, that's damn tempting.

I love these guys, they make great tools.

So this is only sort of programming based, as technically speaking it's hardware... but it's programmable hardware so I'll soldier on.

I recently ordered myself a Papilio Pro board. This little guy is intended to be the Arduino of FPGA boards. For those of you who've never heard of FPGAs, they're basically an IC crammed to the gills with logic gates and switches that you can program to be any type of logic circuit you want from a simple AND gate all the way up to full on micro-controllers complete w/ RAM. You use a hardware description language such as VHDL or Verilog to describe the functionality and I/O ports of the hardware. This is then synthesized and programmed on the chip. It's got a Xilinx Spartan 6 LX FPGA, 64Mb of SDRAM, and 48 digital I/O lines broken out to 6 8-bit headers.

There are also a series of "wing" boards that allow you to connect all kinds of cool peripherals like VGA ports, DACs, ADCs, serial ports, etc. There even larger "MegaWings" that provide a many of these peripherals all in one board. The one I got to go w/ my board was the "RetroCade Synth." Along w/ high quality audio ports, it has MIDI ports, a LCD screen, and tons of ADC and DAC pins for adding sliders and switches.

There is also an "Arcade Megawing" that provides a VGA port, mouse and keyboard, and serial ports for joysticks. Several classic arcade boards have been implemented on this system including Pac-Man, Space Invaders, Galaxian, and Frogger. What's cool about this is that these aren't emulations, but the equivalent of having the original hardware boards. See the video below for a demonstration.

My plans for this board are two fold.

1) I want to try my hand at building a full CPU core from the ground up. This may sound quite complex, but I've done it before back in college, and I just found all my text books, lab books, and notes in storage, so I won't be going in cold. This time I'm doing it for fun and not for a grade, so hopefully I won't have to pull as many all nighters. The intention of this exercise are to start small and reorient myself w/ the tools and VHDL and slowly build up to something quite complex and downright cool.

2) I want to be able to program old sound chips on it on the fly and be able to play them via a MIDI keyboard. This is why I got the RetroCade board along with the main board. There are already implementations of the SID and Amiga chips, and I've seen a project to clone the Master System that contains an implementation of the SN76489 sound chip. I also want to try some of the projects at FPGA-Synt.net, as well as some of my own synth designs.

Pictures:

Main Board
IMAGE(http://papilio.cc/uploads/Papilio/ppro.jpg)

RetroCade wing
IMAGE(http://retrocade.gadgetfactory.net/uploads/Main/retrocadeMW.png)

Arcade wing:
IMAGE(http://papilio.cc/uploads/Papilio/callouts.jpg)

Videos:

PACMAN!

Chiptune Times

Heh, looks like JetBrains' site is buckling under the load.

I got a new job in October, going from a position where I did C# programming half or 3/4 time (with other duties competing for my time) to my new employer doing C# full-time. It's been a great change. My old job was spent with a govt contractor writing desktop apps with no network connectivity. Now it's still working on desktop apps, but ones that download financial data, talk to databases, etc. I've learned more in the past couple months than I probably have in the last few years. It's a good feeling to know your career is actually going somewhere.

On the other hand, I'm working with financial data sent in binary format, and it's tedious as hell to parse. Not difficult, really. Just tedious.

beanman101283 wrote:

Heh, looks like JetBrains' site is buckling under the load.

Yeah, I've got a support renewal for RubyMine which was $9.75 instead of the standard $39 - figured might as well buy it at that price. Finally got my order through, but haven't gotten any email response from them, and their site is very, very slow right now.

beanman101283 wrote:

On the other hand, I'm working with financial data sent in binary format, and it's tedious as hell to parse. Not difficult, really. Just tedious.

I love the Erlang binary data parsing, it's simple and smooth.

I tried it out and found it a bit twitchy/crashy on Windows, but I'm sure that's not the target platform. Once I did get it stable and had some clojure code for it to mess with it was pretty cool. Made me wish I knew Clojure better to really mess with it. I can't see any good reason why something similar wouldn't work for Erlang, which is also a module-based functional language.

Huh, those Spartan chips seem pretty small -- am I reading the datasheet correctly that the LX9 on that board only comes with 9152 components? Or does "logic cell" mean something different?

If it IS only 9152 components, how on earth are they doing a Z80?

Date manipulation in a database is really hard. Apparently.

We had a minor bug with date comparisons. @date1 >= @date2 doesn't quite work the way you like when one or both of the dates have time information and the date parts are the same. Considering the cases that happened, it seemed kind of minor to me, but a dev on another team "fixed" it, and apparently it wasn't properly code reviewed.

First, they converted the datetime to date, which seems like a good way to handle it, but we have to support MSSQL 2005 which doesn't have the date datatype, so they decided to convert the dates to strings, and do a string comparison. Yuck. Technically, it would have worked if they had used the YYYY-MM-DD format (CONVERT(CHAR(10),@date,120)), although it's ugly as hell and I would have brought it up in a code review. Instead, they used DD/MM/YYYY format (CONVERT(CHAR(10),@date,103)) for the comparison. Ok, fine spotting the difference between 103 and 120 is tough unless you memorize what they mean, but I guess it wasn't actually tested, or it happened to work with the data they were using.

I had to google for the docs, because I knew that SQL Server has date manipulation functions, but needed a reminder of what they were and how to use them. Unfortunately, googling for "sql server date comparison without time", the official docs are 7th in the results and first 6 are some good advice, but mostly bad advice, including one poster in a forum who knew about the built-in date functions and basically asked whether he should use the native api or roll his own (citing performance concerns).

The simple way to fix this bug is DATEDIFF(day, table.date_field, @date_arg) >= 0. This (as I just learned) has the problem of not being "sargable" (i.e., it can't use and index) on the date_field due to it being wrapped in a function. The closest I could find is DATEADD(dd, DATEDIFF(dd,0,@date_arg), 0), which looks confusing (get the number of days since Day Zero, then add that many days to Day Zero :-?). In our case, the table in question will always be small (ref data that the customer cannot change), and that only handles @date_arg having time data, but doesn't handle table.date_field having it, so I'm going with the simple DATEDIFF version.

Date manipulation is hard regardless of the stack. Most high level languages have some nice wrappers around it now, but dates down close to the metal are a huge PITA.

In SQL Server 2005 to be "sargable" the functions need to be on one side of the predicate while the columns should be on the other.

So say I was using the SalesOrderHeader table from adventure works and wanted to find sales on a given order date in a time insensitive manner.

I could do this:

-------------------------------------------------------------
declare @CompareDate datetime
set @CompareDate = '2006-01-01 5:37:00'

select * from sales.SalesOrderHeader
where DATEDIFF(day, OrderDate, @CompareDate) = 0

-------------------------------------------------------------

That of course would ignore any index, because it would have to run the comparison on each record in the table to evaluate the expression.

A better way to write this query to take advantage of an index on order date would be as follows:

-------------------------------------------------------------
select * from sales.SalesOrderHeader
WHERE
OrderDate >= DATEADD(dd,(DATEDIFF(dd,0,@CompareDate)),0) and
OrderDate < DATEADD(dd,(DATEDIFF(dd,0,@CompareDate))+1,0)
-------------------------------------------------------------

Here the t-sql parser would recognize that the right hand expression, since it doesn't use the OrderDate column anymore is static, so it would evaluate it once, then scan the index for matches.

Hopefully that makes sense.

So, I have two jobs open in Cleveland. Nice small company that does training and consulting work. They will spend the first 3 months paying for you to get your Microsoft Certified Trainer cert. After that you will spend some weeks training and some weeks out at clients.

What they're looking for is someone with mid-level C# and SQL Server (T-SQL script writing) who has a personality conducive to being a trainer.

Salary looks to be low to mid 80s, which in Cleveland affords you a pretty comfortable lifestyle.

I have to say if you have the aptitude the life of a trainer is a good one. Work 9-4 with 90 minutes of breaks. Go home at the end of the day, not on call...

So I'm entering the last term of my CS degree and I think the only thing I've learned from the CS department are bad habits. Indeed this past term I discovered that our university is now known (at least locally) for turning out crap coders and businesses actively avoid hiring them when given a choice. -_-

I don't actually want to be a coder but I'm aware of the fact that I may need to be. I gather that the best way to fix this is probably to sit down and crunch some practical rather than academic code but finding time to do that amidst working full time and the busy work imposed by full time school is a bit much.

If I do somehow manage to go that route what sort of things are ideal in a coding portfolio? You'd think they would have covered this in my degree at some point but not so much.

Are there other more time efficient ways to improve the situation, perhaps certificates to stick on the resume along with the degree?

krev82 wrote:

So I'm entering the last term of my CS degree and I think the only thing I've learned from the CS department are bad habits. Indeed this past term I discovered that our university is now known (at least locally) for turning out crap coders and businesses actively avoid hiring them when given a choice. -_-

I don't actually want to be a coder but I'm aware of the fact that I may need to be. I gather that the best way to fix this is probably to sit down and crunch some practical rather than academic code but finding time to do that amidst working full time and the busy work imposed by full time school is a bit much.

If I do somehow manage to go that route what sort of things are ideal in a coding portfolio? You'd think they would have covered this in my degree at some point but not so much.

Are there other more time efficient ways to improve the situation, perhaps certificates to stick on the resume along with the degree?

CS isn't good at creating good business coders for a lot of reasons. You are right that the best way to learn is to do, but without guidance from a good mentor you will likely struggle. If you don't want to be a coder though I'm not sure why you'd be concerned about this?

@krev82, I suggest that you find a project that you'll enjoy working on and make a significant contribution. You could contribute to an open source project or make something of your own. If you decide to make something on your own it will show initiative and perseverance, even if it doesn't implement any impressive algorithms. Mobile apps are pretty easy to self-publish but don't really make you stand out anymore. I haven't found many organizations that are impressed by certifications for new hires.