The Joys Of Programming

Are there any NoSQL databases out there yet that can actually out-scale a good RDBMS? I know that Mike Stonebreaker and some others are also working on NewSQL systems that are supposed to scale like crazy while retaining the ACID guarantees of SQL, but haven't checked into the details yet.

Hypatian wrote:
bandit0013 wrote:

At least the penalty for a bad data model in nosql is pretty minimal since there's no enforced schema.

Or, contrariwise, the penalty for a bad data model in nosql is huge because you may have to substantially re-write your code if you decide you need to change the data layout. (That would be why the world invented RDBMSs in the first place--because back then everything was "nosql", so everybody knew why it sucked.) (And that's not to say that people don't do a lot of stuff with RDBMSs that really don't [em]need[/em] the capabilities they provide, nor that people don't frequently use RDBMSs poorly and get both the rewrite pain and the overhead at the same time. Just don't imagine that you're doing anything except trading one flavor of pain for another. :))

Yes. Thanks for posting this. The NoSQL movement is interesting, but as someone who has been around the business for 19 years I am kind of flummoxed as to why it's the new hotness when there were distinct reasons RDBMSs were used in the first place.

I work on systems that have to support a high level of traffic. Not Amazon.com high or eBay high, but pretty high. So I've been thinking about this stuff a lot more recently than I have had to most of my career. The database is definitely a bottleneck for us, but more than that the entire architecture of the system is the bottleneck.

By that I mean that if I started my entire project Greenfield tomorrow I'm sure it could be redone in a manner that would allow the RDMBS to be scalable. Things like...

1 - Does everything need to be dynamic? There was a time when all web pages for everything became dynamic. There is some content that just doesn't need to be driven from a database. Especially large chunks of text that don't change more frequently than every day. These could be stored on the filesystem and written out by a batch job periodically, rolling across servers.

2 - How many disparate systems make database calls using unique connection pools of their own to the same tables? Not only could you free up connections by dividing things into a more SOA style architecture, but you could allow SQL to cache frequently used queries by limiting the number of servers accessing the database.

That's just a short list of large enterprise things I've been thinking about lately. Then there are all the obvious issues of how SQL queries are written, etc. If you're highly transactional in most cases you probably work for a large company. And unless that company has organized itself well there are probably people making crazy SQL calls from some corner of the company, possibly for reporting.

Actually, that's one place where new databases can have some real improvements. Distributed transactions are a tremendous pain, performance-wise--and you can potentially do a lot better if instead of going for straight up atomic transactions you aim for different consistency and atomicity guarantees. I wouldn't be surprised to see capabilities along those lines start appearing in some RDBMSs in the near future--although I'm not sure what the right way to fit things in would be. (It kind of goes against the "standard" model of consistency in that context.)

bandit0013: Well, the kind of problem I was hinting towards is somewhat more subtle. What do you do if you realize you need to switch from one way of linking your "objects" together to another one? You have an old property in your stuff, and code that depends on that property being there, and you want to switch everything over to using a new property (or even a completely new collection of stuff, if you've had to go from "property on X points directly to thing Y" to "Xs and Ys are related by some other mechanism".) Now you have to find every piece of code that uses that property and replace it with code that uses the new mechanism. If you made that abstract in the first place, that might reduce the number of places you have to look. You definitely still do have to write code to either go and change every object in your database to use the new format, or code that every time you touch an object checks which format it is and converts it if it hasn't been converted yet.

I will say that "random grab-bag of properties" tends to fit into SQL poorly. There are well-known ways to handle it (skinny tables), but they can be very unwieldy. Postgres has the hstore extension to make it a lot easier to do that sort of thing. Either way, one real danger, and the downfall of most systems I've seen use skinny tables, is the temptation to put [em]everything[/em] in that format--and some things start to break down when you handle them that way. (Although partial indexes and functional indexes both help with that.)

In short: There are no short-cuts. There are methods of design that make certain sorts of tasks easier or harder, but always with drawbacks. It's good to experiment with everything in order to learn what those drawbacks are.

I work for someone currently working with Oracle Coherence to try and split the difference. It's a data cache where you can store portions of your database in memory. It's not terribly clear when to use it and when not to use it, though. When to persist data to the cache and when not to. Often times the answer is to persist in both places. Ideally if you're spending big money on big iron you'd like a database that provides an in-memory cache of common queries or queries you designate and handles it for you.

Most of the big players that I'm aware of have chosen the Third Way on the issue of data storage: use both. Relational databases when you have a well-defined structure and need full ACID compliance; NoSQL systems when data is unstructured (or not yet standardized) and your use case is specialized (graph traversal, high write throughput e.g. logging, infrequently updated but frequently read information, etc.).

To somewhat echo what others have said: The hstore datatype in PostgreSQL is definitely becoming more and more attractive for companies that can't justify fully investing in a secondary data storage system. NewSQL contenders look promising but I don't know of any project that has gotten mainstream acceptance yet. And yes, you need to have at least a partial schema definition managed in your application code if you give up the strict enforcement offered by a RDBMS.

@complexmath: The question of "out-scaling" is a bit too vague to answer. Pretty much every significant NoSQL solution can scale quite well. Again it's a question of trade-offs, in this case mostly in terms of money. I'm not as much of a hardware/devops guy, so I couldn't price it out too well. But yeah... unless you're examining a fixed ratio of physical resources cost, total storage, response time, maintenance, etc., the best answer I could give you is "it depends".

This seems on topic and interesting.

http://www.infoq.com/news/2011/09/java-memcached-rise

Yeah--I was thinking more along the lines of massively distributed stuff. With things like Google's and Amazon's cloud stuff, you can have much more massively distributed setups than you can with a typical RDBMS. And a lot of that comes down to how you handle consistency and redundancy. With massive distribution, what you care about is "eventual consistency", and particular consistency guarantees for things.

Like web search doesn't really require consistency--you get stuff in, and you want to make sure that eventually it reaches the end user, and you don't let old data override new data. But if you double up on writes, it's no big deal, and if you miss a write, it's also no big deal. That does require a great deal of care with the use case and the data model: you need to know what the data implies, and what you can get away with losing and so on.

Similarly, you could imagine a web forum not needing perfect instant consistency. Each post is somewhat separate. People are posting past each other all the time, so it wouldn't matter if occasionally post A "really" came in before post B or post C happened before a user opened the page, but it hadn't shown up quite yet. As long as it doesn't take more than a minute or two to get to everybody, you're fine. If you want to work that way, you need to design a data model that doesn't give posts a sequential number (since that's a single synchronization point everybody has to agree on), but that's doable. (It's just now how people usually code web forums.) And, of course, a web forum generally doesn't need super massive scaling.

In any case, these are examples of ways that a good design can avoid needless synchronization. That can speed things up even on a single DB instance, and it speeds things up dramatically on a distributed setup--to the point where you can suddenly scale up to hundreds and thousands of nodes at different locations rather than just a few closely-located nodes.

These are things that traditional RDBMSs don't really do, because they're built at the very core with an old-school "everything is consistent all the time" model more well suited to bank transactions and the like. I think the RDBMS model allows for it, but people have to do some experimenting and figure out the best ways to achieve that. That may take a while, though, because it's not a feature that can be added superficially--it has to be built in from the ground up to make sense. And that means writing a [em]new[/em] RDBMS, which is a hell of a lot of work. Somebody's probably already working on this sort of thing in a research project, so I'm sure we'll see the fruits within twenty years or so. ;>

And of course, you can layer a cache on top of any of these things. (And you probably need to, with a massively distributed system--in order to provide additional coherence to the end user. It doesn't matter much if user A sees user B's posts a minute late. But user A would absolutely be disturbed if they did not see their *own* posts immediately.)

I see it as a boon for the super high performance tier, like web applications that need to be responsive using mobile / ajax.

Look at it this way, say you have a blog engine in an RDMS. You have a users table, post table, authors table, comments table, etc etc etc. To display a blog page with comments you are typically doing multiple lookups and each of those lookups is scanning an index and doing subsequent lookups to find each "row" of data as well as joining things together so all the comments have user names, etc in human readable format. When you think on it... the application displaying the blog post wants all of this information together pretty much all the time, so all those joins and lookups are highly inefficient. In something like CouchDB you could store the whole blog post and comments as one document making those lookups highly efficient.

Given the comment that forums don't need to be massively scalable... on its surface Facebook is pretty much a massive forum with 3rd party apps attached.

So I am starting to see it more as a question of data "grouping" and performance vs ... reporting? RDMS's do a great job at data warehousing and slicing across data in various ways... if your use case says "this set of data 95% of the time goes together" then key,value may be for you, especially if high performance is a key concern.

Hypatian wrote:

In any case, these are examples of ways that a good design can avoid needless synchronization. That can speed things up even on a single DB instance, and it speeds things up dramatically on a distributed setup--to the point where you can suddenly scale up to hundreds and thousands of nodes at different locations rather than just a few closely-located nodes.
..
These are things that traditional RDBMSs don't really do, because they're built at the very core with an old-school "everything is consistent all the time" model more well suited to bank transactions and the like.
..
And of course, you can layer a cache on top of any of these things. (And you probably need to, with a massively distributed system--in order to provide additional coherence to the end user. It doesn't matter much if user A sees user B's posts a minute late. But user A would absolutely be disturbed if they did not see their *own* posts immediately.)

I am completely stuck in that old-school way of thinking about how computers work. What textbook can I read to grok this whole "whaddya mean eventually it'll get updated, Does Not Compute!".
(Ironically I started reading the books on the Actor model and Pi-Calculus a few months ago, need to go back, but it's not the mathematical abstract level I need, I need the system-level cause I still think of those things happening on a single computer albeit with parallel hardware.)

Another thought, if you've ever been in a high performance RDMS environment you will quickly find that scaling out in database land is exceptionally expensive and highly difficult. The newsql databases scale out fantastically well.

RolandofGilead wrote:

(Ironically I started reading the books on the Actor model and Pi-Calculus a few months ago, need to go back, but it's not the mathematical abstract level I need, I need the system-level cause I still think of those things happening on a single computer albeit with parallel hardware.)

Try thinking in terms of chemical reactions. Both your data (message ready to send) and your computations (receive + do some stuff + message send) are reagents floating around in water. When data and a compatible computation meet, they react (the message is consumed, transformed by the computation, and a new message is produced). It's all non-deterministic--it's just which compatible reagents happen to meet up. And it can be in parallel--multiple reactions can happen simultaneously.

In terms of an actual computer, things run sequentially on individual cores, but it stays non-deterministic: you don't know which things are going to happen when. When possible, you want to keep computing on the same core: if you produce a piece of data, then compute on it to produce more, and so on, and stay on the same core you have nice cache locality. But if there's a lot more work to do than one core can deal with, it's easy to see how you could spill onto a second core. Take some of the reagents and throw them over onto that core and start it going. And likewise, you can do the same thing over the network--migrate the data over to another host. Each step, of course, is more costly, so you don't do it until you have to. On the other side, when a core or machine isn't doing much work, stop some cores/machines and migrate their data to another lightly loaded one. (You can be smarter about those and actually work at putting things together that are compatible to improve efficiency, but you don't have to--you can just do it randomly.)

It's a fun computational model, with a lot of promise, especially for certain sorts of applications.

A simplified version of this is the idea of 'Tuple Spaces' (see also "JavaSpaces"). In that scenario, computations and data are separate (as opposed to the pi-calculus where there's really no difference). Code can say "put data X into the space", and "give me a piece of data that looks like Y, because I want to handle that". So here, you have an explicit set of threads sifting through the data. That makes things a little bit more deterministic, but it's still limited by the fact that data is not guaranteed to come out in anything like the order it went in. In fact, the only operation for fetching data is "give me a random data that matches this pattern, and remove it from the space".

An example of how you could use this is for an email processing system. Say you have one thread listening for messages. When a message arrives, it gets wrapped up as a tuple value and thrown into the pool. Then let's say you have two more threads. One looks for messages that haven't been virus scanned yet, and virus scans them--and then it puts the message in the pool, annotated with its virus status (and therefore the scanner won't pick it up again). The other looks for messages that haven't been checked for spam, same thing. Then you have two final threads. One looks for messages marked as spam and sends out bounces. One looks for messages marked as checked for viruses and spam and delivers them (or possibly a virus warning instead of the original message if there was a virus).

So some things to note about this example: You can't decide ahead of time whether a message will be virus checked or spam checked first. You can be sure that no message will be virus checked or spam checked more than once. A message that's spam checked and rejected could be bounced without being virus checked. Or it could be virus checked and then bounced.

And each of these operations could be being done by multiple threads at once, possibly on multiple physical systems. Each one would just have its local pool of messages. You could vary the number of threads running the various activities based on your actual load. If you wanted to try a new alternative virus checker, you could just add it as another thread talking to the same space. You wouldn't be able to be sure which virus checker was to be run on any individual message, but you could audit things to see what was going through the new checker, and how long it was taking, and whether the results were good. (And in fact, you could use that for a benefit on the spam side. Let's say you have one spam algorithm that's super good but super slow--and that its results can be used to train the weaker faster one. Run one thread of the super slow one, and a few threads of the fast one. So most messages go through the fast one, and the slow one does as many as it can manage, and then its data is used to train the others.)

So that's sort of a similar system, in that it has that "non-deterministic reaction" sort of thing going on. But, the individual pieces are a bit more set in stone and comprehensible. (With pi-calculus you *could* easily build tuple spaces by having a fixed set of channels and a fixed number of processes that never fork, but you could also do much more interesting and mind-bending things.)

Hypatian wrote:

Cool Stuff

I just got done reading the chapter that touches on this in The Pragmatic Programmer. Really cool to see stuff you recognize so soon after.

RolandofGilead wrote:

I am completely stuck in that old-school way of thinking about how computers work. What textbook can I read to grok this whole "whaddya mean eventually it'll get updated, Does Not Compute!".

I don't know of any textbooks off the top of my head. If you want to read some touchstone articles, Google's Bigtable paper and Amazon's Dynamo paper are two of the most commonly cited works in this area. Brush up on the CAP theorem as you do so.

A concrete way of looking at it which might help is the quorum behavior in replica sets. For simplicity's sake, imagine you have 7 machines, distributed across the globe, each with a full copy of your data. You could choose to accept a performance hit to ensure that all 7 machines are kept perfectly in sync [strict consistency]. Alternatively, you could design your application to consider a read or write operation successful if it reaches 4 of the 7 machines (in this case, probably by geographical proximity); this is referred to as obtaining a quorum. With this setup, it is impossible for a read operation not to return the latest view of the data: 4 + 4 > 7, meaning at least 1 machine that participated in a given write of data will be contacted for reading that data (the inequality is generally formulated as W + R > N, and there's also a reason behind having W > N/2). The cost of obtaining a reliable write quorum goes up as the data gets out of sync, though -- so you allow the machines to play catch-up whenever there are spare cycles. If the access patterns of your system indicate that "4 writes + 4 reads" is preferable to "7 writes + 1 read", you could be in a position to take advantage of eventual consistency.

ADD: Thought of another concrete way of looking at it when I woke up this morning. Let's say your application is a global social network. Your users only interact with a portion of the data; in order to have a good experience they just need their working set to reflect 1) their own actions with strict consistency, 2) the actions of people in their network with eventual [but fast] consistency, and maybe 3) the actions of friends-of-friends with eventual consistency. With those needs in mind, you can design a system which ensures that users read their own data from the machine(s) to which they wrote and which localizes data as much as possible in order to keep the number of on-demand reads and writes as low as possible.

Both scenarios rely on some mechanism for playing catch-up behind the scenes. I don't know the internals of these mechanisms very well, apart from the fact that they generally use timestamp information to collate the atomic operations and benefit from relative formulations of writes more than absolute ones (i.e. "Set Jim's account balance to $5000" is more likely to cause conflicts or errors than "Subtract $1000 from Jim's account balance"). In the event of conflicts, IIRC the system will generally return a predictably-structured response indicating the conflict and let the application logic determine the proper resolution (in some cases, this could even be propagated out to the end user). I seem to recall Cassandra and MongoDB having some good information available on how their systems function internally; might be worth a look.

Cyranix wrote:

Most of the big players that I'm aware of have chosen the Third Way on the issue of data storage: use both.

I <3 Postgres and Redis.

Having recommended Seven Languages in Seven Weeks earlier in this thread, I will now add a rceommendation for its companion volume, Seven Databases in Seven Weeks. It starts with a representative traditional SQL RDMS (PostgreSQL) and then surveys six popular NoSQL databases of various kinds (Riak, HBase, MongoDB, CouchDB, Neo4j, and Redis), exploring the different strategies they use and their individual pros and cons. There's also a chapter on the CAP theorem. Excellent background reading here.

DSGamer wrote:

I like that book so far. Random thing. I met the guy who wrote that book while waiting in line for brunch at a restaurant here in Portland. Nice guy. We talked at length about NoSQL databases (I didn't know he had a book, we were just making small talk with this couple and it branched into our jobs and tech), the theory, etc. I picked up his book later and it's the main reference I'm suing outside of those whitepapers.

You're suing the guy?! Didn't he write the book before he met you?

Cyranix wrote:

@complexmath: The question of "out-scaling" is a bit too vague to answer. Pretty much every significant NoSQL solution can scale quite well. Again it's a question of trade-offs, in this case mostly in terms of money. I'm not as much of a hardware/devops guy, so I couldn't price it out too well. But yeah... unless you're examining a fixed ratio of physical resources cost, total storage, response time, maintenance, etc., the best answer I could give you is "it depends".

Classic SQL databases are fantastic for vertical scaling, but once you want distributed transactions or replication things start to fall over. In theory, NoSQL is much better for this provided you don't have atomicity requirements, but all of the discussion I've seen regarding the performance and scalability of publicly available NoSQL databases tops out at thresholds that are orders of magnitude less than I'd actually need. MongoDB, for example, used to do only a few hundred transactions per second (I don't know the numbers now), so even if it scales well horizontally, the hardware required to match the capability of a single SQL server is considerable.

Real use of NoSQL is interesting though. Consider Amazon. They want to guarantee rapid transaction completion, but global atomicity isn't actually required. So what they'll do is give you a fairly recent count of how many items are in stock and let you buy one. Then the order is processed asynchronously and it's possible that with a large number of concurrent purchases of the same item, you end up on a backorder list instead. So they might use a NoSQL database for initial transaction processing and a SQL database for the final reconciliation. I'd be curious to see how they handle the situation of a user buying something that turns out to be backordered from a user perspective though. I don't think I've actually had this happen to me.

ibdoomed wrote:
DSGamer wrote:

I like that book so far. Random thing. I met the guy who wrote that book while waiting in line for brunch at a restaurant here in Portland. Nice guy. We talked at length about NoSQL databases (I didn't know he had a book, we were just making small talk with this couple and it branched into our jobs and tech), the theory, etc. I picked up his book later and it's the main reference I'm using outside of those whitepapers.

You're suing the guy?! Didn't he write the book before he met you?

Hopefully my edit represents DSGamer's intent.

S0LIDARITY wrote:
ibdoomed wrote:
DSGamer wrote:

I like that book so far. Random thing. I met the guy who wrote that book while waiting in line for brunch at a restaurant here in Portland. Nice guy. We talked at length about NoSQL databases (I didn't know he had a book, we were just making small talk with this couple and it branched into our jobs and tech), the theory, etc. I picked up his book later and it's the main reference I'm using outside of those whitepapers.

You're suing the guy?! Didn't he write the book before he met you?

Hopefully my edit represents DSGamer's intent.

Yes.

Funny typo.

CaptainCrowbar wrote:

Having recommended Seven Languages in Seven Weeks earlier in this thread, I will now add a rceommendation for its companion volume, Seven Databases in Seven Weeks. It starts with a representative traditional SQL RDMS (PostgreSQL) and then surveys six popular NoSQL databases of various kinds (Riak, HBase, MongoDB, CouchDB, Neo4j, and Redis), exploring the different strategies they use and their individual pros and cons. There's also a chapter on the CAP theorem. Excellent background reading here.

I like that book so far. Random thing. I met the guy who wrote that book while waiting in line for brunch at a restaurant here in Portland. Nice guy. We talked at length about NoSQL databases (I didn't know he had a book, we were just making small talk with this couple and it branched into our jobs and tech), the theory, etc. I picked up his book later and it's the main reference I'm using outside of those whitepapers.

There is some really interesting stuff in this thread, though. I guess I should push past the preachiness of the early parts of the Pragmatic Programmer to get to the aforementioned architecture talk.

What? No love for MS Access?

MS Access is fantastic for single-user applications, and pretty nice as a front-end for a remote SQL database. I wouldn't try to get any fancier with it than that though.

@CaptainCrowbar: Thanks for the tip, that book sounds excellent.

@complexmath: I admit that my knowledge on scaling is mostly second-hand. You could very well be right in saying that the cost difference might be staggering. If that's the case, then it probably reinforces the idea noted previously, that NoSQL solutions generally shine when they can perform some specialized data handling tasks, which makes them more effective as a complement to RDBMS (which should be the default assumption) as in your Amazon example.

complexmath wrote:

MS Access is fantastic for single-user applications, and pretty nice as a front-end for a remote SQL database. I wouldn't try to get any fancier with it than that though.

I was joking, Access is the devil!

Dr.Ghastly wrote:
complexmath wrote:

MS Access is fantastic for single-user applications, and pretty nice as a front-end for a remote SQL database. I wouldn't try to get any fancier with it than that though.

I was joking, Access is the devil!

At least access is easy to use. Not like google forms.

Dr.Ghastly wrote:

I was joking, Access is the devil!

There are a lot of times where I want to do something simple and data-oriented where a spreadsheet just doesn't cut it. Access is perfect for this. I'm mostly a Mac person these days though, and the comparable Mac option, FoxPro, is laughable.

complexmath wrote:
Dr.Ghastly wrote:

I was joking, Access is the devil!

There are a lot of times where I want to do something simple and data-oriented where a spreadsheet just doesn't cut it. Access is perfect for this. I'm mostly a Mac person these days though, and the comparable Mac option, FoxPro, is laughable.

This is how I got started writing python scripts.

complexmath wrote:
Dr.Ghastly wrote:

I was joking, Access is the devil!

There are a lot of times where I want to do something simple and data-oriented where a spreadsheet just doesn't cut it. Access is perfect for this. I'm mostly a Mac person these days though, and the comparable Mac option, FoxPro, is laughable.

I thought the mac option was filemaker? That's what we use and I love it compared to access.

Oops, you're right. FoxPro is Microsoft's old DB thing. For single-table stuff, Filemaker is passable, but it's terrible at doing multi-table stuff and the reporting support is surprisingly weak. The last time I used it I wanted to report on a single table with grouped subtotals and as far as I can tell this is impossible in Filemaker.

F**k it.

We'll do it in HyperCard.

IMAGE(http://images.granneman.com/tech-history/1987_HyperCard.jpg)