The thrill of a new technology: CouchDB
Posted by Giovanni Intini | Filed under Programming
It has been a long time since I fell in love with a new technology (maybe Ruby on Rails when it first came out?), and the time I am spending with CouchDB lately has been a nice return to the pleasure of programming.
I’ll probably write some in-depth articles on Document based DBs later, but let me show you how complex operations on relational DBs can become really simple with a document based approach.
Let’s say you have a db of tagged objects, and you want to get a report with a “per tag” distribution of your objects. In a relational environment you would need at least three tables, objects, tags and a join table and a select operation involving two joins to get the results you needed:
SELECT tags.name, group_concat(people.name) FROM tags INNER JOIN tags_people ON tags.id = tags_people.tag_id INNER JOIN people ON people.id = tags_person.id GROUP BY tags.name
This isn’t rocket science but I guess not everyone knows about group_concat
In a document based environment I would just have a document for each person, with a tags property that holds an array of tags. The map and reduce functions would look like this:
// map function(doc) { if (doc['couchrest-type'] == 'Person' && doc.tags) { doc.tags.forEach(function(tag) { emit(tag, doc.name); } } } // reduce function(keys, values, rereduce) { return values; }
Using Javascript to manipulate data may seem counter intuitive, but as you can see we have a much simpler datastore (just people documents thrown in the db) and an easy way of manipulating them to extract the info we need.
I love CouchDB.
March 30th, 2009 at 3:34 pm
I’m not so sure this is an improvement at all:
Moving from 178 characters -> to 219 characters is better?
Moving from 1 declarative statement -> 3 functions (being used in a non explicit way) is better?
Knowing that the schema guarantees the correct results -> not actually knowing that this code will run is better?
Being able to query the tags -> not being able to get at all your tags except through walking the docs is better?
Every single identifier in the SQL statement is still present in the javascript code (there is a logical difference however as multiple tags could have the same name, though they are getting splatted in your group, meaning the tags table is probably redundant in your model). Perhaps the inner joins are offensive for their explicit value references, though those could be replaced by natural joins. Just because you are stating structure explicitly in SQL doesn’t make it harder than the implicit statements like “doc.tags”.
Map reduce is a great tool for making parallel programs, but this sort of problem is SQL’s wheelhouse.
March 30th, 2009 at 3:46 pm
JKF, thanks for the comment, but I disagree with what you said. Just not having to split your data in three tables is worth the ammission price IMHO.
I can understand how Javascript isn’t the most wonderful thing to look at (I also prefer SQL syntax), but the paradigm shift is incredible in my opinion.
March 30th, 2009 at 4:07 pm
Hey thanks for the reply, I want to try to get to the bottom of why these data stores are seen as leaps forward in data modelling (I can certainally accept the argument to scalability).
What is wrong with “having to split your data in three tables”?
I know that creating new tables feels “heavy” but it logically is not more complex than having an object with a ”.” chain. In fact it can be even simpler to think about because it relies entirely on the equality of values, instead of a an entity with an identity and an reference pointer.
The act of breaking data into tables is no less work than deciding to nest tag objects inside the person objects conceptually. For example
{ person_name: “Mr Happy”, Tags: [ {tag_name:”Handsome”},{tag_name: “Honest”}] }
Still contains all the same concepts as the 3 tables. The person structure still is there, the tag structure still are there, and the fact that tags is an array not a single value is still there. We haven’t actually saved any of the work, so where is the paradigm shift?
March 30th, 2009 at 4:13 pm
“In a document based environment I would just have a document for each person, with a tags property that holds an array of tags”
If you like that, why do you normalize it in the SQL version? The tags_people could hold the tags as strings too.
Where is the Javascript executed? On the client? Then you have to send more data over the net.
March 30th, 2009 at 4:18 pm
Let’s say we’re modeling people, SQL case:
1) You think in advance the possible fields you will need.
2) You create the table(s)
3) If you later decide to add fields you have to change the schema for ALL the people you have in your db.
Document based:
{ “name”: “Giovanni”, “surname”: “Intini”, “tags”: [“programmer”, “blogger”]}
Then you add another person with different attributes:
{ “name”: “Hulh”, “surname”: “Hogan”, “tags”: [“wrestler”], “song”: “I’m a Real American” }
You can now query the db for songs, tags, people, and whatever extra property you want to add to the people you’re tracking without needing to add dozens of null fields to those people who don’t have a theme song (or a favorite sport, or whatever), because you don’t depend on a schema.
IMHO that’s a paradigm shift.
March 30th, 2009 at 4:20 pm
[...] – influence future technology by adding their fertile ideas to the collective now. Bleecker The thrill of a new technology: CouchDB – tempe.st 03/30/2009 It has been a long time since I fell in love with a new technology (maybe [...]
March 30th, 2009 at 4:30 pm
I don’t think you can truly compare document based DBs to relational DBs – its like comparing apples and oranges.
The reason? Both have fundamentally different goals:
Document based: Basically key-value stores – allowing you to store unstructured data in an unstructured way, giving you the ability to easily scale (as all you have is a glorified hash)
Relationally based: a store that uses logical rules to ensure a consistent state such that logical requests of the data will always be satisfied
For example (using the typical suppliers-parts model), in a document based DB you can’t be certain that a supplier-part has an associated supplier or part.
Note: this isn’t to say that there is no place for document based DBs (for example, if you don’t know your schema ahead of time)
March 30th, 2009 at 4:31 pm
@derique: I’m storing an array with the tags, not a string. Moreover javascript is just the language CouchDB uses for MapReduce. It’s executed inside the db server, not on the client.
March 30th, 2009 at 4:32 pm
I’m sorry that you’re going to feel “ambushed” by the RDBMS horde asking these questions and usually acting very arrogant and seeming to “not get ‘it’”, but please don’t think its because the RDBMS advocates are afraid of new ideas or don’t really understand the hierarchical model, its because they are seeing a return to old, failed ideas, which the RDBMS was invented to fix. They will seem so cranky because so much literature on this issue exists, but when you come at data from the programmers perspective you’re forced to see it with the wrong paradigm. This is why you’ll see so many people question the couchDB “innovation” in an apparently rude way. Its more frustration than anything. I’ll try and respond to your point though:
The issue with adding new attributes to some people and not others, is your program logic that accesses this data now has to check for the existence of these tags each time it deals with the person. This is incredibly difficult to manage on a large program. Your program logic will become filled with these checks, and your old program logic may actually end up breaking (because your new tags may change the meaning of what your person object represents)
Those new checks for the tags are logically equivalent to adding another table “PersonSong”. You can now add your person song data into this table, which explicitly describes the fact that it may or may not exist. The relational model is no less flexible in this way. Data may or may not exist? Add a table. Must exist? Add a column. This stops you from writing code that will likely end up corrupting your data a couple years down the road.
March 30th, 2009 at 4:34 pm
couch db is old technology (seen in lotus notes/domino)
move along …. nothing to see here
March 30th, 2009 at 4:35 pm
@JKF before we go on, I am all for RDBMS
I’m just saying that for a certain class of applications the Document based approach is far superior.
March 30th, 2009 at 4:47 pm
I think CouchDB is the best choice for applications such as a wiki, a CRM or other applications based on documents, like CouchDB itself.
Sometimes you don’t need a schema, and a relation database is restraining.
March 30th, 2009 at 4:50 pm
Hey Giovanni,
No doubt couchDB it is better for a certain class of apps. Though it is usually better for less apps than I see it advocated for: basically anything where you care about the data’s integrity. CouchDB solves busines scalability concerns usually, not logical concerns.
A really useful article imo would be an artice outlining exactly which apps are better off in the distributed hierarchical model. Perhapse thats the best way to advocate (you won’t have us annoying RDBMS fundamentalists all over you then!) the technology. The problem is that couchDB type tech is great in the very small and the very large, but 90% of apps don’t play in that space.
March 30th, 2009 at 4:53 pm
@JKF at Mikamai (http://www.mikamai.com) we’re rewriting our internal CRM switching from a MySQL store to a CouchDB store.
We have just started and CouchDB has already solved lots of implementation issues that needed lots and lots of code and strange conventions to be implemented in first place (like tag algebra).
March 30th, 2009 at 5:10 pm
I am starting a new system and we have decided to go a quasi approach…
So we have a RDBMS with relational data and a documents table which are updated periodically.
It gives us the strength of data integrity and the documents are used for the web client code… it also allows us to seperate the db into a working db and a reading db. Where all Writes (apart from document updates) go into the Relational structure and all of the Reads to the Client happen on the Document Table
your thoughts?
March 30th, 2009 at 5:13 pm
Well I wish you all the luck with that transition, I really hope it works for you, but the best I can do is warn you: What feels like flexibility now in getting rid of the impedance mismatch in your “store” will constrain your ability to organize, aggregate, and adapt your data in the future.
Even for web2.0 companies, your life will be your data, and thinking about how you deal with data first, instead of how to just store the artifacts of your programming is how you will build lasting value. Your code no matter how lovely will rot and be thrown away in some future version, but your data should be forever in an information organization. Never forget that.
Best of luck, and hopefully be aware of what you’re giving up while moving forward into that schema-less world.
March 30th, 2009 at 5:13 pm
I Think sometimes flexibility is what we need and relational databases are not the right solution in some cases.
Think at this case. You wanna map all of your devices in your home. In this scenario you can have different devices, and each one can have different properties. Flexibility in this case is necessary, as you can’t really decide before which fields you will have.
March 30th, 2009 at 5:16 pm
@jkf
“Though it is usually better for less apps than I see it advocated for: basically anything where you care about the data’s integrity.”
This is a false choice. Data integrity is not, and should not, be an either or proposition for any serious app.
March 30th, 2009 at 5:33 pm
This looks like something that’s useful on smaller projects where a relational db would be overkill. however i don’t think it would be appropriate in an enterprise environment.
also, data storage is a hard problem, people. this is no one-solution-that-fits-all.
March 30th, 2009 at 9:03 pm
What is the point of the reduce function in your example?
// reduce
function(keys, values, rereduce) {
return values;
}
March 30th, 2009 at 11:10 pm
@bennyb it’s used to return an array of all the people tagged with a certain tag in a single document.
March 31st, 2009 at 10:18 am
i feel impressed reading this work:
“How FriendFeed uses MySQL to store schema-less data”
http://bret.appspot.com/entry/how-friendfeed-uses-mysql
title is self-explanatory
maybe it’s interesting in this discussion
March 31st, 2009 at 1:46 pm
@emaaaa I read it and I really liked the article, but I think that building a document based store on top of mysql is a very conservative (too much) approach. If the right tool for a job exists, I ‘m happy to use it.
It worked for them though so I can’t say anything
March 31st, 2009 at 1:51 pm
@emaaaa thats exactly the article we are using. I know Giovanni thinks its conservative, but for us, its a perfect transition, i.e. we are implementing a system where the previous implementation used relational data. However I believe for the client facing logic the documents are the best…
April 10th, 2009 at 12:24 am
I wrote the original CRM system Giovanni is talking about. It’s based on a piece of RoR software that uses a relational model to handle tag-based Notes. I also implemented a tag algebra to handle tags, metatags and such.. so that I could write things such as: “ruby -city:NY on:tomorrow” to find all events related to ruby, not taking place in NY tomorrow.
Although I tried to build nice abstractions I had to break them continuously to handle ugly performance problems. Also the queries felt unnatural, awkwardly mapping to the domain I was trying to describe. My business logic also got badly cluttered with small variations of the same code. I could refactor that only to a point..
I hope that CouchDB can provide some solution. The code is certainly much much more understandable as there is a pretty cozy paradigm fit.
We’ll let you know how it goes..