Let’s say you start a new project and need to store some data (e.g., users, orders, settings, comments).
10+ years ago, there were few options. Likely, you would pick some SQL database technology such as Oracle, MySQL, PostgreSQL, SQLite, Microsoft SQL Server, IBM DB2… Conceptually they are similar, so it was not such a big deal.
Right now, there are way more choices. Not Only SQL movement represents quite a lot of very diverse technologies. So how to survive the paradox of choices?
Scenario: I do a side-project/open-source/startup
Most likely, this is not your most important problem. Just pick something that you feel the most comfortable with and most productive doing. It should be simple to figure out and use, you will find it in the language/framework tutorial.
“Premature optimization is the root of all evil” - Donald Knuth
Premature scaling is the source of most startup failures (Startup Genome Research)
Scenario: I develop a project for non-technical company
Background: They need something that will work for a long time and it is very unlikely that they will change it.
Big company and/or already talking about Oracle/MS SQL Server
Convince them to use MySQL/PostgreSQL instead. It is fairly similar. Commercial SQL database technologies tend to cost a lot and can be a perfect example of vendor lock-in. Split saved money between you and your client. Both you and your buyer deserve a bonus.
A small business
Likely managing a database is too much of a pain. Consider something as a service:
- Amazon DynamoDB
Some points from “startup” scenario also apply, but you would not be able to change it. PostgreSQL/MySQL should work for you.
Scenario: I have big data
Do you have more than 1 GB new data a day? If not, you are likely not in this scenario. It is very common for people to use big data solutions for problems that can be solved in a much simpler way. Do not be that guy/girl.
I just need to run batch jobs (e.g. analytics or non-interactive)
Just use some data warehousing solutions. Common choices:
- map-reduce style processing (e.g., Hadoop)
- column-oriented database (e.g. Amazon Redshift, HBase)
I need to access all data, all the day, I need a web scale database
First, read about CAP theorem. Either on Wikipedia or this example which describes CAP in plain English. Long story short, you have to pick between sacrificing either perfect consistency or perfect availability.
Thumb rule: What sounds better, if the database gets busy/or is in failure mode:
- You will wait longer, performance will suffer.
- You will get results, but sometimes they may be older; this may lead to data loss if developers are not careful enough.
Ad 1. You need consistency, but can relax on availability (CP systems). Good choice for most use cases.
Ad 2. You need availability, but can relax on consistency (AP systems). Good choice for analytics (e.g., collecting how many users visited the site, event data, etc.) where performance is king.
When in doubt, go with consistency. At worst you will just lose some performance/availability.
Second, things get super complicated at this scale, so as to remain sane you can likely use giant sorted key-value dictionaries:
Ad 1. examples: HBase, Amazon DynamoDB
Ad 2. examples: Apache Cassandra
<disclaimer> The world is much more complicated and messy than this over-simplified article. There are many details/specialized needs that make choices different. If serious money is at stake, consult with good software engineers. </disclaimer>
<warning> Databases are hard, do not reinvent them! </warning>