NoSQL
book-review
engineering
In most of my experience as a Data Science at a large Bank, I have retrieved data from relational databases. More recently, we have built a new platform, where I have been working with Cassandra for timeseries on a Kubernetes cluster. I happened to watch this 1-hour presentation on NoSQL databases, which furthered my motivation to understand the topic, so I read a textbook on NoSQL by Pramod Sadalage and Martin Fowler. Below are my notes.
- Polyglot Persistence is the idea of using different data stores in different circumstances. The term borrows from the term Polyglot Programming referring to multiple computer programming languages within an application.
- NoSQL is a movement, not a technology.
- Relational databases are not designed to run on clusters and thus scaling presents a challenge. Amazon (Dynamo paper) and Google (BigTable paper) were very influential in setting the direction for the resolution.
- Relational databases often work very well. Migrating away from this framework should be motivated by a specific objective (e.g. running on clusters).
- Integration databases are a single source of data for multiple applications. The alternative paradigm is an Application database, which has a one to one relationship between storage and application.
- Application databases are the paradigm of NoSQL. The application “knows” the database structure. A schemaless db shifts the schema into the application that accesses it. This type of db is more forgiving for evolving needs.
- Relational databases are good for analyzing data. NoSQL databases are not flexible for querying.
- There are 4 types of NoSQL dbs, the first 3 are called aggregate oriented data models.
- Key-Value: Redis, Riak, Dynamo; value is opaque
- Document: MongoDB, Couch; value has structure.
- Column-family: Cassandra, HBase
- Graph: NodeJs
- An aggregate is a collection of related objects that are treated as an object. They form the boundaries of an ACID operation. It is central to running on a cluster.
- Distribution Model. Demonstrates the trade-off between consistency and availability.
- Single-server
- Sharding: different parts of data onto different servers. Each server is a single source of a subset of data.
- Master/slave replication: Replicating data across multiple nodes with one node the authority (master). Helps read scalability.
- Peer-to-peer replication: Replicating data across all nodes, no authority. Helps write scalability. Linear scaling because no master (Cassandra is example).
- Consistency:
- Conflicts occur when clients try to write the same data at the same time (write-write) or one client reads inconsistent data during another’s write (read-write).
- Pessimistic approach lock data to prevent conflicts, optimistic detects conflicts and fixes.
- To get good consistency, many nodes should be involved but the reduces latency.
- CAP Theorem: when you create partitions, you trade-off consistency with availability.