NoSQL

In most of my experience as a Data Scientist at a large Bank, I have retrieved data from relational databases.  More recently, we have built a new platform, where I have been working with Cassandra for timeseries on a Kubernetes cluster.  I happened to watch this 1-hour presentation on NoSQL databases, which furthered my motivation to understand the topic, so I read a textbook on NoSQL by Pramod Sadalage and Martin Fowler.  Below are my notes: 
 
  • Polyglot Persistence is the idea of using different data stores in different circumstances.  The term borrows from the term Polyglot Programming referring to multiple computer programming languages within an application.
  • NoSQL is a movement, not a technology.
  • Relational databases are not designed to run on clusters and thus scaling presents a challenge.  Amazon (Dynamo paper) and Google (BigTable paper) were very influential in setting the direction for the resolution.
  • Relational databases often work very well.  Migrating away from this framework should be motivated by a specific objective (e.g. running on clusters).
  • Integration databases are a single source of data for multiple applications.  The alternative paradigm is an Application database, which has a one to one relationship between storage and application.
  • Application databases are the paradigm of NoSQL.  The application “knows” the database structure.  A schemaless db shifts the schema into the application that accesses it.  This type of db is more forgiving for evolving needs.
  • Relational databases are good for analyzing data.  NoSQL databases are not flexible for querying.
  • There are 4 types of NoSQL dbs, the first 3 are called aggregate oriented data models.
    • Key-Value: Redis, Riak, Dynamo; value is opaque
    • Document: MongoDB, Couch; value has structure.
    • Column-family: Cassandra, HBase
    • Graph: NodeJs
  • An aggregate is a collection of related objects that are treated as an object.  They form the boundaries of an ACID operation.  It is central to running on a cluster.
  • Distribution Model.  Demonstrates the trade-off between consistency and availability.
    • Single-server
    • Sharding: different parts of data onto different servers.  Each server is a single source of a subset of data.
    • Master/slave replication: Replicating data across multiple nodes with one node the authority (master).  Helps read scalability.
    • Peer-to-peer replication: Replicating data across all nodes, no authority.  Helps write scalability.  Linear scaling because no master (Cassandra is example).  
  • Consistency: 
    • Conflicts occur when clients try to write the same data at the same time (write-write) or one client reads inconsistent data during another’s write (read-write).
    • Pessimistic approach lock data to prevent conflicts, optimistic detects conflicts and fixes.
    • To get good consistency, many nodes should be involved but the reduces latency.  
  • CAP Theorem: when you create partitions, you trade-off consistency with availability.