Notes from the field: CockroachDB Cluster Setup

Download the CockroachDB Binary

Perform on each node.

wget https://binaries.cockroachdb.com/cockroach-latest.linux-amd64.tgz
tar xvzf cockroach-latest.linux-amd64.tgz
mv cockroach-latest.linux-amd64/cockroach /usr/bin/
chmod +x /usr/bin/cockroach

Create cockroach user and directories

Perform on each node.

groupadd cockroach
useradd -r cockroach -g cockroach
su - cockroach
cd /home/cockroach
mkdir -p certs my-safe-directory cockroach_db

Check ntp status

Check NTP is running and configured correctly. CockroachDB replies on syncronised clocks to function correctly.

service ntpd.service status

Secure the Cluster

Perform on each node.

Copy all keysgenerate on the first host to the others but regenerate the node certificates (This means the command with create-node). For further details see Secure a Cluster.

cockroach cert create-ca --certs-dir=certs --ca-key=my-safe-directory/ca.key # These keys, in both dirs, need to be copied to each host
ls -l certs
cockroach cert create-node localhost $(hostname) --certs-dir=certs --ca-key=my-safe-directory/ca.key
ls -l certs
cockroach cert create-client root --certs-dir=certs --ca-key=my-safe-directory/ca.key --overwrite
ls -l certs

Start the nodes

node1

su - cockroach
cockroach start --background --host=node1 --http-host=node1 --port=26257 --http-port=8080 --store=/home/cockroach/cockroach_db --certs-dir=/home/cockroach/certs;

node2

su - cockroach
cockroach start --background --host=node2 --http-host=node2 --port=26257 --http-port=8080 --store=/home/cockroach --join=node1:26257 --certs-dir=/home/cockroach/certs

node3

su - cockroach
cockroach start --background --host=node3 --http-host=node3 --port=26257 --http-port=8080 --store=/home/cockroach --join=node1.ucid.local:26257 --certs-dir=/home/cockroach/certs

Check the status of the cluster

sudo su - cockroach
cockroach node ls --certs-dir=certs --host node1
cockroach node status --certs-dir=certs --host node1

Create a cron to start CockroachDB on boot

Create the file /etc/cron.d/cockroach_start with the below cron command for each node…

node1

@reboot cockroach       cockroach start --background --host=node1 --http-host=node1 --port=26257 --http-port=8080 --store=/home/cockroach/cockroach_db --join="node2:26257,node3:26257" --certs-dir=/home/cockroach/certs;

node2

@reboot cockroach       cockroach start --background --host=node2 --http-host=node2 --port=26257 --http-port=8080 --store=/home/cockroach --join="node1:26257,node3:26257" --certs-dir=/home/cockroach/certs;

node3

@reboot cockroach       cockroach start --background --host=node3 --http-host=node3 --port=26257 --http-port=8080 --store=/home/cockroach --join="node1:26257,node2:26257" --certs-dir=/home/cockroach/certs;

Reboot the nodes to ensure the CockroachDB comes up and all join the cluster successfully.


MongoDB: Making the most of a 2 Data-Centre Architecture

There’s a big initiative at my employers to improve the uptime of the services we provide. The goal is 100% uptime as perceived by the customer. There’s obviously a certain level of flexibility one could take in the interpretation of this. I choose to be as strict as I can about this to avoid any disappointments! I’ve decided to work on this in the context of our primary MongoDB Cluster. Here is a logical view of the current architecture, spread over two data centres;

MongoDB Cluster Architecture Two Data Centres

What happens with this architecture?

If DC1 goes down shard0 and shard2 are both read-only while shard1 remains read/write. DC1 contains only a single config server so some meta-data operations will be unavailable. If DC2 goes down shard0 and shard2 remain read/write while shard1 becomes read only. 2 config servers are hosted in DC1 so cluster meta-data operations remain available.

What are the options we can consider when working within the constraints of a two data-centre architecture?

  1. Do nothing and depend on the failed data-centre coming back online quickly. Obviously not an ideal solution. If either data-centre goes down we suffer some form of impairment to the customer.
  2. Nominate one site as the PRIMARY data-centre which will contain enough shard members, for each shard, to achieve quorum should the secondary site go down. Under this architecture we remain fully available assuming the PRIMARY data centre remains up. The entire cluster becomes read-only if the main data-centre goes down. To achieve our goal, of 100% uptime, we have to hope our PRIMARY Data centre never goes down. We could request that application developers make some changes to cope with short periods of write unavailability.
  3. A third option, which would actually work combined with both of the above approaches, would be to force a replicaset to accept writes when only a single server is available. You need to be careful here and be sure the other nodes in the replicaset aren’t receiving any writes. We can follow the procedure Reconfigure a replicaset with unavailable members to make this happen. This would require manual intervention and there would be some work to reconfigure the replicaset when the offline nodes returned.

Clearly none of these options are ideal. In the event of a data-centre outage we are likely to suffer some form of impairment to service. The only exception here is if the SECONDARY data-centre goes down in #2. This strategy depends, to some extent, on luck. With full documented, and tested, failover procedures we can minimise this downtime. The goal of 100% uptime seems pretty tricky to achieve without moving to a 3 data-centre architecture.

 


MongoDB and the occasionally naughty query

It’s no secret that databases like uniqueness and high cardinality. Low cardinality columns do not make good candidates for indexes. A recent issue I had with MongoDB proved that NoSQL is no different in this regard.

The MongoDB Query Planner is quite simple and works as follows (simplified)…

  • If there are no usable indexes; perform a collation scan.
  • If there is one usable index; use that index.
  • If there is more than one usable index; generate candidate plans, score plans and cache them, use the plan with the best score.
  • Periodically or due to certain actions (creating, dropping indexes); purge plans from cache and re-evaluate possible plans again.

This fairly simple method generally seems to perform well but I have observed occasional problems resulting from this method.

Assuming the following document schema;

?View Code JAVASCRIPT
{
     "_id": XXXXXXXXX,
     "email1": "fakeemail@here.com",
     "email2": "fakeemail@here.com",
     "identifier": "XXXXXXXX"
}

Assume the following indexes are defined;

?View Code JAVASCRIPT
{ "email1": 1, "email2": 1 }
{ "identifier": 1 }

The value for “identifier” was originally specified to be unique but was changed after developer feedback as they said this will “not always” be the case.

Let’s assume we have the following query;

?View Code JAVASCRIPT
db.collection.find({ "email1": XXXXXX, "email2": XXXXXX, "identifier": XXXXXX });

For several months this query performed great always selecting the “identifier” index. As time went on we started to notice some slow instances of this query were creating higher load. Upon initial investigation we discovered that MongoDB was selecting the index on email1, email2 when the index on “identifier” was clearly a better choice. Often there would be tens of thousands of documents for a combination of (email1, email2) and just a single document for the “identifier” value. Why was MongoDB making this bad choice?

After further investigation we discover queries like this…

?View Code JAVASCRIPT
db.collection.find({ "email1": XXXXXX, "email2": XXXXXX, "identifier": "welcome-message" })

The key here was the value of “identifier”. There were a few hundred thousand documents with the value of “welcome-message” in the system. When one of these queries, that happened to have the value of “welcome-message” came in and MongoDB decided to re-evaluate plans for this query. Sometimes MongoDB decided that “identifier” was not the best choice of index. While that was certainly true for that instance of the query it wasn’t for the vast majority of them. So 99% of these queries that used to only scan one index key / document now scanned hundreds of thousands!

So how did we fix it? We thought about changing our indexes to…

?View Code JAVASCRIPT
{ "email1": 1, "email2": 1, "identifier": 1 }

But that would have taken us too long to deploy across our multiple shards and been quite disruptive. We could also have implemented index filters but decided against it as they don’t persist through restarts. As a short term fix we deleted the documents with the “welcome-message” identifier and the developers agreed to append a random code to the end of the identifier, i.e. “welcome-message-s6geR4tgds36”. This quickly resolved the problem and we’ve had no repeating instances since. Databases, including NoSQL ones, love uniqueness!

Possibly in the future MongoDB may implement something like Column Statistics that many other traditional databases employ to solve these problems. In the meantime we all need to be a little bit more aware of our indexes and the data being pushed into our systems.


A few Splunk queries for MongoDB logs

Here’s a few Splunk queries I’ve used to supply some data for a dashboard I used to manage a MongoDB Cluster.

Election events

If any MongoDB elections happen at 3AM on a Wednesday night I want to know about it. This query, added to a single value panel allows me to do this easily…

host=mongo* source=/var/log/mongo*.log "Starting an election" | stats count

Rollbacks

I also want to know about any rollbacks than happen during an election…

host=mongo* source=/var/log/mongo*.log "beginning rollback" | stats count

Log message with severity ERROR

Count log messages with ERROR severity

host=mongo* source="/var/log/mongodb/*.log" | rex "(?<timestamp>^\d\d\d\d-\d\d-\d\dT\d\d:\d\d:\d\d.\d\d\d\+\d\d\d\d) (?<severity>.) (?<component>\S*) "| where severity=E | stats count

Chunk moves initiated

Have any chunks moved

host=mongo* source="/var/log/mongodb/*.log" "moving chunk" | stats count

State changes

How many states changes, i.e. PRIMARY -> SECONDARY in period…

host=mongo* "is now in state" | stats count

Getting started with CockRoachDB

I’ve been quite interested in CockRoachDB as it claims to be “almost impossible to take down”.

Here’s a quick example for setting up a CockRoachDB cluster. This was done on a mac but should work with no, or minimal, modifications on *nix.

First, download and set the path PATH

wget https://binaries.cockroachdb.com/cockroach-latest.darwin-10.9-amd64.tgz
tar xvzf cockroach-latest.darwin-10.9-amd64.tgz
PATH="$PATH:/Users/rhys1/cockroach-latest.darwin-10.9-amd64";
export PATH;

Setup the cluster directories…

mkdir -p cockroach_cluster_tmp/node1;
mkdir -p cockroach_cluster_tmp/node2;
mkdir -p cockroach_cluster_tmp/node3;
mkdir -p cockroach_cluster_tmp/node4;
mkdir -p cockroach_cluster_tmp/node5;
cd cockroach_cluster_tmp

Fire up 5 CockRoachDB hosts…

cockroach start --background --cache=50M --store=./node1;
cockroach start --background --cache=50M --store=./node2 --port=26258 --http-port=8081 --join=localhost:26257;
cockroach start --background --cache=50M --store=./node3 --port=26259 --http-port=8082 --join=localhost:26257;
cockroach start --background --cache=50M --store=./node4 --port=26260 --http-port=8083 --join=localhost:26257;
cockroach start --background --cache=50M --store=./node5 --port=26261 --http-port=8084 --join=localhost:26257;

You should now be able to access the Cluster web-console at http://localhost:8084.

Command-line access is achieved with…

cockroach sql;

Those familiar with sql will be comfortable…

?View Code POSTGRESQL
root@:26257/> CREATE DATABASE rhys;
root@:26257/> SHOW DATABASES;
root@:26257/> CREATE TABLE rhys.test (id SERIAL PRIMARY KEY, text VARCHAR(100) NOT NULL);
root@:26257/> INSERT INTO rhys.test(text) VALUES ('Hello World');
root@:26257/> SELECT * FROM rhys.test;

Any data you insert should be replicated to all nodes. You can check this with…

cockroach sql --port 26257 --execute "SELECT COUNT(*) FROM rhys.test";
cockroach sql --port 26258 --execute "SELECT COUNT(*) FROM rhys.test";
cockroach sql --port 26259 --execute "SELECT COUNT(*) FROM rhys.test";
cockroach sql --port 26260 --execute "SELECT COUNT(*) FROM rhys.test";
cockroach sql --port 26261 --execute "SELECT COUNT(*) FROM rhys.test";

We can also insert into any of the nodes…

cockroach sql --port 26257 --execute "INSERT INTO rhys.test (text) VALUES ('Node 1')";
cockroach sql --port 26258 --execute "INSERT INTO rhys.test (text) VALUES ('Node 2')";
cockroach sql --port 26259 --execute "INSERT INTO rhys.test (text) VALUES ('Node 3')";
cockroach sql --port 26260 --execute "INSERT INTO rhys.test (text) VALUES ('Node 4')";
cockroach sql --port 26261 --execute "INSERT INTO rhys.test (text) VALUES ('Node 5')";

Check the counts again…

cockroach sql --port 26257 --execute "SELECT COUNT(*) FROM rhys.test";
cockroach sql --port 26258 --execute "SELECT COUNT(*) FROM rhys.test";
cockroach sql --port 26259 --execute "SELECT COUNT(*) FROM rhys.test";
cockroach sql --port 26260 --execute "SELECT COUNT(*) FROM rhys.test";
cockroach sql --port 26261 --execute "SELECT COUNT(*) FROM rhys.test";

Check how the data looks on each node…

cockroach sql --port 26261 --execute "SELECT * FROM rhys.test";
+--------------------+-------------+
|         id         |    text     |
+--------------------+-------------+
| 226950927534555137 | Hello World |
| 226951064182259713 | Hello World |
| 226951080098856961 | Hello World |
| 226952456016003073 | Node 1      |
| 226952456149368834 | Node 2      |
| 226952456292663299 | Node 3      |
| 226952456455684100 | Node 4      |
| 226952456591376389 | Node 5      |
+--------------------+-------------+
(8 rows)
cockroach sql --port 26260 --execute "SELECT * FROM rhys.test";
+--------------------+-------------+
|         id         |    text     |
+--------------------+-------------+
| 226950927534555137 | Hello World |
| 226951064182259713 | Hello World |
| 226951080098856961 | Hello World |
| 226952456016003073 | Node 1      |
| 226952456149368834 | Node 2      |
| 226952456292663299 | Node 3      |
| 226952456455684100 | Node 4      |
| 226952456591376389 | Node 5      |
+--------------------+-------------+
(8 rows)

To clean up…

# clean up (gets rid of all processes and data!)
cockroach quit --port=26257
cockroach quit --port=26258
cockroach quit --port=26259
cockroach quit --port=26260
cockroach quit --port=26261
cd;
rm -Rf cockroach_cluster_tmp;

I’ll probably continuing playing with CockRoachDB. As usual resources will be available on my github.