This guest post was written by Reto. He works at Bluevalor, Nelmio’s very first client ever. Reto is a crack at analyzing economic data using MATLAB and Python. But he can not help himself and sometimes enjoys working on our infrastructure as well.
When we started our project with Nelmio, Pierre proposed to use CouchDB as a container for highly dimensional data items we receive from our third parties. We have been happy with CouchDB so far. One of the nice features of couch is its reliance on sequence IDs to assure very easy synchronisation between different CouchDB instances. It is even possible to use these sequence IDs to set up synchronisation between say SQL and CouchDB, since there is a nice API to query for changes in the CouchDB server.
A very convenient way to set up a backup of the data is to just configure a second CouchDB on another machine and replicate the data onto that machine. There is a feature called “continuous replication”. This seems to imply that you would have to set up the replication only once… However there is quite a big drawback as of CouchDB 1.2.: If the server is restarted, the replications will not be re-initiated. Even worse, sometimes replications just break down without any apparent reason.
Update: If you set up the replication via the _replicator database it fixes the restart issue.
In short: CouchDB’s “continuous replication” is not reliable enough as a backup system.
I’ve written a small Python script that you can run as a cronjob to check if a replication exists for a list of CouchDBs. As a little bonus, I added email notification in case something is wrong, so you can sleep well knowing your CouchDB backup is still working. With this script, it should be viable to backup your CouchDB databases via replication. I’ve attached the code after my little fanboy praise of Python.
I’ve studied finance and basically taught myself programming for scientific purposes. I’m trying really hard to write good code, but sometimes, I lack experience because I do not have a true programming background. If you’d like to point out things i could do better in terms of form, structure or function, please comment!
I’ve worked extensively with MATLAB so far. However, recently I stumbled over Python as a language for scientific computing and I’m absolutely loving it, so I would like to take the opportunity to praise on Python a little:
There are various reasons to use Python for scientific computing:
- high level language (good productivity, easy to learn for people like me)
- general purpose and object oriented (can interface with everything, bigger projects possible)
- beautiful, easy to read syntax
- ability to interface with low level languages if speed is first priority
- very rich libraries that support scientific computing needs
- open source (the MATLAB commercial license is 10-20k CHF depending on toolboxes)
With the open source packages numpy, scipy, ipython and pandas, Python pretty much trumps over every other scientific toolbox (R, MATLAB, Mathematica) while remaining super easy to use.
Especially pandas (an open source library that was developed at a hedge fund – true story!) improves data handling of time series ten-fold. I truly believe that if you need to do research with time series data, Python with pandas is the future.
So if you ever run into a problem where you need to do a lot of data cleaning and wrangling, look at pandas. There is a very good book called Python for Data Analysis written by Wes McKinney, the main developer of pandas (albeit only released as an “early release”).
Now to the code: Note that you will need the couchdb library to make this code work, so either install couchdb-python in your Python folder, or simply put it into the folder of the script. Note that I’ve only tested it with Python 2.7. You need to configure the “CONFIG” part of the script, and you should be all set.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
| import couchdb
import datetime
import smtplib
from email.MIMEMultipart import MIMEMultipart
from email.MIMEText import MIMEText
#-----CONFIG------------------------------------------------------
#source & target adresses
SOURCE = 'http://admin:pwd@host:5984'
TARGET = 'http://admin:pwd@host:5984'
#list of dbs (must have equal length)
SOURCE_DBS = ['db1', 'db2']
TARGET_DBS = SOURCE_DBS
#email credentials
GMAIL_USER = "gmail_user"
GMAIL_PWD = "gmail_pwd"
TO = "your email"
# set to False if no email desired
SEND_MAIL = True
#-----------------------------------------------------------------
class CheckReplicator(object):
"""
Checks if a replication for a list of dbs on the target exists between two
CouchDB instances.
Input a connection string for the source and the target of the
replication and provide two lists with the names of the databases you want
to have replicated. (source_dbs[0] will be replicated to target_dbs[0] etc)
Note that this class prints to console, so if you want to log progress,
print output to file in cronjob.
Note that all the dbs need to be created. It is smart to initiate the
first continuous sync via futon or http api!
"""
def __init__(self, source, target, source_dbs, target_dbs):
db_equality = len(source_dbs) == len(target_dbs)
assert db_equality, "source length must equal target length"
self.source = couchdb.client.Server(source)
self.target = couchdb.client.Server(target)
self.source_string = source
self.target_string = target
self.desired_reps = zip(source_dbs, target_dbs)
self._check_connections()
self.active_reps = self._get_active_reps_on_target()
def check(self):
if self._check_if_all_desired_reps_exist():
print str(datetime.datetime.now())[:-7] + " ok"
else:
self._fix_replications()
def _check_if_all_desired_reps_exist(self):
res = True
for d in self.desired_reps:
if d not in self.active_reps:
res = False
return res
def _fix_replications(self):
for d in self.desired_reps:
if d not in self.active_reps:
source_str = self._build_source_string(d[0])
self.target.replicate(source_str, d[1], continuous=True)
self.active_reps = self._get_active_reps_on_target()
if not self._check_if_all_desired_reps_exist():
raise EmailError("""
could not replicate all targets. Please
check if the couch instances are running
and all the dbs are created!
""", SEND_MAIL)
else:
print str(datetime.datetime.now())[:-7] + " replicators created"
def _build_source_string(self, db):
string = self.source_string
if string[-1] == '/':
string = string + db
else:
string = string + '/' + db
return string
def _get_active_reps_on_target(self):
tasks = self.target.tasks()
#parse source and target of the task string
#from the replication information
active_reps = list()
replications = [t['task'] for t in tasks if t['type'] == 'Replication']
for r in replications:
first_split = r.split('/ -> ')
target = first_split[-1]
second_split = first_split[-2].split('/')
source = second_split[-1]
active_reps.append((source, target))
return active_reps
def _check_connections(self):
try:
self.source.version()
except:
raise EmailError('could not connect to source', SEND_MAIL)
try:
self.target.version()
except:
raise EmailError('could not connect to target', SEND_MAIL)
class EmailError(Exception):
def __init__(self, value, send_mail=False):
self.value = value
if send_mail:
self._mail('Watchman Error', value)
def __str__(self):
return repr(self.value)
def _mail(self, subject, text):
msg = MIMEMultipart()
msg['From'] = GMAIL_USER
msg['To'] = TO
msg['Subject'] = subject
msg.attach(MIMEText(text))
mailServer = smtplib.SMTP("smtp.gmail.com", 587)
mailServer.ehlo()
mailServer.starttls()
mailServer.ehlo()
mailServer.login(GMAIL_USER, GMAIL_PWD)
mailServer.sendmail(GMAIL_USER, TO, msg.as_string())
# Should be mailServer.quit(), but that crashes...
mailServer.close()
#run it!
if __name__ == '__main__':
try:
check_replicator = CheckReplicator(SOURCE, TARGET, SOURCE_DBS, TARGET_DBS)
check_replicator.check()
except:
raise EmailError('program code failed', SEND_MAIL) |