Sometime in the next couple of weeks I will be performing a major software upgrade on the server that hosts this blog, as well as a number of services I host on behalf of my clients.

What this means to you

Hopefully nothing.

All being well, there should be no significant downtime and the services hosted by this server will continue uninterrupted.

If you are a client of mine, I will be contacting you directly in the next few days with more details about when the upgrade will be performed and how it might affect you.

I apologise in advance for any possible inconvenience this may cause.

NoSQL is the name given to a collection of newer database storage systems, which, among other things, don’t require a database schema to be defined ahead of time. They have become increasingly popular in recent years, and a large part of the reason is that they offer a number of significant scalability advantages over traditional relational database systems, especially when deployed in modern distributed web architectures.

When Elgg was coded, all those years ago, the standard web application environment was LAMP, where the M of course meant MySQL. This was fine for the time, but things have moved on, and I have been getting an increasing number of queries from people asking me how they might go about migrating Elgg over to NoSQL, so I thought it’d be worth writing up some of my thoughts on the subject.

I caveat all of this heavily by saying that, whatever you do, migrating Elgg over to NoSQL is going to be a big job, and additionally I’ve not actually tried to do it (and I’m not likely to, unless someone persuades me). However, the following should give you a place to start…

The Object Model

The good news is that Elgg’s object model, together with it’s key -> value metadata system, is actually pretty well suited to NoSQL. Additionally, the fact that every entity in Elgg has a globally unique identifier, which can canonically identify an object, means that you should run into fewer issues when you come to scale.

Obtaining this guid (and in fact any identifier – metadata ids, annotation ids etc etc) presents you with your first major issue.

Currently, Elgg uses the MySQL’s auto_increment value in the table. This was simple, and writing a table and receiving the ID is an atomic operation, meaning you don’t have to lock the table or do any other fancy stuff to ensure that the ID you receive is the correct ID for the record you’ve just written. It does however introduce a limit in how much you can scale out, since you always must have one canonical write database in order to get IDs that are unique globally throughout the system.

Were I to write Elgg today, I would not have done it this way.

A starting point to addressing this issue would be to look at using something like Twitter Snowflake. Snowflake is a server process which returns algorithmically generated identifiers which are unique, and incremental over time. How important this is in practice is up for debate, since most native operations base sort on a separate time_created field.

One assumption that is made quite widely throughout Elgg (and also a fare few plugins), however, is that GUIDs are integer values. There’s no getting around that this going to cause a fair amount of pain.

Objects and Functions

Once the data model has been migrated over to NoSQL, you’re going to have to modify the Elgg core database retrieval functions.

For the really low level get_entity() method, and similar functions which return individual records, this should be fairly straightforward. For the more involved get_entities*, you’re going to have to get a little bit more creative, especially since 1.8, Elgg allows you to specify custom JOIN and WHERE clauses, so these are going to have to be remapped.

It is possible that there are some libraries or DB front end layers available to simplify this process significantly, but I’m not currently aware of any.

Plugins

Migration of plugins is going to either be really easy, or really hard, depending on how they’ve been written. If they are using core Elgg function and are not making too many assumptions, you should in theory be able to virtually drop them in and hit go (after maybe changing any occurrence of the plugin casting GUIDs to an integer, if you’re using Snowflake).

Plugins which make their own DB queries (there shouldn’t be any, but those that are around are fewer in number) will obviously cause you a bit of a headache.

Anyway, those are my first thoughts on the matter. I’d be interested to hear from anybody who’s tried this!

If you run a web server and you take a look at the logs, you will likely have seen something like this appearing:

xxx.xxx.xxx.xxx - - [01/May/2013:18:32:36 +0100] "GET /w00tw00t.at.ISC.SANS.DFind:)
HTTP/1.1" 400 320 "-" "-"

This is the product of a tool called w00tw00t, which is used by nefarious script kiddies to probe and attempt to compromise your server. If your server is up-to-date, there is probably not too much to worry about (without wanting to jinx it), however since a strength in depth approach is always the best plan when it comes to security, it is probably a good idea if we could deploy some additional countermeasures.

A common tactic is to use a tool like fail2ban to monitor your logs and then firewall off the offenders IP address, and there are filters out there to do this.

However, like a lot of people, I use the caching proxy Squid, in reverse proxy mode, to help handle high load on a web server. Since, in this configuration, Apache (and therefore fail2ban’s standard w00tw00t rules) will see these requests as coming from the cache machine, we need to take another approach.

One option is to modify the Apache log format to use the X-Forwarded-For variable instead (details of how can be found here), and thus preserving the original IP address in the logs. However, this would require me to modify a number of vhosts, and it seemed simpler to monitor the one squid access log.

So, I wrote a quick fail2ban filter to catch w00tw00t scans and block the offending IP address.

In jail.local

[squid-w00tw00t]
enabled = true
filter = squid-w00tw00t
port = all
logpath = /var/log/squid/access.log
maxretry = 1

In filter.d/squid-w00tw00t.conf

# Fail2Ban configuration file to catch w00tw00t scans on squid reverse proxy settings
#
# Author: Marcus Povey
#

[INCLUDES]

# Read common prefixes. If any customizations available -- read them from
# common.local
before = common.conf

[Definition]

_daemon = squid

# Option: failregex
# Notes.: regex to match the password failures messages in the logfile. The
# host must be matched by a group named "host". The tag "" can
# be used for standard IP/hostname matching and is only an alias for
# (?:::f{4,6}:)?(?P<host>[\w\-.^_]+)
# Values: TEXT
#
failregex = <HOST> TCP_.*http.*/(w00tw00t|wootwoot|WootWoot|WooTWooT).*$

# Option: ignoreregex
# Notes.: regex to ignore. If this regex matches, the line is ignored.
# Values: TEXT
#
ignoreregex =

» Visit the project on Github…