Posts
Search
Contact
Cookies
About
RSS

from json to sqlite and custom sql function

Added 19 Jul 2015, 4:42 p.m. edited 18 Jun 2023, 1:12 a.m.
Elite Dangerous, has a fairly decent sized data set, and players have done (great work!) a number of different things to scrape data from the application, mainly to help with the woefully poor trading information that the application provides you with... One of the better databases has a web presence at eddb.io the data is available in json format, this has two drawbacks first ascii isn't the most compact of media and as is often the case with json, data that should be in different tables is munged together into individual responses. Take for example the stations json, this is an array of all the json responses you'd get from querying each station in turn (assuming there was a live json interface!) this is great except that with each station, you also get a list of its commodity prices, this is data that really should be in its own response ideally, so you'd query for station details then query for an array of commodities at that station, rather than receive all this data bundled together, especially as currently its only available as a data snapshot... I looked at quite a number of libraries and a number of different languages, and I have to say I was less than impressed with the quality and lack of robustness with a number of solutions (to say the least) By far the easiest to use and by far the most robust was Python, and I guess I shouldn't be surprised that for chomping through data and rapid development Python really is a great tool. The script I ended up creating takes 140mb of json and creates 3 tables and a couple of support tables and bundles them into just over 35mb of sqlite database. (in a little over 10 seconds on my machine) I chose sqlite as its compact, and from sqlite its easy to spew output either as data in a new format, or for presentation. While technically a binary format, its very open and very accessible and fairly data dense too. You might tell I'm a bit of a fan - its kind of a data swiss army knife! creating a python object from a json file is really (very!) simple
commodities = json.load(open("commodities.json"))
assuming that the json is an array of responses (a fairly typical case) you end up with an iterable list of dict objects, each dict being in effect the fields of each response or record. While iterating each item it is fairly easy (make work) to pick out foreign table data for later storing in another table, while updating the main table from the json file. Looking at the whole code for creating the commodities table...
commodities = json.load(open("commodities.json"))
categories = {}

eddb.execute('create table commodities (id,name,ave_price,catid)')

for c in commodities:
    eddb.execute('insert into commodities values (?,?,?,?)', [c.get("id"),c.get("name"),c.get("average_price"),c.get("category_id")])
    cat = c.get("category")
    categories[cat.get("id")] = cat.get("name")
    
conn.commit()

eddb.execute('create table commod_cats (id,name)')
for c in categories:
    eddb.execute('insert into commod_cats values (?,?)',[c,categories[c]])

conn.commit()
Its clear to follow in the first loop that I'm collecting commodity category data as the commodities table is being, while overwriting the same category id potentially every loop is far from idea the impact seem minimal, indeed check the id exists and doing nothing could potentially take the same or more time! The second loop is then just a case of throwing small amount of data into the support table... As you can see from the code fragment the syntax of Python is very clear and aided no end by one of its best assets - formatting defines code blocks... this forces even sloppy coders to be neat! (probably why some people hate it!) Python is also scary fast, much faster than you'd expect it to be, but then its much more than just a "simple" script interpreter if you're interested this is a great post Of course the one disadvantage of making a custom script is it will immediately break given any change of data structure, but then it would be a very powerful importer that could intelligently analyse changing data structures and produce something suitable for a typical relational database... One of the immediate questions I had for the newly parsed data involved distances in three dimensions, now while some SQL guru could probably work out some pure SQL solution (actually probably not as there are little other than basic Math in sqlite) So now I really do have an excuse to make a custom function (YAY!) while I initially made a prototype in C I could have as easily started in Java as we'll see later... The most complex part of a custom function is the actual function itself - actually integrating it using it is really trivial
static void distance3d(sqlite3_context *context, int argc, sqlite3_value **argv) {
    if (argc == 6) {
        double x1 = sqlite3_value_double(argv[0]);
        ... four similar line here!
        double z2 = sqlite3_value_double(argv[5]);

        double result = sqrt( pow(fabs(x1-x2),2) + pow(fabs(y1-y2),2) + pow(fabs(z1-z2),2) );

        sqlite3_result_double(context, result);
        return;
    }
    sqlite3_result_null(context);
}
I've removed some of the line to reduce the space this fragment takes up, this is the basic minimum you can get away with and if you're dealing with strings you'll have to pay a lot more attention to validating and ensuring you don't leak, (I could see it being all too easy to cause problems!) The sqlite_ functions are basically self explanatory and are very nicely documented on the sqlite website. To tell sqlite about it
sqlite3_create_function(db, "distance3d", 6, SQLITE_UTF8, NULL, &distance3d, NULL, NULL);
using is then just like its any other sqlite sql function...
select name,round(distance3d(75.75,48.75,70.75,x,y,z),2) as distance,x,y,z from systems where distance<7
The coordinates used here are for a little known system called Lave... the query will pull out all systems within 7 light years of Lave. I wonder if you can guess what system is at coordinates 0,0,0 ? ;) Quite why I assumed a Java sqlite library wouldn't support custom functions, I don't know I'm sure... I used sqlite-jdbc there are other solutions including a pure java port that I really must investigate at a later date... Java custom functions are more object oriented - you make a class that extends Function, and override / use its methods to achieve your evil designs... for example the value_double(int param); method will get a function parameter as a double value, while the return methods will do just that with values!... itegrating the function is as simple as calling a static method of Function
Function.create(connection, distance3d.class.getSimpleName(), new distance3d());
So the lesson to take from this is a guess that it is very easy almost trivial to extend sqlite via custom functions and because of its light weight nature its very quick, although not a server architecture - if you're mainly reading it can me used in even volume web scenarios - of course there are plenty of people who would throw their hands in the arm at the mention of a server-less data back-end but then you have to closely analyse your requirement in every case anyhow... All in all sqlite has a lot going for it especially in its intended sphere on the desktop, one place where it especially shines is for data transfer to unknown destinations, for example you could make data available in csv format or json format, but then equally why triple your file size and therefore bandwidth requirements... while you can compress csv or json effectively (and smaller than sqlite) neither is a ready to go format, for example you can download a sqlite file, open it in your favourite SQL environment and begin asking questions of the data - immediately.