BFR - A Scalable Inverted Index

07/10/24 Permalink

When Google took over the search scene years ago they did something that the likes of Dogpile, Metacrawler, Altavista, Yahoo etc didn't. They didn't have a better looking website that users preferred, or better customer service, or even a better social media presence with cat memes. They had quicker lookup times and more relevant results.

Clearly the technology was the edge. But how did they do it? Well, they built it. They built a non-SQL non-relational DB form of database - an inverted index.

Inverted indexes work by storing the data based on its content. So data gets tokenized and those tokens get indexed. When you want to get all documents based on say, car, the DB already has that indexed for quick lookup. This differs from relational databases in which it has to spin through the data to find it (some modern SQL DBs do have indexes but they don't compete with a proper II).

Several years later an Open Source product called Lucene was released and it was brilliant. You could lookup terms in sub-second time and to be honest why would you use SQL now? I used Lucene (and still do) a lot and it was great for most situations, but when the number of documents got over a few million and real-time indexing was constant it would slow down.

So I thought about making my own. You could split/shard Lucene but I thought optimizing on one machine would be very beneficial and cost efficient. I dug through the Lucene code and found a relevancy scorer - basically a way to rank how good of a match a search phrase is to each particular document. I then set on a journey deep into file systems, IO buffering, compression, serialization, code testing etc.

Java has the APIs for making just about anything and I spent a long time researching and testing various IO classes for best performance. Overall I wanted something that could handle many read/writes a second. I looked into disk performance (solid state was king) and file system types (reiserFS was king although it wasn't well supported). For those that aren't aware, a file system is a database and all modern databases are abstractions of filesystems. I eventually had a fair idea of the design of the index and started to code.

Theoretical max IO throughput was one of main considerations. Imagine a document with 1000 tokens and 1000 of these documents being written a second. 1m disk writes a second? Even a SSD would struggle. Add to that, you need to sync with read as well. Buffers needed to be used efficiently. In fact the sync between read-write buffers was a light bulb moment. Maybe the write and read threads could communicate and share? So that's what I did. Data didn't even need to be flushed to disk, the read thread already knew about it and would search it as soon as it arrived. These buffers could be variable sized based on available hardware resources so performance could be leveraged. CI/CD TDD indeed. And very "agile".

The performance was really good and the only theoretical max was disk space. Data would come in and be instantly searchable. I setup a multi-threaded news web-scraper that would send BFR (over a serialized socket) the documents and on the HTML frontend there were over 50 docs that displayed as "0 seconds ago". In fact, the first several pages were constantly 0 seconds ago. I knew it wouldn't hit any unforeseen errors because I knew the code and the infrastructure, and even if it did hit runtime errors I could fix it more easily than you could with off-the-shelf software.

This ran for months and resulted in a searchable index of several 100 GB. All searches remained sub-second (often it would be less than 10ms - and this is on a bog standard server btw). Eventually I had to shut the news scraper down for bandwidth reasons and due to the fact Google News owns the space but I still use BFR today as both a learning tool and for production.

RoadAmp.com SatNav (POC)

12/09/24 Permalink

This is a proof-of-concept SatNav application that uses real-time data from Highways England to plot routes. It's a Full Stack Java/Javascript/HTML/CSS/JSON software from server up to display icons.

Some of the features:

JSON "tiles" that are translated from OpenStreetMap's XML format. These have the capability of being runtime modifiable rather than the traditional baked bitmap approach. The browser essentially interprets the raw data and draws it onto a HTML canvas rather than just rendering a static bitmap. It means you can alter display preferences at runtime. Turn features different colours, add new roads, new buildings etc.
The SatNav uses real-time Highways England data to calculate where there's slower than usual traffic and auto selects an alternate route if needed.
Plenty POIs (Places of Interest) have been added as separate API calls. Parking, shops, speed cameras, petrol stations etc.

This took a lot of work. No, seriously, a lot. The size of just the UK taken from OpenSteetMap's xml format is huge and you need significant RAM to associate the data and make sense of it. You can't just use a standard xml parser, you need to stream the input - it's essentially a huge data dump of points and areas that won't fit into memory. You need to then convert these points and areas into individual json "tiles" at various zoom levels. Big data.

Then you need to think about the front-end. The browser has the data it needs but you need to utilise caching because you don't want to be constantly re-drawing stuff. So the browser would store a previously drawn tile and pull from cache when needed.

The SatNav part also took a lot of work. The data gets sent in XML format which was parsed into Sensor and Measurement objects and these were scanned real-time to dump anomalies in traffic flow. These are then ranked in a severity list of slow roads. This data is also integrated into a GraphHopper route finder to calculate fastest routes using live data. A lot of data crunching involved.

Because the map is web-based it can be run on a mobile phone but it was ran from a http server not https, so I had to build an Android app with WebView that would call the WebView's javascript with GPS coordinates (with https you can use the navigator.geolocation.getCurrentPosition(showPosition) directly and just use your usual browser). I also wanted the map to orientate with the direction of the vehicle so bearing data was converted to radians and used to rotate the whole map's context (the map is essentially a context holding many contexts).

I could go into a lot finer detail about coding decisions based on performance and resource usage but maybe that's for another day. Anyway, here's a video of the POC that demonstrates the service:

Play the video here.

Stick Land

10/09/24 Permalink

Inspired by Minecraft this simple Java game uses the lwjgl library to render a 3D world.

The vector data for the sticks is stored in a custom database for quick lookup time. There's plenty of scope to create future improvements such as NPCs, interactivity, real time updates, collision detection etc.

Play the video here

NuggetServer

01/09/24 Permalink

A lightweight (under 200KB with dependencies) Java HTTP/1.1 server with integrated cert renewal.

Some of the features:

Cert renewal with Let's Encrypt API integration
Overridden keystore to enable 0 downtime
Real-time stale socket detection
Drop-in app integration specified at the command line
Wraps in a bash script that detects unresponsiveness
Only necessary libraries used

This was built due to my dissatisfaction with being unable to utilise other available web containers. Sure, you could try and strip and modify but the time it would take would probably be more than building from scratch. Building from scratch gives you the power to be able to optimise performance for your particular requirements. The HTTP/1.1 protocol is fairly simple compared to HTTP/2.0 and if you inline content and use keep-alive the performance should be similar to HTTP/2.0 (give or take the binary speed boost). NIO Selectors weren't used because under the hood NIO sockets are used when using plain new Socket. Virtual threads offered a significant performance improvement as well.

Precise Guitar Tuner - Review

14/09/19 Permalink

Produced as a part of a study into the science of soundwaves this was possibly one of the most fun pieces of software I have written.

I've always had a love of guitars and self taught myself to a decent level using books and the Internet, but I've always been intrigued how a guitar tuner works. How does it know what is in tune?

It turns out it's all in the science of sound waves and how they're formulated. A sound wave is produced by the vibration of air particles and travels in a wave. Now the wave has a frequency (measured in hz, how many waves/second) and amplitude (how tall the wave is) and by pulling data from the mic you can measure this. This is where it gets complicated.

A mic transforms the wave into data. It can 8 bit, 16 bit, 32 bit etc. 8 bit was tested first and it was fairly easy to measure - you've only got 256 values after all, but the accuracy was poor. Too approximated. So up to 16 bit - short values. Way more accurate and suitable - it added a lot more CPU horsepower to the job but it was necessary. The sample rate was next - how much data per second was piped through. It needed to as high as possible but to keep the stability.

The software was firstly developed on desktop Java and then ported quite easily to Android. It was much easier to develop the bulk of the software on the desktop, especially when visualising the sound waves on a graph. I was blown away how different sounds make different waves. You'd think a simple note played on a guitar would look quite similar to a sine wave. Nope, they have a weird shape dependent on something called timbre - the physical shape and type of string can change the shape significantly. All kinds of weird data going on. But I eventually got some decent measurements using various calculations, buffer sizes etc.

At this point the software was at a decent level - enough to port. Because Android is basically a container for Java (mostly earlier versions), the porting was fairly straightforward. A few classes came in different flavours, but a tweak here and there, and the engine was slotted in. Then you've got Android phones supporting various sample rates to contend with. I ended up writing code that would start at a reasonable high and test down until it didn't throw an exception. So some devices will benefit from higher frequencies, ie. more data, more accurate. This has the advantage of the software improving when newer hardware does. I think all devices support 48k which is sufficient, but I wanted to use as much data as possible.

The GUI was next. I wanted it to be clean, simple and fast - you don't want to sit there waiting - you just want to tune your guitar. I always liked the idea of presenting the raw data, but not too much that it would baffle people. So the 2 decimal place accuracy number was added. Then the graphical view to give perspective of where the played string is in relation to the target string and the rest of the strings. Custom tunings presets were then added as an alternative to EADGBE. I've never tuned to some of these tunings before and have to say, they sound awesome. Then I added manual string selection. 4 half notes down, 4 half notes up. I figured if you tune below or higher than this your strings are going to be too loose/tight. This added numerous other combos of setups (531,441 as an overall total (9⁶)). Yes you read that right, over half a million custom tunings.

Anyway, I hope you've enjoyed this blog post, and if you want to try Precise Guitar Tuner out you can download it at Amazon, or Google Play.

Products/Projects

Old Products (J2ME)