Monday, August 17, 2015

Configuring Wildfly for HTTPS in a post Poodle world

If you have ever run into “ssl_error_no_cypher_overlap” errors trying to configure Wildfly to use HTTP then you have probably cursed the lack of decent documentation for configuring Wildfly now that browsers have disabled a lot of insecure SSL cyphers. This is how I got around the problem.

First you need a self signed key. This can be created with the command:

keytool -genkey -alias mycert -keyalg RSA -sigalg SHA256withRSA -keystore my.jks -storepass secret  -keypass secret -validity 9999

Then you need to configure Wildfly to accept a list of known cyphers. Mozilla has a nice list of cypher codes for high security, compatibility etc at

The problem is that this list has the OpenSSL key names, and Wildfly needs the RFC names. So you need to map one to the other using the table at

What I ended up with was this list, defined in a enabled-cipher-suites attribute. This list is the Modern Compatibility list. This is added to the <subsystem xmlns="urn:jboss:domain:undertow:2.0"> element.


You then need to configure the self signed key with this. It goes under the <management><security-realms> element.

<security-realm name="UndertowRealm"> <server-identities> <ssl> <keystore path="my.jks" relative-to="jboss.server.config.dir" keystore-password="secret" alias="mycert" key-password="secret"/> </ssl> </server-identities> </security-realm>

Wednesday, July 15, 2015

Authenticating via Kerberos with Keycloak and Windows 2008 Active Directory

The following instructions show you how to configure Keycloak with Windows AD in order to use Kerberos authentication.


  1. The Kerberos realm is VIRTUAL.LOCAL
  2. The hostname used to access Keycloak is virtual.local. This just means we are running Keycloak on the domain controller. In production virtual.local will be replaced with something like or something like that, giving you a SPN of HTTP/


  1. Create a windows domain account called Keycloak.
  2. Run the following command to assign a SPN to the user and generate a keytab file:
    ktpass -out keycloak.keytab -princ HTTP/virtual.local@VIRTUAL.LOCAL -mapUser Keycloak@VIRTUAL.LOCAL -pass password1! -kvno 0 -ptype KRB5_NT_PRINCIPAL -crypto RC4-HMAC-NT
  3. Verify the SPN has been assigned to the user with the command:
    setspn -l Keycloak
  4. Configure the LDAP settings in Keycloak like this. Since we are running Keycloak on the domain controller, we reference LDAP via the local loopback address. Obviously this would change in most production environments.
  5. Configure the Kerberos integration like this:

Configuring Java

You will need to install the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files for your version of Java. Download them for Java 8 at, or Java 7 at


Don't use port names in the SPN. By default Keycloak will run under port 8080, but this must not be added to the SPN, even though SPNs can contain port details.

To make use of Kerberos authentication from a Windows client, the Keycloak server has to be in a Internet Zone that has User Authentication -> Logon -> Automatic logon with current user name and password enabled. Chrome and IE both respect this settings.

To enable Kerberos authentication in Firefox, you need to add the Keycloak domain to the network.negotiate-auth.trusted-uris setting in about:config. has some useful information that can be used to debug exceptions.


This page can be used to quickly test the ability for a client to log in.

Get the keycloak.js file from the Keycloak JavaScript Adapter download.

Get the keycloak.json file from Keycloak itself.

<script src="keycloak.js"></script>
var keycloak = new Keycloak('keycloak.json');
keycloak.init({ onLoad: 'login-required' }).success(function(authenticated) {
keycloak.loadUserProfile().success(function(profile) {

Some notes on Java 8

Builds of Java 8 above 1.8.0_31 have a bug that will throw the Defective token detected (Mechanism level: GSSHeader did not find the right tag) exception. See for details.

Thursday, April 09, 2015

Remember me not - avoiding the Australian metadata dragnet with Tor and Asus

So it is official. As an internet user in the great country of Australia my actions online are now tracked and recorded by the government. And that doesn't sit so well with me.

But rather than complain, I decided to take action and install an Asus RT-N66U router as the gateway on my home network. The router had generally positive reviews online, but I was mostly interested in the fact that it supported third party firmware, like the popular ones provided by a developer calling himself Merlin.

One of the big benefits provided by third party firmware releases is that you get early access to some cool new features. One such feature that caught my eye was the introduction of Tor into the router.

I have used Tor sporadically in the past. While I have to commend the Tor developers for making it easy to install Tor and browse anonymously, the reality is that running an additional piece of software was kind of a pain. It was a mental jump to go from "always online" to "first make sure you have opened the Tor software, wait for it to connect, then go online", especially when the process was different depending on whether you are on a desktop PC or a tablet.

The end result was that I usually spend a few hours on Tor only to eventually put it in the "too inconvenient" basket and give up.

Which is why having Tor enabled at the router level is so nice. It's a one off setup, after which you never have to think about Tor again.

So how well does Tor on the N66U work?

First of all, the configuration is a breeze. Well, a breeze for someone who is knowledgeable enough to be flashing their home router with custom firmware. The process is not going to appeal to those who don't know what a router is, but anyone with a bit of technical knowledge, or a nerdy friend, can be up and running without too much trouble.

As you can see by the screenshot above, enabling Tor is a matter of selecting "Enable" in a drop down list, and deciding on whether to protect the entire network, or just a few select devices.

Once Tor is enabled, there is nothing more that needs to be done on the devices that you have selected. There is no software to be installed, and no process required to enable Tor.

This convenience comes at a cost though. Tor is only as secure as those applications whose data it is transferring, and one of the benefits of the Tor bundle is a browser that has disabled a number of plugins that are known to leak identifiable information.

However, my goal is to frustrate the government metadata dragnet that has been cast over this country's internet rather than achieve anonymity. Now that the average Australian citizen has to treat their ISP as a Man-In-The-Middle attacker, I'm willing to trade a little less anonymity for the convenience of having the packets that make up my web browsing obscured in a way that makes the government's logging efforts mostly meaningless.

Performance wise, there is a noticeable lag as you open web pages, and complex pages (i.e. 90% of the pages you'll view online) take noticeably longer to load. You'll also notice that the browser continues to show pages as still loading long after the page has been displayed to you. This is most likely because background scripts loading analytics and advertising scripts are still being download after the main content of the page has been displayed. Initially it is a little disconcerting to see pages taking so long to completely load, but you quickly learn to ignore the spinning icon on the browser tab and jump in once the page is rendered.

Overall, while this performance lag is noticeable, I never felt that it prevented me from doing what I needed to do online.

The biggest downside to using Tor is the number of Capchas you'll need to fill out to visit every day sites. Google is especially bad, depending on the Tor exit node you happen to be assigned.

You may find that Google's over zealous bot protection renders the search box on your mobile device useless. Apparently there is no facility on the native Android search widget to prove that you are in fact a human and not a bot scraping Google's search results. 

You'll also notice that the ads you are served are now probably in another language. Given that I have clicked on maybe 4 ads in my entire life, this is amusing, and not particularly troublesome.

I even managed to stream videos at 780p. 1080p was asking just a little to much, but if I'm being honest, I don't see much difference between 780p and 1080p anyway.

There are a few things that I didn't try to do over Tor.

The first was streaming Geolocked content (i.e. Netflix). I have a dedicated media PC that uses a VPN (the VPN is also configured through the the N66U) rather than Tor. I imagine that it would be quite difficult to use Tor with something like Netflix, given that you don't have a lot of control as to where your traffic exits the Tor network. One minute you may appear to be in the US, the next in Europe.

I also don't use applications like Skype. Actually, Skype uses UDP, and won't be routed over Tor anyway.

Even though Tor on the N66U is still in beta, configuration is relatively easy, and I never found myself having to reset the router or perform any kind of troubleshooting. Performance is adequate for day to day browsing, and the ability to only route traffic from specific devices over Tor means that your more bandwidth sensitive devices (like your media PC) are not affected.

Overall, I am quite impressed with how Asus and Merlin have integrated Tor into the N66U. A Tor enabled router is a great way to effortlessly opt out of the Governments metadata dragnet.

Sunday, March 08, 2015

Fixing OpenVPN "Authenticate/Decrypt packet error: cipher final failed"

When connecting to a VPN I was constant getting the error

Mar  8 09:29:27 openvpn[1696]: Authenticate/Decrypt packet error: cipher final failed

I had imported the supplied ovpn file and had followed all the other configuration steps, so this was quite frustrating. Then I saw this in the logs:

Mar  8 09:31:07 openvpn[1790]: WARNING: 'cipher' is used inconsistently, local='cipher BF-CBC', remote='cipher AES-256-CBC'

Changing my client to use "cipher AES-256-CBC" instead of the default (which apparently was cipher BF-CBC) fixed the issue.

Sunday, December 07, 2014

A XSS filter for Java EE web apps

Cross Site Scripting, or XSS, is a fairly common vector used to attack web sites. It involves user generated code being redisplayed by a website with all the privileges and security rights that a browser assigns to code originating from the current host. If the user code is something like <script>doEvil();</script>, then you have a problem.

OWASP is an organisation that provides guidance on web security, and they have a page that provides a suggested method for avoiding XSS in JavaEE web app. You can read this document at

The library being demonstrated here is based off the ideas presented in that article, but fleshed out to be more flexible and easy to deploy. We call this library the (unimaginatively named) Parameter Validation Filter, or PVF.

PVF is implemented as a Servlet filter that intercepts requests to web pages, runs submitted parameters through a configurable sequence of validation rules, and either sanitises the parameters before they are sent through to the web application, or returns a HTTP error code if validation errors were detected.

We have made the following assumptions when developing this library:
  • Client side validation will prevent legitimate users from submitting invalid data.
  • The PVF library should prevent further processing if invalid data is submitted in the majority of cases.
  • Occasionally it might be appropriate to sanitise submitted data, but any sanitisation should be trivial (like the removal of whitespace).
To make use of the PVF library, you’ll need to add it to your project. This artifact is currently in the Sonatype staging repo, so you'll need to add that repo to your Maven config. See  for details.


The filter then needs to be added to the web.xml file with the following settings. You may want to configure the url-pattern to match the pages that you actually want to protect.


Finally you need to create a file called WEB-INF/xml/pvf.xml. This file defines the custom validation rules applied to the parameters being sent to your web applications.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!-- ParameterValidationChainDatabase is always the document element -->
        Enforcing mode needs to be set to true to return a HTTP error code if validation fails.
        If set to false, validation errors are logged but ignored.

<!-- We always have a single ParameterValidationChains element under the parent -->


        <!-- Each chain of validation rules is contained in a ParameterValidationDefinition element -->
            This chain apply some global validation rules. If anyone supplies encoded or params with HTML
            characters, it will fail.
            <!-- This is the list of validation classes that should be applied to matching parameters -->
                    <!-- This is the fully qualified name of the class used to apply the validation rule -->
                    <!-- All input fields are to be trimmed of excess whitespace -->
                    <!-- No parameters are expected to already be encoded -->
                    <!-- No parameters are expected to contain html -->
            <!-- This is a regex that defines which parameteres will be validated by the classes above -->
            <!-- This is a regex that defines which URLs will be validated by the classes above -->
                 Setting this to false means the paramNamePatternString has to match the param name.
                 Setting it to true would mean that paramNamePatternString would have to *not* match the param name.
                 Setting this to false means the requestURIPatternString has to match the uri.
                 Setting it to true would mean that requestURIPatternString would have to *not* match the uri name.

The XML has been commented to make it easier to understand, but there are a few interesting elements:
  • paramNamePatternString, which has been configured to enable the validation chain to match all parameters
  • requestURIPatternString, which has been configured to enable the chain to match all URIs
  • The three elements called validationRuleName, which reference the full class name of the validation rules that will be applied to each parameter passed into our web application
Although this is a simple example, the three validation rules that have been implemented (TrimTextValidationRule, FailIfNotCanonicalizedValidationRule and FailIfContainsHTMLValidationRule) are quite effective at preventing a malicious user from submitting parameters that contain XSS code.

The first rule, TrimTextValidationRule, simply strips away any whitespace on either side of the parameter. This uses the trim() function any developer should be familiar with.

The second rule, FailIfNotCanonicalizedValidationRule, will prevent further processing if the supplied parameter has already been encoded. No legitimate user will have a need to supply text like %3Cscript%3EdoEvil()%3B%3C%2Fscript%3E, so any time encoded text is found we simply return with a HTTP 400 error code. This rule makes use of the ESAPI library supplied by OWASP.

Like the second rule, the third rule will prevent further processing if the supplied parameter has any special HTML characters. If you would like your customers to be able to pass through characters like &, this rule is too broad. However, it is almost always valid to block special HTML characters.

If you want to see how effective this simple validation chain is, check out the live demo at You may want to take a look at to find some XSS patterns that are often used to bypass XSS filters.

Moving forward we will be looking to implement more targeted validation rules, especially those that can’t be easily implemented as regex matches (like making sure a date if after today, or that a number is between two values etc).

If you have any suggestions, or find any bugs, feel free to fork the code from our GitHub repo . We do hope to get some public feedback in order to make this library as robust as it can be.

Saturday, July 19, 2014

Three Daily Things

Three Daily Things is a new, free website that I have put together based on a motivation app that I have had great personal success with.

The reason why I wrote this app is best explained with a story about myself.

Like most people, I want to be fit an healthy. To achieve this, I sign up to my local gym. But it doesn't take long for the unknowns to start rattling around in my brain. How many reps and set should I do? Should I do cardio before or after resistance training? Should I be taking supplements? Is it best to work out in the morning or afternoon? Which exercises are best?

All these unknowns start to weigh heavily in my mind. I begin to wonder if I am wasting my time. I skip a few sessions, and before I know it I haven't visited the gym in weeks.

So many beneficial aspirations in my life have followed this path. I like the idea, I try it out, I get overwhelmed by the unknowns and eventually I give up.

So I asked myself, what was it that I really wanted out of my gym membership? The answer is right there in the first sentence: to be fit and healthy. And I realize that I can achieve this goal without actually knowing the best ratio of sets to reps, the most effective exercises and without taking supplements. All I need to do is walk through the gym door. Simply being in the gym day after day and doing some sort of exercise achieves my goal of being fit and healthy.

In a nutshell, my problem is that in seeking perfection I often end up with nothing. The solution was to convince myself that continuing to focus on these beneficial activities, even if I do them imperfectly, will eventually get me the result I am looking for.

That is why I created Three Daily Things. It is a way to remind myself that any time I spend working on the three things that I feel would most improve my life is worthwhile.

The app is deliberately simple. You list the three things that you feel would most improve your life, and when you spend any amount of time working on them, you give yourself a tick. The report allows you to quickly gauge how often you have made time (any time) for these three things over the last week.

For such a simple idea, my personal results were quite incredible. Prior to writing this app, I would estimate that I would have spent time on my three most important things maybe six or seven times a week, giving me a weekly score of around 30%. Using the app daily, my weekly score usually sits between 60% and 80%. That means I have easily doubled my focus on those three things that I know are improving my life.

This started out as a personal motivational tool, but it has worked so well that I decided to make the web site free to everyone in the hopes that others would find Three Daily Things as simple and effective as I have.

Friday, June 13, 2014

RHEL 7 VirtualBox Guest Additions Patched

If you have tried to used RHEL 7 in VirtualBox, and run into the issue with the guest additions not compiling (see for details), you can download this tar file, extract it, and run

sudo ./

The tar file is just the guest additions pre patched to work with RHEL 7.

Friday, January 24, 2014

Scroll to the bottom of log files in web pages

We use supervisord on our systems, which has a handy browser based 'tail -f' feature. The problem is that the end of the log file appears off the bottom of the screen, and the browser won't scroll to the bottom automatically. This bookmarklet will keep the end of the page in view.


Saturday, January 11, 2014

If you are a knowledge gatekeeper, the game is changing

"We take comfort from Charles Darwin's observation that it's not the strongest species that survives, nor the most intelligent, but the ones most responsive to change. We just need to be adaptable." - Gary Pruitt, then the CEO of McClatchy Newspapers and now CEO of the Associated Press (

I recently spent some time looking into the stats for web content we were responsible for and noticed two things. The first was that the majority of visitors were viewing chunked versions of our books, where each individual web page aligned roughly with an entry in the table of contents. The second was that most visitors were spending only a couple of minutes reading the content in any given session.

This observation aligned very closely with my own information gathering process. Once or twice a year I'll make time to sit down and read a technical book from start to finish. This is usually because I am studying for a test or need to familiarise myself with a new technology. This compares to the dozens of web searches I do every single day to find answers to various issues that I run into as part of my daily job.

My productivity depends in large part on my ability to find answers to specific and usually isolated problems. I imagine a lot of information workers are in the same boat.

Right now the process of finding answers is to do a search in Google and then skim over the resulting links to try and find an answer. Or in other words, Googling.

The fact that Googling involves viewing the original content is very important to those that create, collate and profit from factual digital content (think wikis, knowledgebases, books, articles, blogs etc), because it means that they are the gatekeepers to their digital kingdoms. The answers people seek can be presented, restricted and sold in any way that best suits the knowledge gatekeepers.

If everything goes according to IBM’s plan, Watson will help kick off what CEO Ginni Rometty refers to as a third era in computing. The 19th century saw the rise of a “tabulating” era: the birth of machines designed to count. In the latter half of the 20th century, developers and scientists initiated the “programmable” era—resulting in PCs, mobile devices, and the Internet. The third (potential) era is “cognitive,” in which computers become adept at understanding and solving, in a very human way, some of society’s largest problems.

Knowledge is about to become commoditized. Since most people want answers to questions, this third era of computing is going to demote knowledge gatekeepers to funnels into systems like Watson. At best your original content will be presented as a footnote against the answer provided by Watson. At worst, no will will care where the answer came from, only that it is accurate.

Today it takes teams of highly skilled employees to create, curate, manage and navigate digital knowledge repositories designed to support vertical markets. Stick a support contract in front of that and you have the makings of a billion dollar business. In the not too distant future, some kid in a dorm room is going to be able to scrape every forum, wiki, book, bug report, chat log and code base that relates to your business or product, funnel it into a system like Watson, and produce a “Ask Watson” style web page that renders your once valuable digital knowledge repositories obsolete.
And if you think that your content is so engrained, valuable or unique as to be immune from disruption from this third era of computing, I have shares in Encyclopaedia Britannica, print based news media and bricks and mortar book shops that I would like to sell you.

Adaptation is the key to surviving the transition from valuable knowledge gatekeeper to anonymous information source. Because make no mistake, change is upon us right now, and history is littered with the corpses of business that missed the signs.

Sunday, January 05, 2014

Your grandkids are going to laugh at the notion of "searching the web"

In the next 5 - 10 years our good friend and faithful digital companion, the search bar, is going to ride off into the sunset.

Googling has forever changed how we find answers to almost anything we could ever want to know, because the answer is most likely there buried in the first 10 links brought up in response to anything you type into the search bar.

When you think about it though, having to trawl through those 10 links is actually not a great way to answer a question. If you were to ask your mechanic what was making that noise in your engine, you want the answer, not rambling descriptions and discussions on engine design and maintenance. But that is exactly what you get if you were to ask Google the same question: pages of forum posts when you only want the last one that actually has the answer; a blog post where that engine noise is coincidentally mentioned in the comments; or a 10 minute YouTube video where the information you want is 9 minutes in.

That is all about to change. In the not too distant future you will type your query into a question bar, and the result will be a plain English answer. No more "this search returned 10000000000000 results"; no more clicking through web pages to find the information you are looking for; and, if Google isn't careful, potentially no more Googling.

We have seen a glimpse of this move from searching to asking with Wolfram Alpha. You enter some search criteria, Alpha identifies real world entities or concepts in the search query, and then displays a bunch of cross referenced information. It's a great tool if you are looking for something that is easily identifiable, like the manufacturer of your car, and possibly even your car's model. But Alpha is not going to tell you what that sound in your engine could be.

That's not to say that Wolfram is giving up. He has made some vague reference to a new knowledge based programming language:
We call it the Wolfram Language because it is a language. But it’s a new and different kind of language. It’s a general-purpose knowledge-based language. That covers all forms of computing, in a new way. 
Apple has made some bold moves away from searching with Siri. Granted, Siri is often just a speech to text converter for plain old web searches, but it is not hard to imagine Siri connecting the dots between the questions you asked about "that noise in my engine" and post number 13 in a forum conversion that mentions a fan belt with the same ease with which she connects you request to set an alarm with the settings in your iPhone alarm clock app.

Microsoft is also making a move towards asking questions:
"Our vision of search is 'just ask'. Search is the ultimate human interface. It should be able to cope with any input."
And then there is IBM, which bested some of the brightest minds ever to grace the game show Jeopardy with their system Watson. Watson is probably the best public example of a system that can answer general questions with plain English responses, and that power is going to be available to developers:
Developers who want to incorporate Watson’s ability to understand natural language and provide answers need only have their applications make a REST API call to IBM’s new Watson Developers Cloud. “It doesn’t require that you understand anything about machine learning other than the need to provide training data,” Rob High, IBM’s CTO for Watson, said in a recent interview about the new platform.
Google has been fairly quiet about this post web search world, probably because it is of no benefit to them to suggest that searching is going anywhere, but their acquisitions of data sources like Freebase, the sidebar that displays relevant information about any entities in your search query and Google Now do indicate that they are looking to move beyond just plain old web searches too.

So when my grandkids ask me what the internet was like when I was their age, I will tell them of the great information hunts I used to go on, carefully crafting my search queries and then staking the answer through pages of unrelated and often unhelpful information. And they will laugh.

Friday, November 15, 2013

Three.js Tutorials

A tutorial series looking at the JavaScript 3D engine three.js.

Three.js - 1 - Getting Started Part One
Three.js - 2 - Getting Started Part Two

Three.js - 2 - Getting Started Part 2

Three.js - 2 - Getting Started Part Two


In the last tutorial we got a very basic three.js application up and running. This tutorial will continue on from that, adding a stats counter and dealing with window resizing.

Stats Counter

A number of the example three.js applications have a neat little stats counter showing the number of frames per second, and displaying a graph of the frame rate over the last few seconds. This information is invaluable when testing the performance of your 3D application.

The JavaScript library responsible for this stats counter is not actually part of three.js. It is a separate library that can be found on GitHub at The file that we need to include in our project can be found at This JavaScript file is included like the others in the <head> element.

  <title>Three.js - 2 - Getting Started 2</title>
  <link rel="stylesheet" type="text/css" href="css/style.css">
  <script type="application/javascript" src="javascript/stats.min.js"></script>
  <script type="application/javascript" src="javascript/three.min.js"></script>
  <script type="application/javascript" src="javascript/threejs-2.js"></script>
If you look at the code above, you'll notice that we have also included a link to a CSS file. The DOM element created by the stats.js library has the ID of stats. Knowing this, we can create a CSS rule to position the stats counter in the top left hand corner of the page.

#stats { position: absolute; top:0; left: 0 }  
With the stats.js library loaded and the DOM element styled, we now create the stats counter and add it to the HTML page in JavaScript. The following lines are added to the end of the init() function. Note that we are appending the stats element directly to the body of the HTML page, just like we did with the renderer in the last tutorial. Because we specified absolute positioning in the style that is applied to the stats element, it will not interfere with the position of the renderer.

 stats = new Stats();
 document.body.appendChild( stats.domElement );  
The final step is to update the stats counter every time a frame is rendered. This is done with the following code added to the end of the animate() function.


Window Resizing

Take a moment to run the example from the last tutorial, and after it has been loaded, resize the window. Notice that the 3D cube does not change its position relative to the top left hand corner of the screen. Obviously this is not ideal.

Fixing this requires the camera's aspect ratio and projection matrix, and the renderer's size, to be updated if the window is resized. We can specify a function to be run in the event that the window is resized by adding an event listener to the window's resize event.

 window.addEventListener( 'resize', onWindowResize, false );  
The onWindowResize() function is where we update the camera's aspect ratio and projection matrix, and the renderer's size.

function onWindowResize() {
 camera.aspect = window.innerWidth / window.innerHeight;
 renderer.setSize( window.innerWidth, window.innerHeight );
When running the example supplied with this tutorial, it is possible to resize the window and still have the cube placed in the middle of the screen.


If you click on the stats counter, you will switch between two views. The first displays the frames per second, while the second displays the milliseconds per frame.

Thursday, November 14, 2013

MinHash for dummies

Duplicate document detection is becoming increasingly important for businesses that have huge collections of email, web pages and documents with multiple copies that may or may not be kept up to date.

MinHash is a fairly simple algorithm that from all my Googling has been explained very poorly in blogs or in the kind of mathematical terms that I forgot long ago. So in this article I will attempt to explain how MinHash works at a practical code level.

Before I start, please take a look at That document goes into a lot of theory, and was ultimately where my understanding on MinHash came from. Unfortunately it approaches the algorithm from a theoretical standpoint, but if I gloss over some aspect of the MinHash algorithm here, you will almost certainly find a fuller explanation in the PDF.

I'll also be using pseudo Java in these examples instead of traditional math. This means when I use terms like Set, I am referring to the group of classes that implement a Set in Java. Of course a set in Java shares a lot of properties with a set in maths, but keep in mind that this is written from a developers point of view.

What does it mean when we talk about document similarity?  

The whole point of MinHash is to efficiently answer the question "How similar is this document to that document".

Before we can answer that question, we need to look at a more simple example of two sets of strings and determine how similar they are.

Here are our two sets:

Set a = new Set(["chair", "desk", "rug", "keyboard", "mouse"]);
Set b = new Set(["chair", "rug", "keyboard"]);

A measure of how similar these two sets are is known as the Jaccard Coefficient. It is calculated as number of common elements / (total number of elements - number of common elements).

In our case we have 3 common elements: "chair", "rug" and "keyboard". We have a total of 8 elements. So the Jaccard Coefficient  = 3 / (8 - 3) = 0.6, or 60%. is an online calculator that you can play around with to determine the similarity of two sets.

Documents as sets

While a document can be though of as a giant set of words, we don't just break down a document into individual words, place them in a set and calculate the similarity, because that looses the importance of the order of the words. The sets

Set a = new Set(["I", "went", "to", "work", "today"]);
Set b = new Set(["today", "I", "went", "to", "work"]);

would be considered 100% similar, even though they clearly are not.

Instead we break a document down into what are known as shingles. Each shingle contains a set number of words, and a document is broken down into total words - single length + 1 number of shingles. So if a document contained a single sentence of "The quick brown fox jumps over the lazy dog", that would be broken down into the following 5 word long shingles :
  1. The quick brown fox jumps
  2. quick brown fox jumps over
  3. brown fox jumps over the
  4. fox jumps over the lazy
  5. jumps over the lazy dog
So our document as a set looks like:

Set a = new Set(["The quick brown fox jumps", "quick brown fox jumps over", "brown fox jumps over the", "fox jumps over the lazy", "jumps over the lazy dog"]);

These sets of shingles can then be compared for similarity using the Jaccard Coefficient.

Optimizing the process

So we now have a way to compare two documents for similarity, but it is not an efficient process. To find similar documents to document A in a directory of 10000 documents, we need compare each pair individually. This is obviously not going to scale well.

What we can do to reduce some cycles is compare sets of randomly selected shingles from two documents. So for a document that is 10000 words long, we break it down into 9996 shingles, and randomly select say 200 of those to represent the document. If the second document is 20000 words long, we boil that down from 19996 shingles to another set of 200 randomly selected shingles.

Now instead of matching 9996 shingles to 19996 other shingles, we are are comparing 200 shingles to another 200 shingles. We have reduced our workload at the expense of accuracy, but this is still good enough for most scenarios.

So just how much accuracy did we sacrifice by comparing only a small subset of randomly selected shingles? I actually don't know, but I assume it sits on a bell curve where the number of random selections improves the accuracy to a point where more random selections won't provide any meaningful increase in accuracy. If you know the math behind this, please leave a comment.

I chose 200 random selections after reading

Computing and saving the random shingle selections

So this is where MinHash comes in.

We could save the 200 randomly selected shingles, but storing variable length strings kinda sucks. It complicates database design and comparing strings is slow.

The clever part about MinHash is that it allows us to save integers instead of strings, and also takes the hassle out of randomly selecting shingles. It works like this.

1. Break down the document a set of shingles.
2. Calculate the hash value for every shingle.
3. Store the minimum hash value found in step 2.
4. Repeat steps 2 and 3 with different hash algorithms 199 more times to get a total of 200 min hash values.

At this point instead of 200 randomly selected shingles, we have 200 integer hash values. This is effectively a random selection of shingles, because a good hash algorithm is supposed to generate a number that is as likely to be large as it is likely to be small. This kind of distribution of hash codes is what hashing is all about, so selecting the smallest hash is, for all intents and purposes, a random selection of a shingle.

Um, 199 more hash algorithms? WTF!

So you see the String.hashCode() method, but where do these other 199 hash codes come from? This is the step that most explanations on the MinHash algorithm conveniently leave out.

The short answer is that you XOR the value returned by String.hashCode() with 199 random numbers to generate the 199 other hash code values. Just make sure that you are using the same 199 random numbers across all the documents.

If you want some more detail on why this works, see

Great, but I still have to compare every document to every other document. Can this be optimized?

This is where the second neat trick of MinHash comes in. 

Locality sensitive hashing (LSH) involves generating a hash code such that similar items will tend to get similar hash codes. This is the opposite of what .hashCode() does.

LSH allows you to precompute a hash code that is then quickly and easily compared to another precomputed LSH hash code to determine if two objects should be compared in more detail or quickly discarded.

We don't actually calculate a LSH hash code as such, but the idea of a boiling down a complex object (like our collection of minhash codes) into something that is quickly and easily compared with other complex objects is still applicable.

What we do is take the collection of minhash values, and categorize them into bands and row. For the sake of simplicity we'll assume each document has 20 minhash values. We will break them down into 5 bands with 4 rows each. Also for the sake of simplicity we'll assume that .hashCode() (and any number XORed with it) results in a number between 0 and 9.

So conceptually the minhash values from 5 documents in the first of the 5 bands and its rows looks like this.

Band One
MinHash Algorithm One MinHash Algorithm Two MinHash Algorithm Three MinHash Algorithm Four
Document One 1 3 6 0
Document Two 2 3 1 0
Document Three 1 3 6 0
Document Four 2 1 3 1
What we are looking for are rows within a band that are the same between documents. Document One and Document Three in this case both have rows with values 1, 3, 6, 0. This tells us that these two documents should be compared for their similarity.

Although it is not shown here, you should look through Bands Two - Five looking for documents that share rows. Any documents that share rows in any bands should be compared for their similarity. 

This is useful when you have a document, and you want to know which other documents to compare to it for similarity. I'll assume that the minhash values are stored in SQL, so to get out candidates for comparison you would write:

FROM Documents
BandOneMinHashAlgorithmOne = 1 AND
BandOneMinHashAlgorithmTwo = 3 AND
BandOneMinHashAlgorithmThree = 6 AND
BandOneMinHashAlgorithmFour = 0
) OR
BandTwoMinHashAlgorithmOne = # AND
BandTwoMinHashAlgorithmTwo = # AND
BandTwoMinHashAlgorithmThree = # AND
BandTwoMinHashAlgorithmFour = #
) OR (
BandThreeMinHashAlgorithmOne = # AND
BandThreeMinHashAlgorithmTwo = # AND
BandThreeMinHashAlgorithmThree = # AND
BandThreeMinHashAlgorithmFour = #
) OR (
BandFourMinHashAlgorithmOne = # AND
BandFourMinHashAlgorithmTwo = # AND
BandFourMinHashAlgorithmThree = # AND
BandFourMinHashAlgorithmFour = #
) OR (
BandFiveMinHashAlgorithmOne = # AND
BandFiveMinHashAlgorithmTwo = # AND
BandFiveMinHashAlgorithmThree = # AND
BandFiveMinHashAlgorithmFour = #

This explanation completely glosses over why this works and what combination of bands and rows you should construct when looking for similar documents. If you want to get into that detail, take a look at

Sunday, November 10, 2013

Three.js 1 - Getting Started

Three.js - 1 - Getting Started





After Apple's war on Flash, relying on plugins in your web pages is just not a viable option. Flash (and maybe, thanks to Netflix, Silverlight) were the last of a dying breed of browser plugins. In their place are a host of new technologies being standardised as part of HTML5.

WebGL is one of these technologies, and is supported by all major browsers. Microsoft is even getting on board with IE 11. Combine that with the ever increasing performance of JavaScript engines and it is now possible to create some impressive 3D applications in a web browser without the aid of plugins.

Three.js is one of the more powerful and mature JavaScript 3D engines available. This article will show you how to get a simple 3D application running with three.js.

What to download

Using three.js requires only a single JavaScript file called three.min.js. You can grab the latest version of file from GitHub.

The HTML page

A three.js application is a DOM element that can be appended as a child of an element in your HTML page. This means that you can embed a 3D application in a web page like it was an image or video file. This gives you a great deal of flexibility when presenting your 3D application.

However, for this example we will create a very basic HTML page and add the three.js application as the only child. While basic, this configuration is quite suitable for full screen games or simulations.

As you can see, this is a very simple HTML page. The only code of importance here are the referenced to the two JavaScript files: the three.js library (three.min.js), and the JavaScript file that will hold our application (threejs-1.js).

  <title>Three.js - 1 - Getting Started</title>
  <script type="application/javascript" src="javascript/three.min.js"></script>
  <script type="application/javascript" src="javascript/threejs-1.js"></script>

The Components of a 3D Application

There are 4 important classes in three.js that make up a basic 3D application: the camera, the scene, the meshes, the geometries and the materials.

The Camera

This is exactly what it sounds like. The 3D world created by three.js is viewed as if through the lens of a video camera. In fact the camera has methods like .setLens() to set properties like focal length and frame size, which relate to a camera's field of view.

The Scene

The scene is the world that you will populate with 3D objects, lights and cameras.

The Meshes

A mesh defines the individual charictaristics  of a 3D object. These include things like the position, rotation and scale of the object. A mesh will reference a geometry to define the shape of the 3D object, and a material to define the appearance of the 3D object.

What is important to note about a mesh is that they can share geometries and materials. For example, to render a forest with a thousand 3D trees, you would need a thousand meshes, but potentially only 3 or 4 geometries (to represent the different types of trees) and 1 material.

The Geometries

The shape of a 3D object is defined by a geometry. Geometries define the various vertices and faces that make up a 3D model. Simple applications can make use of the build in geometry classes that describe a sphere, cube, cylinder and other common shapes. More complex 3D applications will load geometries created in 3D modelling applications.

The Materials

While a geometry defines the shape of a 3D object, the material defines the appearance. Materials can be a simple as a plain colour, or they can be complex shaders with reflections or animated appearances.

Putting it all Together

We start by assigning a function to the window.onload event. This is good practice when writing any JavaScript that manipulates the HTML DOM, as it is possible to start playing with the DOM before it is completely loaded. If you use jQuery, the ready() function is even better as it will allow your code to manipulate the DOM without having to wait for images to download. Since our page has no images, and we are not using jQuery, the window.onload event will work just fine.

The function assigned to the window.onload event will call two additional functions: init() and animate().

window.onload = function(){

The init() function is where we setup the components that make up the 3D application.

We create the camera, which is an instance of the PerspectiveCamera class. The 4 parameters supplied to the constructor are

fov - Camera frustum vertical field of view.
aspect - Camera frustum aspect ratio.
near - Camera frustum near plane.
far - Camera frustum far plane.
We then set the position of the camera down the z axis. This will allow us to position a 3D object at the origin and be able to view it.

camera = new THREE.PerspectiveCamera( 75, window.innerWidth / window.innerHeight, 1, 10000 );
 camera.position.z = 1000;    

Next we create some geometry. Three.js provides a number of geometry classes that define a number of common shapes. Here we use the CubeGeometry class to create the geometry for a cube. The supplied parameters define the size of the cube along the three axes.

 geometry = new THREE.CubeGeometry( 200, 200, 200 );   

In order for the 3D object to be visible, it needs a material. The MeshBasicMaterial class allows us to create a simple material without the use of any external image files. The MeshBasicMaterial constructor takes an object whose properties define the appearance of the material. In this case we have set the colour to red, and specified that the material should display as a wireframe.

 material = new THREE.MeshBasicMaterial( { color: 0xff0000, wireframe: true } );  

The mesh uses the geometry and material to create the 3D object that we will see on the screen.

mesh = new THREE.Mesh( geometry, material );   

Now that we have a 3D object to view, we need to create the scene that will hold it. The mesh is then added to the scene.

 scene = new THREE.Scene();
 scene.add( mesh );

Three.js can use a number of technologies to draw the 3D application within the browser. Each is provided by a renderer class. These different renderers are provided because not all browsers support all the HTML technologies that are available to libraries like three.js. For example, Internet Explorer won't support WebGL until version 11.

We have used the WebGL renderer here. This code will fail if it is run on any version of IE before 11, and future tutorials will provide workarounds for older browsers. For now though make sure you run the demo on Firefox or Chrome.

Because our application will take up the full browser window, we set the size of the renderer to the size of the browser window.

 renderer = new THREE.WebGLRenderer();
 renderer.setSize( window.innerWidth, window.innerHeight );   

The final step is to attach the renderer to the HTML DOM. In this case the renderer is attached directly as a child of the HTML body.

 document.body.appendChild( renderer.domElement );   

At this point we have created all the objects required by the 3D application. The next step is to animate the scene.

Just like a flip book, animation of 3D applications quickly taking snapshots of the 3D objects and displaying the snapshots on the screen.

HTML pages used to be fairly static, and only updated in response to some kind of user interaction (like clicking a button or submitting a form). This model doesn't work well for applications like the one we are creating, which expect to be able to update the browser page with each frame, independently of any explicit user interaction.

Modern browsers support this kind of application through the requestAnimationFrame function. This function takes another function as a parameter, and will call the supplied function when the browser page is available to be refreshed. The requestAnimationFrame function is not supported by all browsers, but three.js includes a shim (or workaround) for older browsers, saving us from having to deal with these incompatibilities.

So the animate() function is first called when the window is loaded, and is then scheduled to be recursively called each time we render a frame.

 requestAnimationFrame( animate );   

So we have something to look at, the cube is rotated slightly with each frame.

 mesh.rotation.x += 0.01;
 mesh.rotation.y += 0.02;  

Finally, the renderer takes the scene and the camera, and renderers the current frame to the screen.

 renderer.render( scene, camera );  

Final Words

The end result of this tutorial has been to display a single spinning 3D cube in a web page. While this is quite a trivial example of the three.js library, it does cover all the basic elements of a 3D application, and provides a useful starting point for building more complex 3D applications.

Monday, September 23, 2013

Stopping emails in the new Gmail tabs from showing up in new mail counts

I love the new tabbed interface in Gmail. To be blunt, email clients have done a very poor job of dealing with the increasing volume and variety of email that people get nowadays, and it is nice to see Gmail implementing simple and effective solutions.

The one problem I have with the new interface is that email that was simply going to spam is now being categorised and filed under these tabs, and their existence is contributing to the number I see in my Chrome Gmail monitoring extension. It became impossible to know when I have mail that I was actually interested in, because the little icon in my browser was always showing 100+ new emails.

I initially tried to setup a filter, but nothing I could type into the new filter dialog would allow me to set emails sent to the categories to be marked as read.

Nothing you can type in here will let you create a filter based on the categories.
But I soon tweaked that you can create a filter from a search, and you can search based on categories.

Do a search with the categories: prefix.

Drop the search menu down again, and the Create filter with this search link is now active!
And with that, any email sent to one of the new tabs can be marked as read, leaving your mail monitoring extensions showing only the actual inbox count.