Sunday, December 07, 2014

A XSS filter for Java EE web apps

Cross Site Scripting, or XSS, is a fairly common vector used to attack web sites. It involves user generated code being redisplayed by a website with all the privileges and security rights that a browser assigns to code originating from the current host. If the user code is something like <script>doEvil();</script>, then you have a problem.

OWASP is an organisation that provides guidance on web security, and they have a page that provides a suggested method for avoiding XSS in JavaEE web app. You can read this document at https://www.owasp.org/index.php/How_to_add_validation_logic_to_HttpServletRequest.

The library being demonstrated here is based off the ideas presented in that article, but fleshed out to be more flexible and easy to deploy. We call this library the (unimaginatively named) Parameter Validation Filter, or PVF.

PVF is implemented as a Servlet filter that intercepts requests to web pages, runs submitted parameters through a configurable sequence of validation rules, and either sanitises the parameters before they are sent through to the web application, or returns a HTTP error code if validation errors were detected.

We have made the following assumptions when developing this library:
  • Client side validation will prevent legitimate users from submitting invalid data.
  • The PVF library should prevent further processing if invalid data is submitted in the majority of cases.
  • Occasionally it might be appropriate to sanitise submitted data, but any sanitisation should be trivial (like the removal of whitespace).
To make use of the PVF library, you’ll need to add it to your project. This artifact is currently in the Sonatype staging repo, so you'll need to add that repo to your Maven config. See http://stackoverflow.com/questions/13945757/how-do-you-import-a-maven-dependency-from-sonatype-org  for details.

<dependency>
  <groupId>com.matthewcasperson</groupId>
  <artifactId>parameter_validation_filter</artifactId>
  <version>LATEST</version>
</dependency>

The filter then needs to be added to the web.xml file with the following settings. You may want to configure the url-pattern to match the pages that you actually want to protect.


<filter>
  <filter-name>ParameterValidationFilter</filter-name>
  <filter-class>com.matthewcasperson.validation.filter.ParameterValidationFilter</filter-class>
  <init-param>
    <param-name>configFile</param-name>
    <param-value>/WEB-INF/xml/pvf.xml</param-value>
  </init-param>
</filter>
<filter-mapping>
  <filter-name>ParameterValidationFilter</filter-name>
  <url-pattern>*.jsp</url-pattern>
</filter-mapping>

Finally you need to create a file called WEB-INF/xml/pvf.xml. This file defines the custom validation rules applied to the parameters being sent to your web applications.


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!-- ParameterValidationChainDatabase is always the document element -->
<ParameterValidationChainDatabase>
   <!--
        Enforcing mode needs to be set to true to return a HTTP error code if validation fails.
        If set to false, validation errors are logged but ignored.
    -->
    <EnforcingMode>true</EnforcingMode>

<!-- We always have a single ParameterValidationChains element under the parent -->

    <ParameterValidationChains>

        <!-- Each chain of validation rules is contained in a ParameterValidationDefinition element -->
        <!-- 
            This chain apply some global validation rules. If anyone supplies encoded or params with HTML
            characters, it will fail.
         -->
        <ParameterValidationDefinition>
            <!-- This is the list of validation classes that should be applied to matching parameters -->
            <ParameterValidationRuleList>
                <ParameterValidationRule>
                    <!-- This is the fully qualified name of the class used to apply the validation rule -->
                    <!-- All input fields are to be trimmed of excess whitespace -->
                    <validationRuleName>com.matthewcasperson.validation.ruleimpl.TrimTextValidationRule</validationRuleName>
                </ParameterValidationRule>
                <ParameterValidationRule>
                    <!-- No parameters are expected to already be encoded -->
                    <validationRuleName>com.matthewcasperson.validation.ruleimpl.FailIfNotCanonicalizedValidationRule</validationRuleName>
                </ParameterValidationRule>
                <ParameterValidationRule>
                    <!-- No parameters are expected to contain html -->
                    <validationRuleName>com.matthewcasperson.validation.ruleimpl.FailIfContainsHTMLValidationRule</validationRuleName>
                </ParameterValidationRule>
            </ParameterValidationRuleList>
            <!-- This is a regex that defines which parameteres will be validated by the classes above -->
            <paramNamePatternString>.*</paramNamePatternString>
            <!-- This is a regex that defines which URLs will be validated by the classes above -->
            <requestURIPatternString>.*</requestURIPatternString>
            <!--
                 Setting this to false means the paramNamePatternString has to match the param name.
                 Setting it to true would mean that paramNamePatternString would have to *not* match the param name.
             -->          
            <paramNamePatternNegated>false</paramNamePatternNegated>
            <!--
                 Setting this to false means the requestURIPatternString has to match the uri.
                 Setting it to true would mean that requestURIPatternString would have to *not* match the uri name.
             -->
            <requestURIPatternNegated>false</requestURIPatternNegated>
        </ParameterValidationDefinition>        
    </ParameterValidationChains>
</ParameterValidationChainDatabase>

The XML has been commented to make it easier to understand, but there are a few interesting elements:
  • paramNamePatternString, which has been configured to enable the validation chain to match all parameters
  • requestURIPatternString, which has been configured to enable the chain to match all URIs
  • The three elements called validationRuleName, which reference the full class name of the validation rules that will be applied to each parameter passed into our web application
Although this is a simple example, the three validation rules that have been implemented (TrimTextValidationRule, FailIfNotCanonicalizedValidationRule and FailIfContainsHTMLValidationRule) are quite effective at preventing a malicious user from submitting parameters that contain XSS code.

The first rule, TrimTextValidationRule, simply strips away any whitespace on either side of the parameter. This uses the trim() function any developer should be familiar with.

The second rule, FailIfNotCanonicalizedValidationRule, will prevent further processing if the supplied parameter has already been encoded. No legitimate user will have a need to supply text like %3Cscript%3EdoEvil()%3B%3C%2Fscript%3E, so any time encoded text is found we simply return with a HTTP 400 error code. This rule makes use of the ESAPI library supplied by OWASP.

Like the second rule, the third rule will prevent further processing if the supplied parameter has any special HTML characters. If you would like your customers to be able to pass through characters like &, this rule is too broad. However, it is almost always valid to block special HTML characters.

If you want to see how effective this simple validation chain is, check out the live demo at http://pvftest-matthewcasperson.rhcloud.com/. You may want to take a look at https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet to find some XSS patterns that are often used to bypass XSS filters.

Moving forward we will be looking to implement more targeted validation rules, especially those that can’t be easily implemented as regex matches (like making sure a date if after today, or that a number is between two values etc).

If you have any suggestions, or find any bugs, feel free to fork the code from our GitHub repo . We do hope to get some public feedback in order to make this library as robust as it can be.

Saturday, July 19, 2014

Three Daily Things


Three Daily Things is a new, free website that I have put together based on a motivation app that I have had great personal success with.

The reason why I wrote this app is best explained with a story about myself.

Like most people, I want to be fit an healthy. To achieve this, I sign up to my local gym. But it doesn't take long for the unknowns to start rattling around in my brain. How many reps and set should I do? Should I do cardio before or after resistance training? Should I be taking supplements? Is it best to work out in the morning or afternoon? Which exercises are best?

All these unknowns start to weigh heavily in my mind. I begin to wonder if I am wasting my time. I skip a few sessions, and before I know it I haven't visited the gym in weeks.

So many beneficial aspirations in my life have followed this path. I like the idea, I try it out, I get overwhelmed by the unknowns and eventually I give up.

So I asked myself, what was it that I really wanted out of my gym membership? The answer is right there in the first sentence: to be fit and healthy. And I realize that I can achieve this goal without actually knowing the best ratio of sets to reps, the most effective exercises and without taking supplements. All I need to do is walk through the gym door. Simply being in the gym day after day and doing some sort of exercise achieves my goal of being fit and healthy.

In a nutshell, my problem is that in seeking perfection I often end up with nothing. The solution was to convince myself that continuing to focus on these beneficial activities, even if I do them imperfectly, will eventually get me the result I am looking for.

That is why I created Three Daily Things. It is a way to remind myself that any time I spend working on the three things that I feel would most improve my life is worthwhile.

The app is deliberately simple. You list the three things that you feel would most improve your life, and when you spend any amount of time working on them, you give yourself a tick. The report allows you to quickly gauge how often you have made time (any time) for these three things over the last week.

For such a simple idea, my personal results were quite incredible. Prior to writing this app, I would estimate that I would have spent time on my three most important things maybe six or seven times a week, giving me a weekly score of around 30%. Using the app daily, my weekly score usually sits between 60% and 80%. That means I have easily doubled my focus on those three things that I know are improving my life.

This started out as a personal motivational tool, but it has worked so well that I decided to make the web site free to everyone in the hopes that others would find Three Daily Things as simple and effective as I have.




Friday, June 13, 2014

RHEL 7 VirtualBox Guest Additions Patched

If you have tried to used RHEL 7 in VirtualBox, and run into the issue with the guest additions not compiling (see https://www.virtualbox.org/ticket/12638 for details), you can download this tar file, extract it, and run

sudo ./install.sh

The tar file is just the guest additions pre patched to work with RHEL 7.

Friday, January 24, 2014

Scroll to the bottom of log files in web pages

We use supervisord on our systems, which has a handy browser based 'tail -f' feature. The problem is that the end of the log file appears off the bottom of the screen, and the browser won't scroll to the bottom automatically. This bookmarklet will keep the end of the page in view.

javascript:scroll=function(){setTimeout(function(){window.scrollTo(0,document.body.scrollHeight);scroll();},100);};scroll();

Saturday, January 11, 2014

If you are a knowledge gatekeeper, the game is changing

"We take comfort from Charles Darwin's observation that it's not the strongest species that survives, nor the most intelligent, but the ones most responsive to change. We just need to be adaptable." - Gary Pruitt, then the CEO of McClatchy Newspapers and now CEO of the Associated Press (http://www.slate.com/blogs/future_tense/2012/11/12/google_ad_revenue_tops_entire_us_print_media_industry_chart.html).

I recently spent some time looking into the stats for web content we were responsible for and noticed two things. The first was that the majority of visitors were viewing chunked versions of our books, where each individual web page aligned roughly with an entry in the table of contents. The second was that most visitors were spending only a couple of minutes reading the content in any given session.

This observation aligned very closely with my own information gathering process. Once or twice a year I'll make time to sit down and read a technical book from start to finish. This is usually because I am studying for a test or need to familiarise myself with a new technology. This compares to the dozens of web searches I do every single day to find answers to various issues that I run into as part of my daily job.

My productivity depends in large part on my ability to find answers to specific and usually isolated problems. I imagine a lot of information workers are in the same boat.

Right now the process of finding answers is to do a search in Google and then skim over the resulting links to try and find an answer. Or in other words, Googling.

The fact that Googling involves viewing the original content is very important to those that create, collate and profit from factual digital content (think wikis, knowledgebases, books, articles, blogs etc), because it means that they are the gatekeepers to their digital kingdoms. The answers people seek can be presented, restricted and sold in any way that best suits the knowledge gatekeepers.


If everything goes according to IBM’s plan, Watson will help kick off what CEO Ginni Rometty refers to as a third era in computing. The 19th century saw the rise of a “tabulating” era: the birth of machines designed to count. In the latter half of the 20th century, developers and scientists initiated the “programmable” era—resulting in PCs, mobile devices, and the Internet. The third (potential) era is “cognitive,” in which computers become adept at understanding and solving, in a very human way, some of society’s largest problems.

Knowledge is about to become commoditized. Since most people want answers to questions, this third era of computing is going to demote knowledge gatekeepers to funnels into systems like Watson. At best your original content will be presented as a footnote against the answer provided by Watson. At worst, no will will care where the answer came from, only that it is accurate.

Today it takes teams of highly skilled employees to create, curate, manage and navigate digital knowledge repositories designed to support vertical markets. Stick a support contract in front of that and you have the makings of a billion dollar business. In the not too distant future, some kid in a dorm room is going to be able to scrape every forum, wiki, book, bug report, chat log and code base that relates to your business or product, funnel it into a system like Watson, and produce a “Ask Watson” style web page that renders your once valuable digital knowledge repositories obsolete.
And if you think that your content is so engrained, valuable or unique as to be immune from disruption from this third era of computing, I have shares in Encyclopaedia Britannica, print based news media and bricks and mortar book shops that I would like to sell you.

Adaptation is the key to surviving the transition from valuable knowledge gatekeeper to anonymous information source. Because make no mistake, change is upon us right now, and history is littered with the corpses of business that missed the signs.

Sunday, January 05, 2014

Your grandkids are going to laugh at the notion of "searching the web"

In the next 5 - 10 years our good friend and faithful digital companion, the search bar, is going to ride off into the sunset.

Googling has forever changed how we find answers to almost anything we could ever want to know, because the answer is most likely there buried in the first 10 links brought up in response to anything you type into the search bar.

When you think about it though, having to trawl through those 10 links is actually not a great way to answer a question. If you were to ask your mechanic what was making that noise in your engine, you want the answer, not rambling descriptions and discussions on engine design and maintenance. But that is exactly what you get if you were to ask Google the same question: pages of forum posts when you only want the last one that actually has the answer; a blog post where that engine noise is coincidentally mentioned in the comments; or a 10 minute YouTube video where the information you want is 9 minutes in.

That is all about to change. In the not too distant future you will type your query into a question bar, and the result will be a plain English answer. No more "this search returned 10000000000000 results"; no more clicking through web pages to find the information you are looking for; and, if Google isn't careful, potentially no more Googling.

We have seen a glimpse of this move from searching to asking with Wolfram Alpha. You enter some search criteria, Alpha identifies real world entities or concepts in the search query, and then displays a bunch of cross referenced information. It's a great tool if you are looking for something that is easily identifiable, like the manufacturer of your car, and possibly even your car's model. But Alpha is not going to tell you what that sound in your engine could be.

That's not to say that Wolfram is giving up. He has made some vague reference to a new knowledge based programming language:
We call it the Wolfram Language because it is a language. But it’s a new and different kind of language. It’s a general-purpose knowledge-based language. That covers all forms of computing, in a new way. 
Apple has made some bold moves away from searching with Siri. Granted, Siri is often just a speech to text converter for plain old web searches, but it is not hard to imagine Siri connecting the dots between the questions you asked about "that noise in my engine" and post number 13 in a forum conversion that mentions a fan belt with the same ease with which she connects you request to set an alarm with the settings in your iPhone alarm clock app.

Microsoft is also making a move towards asking questions:
"Our vision of search is 'just ask'. Search is the ultimate human interface. It should be able to cope with any input."
And then there is IBM, which bested some of the brightest minds ever to grace the game show Jeopardy with their system Watson. Watson is probably the best public example of a system that can answer general questions with plain English responses, and that power is going to be available to developers:
Developers who want to incorporate Watson’s ability to understand natural language and provide answers need only have their applications make a REST API call to IBM’s new Watson Developers Cloud. “It doesn’t require that you understand anything about machine learning other than the need to provide training data,” Rob High, IBM’s CTO for Watson, said in a recent interview about the new platform.
Google has been fairly quiet about this post web search world, probably because it is of no benefit to them to suggest that searching is going anywhere, but their acquisitions of data sources like Freebase, the sidebar that displays relevant information about any entities in your search query and Google Now do indicate that they are looking to move beyond just plain old web searches too.


So when my grandkids ask me what the internet was like when I was their age, I will tell them of the great information hunts I used to go on, carefully crafting my search queries and then staking the answer through pages of unrelated and often unhelpful information. And they will laugh.

Friday, November 15, 2013

Three.js Tutorials

A tutorial series looking at the JavaScript 3D engine three.js.

Three.js - 1 - Getting Started Part One
Three.js - 2 - Getting Started Part Two

Three.js - 2 - Getting Started Part 2

Three.js - 2 - Getting Started Part Two


Introduction

In the last tutorial we got a very basic three.js application up and running. This tutorial will continue on from that, adding a stats counter and dealing with window resizing.

Stats Counter

A number of the example three.js applications have a neat little stats counter showing the number of frames per second, and displaying a graph of the frame rate over the last few seconds. This information is invaluable when testing the performance of your 3D application.

The JavaScript library responsible for this stats counter is not actually part of three.js. It is a separate library that can be found on GitHub at https://github.com/mrdoob/stats.js/. The file that we need to include in our project can be found at https://github.com/mrdoob/stats.js/blob/master/build/stats.min.jshttps://github.com/mrdoob/stats.js/blob/master/build/stats.min.js. This JavaScript file is included like the others in the <head> element.

 <head>
  <title>Three.js - 2 - Getting Started 2</title>
  <link rel="stylesheet" type="text/css" href="css/style.css">
  <script type="application/javascript" src="javascript/stats.min.js"></script>
  <script type="application/javascript" src="javascript/three.min.js"></script>
  <script type="application/javascript" src="javascript/threejs-2.js"></script>
 </head>  
  
If you look at the code above, you'll notice that we have also included a link to a CSS file. The DOM element created by the stats.js library has the ID of stats. Knowing this, we can create a CSS rule to position the stats counter in the top left hand corner of the page.

#stats { position: absolute; top:0; left: 0 }  
  
With the stats.js library loaded and the DOM element styled, we now create the stats counter and add it to the HTML page in JavaScript. The following lines are added to the end of the init() function. Note that we are appending the stats element directly to the body of the HTML page, just like we did with the renderer in the last tutorial. Because we specified absolute positioning in the style that is applied to the stats element, it will not interfere with the position of the renderer.

 stats = new Stats();
 document.body.appendChild( stats.domElement );  
  
The final step is to update the stats counter every time a frame is rendered. This is done with the following code added to the end of the animate() function.

 stats.update();    

Window Resizing

Take a moment to run the example from the last tutorial, and after it has been loaded, resize the window. Notice that the 3D cube does not change its position relative to the top left hand corner of the screen. Obviously this is not ideal.

Fixing this requires the camera's aspect ratio and projection matrix, and the renderer's size, to be updated if the window is resized. We can specify a function to be run in the event that the window is resized by adding an event listener to the window's resize event.

 window.addEventListener( 'resize', onWindowResize, false );  
  
The onWindowResize() function is where we update the camera's aspect ratio and projection matrix, and the renderer's size.

function onWindowResize() {
 camera.aspect = window.innerWidth / window.innerHeight;
 camera.updateProjectionMatrix();
 renderer.setSize( window.innerWidth, window.innerHeight );
}  
  
When running the example supplied with this tutorial, it is possible to resize the window and still have the cube placed in the middle of the screen.

Tips

If you click on the stats counter, you will switch between two views. The first displays the frames per second, while the second displays the milliseconds per frame.

Thursday, November 14, 2013

MinHash for dummies

Duplicate document detection is becoming increasingly important for businesses that have huge collections of email, web pages and documents with multiple copies that may or may not be kept up to date.

MinHash is a fairly simple algorithm that from all my Googling has been explained very poorly in blogs or in the kind of mathematical terms that I forgot long ago. So in this article I will attempt to explain how MinHash works at a practical code level.

Before I start, please take a look at http://infolab.stanford.edu/~ullman/mmds/ch3.pdf. That document goes into a lot of theory, and was ultimately where my understanding on MinHash came from. Unfortunately it approaches the algorithm from a theoretical standpoint, but if I gloss over some aspect of the MinHash algorithm here, you will almost certainly find a fuller explanation in the PDF.

I'll also be using pseudo Java in these examples instead of traditional math. This means when I use terms like Set, I am referring to the group of classes that implement a Set in Java. Of course a set in Java shares a lot of properties with a set in maths, but keep in mind that this is written from a developers point of view.

What does it mean when we talk about document similarity?  

The whole point of MinHash is to efficiently answer the question "How similar is this document to that document".

Before we can answer that question, we need to look at a more simple example of two sets of strings and determine how similar they are.

Here are our two sets:

Set a = new Set(["chair", "desk", "rug", "keyboard", "mouse"]);
Set b = new Set(["chair", "rug", "keyboard"]);

A measure of how similar these two sets are is known as the Jaccard Coefficient. It is calculated as number of common elements / (total number of elements - number of common elements).

In our case we have 3 common elements: "chair", "rug" and "keyboard". We have a total of 8 elements. So the Jaccard Coefficient  = 3 / (8 - 3) = 0.6, or 60%. http://www.planetcalc.com/1664/ is an online calculator that you can play around with to determine the similarity of two sets.

Documents as sets

While a document can be though of as a giant set of words, we don't just break down a document into individual words, place them in a set and calculate the similarity, because that looses the importance of the order of the words. The sets

Set a = new Set(["I", "went", "to", "work", "today"]);
Set b = new Set(["today", "I", "went", "to", "work"]);

would be considered 100% similar, even though they clearly are not.

Instead we break a document down into what are known as shingles. Each shingle contains a set number of words, and a document is broken down into total words - single length + 1 number of shingles. So if a document contained a single sentence of "The quick brown fox jumps over the lazy dog", that would be broken down into the following 5 word long shingles :
  1. The quick brown fox jumps
  2. quick brown fox jumps over
  3. brown fox jumps over the
  4. fox jumps over the lazy
  5. jumps over the lazy dog
So our document as a set looks like:

Set a = new Set(["The quick brown fox jumps", "quick brown fox jumps over", "brown fox jumps over the", "fox jumps over the lazy", "jumps over the lazy dog"]);

These sets of shingles can then be compared for similarity using the Jaccard Coefficient.

Optimizing the process

So we now have a way to compare two documents for similarity, but it is not an efficient process. To find similar documents to document A in a directory of 10000 documents, we need compare each pair individually. This is obviously not going to scale well.

What we can do to reduce some cycles is compare sets of randomly selected shingles from two documents. So for a document that is 10000 words long, we break it down into 9996 shingles, and randomly select say 200 of those to represent the document. If the second document is 20000 words long, we boil that down from 19996 shingles to another set of 200 randomly selected shingles.

Now instead of matching 9996 shingles to 19996 other shingles, we are are comparing 200 shingles to another 200 shingles. We have reduced our workload at the expense of accuracy, but this is still good enough for most scenarios.

NOTE
So just how much accuracy did we sacrifice by comparing only a small subset of randomly selected shingles? I actually don't know, but I assume it sits on a bell curve where the number of random selections improves the accuracy to a point where more random selections won't provide any meaningful increase in accuracy. If you know the math behind this, please leave a comment.

I chose 200 random selections after reading http://blog.cluster-text.com/tag/minhash/.

Computing and saving the random shingle selections

So this is where MinHash comes in.

We could save the 200 randomly selected shingles, but storing variable length strings kinda sucks. It complicates database design and comparing strings is slow.

The clever part about MinHash is that it allows us to save integers instead of strings, and also takes the hassle out of randomly selecting shingles. It works like this.

1. Break down the document a set of shingles.
2. Calculate the hash value for every shingle.
3. Store the minimum hash value found in step 2.
4. Repeat steps 2 and 3 with different hash algorithms 199 more times to get a total of 200 min hash values.

At this point instead of 200 randomly selected shingles, we have 200 integer hash values. This is effectively a random selection of shingles, because a good hash algorithm is supposed to generate a number that is as likely to be large as it is likely to be small. This kind of distribution of hash codes is what hashing is all about, so selecting the smallest hash is, for all intents and purposes, a random selection of a shingle.

Um, 199 more hash algorithms? WTF!

So you see the String.hashCode() method, but where do these other 199 hash codes come from? This is the step that most explanations on the MinHash algorithm conveniently leave out.

The short answer is that you XOR the value returned by String.hashCode() with 199 random numbers to generate the 199 other hash code values. Just make sure that you are using the same 199 random numbers across all the documents.

If you want some more detail on why this works, see http://stackoverflow.com/a/19711615/157605.

Great, but I still have to compare every document to every other document. Can this be optimized?

This is where the second neat trick of MinHash comes in. 

Locality sensitive hashing (LSH) involves generating a hash code such that similar items will tend to get similar hash codes. This is the opposite of what .hashCode() does.

LSH allows you to precompute a hash code that is then quickly and easily compared to another precomputed LSH hash code to determine if two objects should be compared in more detail or quickly discarded.

We don't actually calculate a LSH hash code as such, but the idea of a boiling down a complex object (like our collection of minhash codes) into something that is quickly and easily compared with other complex objects is still applicable.

What we do is take the collection of minhash values, and categorize them into bands and row. For the sake of simplicity we'll assume each document has 20 minhash values. We will break them down into 5 bands with 4 rows each. Also for the sake of simplicity we'll assume that .hashCode() (and any number XORed with it) results in a number between 0 and 9.

So conceptually the minhash values from 5 documents in the first of the 5 bands and its rows looks like this.

Band One
MinHash Algorithm One MinHash Algorithm Two MinHash Algorithm Three MinHash Algorithm Four
Document One 1 3 6 0
Document Two 2 3 1 0
Document Three 1 3 6 0
Document Four 2 1 3 1
What we are looking for are rows within a band that are the same between documents. Document One and Document Three in this case both have rows with values 1, 3, 6, 0. This tells us that these two documents should be compared for their similarity.

Although it is not shown here, you should look through Bands Two - Five looking for documents that share rows. Any documents that share rows in any bands should be compared for their similarity. 

This is useful when you have a document, and you want to know which other documents to compare to it for similarity. I'll assume that the minhash values are stored in SQL, so to get out candidates for comparison you would write:

SELECT DocName 
FROM Documents
WHERE (
BandOneMinHashAlgorithmOne = 1 AND
BandOneMinHashAlgorithmTwo = 3 AND
BandOneMinHashAlgorithmThree = 6 AND
BandOneMinHashAlgorithmFour = 0
) OR
(
BandTwoMinHashAlgorithmOne = # AND
BandTwoMinHashAlgorithmTwo = # AND
BandTwoMinHashAlgorithmThree = # AND
BandTwoMinHashAlgorithmFour = #
) OR (
BandThreeMinHashAlgorithmOne = # AND
BandThreeMinHashAlgorithmTwo = # AND
BandThreeMinHashAlgorithmThree = # AND
BandThreeMinHashAlgorithmFour = #
) OR (
BandFourMinHashAlgorithmOne = # AND
BandFourMinHashAlgorithmTwo = # AND
BandFourMinHashAlgorithmThree = # AND
BandFourMinHashAlgorithmFour = #
) OR (
BandFiveMinHashAlgorithmOne = # AND
BandFiveMinHashAlgorithmTwo = # AND
BandFiveMinHashAlgorithmThree = # AND
BandFiveMinHashAlgorithmFour = #
)

This explanation completely glosses over why this works and what combination of bands and rows you should construct when looking for similar documents. If you want to get into that detail, take a look at http://infolab.stanford.edu/~ullman/mmds/ch3.pdf.

Sunday, November 10, 2013

Three.js 1 - Getting Started

Three.js - 1 - Getting Started

THREE.JS TUTORIALS

VIEW THE DEMO

DOWNLOAD THE SOURCE CODE

Introduction

After Apple's war on Flash, relying on plugins in your web pages is just not a viable option. Flash (and maybe, thanks to Netflix, Silverlight) were the last of a dying breed of browser plugins. In their place are a host of new technologies being standardised as part of HTML5.

WebGL is one of these technologies, and is supported by all major browsers. Microsoft is even getting on board with IE 11. Combine that with the ever increasing performance of JavaScript engines and it is now possible to create some impressive 3D applications in a web browser without the aid of plugins.

Three.js is one of the more powerful and mature JavaScript 3D engines available. This article will show you how to get a simple 3D application running with three.js.

What to download

Using three.js requires only a single JavaScript file called three.min.js. You can grab the latest version of file from GitHub.

The HTML page

A three.js application is a DOM element that can be appended as a child of an element in your HTML page. This means that you can embed a 3D application in a web page like it was an image or video file. This gives you a great deal of flexibility when presenting your 3D application.

However, for this example we will create a very basic HTML page and add the three.js application as the only child. While basic, this configuration is quite suitable for full screen games or simulations.

As you can see, this is a very simple HTML page. The only code of importance here are the referenced to the two JavaScript files: the three.js library (three.min.js), and the JavaScript file that will hold our application (threejs-1.js).

<html>
 <head>
  <title>Three.js - 1 - Getting Started</title>
  <script type="application/javascript" src="javascript/three.min.js"></script>
  <script type="application/javascript" src="javascript/threejs-1.js"></script>
 </head>
 <body>
  
 </body>
</html>  

The Components of a 3D Application

There are 4 important classes in three.js that make up a basic 3D application: the camera, the scene, the meshes, the geometries and the materials.

The Camera

This is exactly what it sounds like. The 3D world created by three.js is viewed as if through the lens of a video camera. In fact the camera has methods like .setLens() to set properties like focal length and frame size, which relate to a camera's field of view.

The Scene

The scene is the world that you will populate with 3D objects, lights and cameras.

The Meshes

A mesh defines the individual charictaristics  of a 3D object. These include things like the position, rotation and scale of the object. A mesh will reference a geometry to define the shape of the 3D object, and a material to define the appearance of the 3D object.

What is important to note about a mesh is that they can share geometries and materials. For example, to render a forest with a thousand 3D trees, you would need a thousand meshes, but potentially only 3 or 4 geometries (to represent the different types of trees) and 1 material.

The Geometries

The shape of a 3D object is defined by a geometry. Geometries define the various vertices and faces that make up a 3D model. Simple applications can make use of the build in geometry classes that describe a sphere, cube, cylinder and other common shapes. More complex 3D applications will load geometries created in 3D modelling applications.

The Materials

While a geometry defines the shape of a 3D object, the material defines the appearance. Materials can be a simple as a plain colour, or they can be complex shaders with reflections or animated appearances.

Putting it all Together

We start by assigning a function to the window.onload event. This is good practice when writing any JavaScript that manipulates the HTML DOM, as it is possible to start playing with the DOM before it is completely loaded. If you use jQuery, the ready() function is even better as it will allow your code to manipulate the DOM without having to wait for images to download. Since our page has no images, and we are not using jQuery, the window.onload event will work just fine.

The function assigned to the window.onload event will call two additional functions: init() and animate().

window.onload = function(){
 init();
 animate();
};  

The init() function is where we setup the components that make up the 3D application.

We create the camera, which is an instance of the PerspectiveCamera class. The 4 parameters supplied to the constructor are

fov - Camera frustum vertical field of view.
aspect - Camera frustum aspect ratio.
near - Camera frustum near plane.
far - Camera frustum far plane.
We then set the position of the camera down the z axis. This will allow us to position a 3D object at the origin and be able to view it.

camera = new THREE.PerspectiveCamera( 75, window.innerWidth / window.innerHeight, 1, 10000 );
 camera.position.z = 1000;    

Next we create some geometry. Three.js provides a number of geometry classes that define a number of common shapes. Here we use the CubeGeometry class to create the geometry for a cube. The supplied parameters define the size of the cube along the three axes.

 geometry = new THREE.CubeGeometry( 200, 200, 200 );   

In order for the 3D object to be visible, it needs a material. The MeshBasicMaterial class allows us to create a simple material without the use of any external image files. The MeshBasicMaterial constructor takes an object whose properties define the appearance of the material. In this case we have set the colour to red, and specified that the material should display as a wireframe.

 material = new THREE.MeshBasicMaterial( { color: 0xff0000, wireframe: true } );  

The mesh uses the geometry and material to create the 3D object that we will see on the screen.

mesh = new THREE.Mesh( geometry, material );   

Now that we have a 3D object to view, we need to create the scene that will hold it. The mesh is then added to the scene.

 scene = new THREE.Scene();
 scene.add( mesh );

Three.js can use a number of technologies to draw the 3D application within the browser. Each is provided by a renderer class. These different renderers are provided because not all browsers support all the HTML technologies that are available to libraries like three.js. For example, Internet Explorer won't support WebGL until version 11.

We have used the WebGL renderer here. This code will fail if it is run on any version of IE before 11, and future tutorials will provide workarounds for older browsers. For now though make sure you run the demo on Firefox or Chrome.

Because our application will take up the full browser window, we set the size of the renderer to the size of the browser window.

 renderer = new THREE.WebGLRenderer();
 renderer.setSize( window.innerWidth, window.innerHeight );   

The final step is to attach the renderer to the HTML DOM. In this case the renderer is attached directly as a child of the HTML body.

 document.body.appendChild( renderer.domElement );   

At this point we have created all the objects required by the 3D application. The next step is to animate the scene.

Just like a flip book, animation of 3D applications quickly taking snapshots of the 3D objects and displaying the snapshots on the screen.

HTML pages used to be fairly static, and only updated in response to some kind of user interaction (like clicking a button or submitting a form). This model doesn't work well for applications like the one we are creating, which expect to be able to update the browser page with each frame, independently of any explicit user interaction.

Modern browsers support this kind of application through the requestAnimationFrame function. This function takes another function as a parameter, and will call the supplied function when the browser page is available to be refreshed. The requestAnimationFrame function is not supported by all browsers, but three.js includes a shim (or workaround) for older browsers, saving us from having to deal with these incompatibilities.

So the animate() function is first called when the window is loaded, and is then scheduled to be recursively called each time we render a frame.

 requestAnimationFrame( animate );   


So we have something to look at, the cube is rotated slightly with each frame.

 mesh.rotation.x += 0.01;
 mesh.rotation.y += 0.02;  


Finally, the renderer takes the scene and the camera, and renderers the current frame to the screen.

 renderer.render( scene, camera );  
 

Final Words

The end result of this tutorial has been to display a single spinning 3D cube in a web page. While this is quite a trivial example of the three.js library, it does cover all the basic elements of a 3D application, and provides a useful starting point for building more complex 3D applications.

Monday, September 23, 2013

Stopping emails in the new Gmail tabs from showing up in new mail counts

I love the new tabbed interface in Gmail. To be blunt, email clients have done a very poor job of dealing with the increasing volume and variety of email that people get nowadays, and it is nice to see Gmail implementing simple and effective solutions.

The one problem I have with the new interface is that email that was simply going to spam is now being categorised and filed under these tabs, and their existence is contributing to the number I see in my Chrome Gmail monitoring extension. It became impossible to know when I have mail that I was actually interested in, because the little icon in my browser was always showing 100+ new emails.

I initially tried to setup a filter, but nothing I could type into the new filter dialog would allow me to set emails sent to the categories to be marked as read.

Nothing you can type in here will let you create a filter based on the categories.
But I soon tweaked that you can create a filter from a search, and you can search based on categories.

Do a search with the categories: prefix.

Drop the search menu down again, and the Create filter with this search link is now active!
And with that, any email sent to one of the new tabs can be marked as read, leaving your mail monitoring extensions showing only the actual inbox count.



Friday, July 05, 2013

Exporting from MySQL to H2

When it comes to testing, you can't beat the convenience of an embedded database. It can be created, populated and cleaned up without any dependencies on outside infrastructure, making it ideal for self contained test suites.

However, there is a very good chance that you don't use an embedded database in your production environment. In my case, I have a MySQL database in production, and want to use H2 in testing.

Getting data from MySQL into H2 is not as straight forward as you might think. You might have some luck with CopyDB, Scriptella, or OpenDBCopy. But in the end I simply used mysqldump and a couple of hacks to get data from MySQL to H2.

Dump the data

Dump the data from MySQL with the following command:

mysqldump --compatible=ansi,no_table_options,no_field_options,no_key_options --hex-blob --skip-opt -uroot -p Database > importansi.sql

This will strip out most of the MySQL specific syntax and give you something that you can clean up for H2 to import.


Fix up single quotes

MySQL will escape a single quote with a backslash, like \'. H2 expects to have single quotes escaped with another single quote, like ''. There has been a request to have this kind of escaping included as an option in MySQL, but it has not been implemented. So you'll have to do a find and replace for yourself.

Fix up hex numbers

H2 doesn't import hex numbers properly. To fix that up, you'll need to find any hex numbers and replace them with strings.

If you have a text editor that can find and replace with regular expressions, this can be done by finding 0x([A-F0-9]*) and replacing it with '$1'.

Fix up bits

The default value for a bit in MySQL will be represented as b'0'. Replace this with a plain 0 for H2.

Don't include ranges in keys

MySQL has the ability to define a range on a key, like KEY "ATextFieldKey" ("ATextField"(255)). This doesn't work in H2, so replace it with KEY "ATextFieldKey" ("ATextField").


Remove character sets

Remove any character set information on a field like CHARACTER SET latin1.


Remove COLLATE settings

Remove any collate information on a field like COLLATE utf8_unicode_ci.


Remove indexes on BLOBS, CLOBS and TEXT fields

H2 does not support indexes on these fields. Queries on these fields will be slower, but you should be able to take that into account in your tests.


Make all index names unique

MySQL only requires that an index name be unique within a table. H2 will require that it be unique within the database. So change any duplicated index names before importing into H2.


Use the MySQL compatibility mode

Even though we have dumped the MySQL database using the ANSI compatibility mode, we still need to enable the MySQL compatibility mode in H2. This is done through a connection URL that looks like jdbc:h2:~/test;MODE=MySQL.

Reorder the table creation sequence 

MySQL will dump tables in alphabetical order. When it imports a database dumps, some commands in the comments instruct MySQL to ignore relationships, allowing MySQL to create tables with foreign keys to tables that have not yet been created.

H2 does not have the same ability to ignore relationships (I tried SET REFERENTIAL_INTEGRITY FALSE, but it didn't work), and so you'll have to do some copying and pasting to make sure that any table that is pointed to in a relationship exists before the relationship is created.

If you have circular dependencies, you'll have to remove the foreign key creation from the CREATE TABLE command, and create the foreign keys only after both tables have been created.

For those that want to automate this process, fork the repo hold this script. It has saved us quite a few hours worth of work.

Wednesday, July 03, 2013

Setting up a MariaDB Galera cluster in Centos 6

 Synopsis

One of the big draw cards of MariaDB over MySQL is open source support for clustering. While the open source version of MySQL allows for replication amongst a number of read only nodes, MariaDB has true multi-master replication.

If you have read the documentation at https://kb.askmonty.org/en/getting-started-with-mariadb-galera-cluster/, you could be forgiven for believing that setting up a cluster is a trivial process. Unfortunately those instructions leave out a few vital details that will prevent your cluster from working. These are the steps I used to get a MariaDB Galera Cluster running in Centos 6.


Configure the YUM repo


Go to https://downloads.mariadb.org/mariadb/repositories/ and select Centos as the OS, and 5.5 as the version. Galera (the clustering component of MariaDB) is only available with version 5.5 at this time.

I have the 64 bit version of Centos, so the resulting file looks like this:

# MariaDB 5.5 CentOS repository list - created 2013-07-03 04:40 UTC
# http://mariadb.org/mariadb/repositories/
[mariadb]
name = MariaDB
baseurl = http://yum.mariadb.org/5.5/centos6-amd64
gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB
gpgcheck=1


The file for the 32 bit version of Centos looks like:

# MariaDB 5.5 CentOS repository list - created 2013-07-03 04:46 UTC
# http://mariadb.org/mariadb/repositories/
[mariadb]
name = MariaDB
baseurl = http://yum.mariadb.org/5.5/centos6-x86
gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB
gpgcheck=1


You'll want to save this file into /etc/yum.repo.d/MariaDB.repo.


Install MariaDB


MariaDB is a drop in replacement for MySQL. If you have installed the desktop packages for Centos, chances are you have MySQL installed already. To uninstall it, run the following command:

yum erase *mysql*


Installing MariaDB Galera is done with the following command:

yum install MariaDB-Galera-server MariaDB-client setroubleshoot

You will notice that we are also installing the SELinux Troubleshooter. This will come in handy later.


Initial Database Configuraion


We need to start the MariaDB service. Because it is a drop in replacement for MySQL, you use the same command to start the server as you would to start MySQL.

service mysql start

Upgrading the databases fixes an issue where MySQL Workbench will simply say "fetching..." and will never list the tables available in the database.

mysql_upgrade

It is good practice to secure the new installation. MariaDB provides a script that will run you through the process.

mysql_secure_installation

We also need to setup some passwords. To do that we load the mysql command line client, and log into our database.

mysql -hlocalhost -uroot

The following commands will set passwords on the root account, both when accessed locally, and when accessed from a remote system. Obviously you can change the password in these commands to something a little more secure, but for this tutorial we'll use 'password'.

GRANT ALL PRIVILEGES ON *.* TO 'root'@'localhost' IDENTIFIED BY 'password' WITH GRANT OPTION;

GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'password' WITH GRANT OPTION;


FLUSH PRIVILEGES;

exit


The last step is to shutdown MariaDB

service mysql stop


Open the network ports


A MariaDB cluster requires ports 3306 and 4567 to be open. If you use rsync to do the state transfers, you'll also need port 4444 open. You can do by running the command:

system-config-firewall-tui

In the other ports section, you need to open up 3306, 4444 and 4567, all with the tcp protocol.





Fix SELinux Issues


The quick way to fix this is to disable SELinux with the command:

setenforce 0

If you want to leave SELinux enabled, you'll need to restart Centos to allow the SELinux Troubleshooter we installed earlier to pick up on any violations.

When you attempt to run MariaDB in the steps below, SELinux Troubleshooter will report three errors.








Clicking the Details button on each of these alerts provides the commands you need to run to allow MariaDB to run with SELinux enabled.








Start the cluster


MariaDB requires 3 nodes to effectively provide fault tolerance. We'll get two running, but the process is the same for any additional nodes.

Start the first server with the command

mysqld --wsrep_cluster_address=gcomm:// --user=mysql --wsrep_provider=/usr/lib64/galera/libgalera_smm.so --wsrep_sst_auth=root:password

The --wsrep_cluster_address=gcomm:// option defines this as the initial node in the cluster.

The --user=mysql option tells MariaDB to start under the mysql user

The --wsrep_provider=/usr/lib64/galera/libgalera_smm.so option tells MariaDB where the Galera library is.

The --wsrep_sst_auth=root:password option provides the credentials that MariaDB requires when syncing the initial state of the cluster to newly joined nodes. Change the password to reflect the password you set earlier.

Remember that if you have SELinux enabled, it is at this point that the SELinux Troubleshooter should be notifying you of issues. If not, make sure you have restarted Centos to allow the troubleshooter to catch violations.


Joining the cluster


With the first node up, run this command on the second node to join it:

mysqld --wsrep_cluster_address=gcomm://192.168.0.1 --user=mysql --wsrep_provider=/usr/lib64/galera/libgalera_smm.so  --wsrep_sst_auth=root:password

The only change here is the --wsrep_cluster_address=gcomm://192.168.0.1 option. We have assumed that the first node has an IP address of 192.168.0.1, and have requested that this second node connect to the first node.

NOTE
One way to avoid editing my.cnf in order to remove gcomm:// is to start all cluster nodes with the following URL: gcomm://node1,node2:port2,node3?pc.wait_prim=no&...


(note the 'pc.wait_prim=no' option which makes the node to wait for primary component indefinitely.) Then bootstrap the primary component by running SET GLOBAL wsrep_provider_options="pc.bootstrap=1"; on any one of the nodes.

Testing the cluster


At this point you should be able to make changes to the database tables on one server and see them reflected on the second server. Your cluster is now up and running!

Making the changes permanent


All the configuration has been defined in command line options, which means that if one of the servers went down, it would not reconnect to the cluster.

To save the settings we supplied on the command line, edit the /etc/my.cnf file, and add the following lines to the end of it. This configuration allows the first node to reconnect to the second.

[mariadb]

wsrep_cluster_address=gcomm://192.168.0.2

wsrep_provider=/usr/lib64/galera/libgalera_smm.so

wsrep_sst_auth=root:password

wsrep_node_address=192.168.0.1

log-error=/var/log/mysql.log


The wsrep_node_address option defines the IP address of the local host. This is optional, but when I ran MariaDB on my VM, it reported that it could not automatically determine the IP address, which is why it is hard coded in this option.

The log-error option defines the log file that MariaDB will write to.

All the other options are exactly the same as what was supplied on the command line.

With this configuration in place you can kill the mysql instance that was started from the command line. The second node is still up, and so the cluster is still up. The first node can then be restarted with the command:

service mysql start

This will start MariaDB and reconnect it to the second node, and thus to the cluster.

We can then follow the same process for the second node. Of course you'll need to change the IP addresses from those listed above. The configuration to add to the /etc/my.cnf file for the second node looks like:

[mariadb]

wsrep_cluster_address=gcomm://192.168.0.1

wsrep_provider=/usr/lib64/galera/libgalera_smm.so

wsrep_sst_auth=root:password

wsrep_node_address=192.168.0.2

log-error=/var/log/mysql.log


All we have done is changes the node it is to connect to, and the IP address of the local host.

The mysql instance started from the command line for the second node can then be killed and restarted as a service. It will then rejoin the first node.

Restarting the cluster 

As long as one node is up, the cluster will remain active, and the node that is down will have something to connect to when it starts up. But what happens when all nodes go down?

At that point you'll have to restart the cluster on the node that has the version of the database you want to use. So if node 1 went down, and then a day later node 2 went down, it is most likely that node 2 will have the copy of the database you want to keep, and so node 2 will be used to restart the cluster.

Restarting the cluster is best done from the command line. Since all our settings are now stored in the /etc/my.cnf file, restarting the node as the initial node in the cluster can be done with the following command: 

mysqld --wsrep_cluster_address=gcomm:// --user=mysql

Just like before, an empty address in the wsrep_cluster_address option defines this node as the initial node in the cluster.

The node one can then be started as a service, and it will connect to node 2, and the cluster will be running once again.

You probably don't want to have a terminal window open with node 2 running, so once node 1 is running you can kill node 2, and restart it as a service. Node 2 will connect to node 1, and the cluster is back in action. 


Troubleshooting

Name Resolution


If the node trying to connect to an establish cluster is reporting the error:

130307 15:14:52 [Warning] IP address '192.168.0.1' could not be resolved: Name or service not known

And the node it is trying to connect to is displaying the error:

'wsrep_sst_mysqldump --user 'root' --password 'password' --host '192.168.0.2' --port '3306' --local-port '3306' --socket '/var/lib/mysql/mysql.sock' --gtid '4c754641-e45a-11e2-0800-425dfc14f8f4:390'' failed: 1 (Operation not permitted)

The try adding the following option to the nodes configuration files:

skip-name-resolve

Be aware that by doing this, MariaDB will no longer use credentials with host names. That means you'll have to configure the password on root@127.0.0.1 instead of root@localhost.

Credentials

If the node trying to connect to an establish cluster is reporting the error:

WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():719: Will never receive state. Need to abort.

And the node it is trying to connect to is displaying the error:

[ERROR] WSREP: Try 1/3: 'wsrep_sst_mysqldump --user 'root' --password 'password' --host '192.168.0.1' --port '3306' --local-port '3306' --socket '/var/lib/mysql/mysql.sock' --gtid '4c754641-e45a-11e2-0800-425dfc14f8f4:420'' failed: 1 (Operation not permitted)
ERROR 1045 (28000): Access denied for user 'root'@'192.168.0.2' (using password: YES)


Make sure the account on the node trying to connect to the cluster is correct. The established node that is being connected to is trying to send the database state to the connecting node, and is failing to do so because of an authentication error.

Rsync

If you see the error on the established node of

rsync: failed to connect to 192.168.0.2: No route to host (113)

It means that the default state transfer method is Rsync, not mysqldump. To use mysqldump, add the setting

wsrep_sst_method=mysqldump

Or you can open firewall port 4444.

Rsync Wan

If you see the error

sh: wsrep_sst_rsync_wan: command not found

in the log files of the node trying to connect to a cluster, run

cd /usr/bin
ln -s wsrep_sst_rsync wsrep_sst_rsync_wan


File Access

If you get the error File '/var/lib/mysql/aria_log_control' not found make sure that the file is owned my the user mysql. I ran into this issue running mysql_install_db as root, which meant that the file was owned by root.

Tips and Tricks

Initial Setup

You can find out how large your SQL database is with the command:

SELECT table_schema,
  sum(data_length) / 1024 / 1024 "data",
  sum(index_length) / 1024 / 1024 "index",
  sum( data_length + index_length ) / 1024 / 1024 "total"
FROM information_schema.TABLES
GROUP BY table_schema;


This value includes all the indexes, and represents the amount of data MariaDB will transfer to an empty node when it joins the cluster.

To save yourself some time, initialise the nodes that will join the cluster by dumping the SQL data into a file with the command:

mysqldump -u root -ppassword --all-databases > all-database.sql

And then restoring it on the node to be connected with

mysql -u root -ppassword < all-database.sql 

Because the backup will not contain the indexes, and can be compressed, you'll find that you'll be able to get a remote node up to a relatively recent initial state far more quickly than attempting to sync the entire database through the MariaDB state transfer. In my case a 5GB database could be backed up into a 600 MB compressed SQL file.

It doesn't matter if the database changes between when you backed it up and when the node connects, because the state transfer will take care of that for you.

Save Bandwidth

Use the setting

wsrep_sst_method=rsync_wan

to save some bandwidth for state transfers over a WAN connection.