Simple Data Science in Java

Fri 05 January 2018
Canada Open Data

Canada has open sourced thousands of datasets at Canada Open Data Portal. There is a lot of useful data on this website for your purposes and you can follow the page on Application Programming Interfaces to see how to use the data. Finally there is an Apps Gallery that lists apps built using the Open Data.

Now for Java Developers here are some tools and techniques to make working with the data much more productive. We will mostly use Morpheus DataFrames plus some custom code to explore the data.

Infrastructure Projects

Lets get the list of Infrastructure Projects. This is just a CSV and it is fairly straightforward get and print using a Morpheus DataFrame.

Ontario Tax Brackets

This data includes two extra rows before the actual data headers so it requires some filtering before passing to the DataFrame constructor.

We want to a method that filters rows from an InputStream and returns another InputStream. So we build the functions IO.toString(InputStream inputStream) and IO.filter(InputStream inputStream, String lineRegex)

     * Converts an InputStream to a string
     * @return the Inputstream converted to a string
    public static String toString(InputStream inputStream, Predicate<String> lineFilter){
        return new BufferedReader(new InputStreamReader(inputStream))
     * Filters an inputstream
     * @return a new InputStream with the lines that match
    public static InputStream filter(InputStream inputStream, String lineRegex){
        final String filtered = toString(inputStream, (line) -> line.matches(lineRegex));
        return new ByteArrayInputStream(filtered.getBytes());

So now we can write code to filter for only the rows we want
           "^ +?(\\d|Age|Under).*" ))

US States Visited By Canadians

So this resource is a zip file but lets not panic. We want to unzip the InputStream and return the CSV inside as an InputStream. So lets write an unzip fuction

    * @return an array of InputStream inside this zip
    public static InputStream[] unzip(InputStream zippedInput){
        byte[] buffer = new byte[1024];
        List<InputStream> streams = new ArrayList<>();

        ZipInputStream zis = new ZipInputStream(zippedInput);

        try {
            ZipEntry zipEntry = zis.getNextEntry();
            while(zipEntry != null){
                ByteArrayOutputStream baos = new ByteArrayOutputStream(1024);
                int len;
                while ((len = > 0) {
                    baos.write(buffer, 0, len);
                streams.add(new ByteArrayInputStream(baos.toByteArray()));
                zipEntry = zis.getNextEntry();
            return streams.toArray(new InputStream[]{});
        } catch (IOException e) {
           throw new RuntimeException(e);

Now we can get the data using the new function we just wrote. For fun and profit, we'll create a chart instead of just outputting to the console

 DataFrame<?,String> visits =
            .select( row -> row.getValue("VISITS").toString().startsWith("Visits"));

        Array<Double> values = visits.col("Value").toArray();
        DataFrame stateVisits = DataFrame.ofDoubles(
                value -> values.getDouble(value.rowOrdinal())

    Chart.create().asHtml().withBarPlot(stateVisits, false, chart -> {
            chart.plot().axes().range(0).label().withText("Visits (1000s)");
            chart.title().withText("US States Visited By Canadians in 2014");

That code downloads and unzips the data, filters for visits and outputs the following chart.

So here we have done some simple data science using Java.

Category: Java Tagged: programming Java

Simple Sql Templates

Tue 02 January 2018

When you have a large code base you also need a large number of tests. Anything you can do to reduce the work required to create and main tests will bring you a lot of leverage. Shave an hour's development time off the creation of a single tests multiplied by …

Category: Java

Read More

Build a Deep Learning Library

Tue 12 December 2017

In this fantastic bit of code-as-entertainment author Joel Grus walks through the building of a functional deep learning library. The video runs in under an hour and I was able to follow and code along in around two hours. He builds an actual library complete with loss and activation functions …

Category: Data Tagged: programming python machine learning

Read More

Beijing Men's 4x400

Sat 05 September 2015

The Favorites

USA were the favorites to win the gold medal in the Mens 4x400 relay finals at the Beijing World Championships. Surprisingly Jamaica was a very close second best team on paper when we sum the season best 400m times of each of the team legs - with Trinidad in …

Category: Track and Field Tagged: athletics jamaica

Read More