Simple Data Science in Java

2 minute read

Canada Open Data

Canada has open sourced thousands of datasets at Canada Open Data Portal. There is a lot of useful data on this website for your purposes and you can follow the page on Application Programming Interfaces to see how to use the data. Finally there is an Apps Gallery that lists apps built using the Open Data.

Now for Java Developers here are some tools and techniques to make working with the data much more productive. We will mostly use Morpheus DataFrames plus some custom code to explore the data.


Infrastructure Projects

Lets get the list of Infrastructure Projects. This is just a CSV and it is fairly straightforward get and print using a Morpheus DataFrame.

DataFrame.read()
    .csv("http://infrastructure.gc.ca/alt-format/opendata/project-list-with-forcast-dates-liste-de-projets-avec-dates-prevu-en.csv")
    .out().print(10);


Ontario Tax Brackets

This data includes two extra rows before the actual data headers so it requires some filtering before passing to the DataFrame constructor.

We want to a method that filters rows from an InputStream and returns another InputStream. So we build the functions IO.toString(InputStream inputStream) and IO.filter(InputStream inputStream, String lineRegex)

    /**
     * Converts an InputStream to a string
     * @return the Inputstream converted to a string
     */
    public static String toString(InputStream inputStream, Predicate<String> lineFilter){
        return new BufferedReader(new InputStreamReader(inputStream))
                .lines()
                .filter(lineFilter)
                .collect(Collectors.joining("\n"));
    }
   /**
     * Filters an inputstream
     * @return a new InputStream with the lines that match
     */
    public static InputStream filter(InputStream inputStream, String lineRegex){
        final String filtered = toString(inputStream, (line) -> line.matches(lineRegex));
        return new ByteArrayInputStream(filtered.getBytes());
    }

So now we can write code to filter for only the rows we want

    DataFrame.read().csv(
        IO.filter(IO.get(
            "http://www.cra-arc.gc.ca/gncy/stts/itstb-sipti/2015/tbl4f-eng.csv"),
           "^ +?(\\d|Age|Under).*" ))
        .out().print(10);


US States Visited By Canadians

So this resource is a zip file but lets not panic. We want to unzip the InputStream and return the CSV inside as an InputStream. So lets write an unzip fuction

    /**
    * @return an array of InputStream inside this zip
    */
    public static InputStream[] unzip(InputStream zippedInput){
        byte[] buffer = new byte[1024];
        List<InputStream> streams = new ArrayList<>();

        ZipInputStream zis = new ZipInputStream(zippedInput);

        try {
            ZipEntry zipEntry = zis.getNextEntry();
            while(zipEntry != null){
                ByteArrayOutputStream baos = new ByteArrayOutputStream(1024);
                int len;
                while ((len = zis.read(buffer)) > 0) {
                    baos.write(buffer, 0, len);
                }
                baos.close();
                streams.add(new ByteArrayInputStream(baos.toByteArray()));
                zipEntry = zis.getNextEntry();
            }
            return streams.toArray(new InputStream[]{});
        } catch (IOException e) {
           throw new RuntimeException(e);
        }
    }

Now we can get the data using the new function we just wrote. For fun and profit, we’ll create a chart instead of just outputting to the console

 DataFrame<?,String> visits = DataFrame.read().csv(
        IO.unzip(
        IO.get("http://www20.statcan.gc.ca/tables-tableaux/cansim/csv/04270009-eng.zip"))[0])
            .rows()
            .select( row -> row.getValue("VISITS").toString().startsWith("Visits"));

        Array<Double> values = visits.col("Value").toArray();
        DataFrame stateVisits = DataFrame.ofDoubles(
                visits.col("TRAV").toArray(),
                Array.of("Value"),
                value -> values.getDouble(value.rowOrdinal())
        );

    Chart.create().asHtml().withBarPlot(stateVisits, false, chart -> {
            chart.plot().orient().horizontal();
            chart.plot().axes().range(0).label().withText("Visits (1000s)");
            chart.plot().axes().domain().label().withText("State");
            chart.title().withText("US States Visited By Canadians in 2014");
            chart.legend().on();
            chart.show();
    });

That code downloads and unzips the data, filters for visits and outputs the following chart.

So here we have done some simple data science using Java.