Java machine learning in a python world

Submitted by Xilodyne on Sun, 07/17/2016 - 09:45
Nothing is impossible

Having dived into my first Udacity machine learning introductory course in May 2016, I was suddenly confronted with a complete Python machine learning ecosystem.  It is difficult enough overcoming Python's propensity for not defining anything beforehand.  It means digging through a ton of documentation, library code, or testing to figure out the structure of returned variables.  But on top of that are the numpy, sklearn, mathplotlib and pylab python libraries that the Udacity courses are leveraging, which also have a learning curve.  At this point everything in Python feels like a black box.   It is one thing to finish an assignment and another to completely understand what is going on in the background.

In an effort to understand exactly what the Python routines were doing I thought it would be interesting to view the routines from a Java perspective.  Digging through StackOverflow, GitHub and SourceForge there are Java based solutions that solve the same algorithmic problems but nothing implements the same numpy methods with similar numpy results (the idea being that I could take my python code and using the same methods and data structures have something in Java with the same results, and vice-a-versa).

For straight-on Java implementations of numpy there are ND4J and a Numpy4J.

ND4J, (N-Dimensional Arrays for Java), is sponsored by deeplearning4j (DL4J - Deep Learning for Java). nd4j DL4J looks comprehensive and seems to offer very similar results that I'm seeing in my Python programming.  It does this by leveraging both JavaCPP , yes, a JNI type access that requires a MSVC compiler and OpenBlas. Unfortunately it isn't as easy to get DL4J/ND4J running as it suggests.  There are so many required and interrelated packages that are not up-to-date in the maven repository, plus tackling libnd4j, that I could never get anything to work.  Even compiling from source, my main method for overcoming environment and library issues, became an exercise in futility and seems to be discouraged by the main coders.

Nump4J looks promising except that it hasn't been updated in a few years.  I'm currently using Java 8 and it appears nump4j is coded against earlier JDK's which have different c header file signatures.  I haven't spent much time previously on JNI so this might be trivial to fix without having to have older JDKs installed.  I'll probably swing around again to this package if I hit road blocks on other java numpy solutions.

Numpy takes advantage of BLAS (Basic Linear Algebra Subprograms),

The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software, LAPACK for example

the question I was wondering is that if BLAS is such an old library (written in Fortran initially) are these routines translated to Java as a package?  Is it necessary to use BLAS directly?  I'm still digging into it.  There is a Fortran to Java byte converter, F2J, (neat concept!) which can access BLAS but it hasn't been maintained in sometime.  An interesting library, and current, is the netlib-java, another BLAS wrapper. 

I have started using vectorz  as a Java NDArray (a basic construct in numpy) and I'm finding it pretty easy to use.   Previously I was using javatuples to load my ndarray type data.  However vectorz has more similar functionality to numpy, including the ability to shape the array (though I haven't verified that it is the same type of shape result).

Throwing aside my need for direct Java implementation of numpy and sklearn, there are a number of comprehensive Java machine learning frameworks.  I've looked at a few of them and  they all seem to have a learning curve on figuring out how get them to do what I need to with the Udacity courses.   Below are a few that I've found interesting.

Of course Google's TensorFlow release to the public seems to be emerging as the defacto go-to implementation.  But as I'm still getting my feet wet in ML I'm still focusing on getting some ML basics down.

CoreNLP from Stanford is up-to-date and covers a wide range of algorithms.  Unfortunately as I do not have the math expressions beforehand and the Java classes do not reference the source of the algorithm, nor are there any explanations in the code as to why certain calculations are made, it makes it pretty difficult as learning tool.  Perhaps as this framework is the basis of the courses all of that is explained in their text books.

aima-java is another comprehensive reference is that code base provided by the Russel / Norvig text book "Artificial Intelligence, A Modern Approach", which I've mentioned in an earlier blog.  A very handy reference to look at code implementation but I do struggle at times understanding exactly what the text mean and following that in the code.

Other frameworks that I have found and occasional reference are Datumbox Machine Learning Framework, the OpenML project with their code reference, and the University of Waikato implementation .  A nice one that works out-of-the-box, is Neuroph.