PKL to ARFF

Submitted by Xilodyne on Sat, 12/10/2016 - 12:16
pickel to attribute-relation file format (pkl to arff)

Java source code for converting PKL files to ARFF are at the bottom of this blog post.  The process is:  convert PKL to text file format to match the Weka TextDirectoryLoader structure using the Jython pickle API, run the Weka TextDirectoryLoader routine, then write out to ARFF.

The Python Pickle module is a file format (.pkl) used for "serializing and de-serializing a Python object structure", otherwise "pickling" and "unpickling" a file.  As the Udacity pkl files are in ASCII and not binary, I thought it would be fairly easy to write a Java routine to convert the pkl files into WEKA ARFF.  Until I tried to figure exactly how a pickle file is formatted.  It turns out there isn't a reference document.  Though there are some industrious souls who have made a good effort and pointed to pickletools as the best way to understanding what Pickle does.  Too complicated to re-invent the wheel. 

It turns out there is an easy Play B:  Jython (Python for the Java Platform).   Just add the standalone jar and you have access to a Pickle API.   On GitHub I found a Java PickleLoader that uses the API. 

The one problem I had is that when running sshipway's PickleLoader, I would get this error:

SEVERE: Cannot unpickle! Err: java.lang.ClassCastException java.lang.ClassCastException: org.python.core.PyList cannot be cast to org.python.core.PyDictionary at org.steveshipway.dynmap.PickleLoader.getDataFileStream2(PickleLoader.java:266)

The file pkl file structureI am reading (word_data.pkl):

(lp0
S' sbaile2 nonprivilegedpst susan pleas send the forego list to richard thank   enron wholesal servic 1400 smith street eb3801a houston tx 77002 ph 713 8535620 fax 713 6463490'
p1
aS' sbaile2 nonprivilegedpst 1 txu energi trade compani 2 bp capit energi fund lp may be subject to mutual termin 2 nobl gas market inc 3 puget sound energi inc 4 virginia power energi market inc 5 t boon picken may be subject to mutual termin 5 neumin product co 6 sodra skogsagarna ek for probabl an ectric counterparti 6 texaco natur gas inc may be book incorrect for texaco inc financi trade 7 ace capit re oversea ltd 8 nevada power compani 9 prior energi corpor 10 select energi inc origin messag from tweed sheila sent thursday januari 31 2002 310 pm to   subject pleas send me the name of the 10 counterparti that we are evalu thank'
p2
...

The PickleLoader from sshipway creates a Map(Key, Value) pair and assumes the pkl file is a Dictionary object.  By writing a loader that uses the PyList object everything loads correctly.  The key is to know if the file is a list of strings or integers.  In this case I just assume that the data are strings and the labels are integers.

Once the data has been loaded, how to get it into a Weka ARFF format?  Weka has a very nice text categorization API, TextDirectoryLoader.  It requires a directory structure of classes with the text files in each directory.  The PickleLoader writes the data out info this structure.

It is then a matter of running the Weka TextDirectoryLoader to get the data into a Weka Instances object, that can then be written out as ARFF.


Run_PickleLoader.java

import java.util.ArrayList;
import java.util.HashMap;
import java.util.logging.Logger;

import weka.core.Instances;
import xilodyne.util.ArrayUtils;
import xilodyne.util.io.CreateARFF;
import xilodyne.util.io.PickleLoader;


public class Run_PickleLoader {
 //private static final Logger log = Logger.getLogger( Run_PickleLoader.class.getName() );
 
 //load the pickle file
 //write string to file, under class name folder
 //load weka text file
 //conver td idf
 //write arff file
    public static void main(String[] args) {
    
     String label_file = "./data/email_authors.sample_10.pkl";
     String data_file = "./data/word_data.sample_10.pkl";
    
     String output_dir_name = "./data/enron_sample_10";
     String output_arff_asText = "./data/enron_text_sample_10.arff";
//     String output_arff_asTdIDF = "./data/enron_TdIDF_sample_10.arff";
    
/*     String label_file = "./data/email_authors.pkl";
     String data_file = "./data/word_data.pkl";
    
     String output_dir_name = "./data/enron_sample_chrissara";
     String output_arff_asText = "./data/enron_text_chrissara.arff";
*/

         int[] labels = PickleLoader.getIntegerData(label_file);
         String[] data = PickleLoader.getStringData(data_file);

         //write pickle data as text file
         PickleLoader.writeToWeka_TextDirectoryLoader_format(data, labels, output_dir_name);
        
         Instances newData = CreateARFF.runWekaTextDirectoryLoader(output_dir_name);
         CreateARFF.writeToARFF(output_arff_asText,  newData);
        
 //        Instances dataFiltered = CreateARFF.convertStringToNumbers(newData);
 //        CreateARFF.writeToARFF(output_arff_asTdIDF,  dataFiltered);
     }
}

 


PickleLoader.java

package xilodyne.util.io;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.util.HashSet;

import org.python.core.PyException;
import org.python.core.PyFile;
import org.python.core.PyList;
import org.python.modules.cPickle;

/**
 * Load python *List* pickle files using the Jython standalone jar.
 * If looking to load a Dictionary pkl file, see:
 * https://github.com/sshipway/dynmap-mcdungeon/blob/master/src/org/steveshipway/dynmap/PickleLoader.java
 *
 * @author Austin Davis Holiday, aholiday@xilodyne.com
 * @version 0.2
 */


public class PickleLoader {

 
 /**
  * Read in a pkl file of strings
  *
  * @param filename
  * @return
  */
 public static String[] getStringData(String filename) {
  String[] data = null;

  File f = new File(filename);
  InputStream fs = null;
  try {
   fs = new FileInputStream(f);
  } catch (FileNotFoundException e) {
   System.out.println("File <" + filename + "> not found");
   return null;
  }

  PyFile picklefile = new PyFile(fs);
  PyList listHash = null;
  try {
   listHash = (PyList) cPickle.load(picklefile);
  } catch (PyException e) {
   e.printStackTrace();
   return null;
  } catch (Exception e) {
   e.printStackTrace();
   return null;
  }

  data = new String[listHash.size()];
  for (int i = 0; i < listHash.size(); i++) {
   String sVal = listHash.get(i).toString();
   data[i] = sVal;
  }
  return data;
 }

 /**
  * Read in a pkl file of integers
  *
  * @param filename
  * @return
  */
 public static int[] getIntegerData(String filename) {
  int[] data = null;

  File f = new File(filename);
  InputStream fs = null;
  try {
   fs = new FileInputStream(f);
  } catch (FileNotFoundException e) {
   System.out.println("File <" + filename + "> not found");
   return null;
  }

  PyFile picklefile = new PyFile(fs);
  PyList listHash = null;
  try {
   listHash = (PyList) cPickle.load(picklefile);
  } catch (PyException e) {
   e.printStackTrace();
   return null;
  } catch (Exception e) {
   e.printStackTrace();
   return null;
  }

  data = new int[listHash.size()];
  for (int i = 0; i < listHash.size(); i++) {
   int iVal = Integer.valueOf(listHash.get(i).toString());
   data[i] = iVal;
  }
  return data;
 }

 /**
  * Write the feature_data and the label_data in a directory structure
  * to match:  https://weka.wikispaces.com/Text+categorization+with+WEKA
  *
  * @param data
  * @param labels
  * @param newDirectoryName
  */
 public static void writeToWeka_TextDirectoryLoader_format(String[] data, int[] labels, String newDirectoryName) {
  // open data and label pickel files.
  // assuming data are lines of strings
  // assuming lables are lines of integers

  // System.out.println("passing in: " + ArrayUtils.print1DArray(data));
  // create the directory
  File dir = new File(newDirectoryName);
  dir.mkdir();
  System.out.println("\nPickleLoader...");
  System.out.println("Creating directory: " + dir.getPath());

  // get unique class names
  // create a folder for each class
  HashSet<Integer> uniqueInt = new HashSet<Integer>();
  // hashset only adds unique
  for (int i : labels) {
   if (!uniqueInt.contains(i)) {
    uniqueInt.add(i);
    File subdir = new File(newDirectoryName + File.separator + "class" + i);
    subdir.mkdir();
    System.out.println("Creating directory: " + subdir.getPath());
   }
  }

  // for each string, create file in appropriate class
  System.out.print("Creating " + data.length + " .txt files...");
  for (int index = 0; index < data.length; index++) {
   int i = labels[index];
   // System.out.println("writing... " + data[index]);

   File file = new File(newDirectoryName + File.separator + "class" + i + File.separator + "text-" + index
     + ".txt");
   try {
    BufferedWriter textWriter = new BufferedWriter(new FileWriter(file));
    textWriter.write(data[index]);
    textWriter.newLine();
    textWriter.flush();
    textWriter.close();
   } catch (IOException e) {
    e.printStackTrace();
   }

  }
  System.out.println(" Done.");
 }
}


CreateARFF.java

package xilodyne.util.io;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

import weka.core.Instances;
import weka.core.converters.TextDirectoryLoader;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.StringToWordVector;

/**
 * Given a Weka TextDirectoryLoader directory structure, read in the data and
 * export to ARFF https://weka.wikispaces.com/Text+categorization+with+WEKA
 *
 * @author Austin Davis Holiday, aholiday@xilodyne.com
 * @version 0.2
 *
 */
public class CreateARFF {

 /**
  * @param directoryName
  * @return Instances
  */
 public static Instances runWekaTextDirectoryLoader(String directoryName) {
  TextDirectoryLoader loader = new TextDirectoryLoader();

  System.out.println("\nCreateARFF...");
  System.out.println("Reading in structure from " + directoryName);

  try {
   loader.setDirectory(new File(directoryName));
  } catch (IOException e) {
   e.printStackTrace();
  }
  Instances dataRaw = null;

  try {
   dataRaw = loader.getDataSet();
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }

  // System.out.println("dataRaw: " + dataRaw.toString());

  return dataRaw;

 }

 public static Instances convertStringToNumbers(Instances data) {
  StringToWordVector filter = new StringToWordVector();
  try {
   filter.setInputFormat(data);
  } catch (Exception e) {
   e.printStackTrace();
  }

  Instances dataFiltered = null;

  try {
   dataFiltered = Filter.useFilter(data, filter);
  } catch (Exception e) {
   e.printStackTrace();
  }
  return dataFiltered;
 }

 public static void writeToARFF(String filename, Instances data) {

  File outFile = new File(filename);
  System.out.print("Creating file: " + outFile.getPath() + ", " + data.numClasses() + " classes, " + data.size()
    + " data lines... ");
  BufferedWriter outARFF;
  try {
   outARFF = new BufferedWriter(new FileWriter(outFile));
   outARFF.write(data.toString());
   outARFF.newLine();
   outARFF.flush();
   outARFF.close();
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  System.out.println("Done.");

 }
}