Archive for September, 2011

Quick intro to Weka

September 25, 2011 – 12:15 pm

Weka (http://www.cs.waikato.ac.nz/ml/weka/) is Data Mining software from The University of Waikato. In Slovenia, The Bioinformatics Laboratory has also developed well known software Orange (http://orange.biolab.si/). Both tools have GUI interface and a library for programmatic access. The main difference is that Weka is Java and Orange is python -based.

Here I will give a short example how to use Weka within Java. Tha Java file is accessible here: Weka.java. All you need to do is put weka.jar to classpath, compile and run Weka.java (of course you need to have c:\\temp folder or choose another one).

For classification problems we normally have to identify features. In Weka standard types of attributes are numeric, nominal, string, date and relation. Relation attribute can represent a whole dataset. There are also some functions for data preprocessing available. Here we define some attributes:

//1.ATTRIBUTES
//numeric
Attribute attr = new Attribute("my-numeric");
System.out.println(attr.isNumeric());
 
//nominal
FastVector myNomVals = new FastVector();
for (int i=0; i<10; i++)
	myNomVals.addElement("value_"+i);
Attribute attr1 = new Attribute("my-nominal", myNomVals);
System.out.println(attr1.isNominal());
 
//string
Attribute attr2 = new Attribute("my-string", (FastVector)null);
System.out.println(attr2.isString());
 
//date
Attribute attr3 = new Attribute("my-date", "dd-MM-yyyy");
System.out.println(attr3.isDate());
 
//whole relation can also be an attr
//Attribute attr4 = new Attribute("my-relation", new Instances(...));

When we have attributes, we can form the dataset aka. relation (reading and writing from files will come later):

//2.create dataset
FastVector attrs = new FastVector();
    attrs.addElement(attr);
    attrs.addElement(attr1);
    attrs.addElement(attr2);
    attrs.addElement(attr3);
Instances dataset = new	Instances("my_dataset", attrs, 0);

Now we have defined the relation structure. There are a few possible ways to fill the dataset and here we present few of them:

//3.add instances
//first instance
double[] attValues = new double[dataset.numAttributes()];
	attValues[0] = 55;
	attValues[1] = dataset.attribute("my-nominal").indexOfValue("value_5");
        attValues[2] = dataset.attribute("my-string").addStringValue("Slavko");
	attValues[3] = dataset.attribute("my-date").parseDate("7-6-1987");
dataset.add(new Instance(1.0, attValues));
 
//second instance
attValues = new double[dataset.numAttributes()];
	attValues[0] = Instance.missingValue();
	attValues[1] = dataset.attribute(1).indexOfValue("value_9");
	attValues[2] = dataset.attribute(2).addStringValue("Marinka");
	attValues[3] = dataset.attribute(3).parseDate("23-4-1989");
dataset.add(new Instance(1.0, attValues));
 
//third instance
Instance example = new Instance(4);
	example.setValue(attr, 16);
	example.setValue(attr1, "value_7");
	example.setValue(attr2, "Mirko");
	example.setValue(attr3, attr3.parseDate("1-1-1988"));
dataset.add(example);

Up to here we have the dataset in the memory. We can use it (class attribute needs yet to be set), print it to stdout or file:

//4.output dataset
System.out.println(dataset);
 
//5.save dataset
String file = "C:\\temp\\weka_test.arff";
ArffSaver saver = new ArffSaver();
saver.setInstances(dataset);
saver.setFile(new File(file));
saver.writeBatch();
 
//6.read dataset
ArffLoader loader = new ArffLoader();
loader.setFile(new File(file));
dataset = loader.getDataSet();

As we have one string attribute, we need to properly preprocess it as very few classifiers support them. We can accomplish this with filters, for example changin it to nominal attribute:

//7.preprocess strings (almost no classifier supports them)
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(dataset);
dataset = Filter.useFilter(dataset, filter);
System.out.println(dataset);

We have the data. The next thing is building a classifier. Weka contains a lot well known classifiers like naive Bayes, decision trees, perceptrons, etc.. I like SVMs and I use LibSVM with Weka. Weka already has built-in LibSVM API, so the only thing you need to do is to include libsvm.jar to classpath and use LibSVM as classifier instance.

Another very easy task is also saving and retrieving back classifiers. The only thing to be aware of is the class index! You must set it before learning the classifier. Best practice is to always set class attribute as last one.

//8.build classifier
dataset.setClassIndex(1);
Classifier classifier = new J48();
classifier.buildClassifier(dataset);
 
//9.save classifier
OutputStream os = new FileOutputStream(file);
ObjectOutputStream objectOutputStream = new ObjectOutputStream(os);
objectOutputStream.writeObject(classifier);
 
//10. read classifier back
InputStream is = new FileInputStream(file);
ObjectInputStream objectInputStream = new ObjectInputStream(is);
classifier = (Classifier) objectInputStream.readObject();
objectInputStream.close();

Usually we need to know how good the classifications are. Weka supports a number of evaluation tools, like CV and different measures. Here we will resample our dataset, create the train and learn dataset and output some results.

//11.evaluate
 
//resample if needed
dataset = dataset.resample(new Random(42));
 
//split to 70:30 learn and test set
double percent = 70.0;
int trainSize = (int) Math.round(dataset.numInstances() * percent / 100);
int testSize = dataset.numInstances() - trainSize;
Instances train = new Instances(dataset, 0, trainSize);
Instances test = new Instances(dataset, trainSize, testSize);
train.setClassIndex(1);
test.setClassIndex(1);
 
//do eval
Evaluation eval = new Evaluation(train); //trainset
eval.evaluateModel(classifier, test); //testset
System.out.println(eval.toSummaryString());
System.out.println(eval.weightedFMeasure());
System.out.println(eval.weightedPrecision());
System.out.println(eval.weightedRecall());

When classifying new instances, we must be aware to transform classifier’s result to the class attribute value – it returns only the index of a value (for classification purposes)!

//12.classify
//result
System.out.println(classifier.classifyInstance(dataset.firstInstance()));
//classified result value
System.out.println(dataset.attribute(dataset.classIndex()).value((int)dataset.firstInstance().classValue()));
System.out.println(classifier.distributionForInstance(dataset.firstInstance()));

I hope this example was useful to you. I tried to show how to use weka for some quick tasks.

ESSIR 2011, Videolectures

September 25, 2011 – 10:17 am

As I have already mentioned, the Videolectures.net team was capturing the main summer school’s track. You can access their material at:

http://videolectures.net/essir2011_koblenz/

ESSIR 2011, Day 6

September 3, 2011 – 5:07 am

On the last day of the summer school, Peter Ingwersen was giving a double lecture of information seeking. He introduced us some theoretical models, variables and their uses. This models are meant for the architecture of a search engine and are independent of index implementation. In the last part, he covered some experiments their were conducting and problems we must be aware of when experimenting. When improving results, he also provided an example of two models with around 5% difference in F-score. The point was that everybody can be happy, accept your paper for giant improvement, but in practice the improvement is not noticeable.

After the second lecture ddr. Sergej Sizov officially ended the summer school:

Then we went for a lunch in the university’s canteen.

In the evening we met with the participant who were still in Koblenz on a few more drinks:

Tomorrow (today as of time of writing) I and Kaja are going to Frankfurt by train and then by plane to Slovenia.

ESSIR 2011, Day 5

September 2, 2011 – 3:57 pm

Yesterday the first two lessons were held by Stefan Rueger from Knowledge media institute. He talked about multimedia
search (images, videos), indexing etc. There were some similarities to wednesday’s symposium on bias. The main problem he addressed is about semantic gap in image recognition. We have methods to extract data or features from images, but it is still hard to classify the whole image, for example victory or championship events.

 

In the afternoon, Ricardo Baeza-Yates was talking about crawling, comparing algorithms and some recent results from articles. He also spoked about idea of searching without explicitly providing a query to an information retrieval system. Then he was talking about distributed web search and the 09′ article on star topology. The interesting here was the paper was rejected twice, but then it got best paper award on th CIKM09 conference.

At around 4p.m. we went to a boat-cruise with dinner on the Rhein and Mosel. We drove up to Lorelei and then back to Koblenz.

When we got back at 10p.m., we attended a party in a local club:


ESSIR 2011, Day 4

September 1, 2011 – 12:16 am

On the fourth day of the ESSIR we had PhD symposiums or presentation of Livingknowledge project work.

I have chosen Livingknowledge project: Bias and diversity sessions.

In the first part Richard Johannson from Trento presented us Opinion Extraction from coarse to fine-grained methods. For coarse classification (useful for larger text chunks) they developed SentiWordnet, inspired by Wordnet. For example word “boring” is not problematic, but “good” has few senses and the system must decide which to use. Sequencers they have developed are using the standard IOB scheme. The recent research from LK is about temporal fact extraction, disambiguation and evolution. For example: Extracting roles of Arnold Schwarzenegger through time. Their system – Prospera works similar like IOBIE, but learns rules like Sofie, in an iterative way. The system was learned from ClueWeb09 corpus (around 500mio english web pages) on hadoop cluster 10×16 cores with 10x48GB RAM.

Then in the second part, Jonathan from Southhampton gave us a brief introduction to extracting information from images. From his research he found out that the best feature is a histogram of changing pixels for angles around pixels. He also compared IE from images with text IE. In the end he showed us the open-source tools like OpenIMAJ – library for working with IE on images and ImageTerrier – an extension for Terrier to index also images.

After this, Stefan Siersdorfer presented us the IE from multimedia content on social web. As I imagined he is helping himself with graph representation. The thing I think is most interestinf is the wisdom of croud – the collective intelligence is better comparing to single. The first research on that happened on the 7th March 1907 in an article, published in Nature. A researcher was asking people “How much does a bull weight?”. The average results were very close to the real answer. After such introduction he focused more on photos and videos.

In the last part Michael Mathews from Yahoo! Labs Barcelona guided use to practical use of their Diversity engine on some datasets. The tutorial we went through is presented on the following slides:

From 4-6p.m. we had poster sessions. My poster was printed by the organizers – I just needed to paste it on the rollup. During the presentation I’ve got some useful tips and found out that one group is working on a very similar thing to IOBIE. The second hour I was watching other’s posters presentations and eating some fingersnacks.

My poster:

This evening there was no organized event, so we went to a close castle on foot. Some pictures are published in a gallery below: