Research

On this page just posts from Research category are shown.

 

 

Natural Language Processing in Slovene language

I have checked tools for text processing in Slovene language. As I have found out, there are no published named entity recognizers, relation extractors or co-reference resolution systems.

Good news is that there has been some work done in lemmatizing (sl. lematizacija) and POS tagging (sl. oblikoslovno označevanje) which will be very important for text preprocessing. The entry point where you can find more information is http://bos.zrc-sazu.si/. There are published some theories, references to scientific articles and research project. For me, the important resources are tagged datasets and learned models.

For Slovene, there are three main datasets that are woth mentioning (tagged according to TEI P5 standard):

  • JOS datasets: It consists of two datasets jos100k and jos1M with 100.000 and 1.000.000 hand-checked linguistical annotations.
  • FidaPLUS dataset: It contains a part of MULTEXT-East texts and is reference dataset for Slovene language as it is very representative.
  • MULTEXT-East: This is a spin-off of MULTEXT project. It contains morphological annotations for eastern languages: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Lithuanian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene and Ukrainian.

Learned models for lemmatizer and POS tagger can be found here: http://oznacevalnik.slovenscina.eu/Vsebine/Sl/ProgramskaOprema/Oblikoslovni.aspx. Sofware is unfortunatelly (unix user now :( ) written in .NET, but it contains training data some examples. Maybe I could get the source code from the developers to rewrite it to Java or I will do it on my own. On videolectures.net there is a nice presentation of work done on this project. Another lemmatiser can be found here: http://lemmatise.ijs.si/Software – it may be the same as previous one as it is writen by researchers from the same department.

For Slovene there also exists a version of Wordnet like lexicon, named sloWNet 2.2. Currently it contains 20.000 synsets along with 17.000 literals.

More information can also be found on Josef Stefan’s reasearch group site: http://nl.ijs.si/. Commercially available tools are available from company Amebis. The most products they have are rule based and from my point of view can only be used as helpers for research.

Quick intro to Weka

Weka (http://www.cs.waikato.ac.nz/ml/weka/) is Data Mining software from The University of Waikato. In Slovenia, The Bioinformatics Laboratory has also developed well known software Orange (http://orange.biolab.si/). Both tools have GUI interface and a library for programmatic access. The main difference is that Weka is Java and Orange is python -based.

Here I will give a short example how to use Weka within Java. Tha Java file is accessible here: Weka.java. All you need to do is put weka.jar to classpath, compile and run Weka.java (of course you need to have c:\\temp folder or choose another one).

For classification problems we normally have to identify features. In Weka standard types of attributes are numeric, nominal, string, date and relation. Relation attribute can represent a whole dataset. There are also some functions for data preprocessing available. Here we define some attributes:

//1.ATTRIBUTES
//numeric
Attribute attr = new Attribute("my-numeric");
System.out.println(attr.isNumeric());
 
//nominal
FastVector myNomVals = new FastVector();
for (int i=0; i<10; i++)
	myNomVals.addElement("value_"+i);
Attribute attr1 = new Attribute("my-nominal", myNomVals);
System.out.println(attr1.isNominal());
 
//string
Attribute attr2 = new Attribute("my-string", (FastVector)null);
System.out.println(attr2.isString());
 
//date
Attribute attr3 = new Attribute("my-date", "dd-MM-yyyy");
System.out.println(attr3.isDate());
 
//whole relation can also be an attr
//Attribute attr4 = new Attribute("my-relation", new Instances(...));

When we have attributes, we can form the dataset aka. relation (reading and writing from files will come later):

//2.create dataset
FastVector attrs = new FastVector();
    attrs.addElement(attr);
    attrs.addElement(attr1);
    attrs.addElement(attr2);
    attrs.addElement(attr3);
Instances dataset = new	Instances("my_dataset", attrs, 0);

Now we have defined the relation structure. There are a few possible ways to fill the dataset and here we present few of them:

//3.add instances
//first instance
double[] attValues = new double[dataset.numAttributes()];
	attValues[0] = 55;
	attValues[1] = dataset.attribute("my-nominal").indexOfValue("value_5");
        attValues[2] = dataset.attribute("my-string").addStringValue("Slavko");
	attValues[3] = dataset.attribute("my-date").parseDate("7-6-1987");
dataset.add(new Instance(1.0, attValues));
 
//second instance
attValues = new double[dataset.numAttributes()];
	attValues[0] = Instance.missingValue();
	attValues[1] = dataset.attribute(1).indexOfValue("value_9");
	attValues[2] = dataset.attribute(2).addStringValue("Marinka");
	attValues[3] = dataset.attribute(3).parseDate("23-4-1989");
dataset.add(new Instance(1.0, attValues));
 
//third instance
Instance example = new Instance(4);
	example.setValue(attr, 16);
	example.setValue(attr1, "value_7");
	example.setValue(attr2, "Mirko");
	example.setValue(attr3, attr3.parseDate("1-1-1988"));
dataset.add(example);

Up to here we have the dataset in the memory. We can use it (class attribute needs yet to be set), print it to stdout or file:

//4.output dataset
System.out.println(dataset);
 
//5.save dataset
String file = "C:\\temp\\weka_test.arff";
ArffSaver saver = new ArffSaver();
saver.setInstances(dataset);
saver.setFile(new File(file));
saver.writeBatch();
 
//6.read dataset
ArffLoader loader = new ArffLoader();
loader.setFile(new File(file));
dataset = loader.getDataSet();

As we have one string attribute, we need to properly preprocess it as very few classifiers support them. We can accomplish this with filters, for example changin it to nominal attribute:

//7.preprocess strings (almost no classifier supports them)
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(dataset);
dataset = Filter.useFilter(dataset, filter);
System.out.println(dataset);

We have the data. The next thing is building a classifier. Weka contains a lot well known classifiers like naive Bayes, decision trees, perceptrons, etc.. I like SVMs and I use LibSVM with Weka. Weka already has built-in LibSVM API, so the only thing you need to do is to include libsvm.jar to classpath and use LibSVM as classifier instance.

Another very easy task is also saving and retrieving back classifiers. The only thing to be aware of is the class index! You must set it before learning the classifier. Best practice is to always set class attribute as last one.

//8.build classifier
dataset.setClassIndex(1);
Classifier classifier = new J48();
classifier.buildClassifier(dataset);
 
//9.save classifier
OutputStream os = new FileOutputStream(file);
ObjectOutputStream objectOutputStream = new ObjectOutputStream(os);
objectOutputStream.writeObject(classifier);
 
//10. read classifier back
InputStream is = new FileInputStream(file);
ObjectInputStream objectInputStream = new ObjectInputStream(is);
classifier = (Classifier) objectInputStream.readObject();
objectInputStream.close();

Usually we need to know how good the classifications are. Weka supports a number of evaluation tools, like CV and different measures. Here we will resample our dataset, create the train and learn dataset and output some results.

//11.evaluate
 
//resample if needed
dataset = dataset.resample(new Random(42));
 
//split to 70:30 learn and test set
double percent = 70.0;
int trainSize = (int) Math.round(dataset.numInstances() * percent / 100);
int testSize = dataset.numInstances() - trainSize;
Instances train = new Instances(dataset, 0, trainSize);
Instances test = new Instances(dataset, trainSize, testSize);
train.setClassIndex(1);
test.setClassIndex(1);
 
//do eval
Evaluation eval = new Evaluation(train); //trainset
eval.evaluateModel(classifier, test); //testset
System.out.println(eval.toSummaryString());
System.out.println(eval.weightedFMeasure());
System.out.println(eval.weightedPrecision());
System.out.println(eval.weightedRecall());

When classifying new instances, we must be aware to transform classifier’s result to the class attribute value – it returns only the index of a value (for classification purposes)!

//12.classify
//result
System.out.println(classifier.classifyInstance(dataset.firstInstance()));
//classified result value
System.out.println(dataset.attribute(dataset.classIndex()).value((int)dataset.firstInstance().classValue()));
System.out.println(classifier.distributionForInstance(dataset.firstInstance()));

I hope this example was useful to you. I tried to show how to use weka for some quick tasks.

ESSIR 2011, Videolectures

As I have already mentioned, the Videolectures.net team was capturing the main summer school’s track. You can access their material at:

http://videolectures.net/essir2011_koblenz/

ESSIR 2011, Day 6

On the last day of the summer school, Peter Ingwersen was giving a double lecture of information seeking. He introduced us some theoretical models, variables and their uses. This models are meant for the architecture of a search engine and are independent of index implementation. In the last part, he covered some experiments their were conducting and problems we must be aware of when experimenting. When improving results, he also provided an example of two models with around 5% difference in F-score. The point was that everybody can be happy, accept your paper for giant improvement, but in practice the improvement is not noticeable.

After the second lecture ddr. Sergej Sizov officially ended the summer school:

Then we went for a lunch in the university’s canteen.

In the evening we met with the participant who were still in Koblenz on a few more drinks:

Tomorrow (today as of time of writing) I and Kaja are going to Frankfurt by train and then by plane to Slovenia.

ESSIR 2011, Day 5

Yesterday the first two lessons were held by Stefan Rueger from Knowledge media institute. He talked about multimedia
search (images, videos), indexing etc. There were some similarities to wednesday’s symposium on bias. The main problem he addressed is about semantic gap in image recognition. We have methods to extract data or features from images, but it is still hard to classify the whole image, for example victory or championship events.

 

In the afternoon, Ricardo Baeza-Yates was talking about crawling, comparing algorithms and some recent results from articles. He also spoked about idea of searching without explicitly providing a query to an information retrieval system. Then he was talking about distributed web search and the 09′ article on star topology. The interesting here was the paper was rejected twice, but then it got best paper award on th CIKM09 conference.

At around 4p.m. we went to a boat-cruise with dinner on the Rhein and Mosel. We drove up to Lorelei and then back to Koblenz.

When we got back at 10p.m., we attended a party in a local club:


ESSIR 2011, Day 4

On the fourth day of the ESSIR we had PhD symposiums or presentation of Livingknowledge project work.

I have chosen Livingknowledge project: Bias and diversity sessions.

In the first part Richard Johannson from Trento presented us Opinion Extraction from coarse to fine-grained methods. For coarse classification (useful for larger text chunks) they developed SentiWordnet, inspired by Wordnet. For example word “boring” is not problematic, but “good” has few senses and the system must decide which to use. Sequencers they have developed are using the standard IOB scheme. The recent research from LK is about temporal fact extraction, disambiguation and evolution. For example: Extracting roles of Arnold Schwarzenegger through time. Their system – Prospera works similar like IOBIE, but learns rules like Sofie, in an iterative way. The system was learned from ClueWeb09 corpus (around 500mio english web pages) on hadoop cluster 10×16 cores with 10x48GB RAM.

Then in the second part, Jonathan from Southhampton gave us a brief introduction to extracting information from images. From his research he found out that the best feature is a histogram of changing pixels for angles around pixels. He also compared IE from images with text IE. In the end he showed us the open-source tools like OpenIMAJ – library for working with IE on images and ImageTerrier – an extension for Terrier to index also images.

After this, Stefan Siersdorfer presented us the IE from multimedia content on social web. As I imagined he is helping himself with graph representation. The thing I think is most interestinf is the wisdom of croud – the collective intelligence is better comparing to single. The first research on that happened on the 7th March 1907 in an article, published in Nature. A researcher was asking people “How much does a bull weight?”. The average results were very close to the real answer. After such introduction he focused more on photos and videos.

In the last part Michael Mathews from Yahoo! Labs Barcelona guided use to practical use of their Diversity engine on some datasets. The tutorial we went through is presented on the following slides:

From 4-6p.m. we had poster sessions. My poster was printed by the organizers – I just needed to paste it on the rollup. During the presentation I’ve got some useful tips and found out that one group is working on a very similar thing to IOBIE. The second hour I was watching other’s posters presentations and eating some fingersnacks.

My poster:

This evening there was no organized event, so we went to a close castle on foot. Some pictures are published in a gallery below:


ESSIR 2011, Day 3

Today Ricardo Baeza-Yates was the keynote speaker. He is very famous researcher in the field of Information Retrieval and leads the Yahoo! Research Labs. The keynote was mostly about webscale algorithms and some interesting IR insights. One of them is the notion of adversarial level of access for the internet which seems to be more than just public. Another interesting thing is that this year’s internet usage research showed there are about 400million of servers and 800million of clients – so there exits 1 server for 2 clients. Another phenomenon is that many people think Google PageRank is still very important, but it is only one of the many features at Learning To Rank task.

Keynote presentation:

The next speaker Andreas Hotho talked about IR in Social Media. He focused on collaborative tagging, folksonomies and some network properties (for example on Delicio.us). At their site http://www.bibsonomy.org/ a number of documents are tagged for bookmarks sharing or scientific use. By Golder & Huberman the tags could be clustered into 7 groups. To me the more interesting notions seem the definition of Folksonomies – “Folksonomies allow users to assign tags to resources”, and Logsonomies – “Logsonomies allow users to assign resources to query terms”. At the end he also mentioned his algorithm FolkRank (it’s purpose can be identified from the name) and some recommendations for folksonomies.

Another interesting product he mentioned is also Piggy Bank – Firefox plugin, developed at MIT and is intended to extract data from the web depending on manually predefined site scrappers.

Andreas’ presentations:

The last lecture was conducted by Tran Duc Thanh, which has recently got his PhD and is very successful scientist. He was lecturing about semantic search. At slides around 22 there are some indentical ideas to our IOBIE system. Then the majority of the talk was about matching at three dimensions and top-K algorithms for ranking. The ObjectRanking (slide 63) could be useful for ranking recognized entities in IOBIE.

The last session was the Plenary Discussion – Sharing PhD experiences and recommendations from ESSIR lecturers. Some interesting answers / reccomendations were (mostly by Ricardo):

  • “Research, you have to live it”
  • “Do not think what you have to do – just do it!”
  • “Why do you need PhD to start the company?”
  • “Do PhD or not?”, “Depends on passion. If not sure, do not do it.”
  • For PhD research you need a life-label: “Invent problems, …, incremental research may not be appropriate”
  • “Go to as many seminars as you can, even if you do not understand the title.”

The most I liked the following one: “If you get through PhD, you are emotionally robust!

Our ESSIR lecturers during the discussion:


In the evening we went to the informal walk around Koblenz, guided by dr. Sergej Sizov, the main organizer of ESSIR 2011.

 

ESSIR 2011, Day 2

Yesterday, on monday the first lectures have began. At 9a.m. we first got our ids. Then I “outsourced” the printing of my poster to organization team at info-point ;) .

After some coffee, the keynote of Nigel Shadbolt started. Nigel is a professor of Artificial Intelligence (AI) at the University of Southampton. As I had supposed before, Tim-Berners Lee and hid proposal of internet was mentioned. Then he presented the overview of the Information Extraction field as it is supposed to be for opening talks. His also introduced us his notion about the importance of people over algorithms we are forgetting about. The next big words I remember were that subfields of IR are not connected to traditional Artificial Intelligence, but may be better to say Augmented Intelligence.

As the keynote presentation is too big, you can retrieve it here: http://zitnik.si/temp/essir2011/01_NigelShatbolt_keynote.pdf

The next two-part lecture Foundations: Models and Methods was given by Hinrich Schuetze. He was one of the authors of the book Introduction to information retrieval which I read before writing my bachelor thesis. That is why I was very familiar with presentation. His presentations and links are also available on http://informationretrieval.org/essir2011. He lectured some introduction to Boolean model (he emphasized about feast or famine mostly), Vector model, Ranking (interesting was what is the sequence user looks at the results, that 30% users will click even the nonrelevant result, nicely presented difference between Euclidean and cosine similarity, pivot normalization, …), Probabilistic models, Language models and finally Learning to Rank.

Schuetze’s presentation:

The las lecture was given by Vladimir Batagelj, the “godfather” of Pajek. He presented the program Pajek and many interesting properties of network analysis, which I’ve also heard about at his course on Large network analysis during my PhD study.

In the evening the beachvolley was planned but because of the wind we went to mexican restaurant Enchiliada. There Lars showed us some game with glasses (by that you may separate stupid and smart people – maybe :) :), I successfully passed the test).

Before going back to hotel Sholz we needed to go to the German corner to take some pictures:


ESSIR 2011, Day 1

On the 28th of August 2011, Kaja Vidmar and I went to Koblenz to attent to the 8th European Summer School of Information Retrieval – ESSIR 2011.

First we flew from Brnik to Frankfurt at 8 a.m. in the morning. In Frankfurt we bought tickets for regional train to Koblenz at 10:37. As there were no data about the route, we somehow managed to get some information from sour lady at DBahn info-point. At around 1 p.m. we got to Sholz hotel, where we checked in and first unpacked our laptops.

After some time, we got out. We got over almost the whole Koblenz on foot, eat at MacDonald’s and got acquainted with the city. At 7:40 p.m. we entered the Gecko lounge where there was a Get-together event for the participants. We got to know some nice people, mostly younger researchers, mainly from the fields of Information Retrival, Data Mining, Social Network Analysis and also Software Engineering.

At around midnight we arrived to Sholz hotel, where I am writing this post.


Starting …

“Sometimes a research is a lot of hard work in looking for the easy way.”