You are viewing a read-only archive of the Blogs.Harvard network. Learn more.

Semantic Web: Wikipedia and Natural Language Processing


Malvina and Zvi after the semantic web panel at WIkimania 2006.

Zvi and Malvina discuss fine points after the panel.
Malvina [right] was one of the panelists.

Suppose that you, like me, are a new wikipedian. You’ve learned the wiki codes which is not a big deal – rather easy compared to HTML, but still takes non-zero time. You’ve learned some of the conventions of the culture. You put “your” page together and put it up on the ‘pedia. What happens then. Well, if you, like me, didn’t read ALL the conventions of the culture, you will come back some time latter and find “your” page emblazoned with banners informing you of the conventions of the culture that you didn’t read. One of these might be that you forgot to assign “your” page a category. So you then need to spend a chunk of time reading the tree of available categories. It’s not hard to find one or two quickly, but how do you know you’ve found the best categories. How do you know you’ve found all the relevant categories. It is a ‘barrier to entry’ for new Wikipedians and a problem even for some experienced Wikipedians.

Natural language Processing (NLP) is equal to automating this process to some extent. It is possible for programs to read bunches of categorized articles and collect a ‘signature’ which could then be used to match up with new articles to make suggestions for categorizing them. This could be done now. The Wikipedians are discussion whether it should be done now.

On the one hand it would make creating new articles easier. Jimmy Wales mentioned in the morning plenary that with over 1,000,000 articles in the English Wikipedia, quality of existing articles is a higher priority than creating new ones. But NLP techniques can help here too. For example, a tool that can identify population numbers could check that a given city has the same population everywhere in the ‘pedia.

On the other hand, NLP systems are complex and consume a lot of computing resources. They are ‘heavy’. Wikipedia currently is ‘light’ i.e. simple and fast. The Wikipedians would like to keep it that way. NLP techniques will be introduced cautiously.

Why have I said “your” page throughout? That’s another aspect of Wikipedia culture. Articles do not belong to the originator, the most profilic contributor, or anyone else. It’s free content baby! It belongs to the world.

Document Licenses
Waging a Living

Leave a Comment

You must be logged in to post a comment.