I’ve been in Berlin at Wikimedia Deutschland’s 5th Birthday (5 Jahre, if you speak the local parlance).
At this party, I stayed with Daniel Kinzler (Duesentrieb) of Toolserver and Wikimedia DE fame. Daniel and I had lengthy conversations about plans for implementing some sort of system for bringing Wikimedia to the semantic web.
Here are some of the things we talked about:
- The “identity crisis” of semantic data. When we add an “author” tag to a page about a book, are we talking about the article itself, or are we talking about the subject of the article. For example, should the tag on http://en.wikipedia.org/wiki/Mein_Kampf be “Wikimedia Community”, or should it be “Adolf Hitler”? While an interesting challenge, Daniel and I figured that all that needs to happen is that this needs to be well-defined, and then it is dealt with.
- The need for integration with existing article wikitext. It is no good to implement a brand new system that requires automatic (or even manual) conversion of millions of existing articles. A good way to do this would be to modify the underlying infobox templates, so that they participate in the experiment.
- The need to balance flexibility with performance in the query infrastructure.
- The storage location for semantic data. We pretty much agreed that the canonical version should be stored in the wikitext itself for versioning purposes, and that a searchable, normalised store of the current version of the data should be stored elsewhere in the database (or in a different database), much as category, page and template links are stored.
- The relationship between infobox usability and the entry of semantic data. I’ll discuss this below.
So I do have a sort of an idea as to how we could implement this really well. Like most good suggestions, it is integrated with the infobox infrastructure. I think this is necessary, because unless the data is user-visible, it will not be maintained very much. Infoboxes make the information user-visible in a very obvious and simple way, and coupled with a client-side infobox editor, it could become very easy and rewarding to edit and maintain semantic data. Infoboxes also tie in neatly with the Wikipedia Usability Initiative. The usability initiative plans (as far as I am aware) to clean up the interface for editing infoboxes, so that it doesn’t involve dealing with horrible template syntax.
So here’s my crazy idea for killing both birds with one stone:
Step 1: Create an Infobox generator/editor
This would involve:
- Defining the different “templates” (i.e. “person”, “US Government Agency”, “1980s singer’s dog” (Daniel’s suggestion) ). Each template would include the fields expected for that kind of subject, and what data types they were.
- Selecting the template to use when creating each article. This could be done with a nice AJAX-y search interface. In turn, you are given a simple interface for entering the data. Adding the data type gives lots of niceties, for example if the data type is a page or image name, you get pretty AJAX search and preview. Daniel noted that we’d need an extra field for auxilliary commentary and sources on this data, population being the classic example (what is the data’s currency and source?).
- Somehow exporting this data to wikitext. This could be done as a template, or (my preferred solution) a parser function or tag extension. I’ll discuss this in the next section.
Step 2: Generate infoboxes from semantic data at a software level
The templates we currently use for infoboxes are terrible hacks produced because nobody stepped up and provided a better way of generating them. My proposal is to replace infobox templates with wrappers around a tag and/or parser-function-based solution.
This makes things much cooler. The semantic data would all be stored in a normalised way in the database (generated from the wikitext), and the infoboxes would be just one front-end interface to this data (albeit the most used and user-friendly). Another interface could be an RDF (or similar machine-readable format) output format.
Step 3: Store normalised data in the database for queries
I’ve touched on this before, but not addressed it rigorously. A huge part of the utility of storing semantic data in the database is that you can emulate many of the ad-hoc, manually maintained categories with very simple normalised storage and retrieval of the semantic data.
So, the most sensible storage system seems to be the tried and tested semantic triplet – subject, relation, object. The subject is, of course, the article (for storage purposes, this isn’t related to the “identity crisis” issue discussed above). That leaves the relation and object as the metadata key and value respectively.
The raison d’être of this storage is the query interface. I won’t dwell on it, because it’s not a particularly original idea.
Advantages
The main advantages of this system are practical. There isn’t that much extra software to be implemented, except the infobox editor. The infobox editor, of course, is planned to be implemented by the Usability Initiative anyway! What remains isn’t too difficult to throw together in a few weeks, and could very quickly change the nature of Wikipedia’s reuse.
Please give me your thoughts on these musings and this proposal!
This sounds like a parallel version of Semantic MediaWiki. What is the point, when one already exists, with superior features?
As for infobox editing, a gadget already exists: http://de.wikipedia.org/wiki/Benutzer:Revvar/VM
It could be improved in a lot of ways, but it is already pretty helpful.
As for infobox templates, I think the template system is not too bad for that (from a template creator point of view; I don’t know how efficient it is with the system resources), except the mess with the table rows in if templates using {{!}} and similar hacks. Conditional rows would be a nice addition to the current table syntax: something like
|- #if=”{{{area|}}}”
| Area: || {{{area}}}
|- …
is easy to understand and much more readable than
–>{{#if:{{{area|}}}|
{{!-}}
{{!}} Area: {{!!}} {{area}}}
}}<!–
not to mention all sort of unexpected behavior with extra newlines left after transclusion turning into empty tags creating ugly spaces around the infobox.
Link | June 16th, 2009 at 9:07 pm
It would certainly be irresponsible to continue with Semantic data without doing a comprehensive review of what Semantic MediaWiki has to offer. With that said, my general impression is that Semantic MediaWiki does not have the focus on infoboxes that this approach would have. My impression of the configuration of Semantic MediaWiki is that it is flexible, yet frustrating to use. It also seems quite feature-heavy, and would need severe paring down, and it does not include one of the most critical elements of my proposal – the AJAX infobox editor.
These impressions may be wrong. At the end of the day, it may make sense to use Semantic MediaWiki as the backend for steps 2 and 3, but step 1 (by far the most development-heavy) is the lynchpin of the proposal, and it is not addressed by Semantic MediaWiki to my knowledge.
Some ideas and/or code could be used from this gadget, but if it were up to the task, then we would already be using it. It’s certainly worth keeping on the table for the usability folks to take a peek at.
I'd really like to see this abstracted away to the software. Table wikicode is hideous except for very simple tables.
Interesting comments, particularly on Semantic MediaWiki, thanks for contributing!
Link | June 17th, 2009 at 1:21 am
I was looking at Semantic Media wiki and how it might be used for Wikipedia and I think that in practice our data is too complicated to put in a triple.
Berlin IS CAPITAL OF Germany.
Start date?
Finish Date?
Preceded by?
Succeded by?
Which Germany do you mean? (Nazi? DDR? BDR?)
Source of the information?
Infoboxes would be much better way to structure this information.
Suggestions:
INTERNATIONALISATION
I would like the info box info stored in Commons or a new Infobase and imported into the articles in every language. With a minimum of infobox Localisation any small wikipedia can make all the info in the infobase available (and editable) in their language.
INFOBOXES FOR NON-NOTABLES
The compromise that the Deletionists and the Inclusionists have arrived at is that where characters are not-notable in themselves they can still be included in an ‘also starring’ list/summary page. This means we lose the one to one correspondence between the page and the person which Semantic Media Wiki depends on. (SMW triples are actually doubles in the form
“The subject of this page IS CAPITAL OF [[Germany]]”)
MAKE WIKISPECIES A PILOT PROJECT
Wikispecies could be a pilot for the Infobase. At the moment (so I’ve been told0 the info in Wikispecies is less complete than in the English Wikipedia however if there was a sound basis for sharing info and reusing it then it would be interesting to see what becomes of it. If that works then I would suggest Commons as the next application, making it easier to enter and search metadata for files there.
ATTRIBUTION
As this information is to be so widespread (Microsoft Bing is already doing something like this with existing infobox data) it becomes even more important that the information is correct so every datum and fact needs to be referenced back to a source and the form for entering data needs to include space for these. This could be partially automated – enter the ISBN and AJAX fills in the name and author if we already have it in the database.
Link | June 18th, 2009 at 1:51 am
and another thing
make the info box info and the info box itself easily exportable / importable. Wikipedia will probably never have an infobase of every CD, movie, book, video game but wikipedia data would be a useful strat for such a collection.
making the forms easy for people to use on their own site. If an author puts up infoboxes for herself and her books on her site that is then easy for us (or Google) to import.
LOCALISATION
Make localisation possible in languages which don’t have their own wikipedias. Think how happy the conservapedians will be when they can have personal info boxes with headings for “Christian Name” and “Surname” and dates AD and BC.
Link | June 19th, 2009 at 6:32 pm