Sunday, January 22, 2006

Copying from Blogger to TWiki with Perl

Blogs are a great tool, but sometimes you want to integrate them with other bodies of work. In my case I have this blog, hosted on Blogger, and on a private network I have a TWiki system (a flavor of wiki) where I would also like the articles to appear.

In the new "Web 2.0" world I would probably do this by plopping some JavaScript in
the wiki, and have it dynamically render the feed from my blog. There are several issues with doing this, the most important one being that because the JavaScript is dynamic, the content is not searchable.

So, the solution, for me was write a program that could read in an Atom feed, and update the wiki accordingly. Perl seemed to the obvious choice for the language to write the tool in, and as expected there was both a Atom, a feed format, and the TWiki API. Besides these the only other support is for parsing and formating dates. Below is the start of the code.


#!/usr/bin/perl
 
use strict;
use WWW::Mechanize::TWiki;
use XML::Atom::Client;
use Date::Parse;
use Date::Format;


Now comes the hard part, which is trying to get the modules installed. I had a lot of problems here, partly because I had some old versions of some modules, and the installers didn't warn me of this. It is hard to predict what problems you might have here, but hopefully the installation of missing modules will go a lot smoother for you. Note that if you need to connect to an SSL server, you will need to install the Crypt::SSLeay module. Some of these modules also have system library requirements as well, but hopefully you already have those installed.

Assuming that you have everything installed, tested, and working, it's time to write some code. First we need to set up our TWiki API. The TWiki API uses a REST style API, which means that I just pretends to be a browser, using just simple HTTP requests.


my $mech = WWW::Mechanize::TWiki->new(
agent => 'BlogAgent/0.1', autocheck => 1 )
or die $!;
$mech->cgibin(
'https://twiki.internal.com/cgi-bin/twiki',
{scriptSuffix => ''});


If you need to log-in to you TWiki, you will also need to make use of the $mech->credentials() method to set your username and password.

Next we need to connect to our Atom feed, and retrieve the entries. This is pretty simple, just specify the feed URL and fetch the entries.


my $api = XML::Atom::Client->new;
my $feed = $api->getFeed(
'http://roberthanson.blogspot.com/atom.xml');
my @entries = $feed->entries;


The easy part is done. Now we need to loop through the entries in the Atom feed, extract the data, and create TWiki entries. This loop references two functions that we need to have, one called toWikiWord() and one called isSame(). toWikiWord() will convert blog titles to a "wiki word" (TWiki uses CamelCase), prefixed with BlogArticle. isSame() is a tool that will compare two pieces of text, not including whitespace, and return true if they are the same. We use this function so that we don't update TWiki pages unless there was a change in the Atom entry.


sub toWikiWord {
my $t = shift;
$t = "blog article $t"; # add prefix
$t = lc($t);
$t =~ s/\d//g; # remove numbers
$t =~ s/\b(\w)/\u$1/g; # convert to camel case
$t =~ s/[^a-z]//ig; # strip non-alpha chars
 
return $t;
}

sub isSame {
my ($s, $t) = @_;
 
# remove all spaces before comparison
$s =~ s/\s//g;
$t =~ s/\s//g;
 
return $s eq $t;
}


Below is the loop which processes each entry in order from the Atom feed. It is a little big, so I haave added comments inside the code to explain each part instead of trying to explain each small section.


for (@entries) {
 
# We need to create the page name for the wiki.
# We will use the toWikiWord() function that we
# already discussed, and prefix it with the name
# of the TWiki "web name". For me, I am placing
# the articles in a web called "PS".
 
my $page = 'PS.' . toWikiWord($_->title);
 
# This is where out TWiki agent actually connects
# to the TWiki system and edits the page where the
# article will live. If the page doesn't exist,
# this will also create it. I am setting the
# topic parent to my main page, you should change
# this to a suitable page.
 
$mech->edit($page, {
topicparent => 'Main.RobertHanson',
});
 
# We need to extract the URL from the Atom feed.
# The feed may contain more than one URL,
# especially if your blog support Atom admin
# extensions, like Blogger does. The loop
# below reads through each link, and grabs
# the first that is labeled as an "alternate".
 
my $url = '';
for my $link ($_->link) {
if ($link->get('rel') eq 'alternate') {
$url = $link->get('href');
last;
}
}
 
# Next we grab the created and modified dates for
# the blog entry. We want to include the created
# date on the TWiki page, and optionally include
# the modified date if the date is different.
# In the code below we are only getting the date,
# and not the time that the entry was created.
# You may want to change the date formats to suit
# your needs.
 
my $c_time = str2time(
$_->get('http://purl.org/atom/ns#',
'created'));
my $m_time = str2time(
$_->get('http://purl.org/atom/ns#',
'modified'));
 
my $c_long = time2str('%B %e, %Y', $c_time);
my $m_long = time2str('%B %e, %Y', $m_time);
 
my $date_str = $c_long;
$date_str .= " (updated $m_long)"
if $c_long ne $m_long;
 
# Now we need to create the text for the TWiki
# entry. We print the title, author, blog entry
# URL, and article text. You will want to change
# this to suit your needs.
 
my $text = sprintf('
Title: %s <br>
By: %s <br>
From: %s <br>
<p> %s </p>',
$_->title,
$_->author->name,
$date_str,
$url,
$_->content->body);
 
# Now we compare the text that we just created
# against the current page in the TWiki. If the
# contents are the same we move on to the next
# entry. This is very important if we run this
# often, as we don't want the TWiki to create
# new versions of the article if it didn't
# actually change.
 
if (isSame($mech->field('text'), $text)) {
print "No change required to $page\n";
next;
}
 
# Almost done. All we need to do is set the
# text, and save it.
 
$mech->field(text => $text);
$mech->click_button(value => 'Save');
 
print "Created/Updated $page\n";
}


From here you have the tools you need to copy the articles to your TWiki system. If you don't use TWiki, you can still use this code bychanging the TWiki specific lines to use an API that is compatible with your wiki. You may also want to add additional functionality to this, for example, in my code I also created an article index page.

Happy coding.

2 comments:

Unknown said...

Hi,

do you have a link to TWiki API please?

Zdenek

Robert Hanson said...

http://search.cpan.org/~wbniv/WWW-Mechanize-TWiki-0.12/