Comments | lj_dev: Downloading all public articles from a blog

kypeli in lj_dev

Downloading all public articles from a blog

Jan 28, 2012 18:21

Hi all ( Read more... )

client: export, client: unauthenticated access, client, code: perl

Comments 11

va_dev January 28 2012, 20:50:20 UTC

There is an API that you can use: http://www.livejournal.com/doc/server/index.html. This require some scripting/coding. Does this make sense?

kypeli January 28 2012, 21:06:06 UTC

Thanks for your reply! Does it make sense for my use case? I am not sure :)

If I understood the documentation right, those API calls would let me interface with my own blog entries by authenticating myself first with the server. But I could not find relevant information from that link on how I could take any public blog on livejournal.com and download content from it. But maybe I just didn't understand how to read the docs?

Maybe I missed something?

va_dev January 28 2012, 21:15:01 UTC

The best way I know is using xmlrpc protocol. There are existing implementations in various programming languages, but you can write your own too. If you look at this page: http://www.livejournal.com/doc/server/ljp.csp.xml-rpc.protocol.html, it lists the methods that can help you for querying anything you need from the journal. In your particular case you can use getevents method in combination with others. The problem is that the number of returned events (entries) per query is limited by 50, however you can fetch all blog entries step by step using the API.

int January 28 2012, 21:17:30 UTC

You could do it via the LJ protocol and syncitems/getevents, then output it all in whatever format you want. This would mean you'd have to have the username/password of a user to get their items though, which I'm guessing you don't want to do as you mentioned pulling all public items.

kypeli January 28 2012, 21:37:57 UTC

That is correct. I am interested in analyzing certain (public) blogs and their content but I am not the admin of these blogs.

So basically there isn't really a way to do what I would like to do?

andy January 29 2012, 07:30:06 UTC

Scraping HTML is the way to do it; LJ is fine with that, assuming your system behaves itself and doesn't create too much strain on the servers. This Perl script used to be able to save a given journal to a set of disk files: http://pastebin.com/1CaVmEij. I haven't checked if it still works, but reviewing it may give you some ideas.

kypeli January 29 2012, 07:44:58 UTC

Thanks! I was afraid it would go to scraping HTML, but that Perl should be very helpful. Cheers!

kypeli January 29 2012, 13:02:26 UTC

Thanks again for the Perl script! It worked perfectly!

andy January 29 2012, 13:08:58 UTC

I'm glad I was able to help!

Thread 6