Title: Web crawling, libraries/APIs

Date: 24.10.2008, 17:00

People: Rasmus Buchmann (vorname.nachname(at)

      Oleksandr Druzhynin (druzhynin at

Tutor: Avaré Stewart



Some Questions:

Do you need content from a single site; does the site provide and API?

Do you need content from multi sites; is there and API?

Does the website offer an RSS Feed?

What type of Web Page to Do Have?

Do you have many pages with *same* structure

Do you have many pages with *different* structure

Do you have to be selective about the content that you extract

Do you have to preserve the structure/type of content on the page: timestamp, tags, etc.

What other criteria is important is selecting a tool?

Some Tools

Site-Specific APIs:


Diggs API: API:

Multi-Site API:

Spinn3r: “We crawl the web, so you dont have to”

Road Runner: Abstract from HTML and create XML with text marked up using a single tag name

WWW::Mechanize: Emulate a Browser

HTML Parser: Fine grained manipulation HTML markup

RSS: automatically collect XML formatted data from the web page and store it locally Curn :

Different tools Variety of Tools - Depending on purpose


