Tuesday, 17 September 2013

Extracting semi-structured user generated content from web pages using Python

Extracting semi-structured user generated content from web pages using Python

I am working on a project for which I need to extract chords played over
song lyrics. The goal is to find what part of lyrics are played under
which chord. I'm using web pages containing guitar chords from
ultimate-guitar.com (I chose this site because it seems to have largest
collection of transcribed songs)
The typical structure of web page is:
For example:
http://tabs.ultimate-guitar.com/p/poets_of_the_fall/carnival_of_rust_crd.htm
Snippet:
As you can see, the chords are written on line before lyrics and the
relative position from left margin decides which chord is played over
which words. The page source for the above song looks like:
My strategy to accomplish the task:
1) Find the above relevant portion (ignore ads, indexes on web page) of
web page using beautiful soup
2) Read this portion line by line.
3) Use <span> tag to identify which lines contain chords.
4) Assume the next line following line having <span> tags is going to
contain lyrics
5) Find out relative position of each chord, store it and compare it to
position of words in line below to find out which chords are played over
what chords.
6) Store this data in a dictionary with chord name as key and value would
be list of phrases played over this key chord.
The above implementation works fine in some cases, but since there's no
specific structure defined, it fails miserably whenever the assumed
structure of page is not followed.
For example, (Source:
http://tabs.ultimate-guitar.com/k/kate_voegele/all_i_see_crd.htm)

Here there are unexpected '\<\pre><\i></i>' tags before and now my key is
stored as "<\pre><\i></i>D" instead of just "D". (Apologies, I had to
insert extra "\" here else tags weren't visible.)
And there are many such errors in my parsed data because of this
unexpected variation in structure of page. Any ideas on how these kind of
cases could be handled or is there a better way to accomplish this task?

No comments:

Post a Comment