Extract text from HTML pages, separate into articles, extracts keywords, tag with keywords, import to wordpress

5 pts.
Microsoft Excel
We run a niche news B2B news service; every day we send out an email with between 10 and 30 stories in it. This email is then turned into an html page including all the stories, and uploaded to our website. This has been going on for around 6 years, so I reckon there are about 1500 individual HTML pages, containing somewhere between 15,000 and 40,000 stories. I've been tasked with creating a new site, using wordpress. I'm fine with this, but I'm stuck when it gets to putting all the old archive stories into wordpress as separate articles. Here's what I need to do: Scrape all the text from the HTML pages. Take the text and separate it into individual articles with the following fields: article name, article body, date, article position. Article position refers to whether the story is first, second, third or whatever on the HTML page. Parse/spider the article titles and article bodies to create a list of possible tags, and then match these against each article to generate suggested tags. If the tagging bit isn't possible, then it would be acceptable for them just to be tagged with "archive" or something, and they can be manually tagged as and when is possible. Save the outputted data (article title, article body, date, position, tag1, tag2, tag3) in whatever form makes sense. - excel/XML/whatever Import the data into wordpress. I have a good understanding of what needs to be done, and a rough idea of how to do it, I just lack the detailed technical knowledge to put it together. Any suggestions/help would be much appreciated!

Answer Wiki

Thanks. We'll let you know when a new response is added.

There are couple of good scripts posted at http://www.biterscripting.com/SS_WebPageToText.html and
http://www.biterscripting.com/SS_URLs.html . They show how to parse web pages. To see how they work –

1. Install biterscripting free from http://www.biterscripting.com .
2. Install the above and all other sample scripts with the following command.

<pre>script “http://www.biterscripting.com/Download/SS_AllSamples.txt”</pre>

You can then modify these scripts to tailor to your requirements. There also are tons of other biterscripting scripts posted over the web.


Discuss This Question:  

There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

To follow this tag...

There was an error processing your information. Please try again later.

Thanks! We'll email you when relevant content is added and updated.


Share this item with your network: