We run a niche news B2B news service; every day we send out an email with between 10 and 30 stories in it. This email is then turned into an html page including all the stories, and uploaded to our website. This has been going on for around 6 years, so I reckon there are about 1500 individual HTML pages, containing somewhere between 15,000 and 40,000 stories.
I've been tasked with creating a new site, using wordpress. I'm fine with this, but I'm stuck when it gets to putting all the old archive stories into wordpress as separate articles.
Here's what I need to do:
Scrape all the text from the HTML pages.
Take the text and separate it into individual articles with the following fields: article name, article body, date, article position. Article position refers to whether the story is first, second, third or whatever on the HTML page.
Parse/spider the article titles and article bodies to create a list of possible tags, and then match these against each article to generate suggested tags.
If the tagging bit isn't possible, then it would be acceptable for them just to be tagged with "archive" or something, and they can be manually tagged as and when is possible.
Save the outputted data (article title, article body, date, position, tag1, tag2, tag3) in whatever form makes sense. - excel/XML/whatever
Import the data into wordpress.
I have a good understanding of what needs to be done, and a rough idea of how to do it, I just lack the detailed technical knowledge to put it together.
Any suggestions/help would be much appreciated!