Due to the lack of in depth PHP work recently I was looking for some personal project to work on. After thinking it over I came with a good and practical project. The sole purpose of this was to practice my skills a bit and to have nicely presented content. Sit tight this entry is a long one.
I’m from Aruba and there is currently only one 100% active online news website. But I can’t stand that website due to many issues. For one it’s full of ad banners, now it’s common knowledge that news websites needs ad banners to generate an income. But these are for 98% flash banners, 12 ads to be exactly on every page.
It would be better to have non obstructive banners/ads. Secondly a great part of the visitors are Arubians living abroad. So these ads have no value whatsoever. The companies based in Aruba pay a fixed price to place their ad on the this news site. So these ads only have impact on the locals. And due to the ads system I’m guessing the site keeps reloading every 30 seconds or so.
Then as a web developer myself it bothers me to spend time on a site with bad code and markup which is the product of using Joomla and a generic template, instead of having the template build from scratch and suitable for this online news source.
This is where PHP Simple HTML DOM Parser comes in. At first I just wanted to use their RSS feed but they only provide a few lines content per news item. So to get my news fix from Aruba I needed a website scraper and the simple DOM parser offers an elegant solution.
With a simple function you can generate the content from a given website:
function scraping_24() {
// create HTML DOM
$html = file_get_html('http://www.24ora.com/index.php');
// get news block
foreach($html->find('table.contentpaneopen') as $article) {
// get title
$item['title'] = trim($article->find('td.contentheading', 0)->plaintext);
// get date
$item['date'] = trim($article->find('td.createdate', 0)->plaintext);
// get introcontent
$item['content'] = trim($article->find('span', 1)->plaintext);
$item['photo'] = trim($article->find('img', 0)->src);
// get permalink hardcoded
$item['permacoded'] = trim($article->find('div.show-linkmore a', 0)->href);
// get comment
$item['comment'] = trim($article->find('div.show-comment a', 0)->outertext);
$ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret;
}
As you can see I had to weed through the mess of code of the news site and figure out the HTML tags that encloses the content that I need. Only the the code structure of the news site isn’t consistent so it can happen that when printing the results content might be missing for some news items on the news index.
After getting to scrape the content right the next step was to present it. The news site uses lightbox for the photo gallery on individual news articles but their version doesn’t work well. So I just scrape the whole gallery table and the images already contain rel="lightbox" I just had to add jquery and a lightbox script. If the article doesn’t has an image gallery the PHP returns a message:
if($photos!=''){echo "$photos";} else { echo "E articulo aki no tin potret";};
For the image thumbnails on the index I used a jquery based preview script. It’s not that elegant but it does the work for now. Considering they use a full sized image of different formats the resize is a bad quality. My next step is to resize the preview in percentage to retain the image quality.
The parser scrapes the news content without the HTML tags so it returns one block of text. To fix this when displaying it I use PHP Markdown.
$content = $ret[1]['content']; $content_html = Markdown($content);
I’m a big fan of Markdown and have been using it since John Gruber released it. Markdown transforms all line breaks into paragraphs rather then using the break tag. Off course Markdown is more powerful then this but you have to be writing your content with Markdown syntax. But it handles this one issue nicely and I’m sure I could use other methods too like Textile or so.
Summing it up I spend quite some hours perfecting the scrape results, dimensional arrays are quite a pain here due that the news site HTML structure isn’t consistent. Each time that an array item came empty it threw the whole output results off. So I had to run if statements to skip empty arrays, and create dummy arrays using the few consistent tags to keep the output stable.
While working on this I read more on jquery, I only use javascript when it’s a must. But javascript is really an essential part of any website so I’m digging deeper in to get a stronger grasp of it. As for why jquery I can’t really say, maybe because using WordPress uses it but also because it’s one of the more popular javascipt libraries out there, thus a lot of resource available.
And for a little fun I added a browser detect script on the side, it tells you what browser/version you are using and if you are using Internet Explorer no matter what version it suggest you to switch to Firefox or Safari. The idea to do this came up due that I shared this site with some friends on Facebook and I know most of them are not aware of differences between browsers. So the message just says that with Firefox or Safari you can experience the newest website technologies.
And finally for the simplistic design I used a template based on Helvetireader by Jon Hicks.
End result: News in Helvetica
This project was fun to work on and left me wanting to do more, I just need to think of something. One idea I have is to redesign and code an existing website using HTML 5. I know it’s not official supported yet but it will be the next big thing, already becoming one might say.
