Reduce RSS Noise With Yahoo! Pipes


A very simple Weekly Challenge to get me started. The task: create a series of aggregators of RSS and twitter feeds, and a simple HTML scraper, using Yahoo! Pipes. I found Yahoo Pipes totally unreliable and stopped using it. Most of the pipes described here do not seem to work anymore.

About Yahoo! Pipes

Pipes is a way to create aggregators relatively quickly and without knowing programming. That’s the theory at least - I doubt a non-programmer would have been able to find their way around. Anyway, you sign up and start creating pipes, which you do by dragging different types of boxes (modules) onto a canvas, and connecting them to each other - scroll down for some screenshots. They are quite powerful, although a bit fidgety, but I managed to create a few pipes fairly quickly. Here’s the Pipes Documentation.

Pipe 1: HackerNews without start-ups

I like Hacker News, but there is quite a lot of noise from start-up fetishists which I don’t want to waste time reading.

Expressing the user story for this pipe in BDD style:

As a developer reading Hacker News
I don't want to read about start ups, politics, or Steve Jobs,
so that I waste less time.

This requires a simple pipe, with one Fetch Feed module to get the feed, and a Filter module to block items that mach any of the blacklisted words.

Get the filtered HackerNews RSS feed.

Pipe 2: Mashable key words

Mashable is a good technical resource which just publishes much more than I can cope with. Therefore I need a way to scan through the titles more quickly.

As a developer reading Mashable
I want to scan through headlines quickly,
so that I waste less time.

In this case I opted for replacing the title with a list of keywords, since that’s what I do mentally anyway. It will probably be a bit too abstract to read quickly, but time will tell. [Update: it was too abstract to scan]

I started off with a Fetch Feed again. That is piped into a Term Extractor module, which summarises the description into a list of keywords. Note that the Term Extractor needs to be embedded within a Loop. The ‘for each’ parameter is set to item.description, the Term Extractor ‘emit result’ is set to ‘single item’, and the loop output is set to ‘assign all results’ to item.title. What that does is to replace the title with an object, containing a summary of the description.

Almost there, but the title is replaced by an object, not a string - in JS notation that would be { content : “the string I am after” }. Why Pipes works like that I am not sure, but to get around that I put another module, Create RSS, which basically echoes all the parameters of the incoming feed, and replaces the title object with the string it contains.

After that, a small filter, and I’m good to go.

Get the filtered RSS feed for Mashable.

Pipe 3: BoingBoing, no fluff no steam punk

BoingBoing has gone seriously downhill, but there are still some interesting posts every so often. What I wanted to do there was to block stories by authors I don’t rate, or in categories I am not interested in. In other words, another simple filtering job.

As a BoingBoing reader
I want to avoid posts by writers (...list) or on subject (...list)
so that I can focus on the good ones.

This was straightforward, as both author and categories have their own fields (item.dc:creator.content and item.category.content respectively), so it was just a matter of filtering those.

Get the filtered RSS feed for BoingBoing

Pipe 4: Jay Is Online Games

Casual Gameplay is packed with interesting links, but I only really have time for online and mobile games.

As a Casual Gameplay reader
I only want to read posts about online and mobile games
so that I can try them out there and then.

That was slightly more challenging, as the feeds are not categorized. Posts are tagged, but the tags are buried within the main text. Eventually by poking around I found a couple of strings I could filter by. Note the two filters back to back, one to permit, the other to block feeds.

Get the feed for Online and Mobile Games from Casual Gameplay

Pipe 5: Tweeting geeks

I follow a fair number of geeks on twitter, mostly looking out for the interesting links they post. However, there is a large amount of general chatter I can do without. Additionally, I want to be able to manage the list of users from within twitter, without having to change the pipe every time. [Update: this is no longer functional since Twitter introduced their API changes]

Pipe 6: Jakob Nielsen

Jakob Nielsen, a well known if irritating industry expert, only offers articles in email newsletters rather than RSS feeds. The last pipe needs to scrape the site for new articles, and create a RSS feed from the links. [Update: JN have totally changed their site, this isn’t working any more]

Challenge Completed 100%

So this was easy enough, and was completed within a week.