MediaSift / DataSift



Just when I thought Gnip (or Ping backwards!) was the dogs b*llocks I discover MediaSift which on paper would seem to have even bigger b*llocks!

Based in the UK (hooray!), MediaSift is a British technology startup and was formed by CEO Nick Halstead (@nickhalstead / @nik) in 2007. In 2008 they launched TweetMeme and their next generation platform is called DataSift.

DataSift will allow customers to search (sift) the full Twitter firehose based on the data in a standard Twitter JSON object (see below) along with the addition of data enrichment (augmentation) from third party services including Klout (influence), PeerIndex (influence), InfoChimps, Qwerly (people search) and Lexalytics (text and sentiment analysis).


A nice feature of DataSift is the Qwerly integration. This will allow the linkage between Twitter and other linked social media accounts such as LinkedIn.

If we believe the hype then it should be possible to ‘sift’ the Twitter firehose for “Football fans in Manchester with over 50 followers who have mentioned Ryan Giggs in a negative way with no swearing in the past day!”

So how does it compare to Gnip?


Firstly, Gnip services are available now. DataSift has been in private alpha testing since Q4 2010 (08/12/2010) and will not officially launch until Q3 2011. In May 2011 (13/05/2011) DataSift started beta testing.

Gnip runs on individual Amazon EC2 instances per customer data collector. I believe DataSift uses a cloud based architecture running a custom engine named “Pickle” (Nick, please DM or email me if I am wrong).


Cost wise, for the Premium Twitter Power Track feed, Gnip charges $2000 / month to rent a data collector and then $0.10 per 1k Tweets delivered. Since Twitter charges all companies $0.10 per 1k Tweets, effectively Gnip are just charging a fixed monthly rental for the collector (Amazon EC2 instance).

The DataSift model is based on a Pay per Use subscription model with processing tiers. As such there is no fixed monthly cost. A user can set an upper limit on the amount of money they are willing to spend per month on DataSift Stream data.

If that upper limit is reached, the user will automatically be disconnected from all their chargeable Streams until the monthly spend amount is increased, or a new month starts.

Basically, Custom streams require processing power which is split into three tiers, the more complex the search definition, the higher the tier and cost. There are then additional publisher costs on top.

As a cost comparison, if I assume 20 working days per month or $100 per day, what can I get for my $2000? Using the DataSift pricing calculator ( the answer is 23,000 Tweets (interactions) per hour using a highly complex search definition. An additional $55.20 of publisher costs ($0.10 per 1k Tweets from Twitter) would apply to both Gnip and DataSift. Hence, a total cost of $0.28 per 1k Tweets.


But hang on – 23K Tweets Per Hour (TPH)! For normal commercial applications we are probably talking an absolute *maximum* of 1000 filtered TPH (which is a bit less than the 12.4M TPH reported during Bin Laden’s Death – This is further backed up by an analysis of the #CustServ hashtag which averages 1000 TPH on Tuesdays.

Using the DataSift pricing calculator again and this time assuming a 12 hour working day using a medium complex search definition we have a total cost of $13.80 per day which is less than $300/month.


Unless I am very wrong (please email me to let me know if I am), this means that Gnip is x7 more expensive than DataSift. Yikes – time for a new business model!

Data Enrichment

Gnip provides data enrichments capabilities such as URL expansion and mapping and duplicate exclusion. Gnip can also append Klout Scores to Tweets and filter for Tweets by users who have Klout Scores within a specified range.

Similarly, DataSift provides Augmentation services. These services include Influence Analysis (Social Authority) from Klout and PeerIndex, Natural Language Processing (NLP) using Lexalytics Salience for Sentiment analysis, and Social Identity Aggregation (People Search) using Qwerly.


DataSift supports CSDL (Curated Stream Definition Language) which was previously called FSDL (Filtered Stream Definition Language). CDSL is powerful search language used to define complex rules which are used to filter curate streams. This is similar to Gnip rules. However, the capabilities provided by DataSift are much more comprehensive than Gnip.

DataSift sources of data are called “targets” or “input services”. Custom streams are defined in CSDL using targets in the “My Streams” section of the DataSift dashboard. CSDL also provides access to augmentation (data enrichment) targets through services such as Lexalytics Salience, TweetMeme, Peer Index, Klout and InfoChimps that allow streams to be augmented with third party data.

Regular expressions are supported in CSDL via the Google RE2 regular expression engine.

Here are some CSDL examples:


Slightly worryingly augmentation targets as specified in CDSL seem to be tied to data in JSON objects as exposed by third party APIs. I wonder if this could cause problems moving forward if and when these APIs change.

Once the stream has been built it can also be used in the definition of another user stream, and it in another stream and so on.


Interestingly, DataSift is encouraging users to build public streams that are discoverable and accessible to other users by providing a number of options on the stream page such as tagging, sharing, comments and visits. Also, most commented and top rated streams are also featured on the home section of the DataSift dashboard. Cool!


API and data formats

Both Gnip and DataSift provide HTTP Streaming APIs using Basic rather than OAuth authentication. Gnip supports output in XML, JSON and Activity Stream formats. DataSift only supports streaming output in the JSON format. Once again there is concern there that the DataSift API might change in the future which is one of the advantages of Gnip since it supports Activity Streams.

In addition to the Streams API, DataSift also provide a Data API which makes application backfilling on startup easy!

Gnip provides APIs to add or delete rules on a data collector. DataSift provides APIs to comment on and manage streams (get, create, update, duplicate, rate, delete, browse, search and compile). In addition to being able to modify CSDL definitions this API also allows for the public discovery of existing public streams.

Finally, DataSift provides a Recording API which allows a stream to be recorded and subsequently retrieved and analysed offline.


Time to write another piece of integration code into Genesys Social Engagement!