ASR Tuning post Release 2 Go-Live

As I posted last week we went live on Monday (27/06/2011) with our Release 2 solution and I am pleased to report that everything went very smoothly (for once!). Of course we had a number of minor issues which the team have worked hard on to resolve this week.

The Release 2 solution includes the rollout of Nuance Speech Recognition (ASR) for existing Customer identification. This is based on them saying their postcode and then the first line of the address. I have been buried in Nuance ASR logs all week and at the same time reviewing the associated recorded utterances. In fact, I analysed at total of 40000 utterances from Monday and 4 hours of utterance audio from 2 of the 9 Nuance Recognizer ASR servers!

As a result the following tuning recommendations have been made:

  • Increase the confidence level on postcode recognition from 0 to 4. This is because we were getting false positives on postcodes and then asking the customer to match against a list of addresses which would never match
  • Change the wording on the address prompt to include house number or name. This is because we observed that Customers were just saying a street name which would never match against a full address line

We have also identified a problem with invalid grammars when the address line contains 4 digits addresses e.g. 1234 SOME ROAD, when house numbers are prefixed with zero e.g. 01 SOME ROAD and when the address line also contains contact details such as telephone number. The result of this is that Customers are transferred directly to an advisor after giving a valid postcode.

Share

Release 2 Go-Live

Some frantic activity over the last few weeks trying to close things down for Release 2 Go-Live which is now scheduled for 27/06/2011. Release 2 has the addition of some core solution functionality including voice self service using speech recognition (ASR), Kofax non-voice channel integration and integration with SAP Web IC.

As usual we have found some “magic” settings at the last minute to fix a couple of critical issues:

IVR interface performance

We have developed a custom C# .NET application which provides the interfaces between the IVR applications (VoiceXML) and back end systems. Although we had performance tested them in isolation we hit a concurrency problem in final testing.

The solution to the problem was to set the .NET option “maxconnection” to enable the .NET runtime to open more than 2 concurrent web service connections (and hence block on subsequent requests):

Image

HTTPS with GVP 7.6

Since we will process payments in Release 2 IVR applications we need to enable HTTPS on the connection between each IPCS (Page Collector) and the IVR (VoiceXML) application servers.

However, enabling HTTPS resulted in intermittent IPCS page fetch errors, especially under load conditions. The solution to this is buried in solution search here:

http://solutionsearch.genesyslab.com/selfservice/dynamickc.do?cmd=show&forward=nonthreadedKC&docType=kc&externalId=15264&sliceId=1

  1. Create the following Registry entry as a DWORD value: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters\MaxUserPort
  2. Set it to the value of 65534 (decimal). The default is 5000.
  3. Create the following Registry entry as a DWORD value: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters\TcpTimedWaitDelay
  4. Set it to 60 or lower (decimal). The default is 240.
  5. Reboot the host.

GVP 7.6.470.xx MCU Core Dumps

We had been struggling with IPCS core dumps since Release 1 Go-Live and this was resulting in GVP ports getting stuck and needing to be taken out of service manually. The problem seemed to be related to prompt recording and playback in Virtual Hold (VHT). A long running ticket with Genesys support was eventually resolved this week after a couple of diagnostic builds provided by Genesys and tested by the team.

Release Number 7.6.470.17 [06/24/11] – Hot Fix

The IPCS MCU process no longer terminates unexpectedly at the end of recording. Previously under certain conditions, some internal C++ Standard Template Library (STL) lists would become corrupted at the end of a recording, causing the MCU to terminate unexpectedly. (ER# 269811868)

Well done team – we got there in the end!

Share

Bitcoin block backlog problems

As I mentioned in a recent blog post for security reasons I do not recommend running Bitcoin clients online 24×7. However the net effect of this is a block backlog and longer time to clear new transactions. This happened to me earlier in the week when I managed to get 1000 blocks behind (not much really since 10 blocks per hour normally get created).

When processing the backlog my client seemed to get stuck on one block for a long time and I was wondering whether this was related to the number of transactions over the last week (Mt.Gox saga!) and as a result a greater proof of work?

If this is the case e.g. blocks with a large number of transactions incur a larger proof of work and hence backlog processing time, there may be trouble ahead! Imagine banks changing their clearing times due to transaction volume.

Share

Bitcoin Theft – The Top Ten Threats (Updated 23/06/2011)

FORWARD

I have been working on this article for a couple of weeks now and have not finished my research. However, given the latest developments I feel as though I should share my initial findings with you ASAP. I have my own Bitcoins (oops – that breaks one of my rules) and hope that the advice at the end of this post will help the Bitcoin community in general.

http://news.yahoo.com/s/afp/20110617/ts_alt_afp/usitcrimecomputersecurityinternetbitcoin

Firstly, let me say that I am not a hacker and that I have no intention of perusing any of this other than in an academic nature!

Given the “theft” this week of 25,000 Bitcoins ($500K) from someone’s wallet this week (http://thenextweb.com/industry/2011/06/15/close-to-us500k-stolen-in-first-major-bitcoin-theft/) I started to ask myself how secure is the Bitcoin system?

Lets work through some scenarios ….

Select a victim

The Bitcoin “Rich List” would seem like the logic starting point – http://bitcoinreport.appspot.com/

Image

Here I select address “1GQnMbeEmUA9g6iypZYhUg4PKsZEswzAYy” as the notional victim as they seem to have 6592 lovely Bitcoins for me to steal!

THREAT 1: The open and distributed nature of Bitcoins means that everybody can read and analyse the block chain as does the Bitcoin Report. This is equivalent to banks giving out details (anonymously) of everybody’s current balance!

Identify the victim

Here I pick a pseudo random post from the Bitcoin forum (http://forum.bitcoin.org/) because:

  • They are a Hero member (and therefore should have quite a few Bitcoins)
  • They give me a Bitcoin address
  • They give me their nick name – “xf2_org”
  • They give me their real name

Image

Given the above information a very quick bit of Google-ing provides further information about this person:

  • A website
  • A link to an email address
  • Another website
  • A contact number
  • Another email address
  • A Twitter address

Image

23/06/2011

Additional information removed from original post as requested

As you can see I can quickly find a lot of information to tie with a Bitcoin address in order to launch a targeted attack.

I could take it further via email and Twitter tweets to try to get an IP address for the conversation, just in case they were running the Bitcoin client on the same machine as they interact with me. Let’s not forget that when you receive an email you receive more than just the message. An email comes with headers than contain information that can tell where the email was sent from and possibly the IP address of the sender.

THREAT 2: Bitcoin addresses and the Internet allow me to link real people to Bitcoin wallets. Yes, I know that you can have multiple addresses per Bitcoin client install but human nature means in reality that only 1 address will be used to receive transactions. With Bitcoins, this information is the equivalent of somebody’s bank account and sort code which we all know that real world hackers are very interested in ….

Using Block Explorer (http://blockexplorer.com) I can dig a bit more into a Bitcoin address and determine useful information such as the number of transactions, the amount of BTC sent and received, when the address what first and last used etc. etc. In the case of the address above:

  • First Seen: 2011-05-18 16:42:43
  • Received transactions: 5
  • Received BTC: 3.4601
  • Sent transactions: 3
  • Sent BTC: 0.4501
  • Last Used: Block 130904 (2011-06-15 03:30:05)

Note: While the last “balance” is the accurate number of bitcoins available to an address, it is likely not the balance available to this person (victim!). Every time a transaction is sent, some bitcoins are usually sent back at a new address makes the balance of a single address misleading.

What I also find are two sent transactions to two other Bitcoin addresses- I wonder if these addresses are being used for a deposit wallet? [See later]

THREAT 3: There is a lot of useful information that can be determined through analysis of the open block chain!

Locate the victim’s wallet(s)

Previously I mentioned one possible way to try to find the IP address of the machine where a Bitcoin client may be running (and hence where a wallet could be located) in order to launch a targeted attack.

However, the Bitcoin architecture is based on Peer to Peer (P2P) technology and all P2P networks have a bootstrapping problem since without central servers, nodes (Bitcoin clients) on the network need to be able to find each other.

Bitcoin solves it using three mechanisms (with some old discussion here – http://forum.bitcoin.org/?topic=84.0):

  • By default, Bitcoin clients join an Internet Relay Chat (IRC) channel and watch for the IP addresses and ports of other clients joining that channel
  • There is a list of “well known” Bitcoin nodes compiled into the software in case the IRC chat server is unreachable for some reason
  • You can manually add (via configuration file or command-line option) IP addresses of other machines running Bitcoin to connect to. Some people use the fallback node list here: https://en.bitcoin.it/wiki/Fallback_Nodes

Each Bitcoin client connects to IRC and stays connected. Using IRC messages the client connects to found peer IP addresses on port 8333.

Once connected to the Bitcoin network, other Bitcoin nodes (machines) send messages containing IP addresses (and ports) of other nodes they know about. Internally Bitcoin client communicate with each other and broadcast new nodes via the Bitcoin protocol on port 8333. However, they are always online in IRC.

An example of the IRC bootstrapping can be found in the “debug.log” file from the Bitcoin client:

Image

The official Bitcoin source code is available here (https://github.com/bitcoin/bitcoin) and an open source Java client is available here (http://code.google.com/p/bitcoinj/).

Using a standard IRC client (mIRC, KVIrc, XChat, Colloquy etc.) it is very easy to connect the Bitcoin bootstrapping IRC channel. The bootstrapping IRC server is LFNet (irc.lfnet.org) and the channel is #bitcoin. The username and nickname are initially set to a random number prefixed with “x” e.g. “x781631660”. The nickname is later changed to an encoded IP address prefixed with “u” based on the client’s external IP address as seen by the IRC server prior to performing a “JOIN” then “WHO” on the #bitcoin channel. Of course the decode function is also available in the source code so we can turn any nickname back into an IP address. In fact, this is exactly what a Bitcoin client does – it sits there listening for “JOIN” and “WHO” messages and if the message relates to a nickname prefixed with “u” it decodes the nickname to get the external IP address of the Bitcoin client.

Alternatively, we can of course just send a “WHO” or “USERHOST” request to get IP address information using an IRC client:

Image

THREAT 4: Based on the source code of the Bitcoin client I could create a “spoof” Bitcoin node which joins the network and then waits for connections from other (genuine) Bitcoin clients. Since the connection is two-way I am then effectively connected to a machine which in all probability also has a Bitcoin wallet stored on its hard disk. If I could exploit this connection in some way (like a good old fashioned TCP buffer overflow exploit – Intrusion.Win.NETAPI.buffer-overflow.exploit) I could take control of the machine and copy their wallet. This is a bit like the ATM network being connected via IRC!

But guess what – life is even easier for a thief who is trying to find the IP address of all machines where a Bitcoin client may be running (and hence where a wallet could be located). This is because each Bitcoin client stores the IP address of all previously discovered peers in a file named “addr.dat” and the chances are that Bitcoin will be running on each of these machines from time to time and listening on port 8333.

THREAT 5: Based on the source code of the Bitcoin client I could read all of the IP addresses stored in the address database and then sequentially try to connect to the Bitcoin client likely to be running on each machine using port 8333. IMHO it is only a matter of time before such an exploit becomes reality.

17/06/2011

I am still researching further into this. The JSON-RPC interface on port 8332 is the biggest threat since there is a “nice” method called “sendtoaddress”. The interface is secured in various ways and the TCP bind address is set to the loopback address unless the option “rpcallowip” is set. If it is set then the bind is on any IP address which means that a physical connection IS possible. There are additional checks behind it such as authentication (including deterring brute-forcing short passwords less than 15 characters) but BEFORE the authentication is checked, the HTTP headers are mapped.

The standard Bitcoin protocol port on 8333 is another threat and implicit in the P2P architecture. There could be the possiblity for SOCKET exploits here trying to either shell out to copy the wallet or even access the contents of the wallet in memory e.g. using Bitcoin commands “getdata”, “getblocks”, “getheaders” with invalid payloads to reveal the contents of the wallet and/or call “SendMoneyToBitcoinAddress”.

23/06/2011

Having been through the source code in some detail I have now convinced myself that the code is written in a very professional manner and that all reasonsable steps have been taken to protect the interfaces exposed by the Bitcoin client. This does not of course mean that bugs cannot introduce vulnerabilities in the future versions.

On the balance of probability I think that an exploit would be as the result of standard Remote Access Trojans (RAT) using the IP addresses exposed as an inherent part of the P2P network and the IRC bootstrapping process. Therefore, I would still recommend running with the “noirc=1” option and only connecting to trusted peers.

THREAT 6: Running the Bitcoin client permanently in the background is an inherent part of the Bitcoin Peer to Peer (P2P) architecture. There is a financial incentive to do so in terms of Transaction Fees. Therefore, many people leave the machine hosting their Bitcoin client (and also their wallet) running 24 hours a day. This is even more so if they are Bitcoin mining at the same time. This means that their wallet is also open to attack 24 hours a day ….

Please see the end of this post for ways to protect your self from the threats outlined above!

Access and steal the wallet(s)

THREAT 7: Everybody knows that the weak point in the system is an individual’s wallet which is stored unencrypted on the hard disk where the Bitcoin client is installed. On a Windows 7 install this is located in C:\Users\<user>\AppData\Roaming\Bitcoin\wallet.dat. Steal this file and steal the contents of the wallet.

http://news.yahoo.com/s/afp/20110617/ts_alt_afp/usitcrimecomputersecurityinternetbitcoin

THREAT 8: Although it is possible to have multiple wallets, human nature again means in reality most people only use a single Bitcoin wallet rather than keeping a separate, more secure and importantly offline deposit wallet which inherently is less prone to theft.

Access to copy (steal) the wallet can be achieved by getting the user to download and install some malicious application such as the notorious Bitcoin wallet backup utility or something guised as a Bitcoin Mining application ….

THREAT 9: In the rush for gold everybody is keen to download and install the latest and best Bitcoin miner software without even thinking twice as to whether it could contain a Trojan.

Alternatively, I could take remote control of the machine running the Bitcoin client software covertly using standard Remote Access Trojans (RAT).

THREAT 10: Anonymity. At the end of the day, as a decentralised network with no authority and no identities attached to the addresses used to send and receive Bitcoins, once Bitcoins are stolen they’re as good as gone. Although there is an alerts mechanism built into the Bitcoin client is does not seem to be used for much now.

And finally ….

THREAT 11: The Bitcoin network is open to attack in general. The longest block chain not only serves as proof of the sequence of events witnessed, but proof that it came from the largest pool of CPU power. Effectively, the majority decision is represented by the longest chain, which has the greatest proof-of-work effort invested in it. Proof-of-work is essentially one-CPU-one-vote.

Therefore, so long as a majority of CPU power is controlled by nodes that are not cooperating to attack the network, they’ll generate the longest block chain and outpace attackers. Hence, the Bitcoin system is secure as long as honest nodes collectively control more CPU power than any cooperating group of attacker nodes. Given the vast amount of hardware being thrown into Bitcoin mining, what happens to this hardware once mining is no longer cost effective? Could this redundant hardware be pooled into an attack such that honest nodes are overrun by attacker nodes?

A Bitcoin whitepaper says:

“The incentive (to mine blocks and generate coins) may help encourage nodes to stay honest. If a greedy attacker is able to assemble more CPU power than all the honest nodes, he would have to choose between using it to defraud people by stealing back his payments or using it to generate new coins. He ought to find it more profitable to play by the rules, such rules that favour him with more new coins than everyone else combined, than to undermine the system and the validity of his own wealth.”

Mmmm … if this ever did happen a lot of elements of (dare I say it) a Ponzi scheme (http://en.wikipedia.org/wiki/Ponzi_scheme) would come into play …

Hopefully this post has given you food for thought and you will now take the necessary actions to secure your wallet(s)!

My recommendations are:

  • Don’t go broadcasting one of you receiving Bitcoin addresses looking for donations. Make sure you use different Bitcoin address for different transactions
  • Be careful what you post on the forums
  • Run your Bitcoin client with your daily working wallet with options “noirc=1” and “nolisten=1” set. Also consider adding “connect=” lines rather than “addnode=” lines so that you only connect to trusted Bitcoin peers. These settings will ensure that a) you do not broadcast your IP address and b) you do not accept connections from the outside
  • DO NOT set the option “rpcallowip”
  • Keep an offline deposit wallet with the majority of your Bitcoins in it. Make sure that this is backed up somewhere safe
  • Be very careful what software you download and install – especially for Bitcoin mining. Do not run mining software on any machine that any wallet is installed or accessible from

I hope this post has been of some use to you. If my blog has been hit by a DDoS attack you know I have been talking too close to the mark! Backup your wallet, sleep well and keep trading!

PS: Donations accepted to:

1L5GQ6JiUkSeLBgsdoy9e9wGKqbvn2mgUH

Yes I know this breaks one of my recommendations but I had already published this address! Don’t worry, only my daily working wallet is online.

 

Share

Bitcoins – A Crypto Currency for a Social World?!

There has been a lot of noise recently about Bitcoins especially given the way the “markets” (http://bitcoincharts.com/markets/) have reacted over the last few months. The value of a Bitcoin (BTC) is shown on the right in USD:

Image

I am not going to get into the detail here as there are plenty of other articles out there both on how Bitcoins work and also on Bitcoin Mining. A good starting point for further research is here: https://en.bitcoin.it/wiki/Main_Page

One key to the Bitcoin currency is the limited and predictable rate at which coins are added to the system. This control is achieved by tying the generation of one block of Bitcoins to the solving of a computationally difficult problem: solve the problem, get a block of Bitcoins. Generating (solving) a block results in a reward which is currently 50 Bitcoins (BTC).

The distributed Bitcoin network adjusts the difficulty of this problem after every 2016 blocks of Bitcoins are generated. The difficulty is adjusted so that on average, 6 blocks are generated every hour by users around the world. The incentive to devoting your computer’s power and time to solving the difficult computational problem is that the user whose computer solves the problem (e.g. block get accepted by the network) gets awarded that block and the 50 BTC reward.

Mining Bitcoins is equivalent to finding a number that hashes (using SHA-256) to a value that is less than a particular 256-bit target. The target is a 256-bit number (extremely large) that all Bitcoin clients share. The SHA-256 hash of a block’s header must be lower than or equal to the current target for the block to be accepted by the network. The lower the target, the more difficult it is to generate a block.

Adjusting that target value is how the distributed Bitcoin network adjusts the difficulty of block generation. Moving the target lower makes for a more difficult computational problem: it will take more random tries to find a value whose hash is less than the target.

Each hash basically gives you a random number between 0 and the maximum value of a 256-bit number (which is huge). If your hash is below the target, then you win. If not, you increment the nonce (completely changing the hash) and try again.

Every 210,000 blocks, the Bitcoin reward per block is cut in half. Right now, the payout is 50 BTC per block. Sometime soon, the payout will halve to 25 BTC per block:

Image

Because nodes have no obligation to include transactions in the blocks they generate, Bitcoin senders may voluntarily pay a transaction fee. Doing so will speed up the transaction and provide incentive for users to run nodes, especially as the reward per block amount decreases over time. Nodes collect the transaction fees associated with all transactions included in blocks they solve.

My Mining Experiences!

OK, so it sounds easy enough but how long will it take me to generate a block? There is a calculator here:

https://en.bitcoin.it/wiki/Generation_Calculator

The diagram below shows an estimate of the amount of time, on average, that you will need to do mining at the specified hash rate before you will generate a block (and earn 50 BTC):

Image

Mmmm … seems like a long time.

However, it is important to realise that block generation is not a long, set problem (like doing a million hashes), but more like a lottery. There’s no such thing as being 1% towards solving a block. You don’t make progress towards solving it. After working on it for 24 hours, your chances of solving it are equal to what your chances were at the start or at any moment.

With this in mind I started mining (just in case I got lucky). Remember 1 block currently gets a 50 BTC reward which (at the time I started) could be worth 50 x $30 = $1500!

On my Windows workstation I get 1.3 Mhash/s e.g. I generate 1.3 million hash attempts per second trying to solve the current target and get a block accepted by the network. On my MacBook Pro I get 8.0 Mhash/s. There is a difference between the two: one uses CPU hashing and the other uses GPU (graphics card) hashing but I won’t bore you with the details here.

Straight away I struck fool’s gold!

Image

I checked the debug log for my Bitcoin client and there is no mention of “generated” (which there should be). DiabloMiner on my MacBook equally has not produced anything so far (which is to be expected). Mining in a pool has resulted in 47 shares (shares of a block reward) worth 0.004 BTC!

Image

I will keep running for a bit longer and see what happens. In the meantime if you want to donate some coins to me my address is: 1L5GQ6JiUkSeLBgsdoy9e9wGKqbvn2mgUH

PS: There is a much wider discussion to be had here about the rise of Crypto Currencies in terms of money laundering and illegal activities – see here:

http://www.techdirt.com/articles/20110605/22322814558/senator-schumer-says-bitcoin-is-money-laundering.shtml

All of this will have an effect on the “markets” (http://bitcoincharts.com/markets/) which is what really interests me.

 

Share

DataSift integration with Genesys Social Engagement

Here we go again … time to develop a custom application to take a DataSift stream and integrate it into Genesys Social Engagement (http://genesysguru.com/blog/blog/2011/04/08/genesys-social-engagement/).

The first step was to sign up as an alpha tester. With 24 hours I received my invite:

Image

Now time to create some custom streams!

Configuration

1. Login to DataSift e.g. http://datasift.net

Image

2. Click on “Settings” to get my API key which I will need later to access my custom streams through the API:

Image

3. Click on “My Licenses” to setup Twitter stream licensing:

Image

4. Click on “My Streams” then “Create Stream” to create a new stream. Note that here I mark the stream as private:

Image

5. Create a stream definition (filtering rules) for this Stream using CSDL (Curated Stream Definition Language). Clicking on the Code Toolbox makes this relatively easy but don’t expect a full IDE like Visual Studio!

Image

Image

When I click on Save I get some important information – the unique key that will be used to access the stream later through the API:

Image

6. Having created a custom stream I can now click on the “Live” tab and see all live Tweets that match CDSL definition that I just created:

Image

7. If I click on “Use” I can get an estimated cost to consume this stream. In this case $0.35 / hour. Also note in the screenshot below that my stream definition is versioned:

Image

8. Finally, clicking on dashboard I can see all of my streams as well as the public streams created by other users:

Image

Image

Ok, so far so good!

All of that took less than 10 minutes. The web based GUI worked fine in Firefox (unlike Gnip) and was easy and intuitive to use. What I *really* like about this GUI is that it is simple enough for Business users to create and modify stream definitions and to see the results in realtime.

Versioning means that if Mr. Cockup is at home we can recover the situation! Also, the estimated cost to consume the stream means that budgets can be kept under control.

Right, back to techie land and a bit of C# coding to consume the stream via the DataSift API.

To be completed …

27/05/2011: DataSift Twitter feed has been down for 24 hours so development work stopped for now.

04/06/2011: Integration completed. Contacts and Interactions being created automatically. Just need to hook up a strategy to test out some auto reply functionality and then finish my custom social desktop application which uses the PSDK. Will post again with a demo ASAP.

Share

MediaSift / DataSift

http://mediasift.com/
http://datasift.net/

Image

Image

Just when I thought Gnip (or Ping backwards!) was the dogs b*llocks I discover MediaSift which on paper would seem to have even bigger b*llocks!

Based in the UK (hooray!), MediaSift is a British technology startup and was formed by CEO Nick Halstead (@nickhalstead / @nik) in 2007. In 2008 they launched TweetMeme and their next generation platform is called DataSift.

DataSift will allow customers to search (sift) the full Twitter firehose based on the data in a standard Twitter JSON object (see below) along with the addition of data enrichment (augmentation) from third party services including Klout (influence), PeerIndex (influence), InfoChimps, Qwerly (people search) and Lexalytics (text and sentiment analysis).

Image

A nice feature of DataSift is the Qwerly integration. This will allow the linkage between Twitter and other linked social media accounts such as LinkedIn.

If we believe the hype then it should be possible to ‘sift’ the Twitter firehose for “Football fans in Manchester with over 50 followers who have mentioned Ryan Giggs in a negative way with no swearing in the past day!”

So how does it compare to Gnip?

Availability

Firstly, Gnip services are available now. DataSift has been in private alpha testing since Q4 2010 (08/12/2010) and will not officially launch until Q3 2011. In May 2011 (13/05/2011) DataSift started beta testing.

Gnip runs on individual Amazon EC2 instances per customer data collector. I believe DataSift uses a cloud based architecture running a custom engine named “Pickle” (Nick, please DM or email me if I am wrong).

Cost

Cost wise, for the Premium Twitter Power Track feed, Gnip charges $2000 / month to rent a data collector and then $0.10 per 1k Tweets delivered. Since Twitter charges all companies $0.10 per 1k Tweets, effectively Gnip are just charging a fixed monthly rental for the collector (Amazon EC2 instance).

The DataSift model is based on a Pay per Use subscription model with processing tiers. As such there is no fixed monthly cost. A user can set an upper limit on the amount of money they are willing to spend per month on DataSift Stream data.

If that upper limit is reached, the user will automatically be disconnected from all their chargeable Streams until the monthly spend amount is increased, or a new month starts.

Basically, Custom streams require processing power which is split into three tiers, the more complex the search definition, the higher the tier and cost. There are then additional publisher costs on top.

As a cost comparison, if I assume 20 working days per month or $100 per day, what can I get for my $2000? Using the DataSift pricing calculator (http://datasift.net/pricing) the answer is 23,000 Tweets (interactions) per hour using a highly complex search definition. An additional $55.20 of publisher costs ($0.10 per 1k Tweets from Twitter) would apply to both Gnip and DataSift. Hence, a total cost of $0.28 per 1k Tweets.

Image

But hang on – 23K Tweets Per Hour (TPH)! For normal commercial applications we are probably talking an absolute *maximum* of 1000 filtered TPH (which is a bit less than the 12.4M TPH reported during Bin Laden’s Death – http://mashable.com/2011/05/02/bin-laden-death-twitter/). This is further backed up by an analysis of the #CustServ hashtag which averages 1000 TPH on Tuesdays.

Using the DataSift pricing calculator again and this time assuming a 12 hour working day using a medium complex search definition we have a total cost of $13.80 per day which is less than $300/month.

Image

Unless I am very wrong (please email me to let me know if I am), this means that Gnip is x7 more expensive than DataSift. Yikes – time for a new business model!

Data Enrichment

Gnip provides data enrichments capabilities such as URL expansion and mapping and duplicate exclusion. Gnip can also append Klout Scores to Tweets and filter for Tweets by users who have Klout Scores within a specified range.

Similarly, DataSift provides Augmentation services. These services include Influence Analysis (Social Authority) from Klout and PeerIndex, Natural Language Processing (NLP) using Lexalytics Salience for Sentiment analysis, and Social Identity Aggregation (People Search) using Qwerly.

Search

DataSift supports CSDL (Curated Stream Definition Language) which was previously called FSDL (Filtered Stream Definition Language). CDSL is powerful search language used to define complex rules which are used to filter curate streams. This is similar to Gnip rules. However, the capabilities provided by DataSift are much more comprehensive than Gnip.

DataSift sources of data are called “targets” or “input services”. Custom streams are defined in CSDL using targets in the “My Streams” section of the DataSift dashboard. CSDL also provides access to augmentation (data enrichment) targets through services such as Lexalytics Salience, TweetMeme, Peer Index, Klout and InfoChimps that allow streams to be augmented with third party data.

Regular expressions are supported in CSDL via the Google RE2 regular expression engine.

Here are some CSDL examples:

Image

Slightly worryingly augmentation targets as specified in CDSL seem to be tied to data in JSON objects as exposed by third party APIs. I wonder if this could cause problems moving forward if and when these APIs change.

Once the stream has been built it can also be used in the definition of another user stream, and it in another stream and so on.

Image

Interestingly, DataSift is encouraging users to build public streams that are discoverable and accessible to other users by providing a number of options on the stream page such as tagging, sharing, comments and visits. Also, most commented and top rated streams are also featured on the home section of the DataSift dashboard. Cool!

Image

API and data formats

Both Gnip and DataSift provide HTTP Streaming APIs using Basic rather than OAuth authentication. Gnip supports output in XML, JSON and Activity Stream formats. DataSift only supports streaming output in the JSON format. Once again there is concern there that the DataSift API might change in the future which is one of the advantages of Gnip since it supports Activity Streams.

In addition to the Streams API, DataSift also provide a Data API which makes application backfilling on startup easy!

Gnip provides APIs to add or delete rules on a data collector. DataSift provides APIs to comment on and manage streams (get, create, update, duplicate, rate, delete, browse, search and compile). In addition to being able to modify CSDL definitions this API also allows for the public discovery of existing public streams.

Finally, DataSift provides a Recording API which allows a stream to be recorded and subsequently retrieved and analysed offline.

Conclusion

Time to write another piece of integration code into Genesys Social Engagement!

Share

Gnip integration with Genesys Social Engagement (Part 1)

As mentioned in a previous post (http://genesysguru.com/blog/blog/2011/05/20/tweets-10-per-1k-social-engagement-and-api-rate-limits/), for commercial applications which require deep historical search, analysis and data enrichment, what Twitter really wants is for developers to use Gnip (http://gnip.com/) who are a commercial reseller of Twitter data.

Image

With Gnip, Twitter data is made available in either Twitter’s native JSON format or a Gnip-provided JSON Activity Streams format. Any data enrichment that Gnip adds to tweets is only available in the Activity Streams format (e.g. unwound URLs, Klout reputation scores, etc). For this reason, it is recommended to use the Activity Streams format.

With this in mind I set off to develop a custom application to take a Gnip activity stream and integrate it into Genesys Social Engagement (http://genesysguru.com/blog/blog/2011/04/08/genesys-social-engagement/).

The first step was to sign up for a free 72 hours Gnip trial (https://try.gnip.com/) and then configure my data collector:

Configuration

1. Login to Gnip:

Image

2. The main dashboard shows each feed into the Data Collector and the health / performance of each feed:

Image

3. Click on “edit data feed” to edit the parameters associated with a feed:

Image

4. Define any rules for filtering the stream:

Image

5. Select the output format and any data enrichments. Here I select the output format as a JSON Activity Stream and also add some data enrichment to expand shortened (bit.ly) URLs:

Image

6. Select the data delivery format. Here I select a HTTP stream:

Image

7. Once the feed is configured, click on the feed to display activity:

Image

8. The Overview tab provides a high level overview of the queries performed on the Twitter Stream including the number of polls performed and the number of activities returned:

Image

9. The Data tab shows the data returned from each query. It also includes details of the HTTP stream e.g. https://trial66.gnip.com/data_collectors/11/stream.xml

Image

10. The Rules tab shows the metrics associated with each rule defined to filter the stream:

Image

Ok, so far so good! Next, a bit of C# coding to consume the activity stream via the Gnip Activities API which takes a number of parameters:

  • max: The maximum number of activities to return capped at 10000.
  • since_date: Only return activities since the given date, in UTC, in the format “YYYYmmddHHMMSS”
  • to_date: return only activities before the given date, in UTC, in the format “YYYYmmddHHMMSS”

However, during my testing I could only get a JSON activity stream using the all activities stream. Even then all activities on this stream seemed to be for the Facebook feed. My Twitter stream would only return activities in an XML format:

Image

After a quick email to Gnip support it turns out that the trial streams are only available in XML format. However, they did offer me a free 2 week trial of their Power Track product which puts me back in business!

To be completed … but there is are some new kids on the block … DataSnip.

 

Share

Tweets $.10 per 1k – Social Engagement and API Rate Limits

It would seem to me that at present one of the biggest limitations of any Social Engagement solution is the rate limits imposed on APIs to social networks such as Twitter and Facebook. For example, what happens to a Customer Tweet when the hourly API rate limit has been exceeded and I cannot retrieve my new Followers, Mentions or Retweet timelines for example? How long should the Customer wait before receiving even an automated response?

For Twitter, the default rate limit depends on the authorisation method being used and whether the method itself requires authentication. Anonymous REST API calls are based on the IP address of the host and are limited to 150 requests per hour. Authenticated (OAuth) calls are limited to 350 requests per hour. However, there are additional Search rate limits, Feature rate limits, Account rate limits as well as unpublished “Twitter Limits”. If you are really naughty and do not honour the rate limit, your IP address might get Blacklisted!

Obviously I can get around per account (authenticated) rate limits by using multiple accounts. However, this just adds system management, configuration overhead and complexity.

The other “unknown” is how these rate limits will change in the future – is it too dangerous to build an application on an API you can’t control? With any public API, application developers are always at risk that their efforts will simply be erased by some unpredictable move on the part of the company that controls the API. Twitter says “… we will monitor how things are going and if necessary reduce the rate further“. Oh dear!

At present the best advice from Twitter is “it is best practice for applications to monitor their current rate limit status and dynamically throttle requests if necessary“. In other words either a) develop highly complicated strategies to manage the rate limits at the risk of these changing at any time or b) risk missing or being able to respond to an important Customer Tweet!

Given that not all Twitter REST API methods are rate limited (I can always update my status using statuses/update for example), may be I am worrying to much?

If you are serious on Sentiment and Influence analysis as part of Customer Service then I think not. This is because sentiment and influence analysis cannot be performed on a single Tweet. What matters is the sentiment across the whole tweet thread and not just the Tweet in isolation. How many times has the Tweet been Retweeted (RT) and by who? What is the sentiment associated with any Replies? What is the Klout or other influence score associated with the key actors in the thread? This sort of analysis will inevitably eat into REST API rate limits.

So where does this leave us?

Well in the past high volume users such as Klout were on a Whitelist which allowed them to make a much higher number of API requests per hour – 20,000 compared to 350. However, in February 2011 Twitter announced:

“Twitter will no longer grant whitelisting requests. Twitter whitelisting was originally created … at a time when the API had few bulk request options and the Streaming API was not yet available. Since then, we’ve added new, more efficient tools for developers, including lookups, ID lists, authentication and the Streaming API. Instead of whitelisting, developers can use these tools to create applications and integrate with the Twitter platform.”

The Streaming API would seem to be the only way forward then since they are designed to “allow high-throughput near-realtime access to subsets of public and protected Twitter data“.

There are 3 Twitter Streaming “products”: The Streaming API, User Streams and Site Streams. The user stream is intended to provide all the update data required for a desktop application after startup once a REST API backfill has been completed. This includes protected statuses such as followings and direct messages.

The main one of interest to us is the Streaming API since this provides filtered public statuses (including replies and mentions) from all users.

Even then, the Streaming API is only part of the equation for a couple of reasons:

  • Status quality metrics and the data access level limits (Spritzer, Gardenhose, Firehose etc) are applied. This means that some statuses will be filtered out automatically.
  • Duplicate messages can be delivered on the stream.
  • The Streaming API Quality of Service (QoS) is “Best-effort and unordered”. This means that “on rare occasion and without notice, statuses may be missing from the delivered stream“.

For commercial applications which require deep historical search, analysis and data enrichment, what Twitter really wants is for developers to use Gnip (http://gnip.com/) who are a commercial reseller of Twitter data.

There is an interesting article on Gnip here: http://www.readwriteweb.com/hack/2011/02/twitter-sets-a-price-for-tweet.php

“Last week at Strata, Gnip released a new set of features for its social-stream processing platform. Called Power Track, the new layer allows customers to set up complex search queries and receive a stream of all the Twitter messages that match the criteria. Unlike existing ways of filtering the firehose, there are no limits on how many keywords or results you can receive.

On top of the standard $2,000 a month to rent a Gnip collector it will cost 10 cents for every thousand Twitter messages delivered.

For clients that want *every* Tweet for a keyword, it supplies a comprehensive solution, rather than trying to work around the traditional Twitter search APIs that have restrictions on volume and content. ”

All good stuff and starts to put a cost basis to Tweets – get used to them costing $.10 per 1k to receive. The Gnip data enrichments capabilities e.g. URL expansion and mapping and duplicate exclusion are also noteworthy. Gnip can even append Klout Scores to Tweets and filter for Tweets by users who have Klout Scores within a specified range – nice!

If you have read this far then thanks for reading! If you want to know why I am so interested in this then please check back over the next couple of weeks for further posts on a “Social Project” that I am working on.

Share