« Making Hadoop Run Faster | Main | Stuff The Internet Says On Scalability For August 24, 2012 »
Monday
Aug272012

Zoosk - The Engineering behind Real Time Communications

This is a guest post by Peter Offringa, VP of engineering at Zoosk. Zoosk is a 50 million member romantic social network.

Our members get the most rewarding experience from Zoosk when they can interact in real-time. After all, a future relationship is potentially at the other end of every connection a user makes. The excitement and richness of this situation can only be fully realized in real-time. The suite of Zoosk services facilitating these interactions are referred to by the general description of real-time communications (RTC). These communications are delivered using the XMPP protocol, which also powers other popular instant messaging products. Zoosk members experience real-time communications within three distinct interactions:

  • Presence.  When a member is actively connected to the Zoosk RTC infrastructure, their public status appears as ‘available’. If they are idle for a period of time, their state transitions to ‘away’. Their presence automatically changes to ‘offline’ when they close or disconnect their client application. A member can also opt to appear “invisible” to other users. This option allows them to remain on the Zoosk service and see other online members, but not appear as such in other users’ rosters.
  • Notifications.  Significant interactions are packaged visually as ‘toasts’ accompanied by short messages. Toasts represent events to a user such as receiving a flirt, having their profile viewed, or being matched with another user. The Zoosk service utilizes these notification packets to tell the client applications to update the value of UI-related badges, such as the number of unread messages from another user.
  • Messaging.  If two users are online simultaneously, they can send messages to each other in a familiar ‘instant messaging’ chat format. These messages are transmitted through the RTC infrastructure in real-time. Message content is also persisted to a database for future message history retrieval if the user reconnects in the future using a different client application.

These communications are currently delivered to users on all major Zoosk products – the Zoosk.com site and Facebook app through a web browser, the iPhone app, iPad, Android, and a downloadable desktop application.

RTC Infrastructure

These RTC services are delivered through a highly performant and scalable XMPP-based infrastructure. The chat serve, powered by the open source Jabber server, Tigase, is the heart of this service. Tigase is written in Java, and our Platform team has created a number of custom extensions which handle Zoosk specific business logic.

Tigase is deployed on standard 8 CPU, Linux-based application server class machines. The Tigase servers are configured in paired clusters, with a primary and secondary node managed through a load balancer. All connections are directed to the primary node at a single time. If the service check to the primary server fails, the load balancer will immediately begin re-directing user traffic to the secondary server.

There are 18 of these paired clusters, each handling 4,000 to 8,000 connections at any time. In addition to socket connections for transmitting XMPP traffic, Tigase also includes a service for supporting BOSH connections over HTTP.

BOSH is the protocol by which we allow the web browser surfing Zoosk.com and our Facebook app to maintain a persistent connection to Tigase. Our desktop application and mobile apps use standard TCP-IP socket connections.

Full Size

A user’s online state is tracked in real-time by the Tigase servers via persistent connections between Tigase and the client applications (web browser, mobile device, desktop application).  Many core Zoosk product features, including search results, profile views and messaging, require ensuring that this state is reflected in near real-time on all client applications. To keep this state consistent throughout the rest of the Zoosk infrastructure, the user’s record in the user database is updated to reflect their current online state including a timestamp of their most recent online transition.

The user’s online state is also stored in cache on our search infrastructure, so that search results can take online state into account. Zoosk search functionality is powered by a tier of SOLR servers. We have extended each SOLR server to include an ehcache instance to store those users who are online currently. This cache of online state is updated in real-time through a dedicated Tigase instance referred to as the Online State Manager (OSM).

The OSM receives custom XMPP packets indicating the user’s online state from the primary Tigase chat servers and then makes a network call to update the ehcache instance on each of the SOLR servers. There are roughly 8,000 of these online state transitions a minute during peak traffic.  Maintaining this cache outside of the SOLR index allows the user’s presence state to be updated in real-time, separate from the periodic index replication snaps from master to slave. The user’s presence state is then combined with search results at query time to either filter or rank results based on whether the user is online currently. The search algorithm prefers users who are online, as this encourages real-time communication and provides a richer experience for other users.
 

User interactions with the Zoosk service outside of the core RTC features can also trigger business logic that generates a real-time notification to a connected user. For example, if another user views our user’s profile, or accepts our user’s friend request, we want to notify our user of that action immediately. The PHP-based web application will trigger an asynchronous job that opens a network connection to a Tigase server and passes an XMPP data packet to the server, with a custom message payload providing the data for the notification. This packet is processed by Tigase and routed to the client application from which user is currently connected.

The user’s client application then processes this custom packet and displays the appropriate “toast” to the user or updates a “badge” reflecting the current value of a particular feature indicator (number of profile views, unread messages, etc.). If the user is offline at the time, Tigase will store the packet until the user reconnects. At which point, it will pass the custom packet to the user’s client application.

Monitoring and Testing

The Zoosk technical operations team has built a number of ways to test and monitor the health of the RTC infrastructure to ensure responsiveness and availability. These tests primarily involve various mechanisms to gather performance data from Tigase servers, or to simulate real user interactions. If a particular health check fails or performance data falls outside of established thresholds, our Nagios installation will generate an alert.

  • Tigase Monitor - This is a script that runs on cron every 10 minutes. It logs into all primary chat servers and tests connections and presence transmission.  It records the results of these tests and sends updates to Nagios to determine whether to generate an alert.
  • Performance Metrics for Tigase - These cover a variety of internal Tigase measures, including times to perform key functions, message counts, queue sizes, memory consumption, etc. These values are collected every 2 minutes by an ad hoc stats command through the XMPP Admin interface.  These metrics are then passed to Ganglia for graphing.
  • Business Intelligence Reports - Every hour, a script checks the number of active connections to each primary Tigase server and the number of messages it has passed in the prior hour. This data is loaded into a database. A customized Excel report can connect to this data source and provide a summarized view of the data with easily comparable historical trending.
  • Tigase Test Suite - This is a headless XMPP client that logs into each Tigase server and simulates real interactions.  TTS will then record the results of its functional tests for the team to review.

Full Size
What’s Next

Looking forward, we will continue to actively explore new ways to leverage the real-time experience for Zoosk members. We will be rolling out RTC support to our mobile web application (Touch) in the next month.  Other devices or mediums that deliver the Zoosk application will similarly be connected in real-time. As our members increase the amount of time they are actively connected to Zoosk applications, we plan to enhance our RTC-based features to facilitate easier discovery and communication between members.

 

Reader Comments (4)

The graphic here (perhaps unintentionally) leaks the number of active users on Zoosk.com.

August 27, 2012 | Unregistered CommenterMatt Basta

Actually, what it leaks is the number of active users at any one time. Matt is assuming that the same people stay logged in all day long. But Zoosk isn't too tight lipped when it comes to scaling out their real-time platform.

August 28, 2012 | Unregistered CommenterGlenn

How come this load balancing of only 150k simultaneous users between the 18 servers ? Wouldn't just one tigase instance suffice such a low volume ? Or maybe this is just for an overrealistic redundancy factor ?

December 28, 2012 | Unregistered Commenterkellogs

I could so work for a company like this!

July 14, 2013 | Unregistered CommenterFelicia Lira-Benedit

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>