Ask HS: Design and Implementation of scalable services?
We have written agents deployed/distributed across the network. Agents sends data every 15 Secs may be even 5 secs. Working on a service/system to which all agent can post data/tuples with marginal payload. Upto 5% drop rate is acceptable. Ultimately the data will be segregated and stored into DBMS System (currently we are using MSQL).
Question(s) I am looking for answer
1. Client/Server Communication: Agent(s) can post data. Status of sending data is not that important. But there is a remote where Agent(s) to be notified if the server side system generates an event based on the data sent.
- Lot of advices from internet suggests using Message Bus (ActiveMQ) for async communication. Multicast and UDP are the alternatives.
2. Persistence: After some evaluation data to be stored in DBMS System.
- End of processing data is an aggregated record for which MySql looks scalable. But on the volume of data is exponential. Considering HBase as an option.
Looking if there are any alternatives for above two scenarios and get expert advice.
Reader Comments (4)
You might want to use flume based agents on every client that is trying to send data and then have flume agent to flume agent communication via avro serialization so that you would have minimal overhead...we have this in production and it is working fine without any issues.
Important thing to keep in mind is that flume will be listening to log file...
I think this type of questions falls into the category of premature optimization. You are lookng for advice on technology without even knowing your bottlenecks. It reminds me of a quote from a book Carlos Bueno wrote for developers at Facebook
[1] https://www.facebook.com/notes/facebook-engineering/the-mature-optimization-handbook/10151784131623920
If you are looking at HBase, then you should also consider Cassandra. That is what we used for web scale persistent notifications and it turned out to scale very well. Choose Cassandra if you are going to be doing more writes than reads.
First off Quora is a great forum for these types of questions and based on the number of replies you have gotten it might even get you better feedback.
Second, I would strongly suggest column oriented dbs for data storage. Vertica is the leader in the space IMHO but you can also look at Infobright, MonetDB, LucidDB, InfiniDB, Paracell, or ParStream.
Third, you will certainly want some queuing system for buffering. ActiveMQ is a strong solution in the Java space. If you are in another language RabbitMQ tends to be a favorite. That said, Kafka is worth checking out. It doesn't have all the features of JMS or AMPQ but it has performance in spades.
Finally, I assume you are not running this in the cloud. If you are though you should probably be using SQS and Redshift in AWS potentially with Kinesis. Even if you are not in the cloud you can look to these services for inspiration. You can also look at stream processing platforms like Storm for inspiration.