Entries in General Discussion (161)

Monday
Jun082009

Distribution of queries per second

We need to measure the number of queries-per-second our site gets for capacity planning purposes.

Obviously, we need to provision the site based on the peak QPS, not average QPS. There will always be some spikes in traffic, though, where for one particular second we get a really huge number of queries. It's ok if site performance slightly degrades during that time. So what I'd really like to do is estimate the *near* peak QPS based on average or median QPS. Near peak might be defined as the QPS that I get at the 95th percentile of the busiest seconds during the day.

My guess is that this is similar to what ISPs do when they measure your bandwidth usage and then charge for usage over the 95th percentile.

What we've done is analyzed our logs, counted the queries executed during each second during the day, sorted from the busiest seconds to the least busy ones, and graphed it. What you get is a histogram that steeply declines and flattens out near zero.

Does anyone know if there is a mathematical formula that describes this distribution?

I'd like to say with some certainty that the second at the 95th percentile will get X times the number of average or median number of QPS.

(Experimentally, our data shows, over a six week period, an avg QPS of 7.3, a median of 4, and a 95th percentile of 27. But I want a better theoretical basis for claiming that we need to be able to handle 4x the average amount of traffic.)

Saturday
Jun062009

Graph server

I've seen mentioned in few times sites like Digg or LinkedIn using graph servers to hold their social graphs. But the only sort of open source graph server I've found is http://neo4j.org/ .

Can anyone recommend an open source graph server?

Thanks
Aaron

Friday
Jun052009

SSL RPC API Scalability

Hi all!

So nice to start discussing cool things in this even cooler forum :)

I am having a problem .. which i believe is already solved but i would love someone confirming actual experience with the same topic.

We are building a client / server architecture, consisting of a web server part and many clients.
Transport will be provided as either XML-RPC / SOAP / JSON or all at once.
All of the communication has to be encrypted and passed within SSL3.

We expect a high load when the application starts (> 2000 concurrent requests).
Combine this with xml parsing for the rpc api, things really look ugly :)
So it's a big mess :)

It will not be that much database bound behind the api - mostly files will be transferred from the server to the clients and simple api for control.

So it's pretty much a matter of 'what-to-do-with-ssl'.

I was thinking of hardware - NetApp or a similar application accelerator.
Can anyone give examples of a hardware piece that combines: Load balancer / SSL accelerator?

I have also been reading about open source software Load Balancers but i really doubt it would meet the needs. Anyone having the same experience (or had) ? :)

Thanks, all!

Sunday
May312009

Need help on Site loading & database optimization - URGENT

Hi Friends,

I need some help in making site access fast. On an average my site has the traffic 2500 hits per day and on 16th May it had 60,000 hits. On this day site was loading very slow even it was getting time out. I also check out the processes running by using "top" command it was indicating mysql was taking too much load.

There are around 166 tables (Including PHPBB forum) in my database. All contents on site are displayed by fetching it from database. I have also added indexing to respective tables where it is required. Plain PHP/HTML coding is used.

Technology:

PHP -- 5.2
MYSQL -- 5.0
Apache -- 2.0
Linux

Following is all the server details of my site:

CPU : Single Socket Dual Core AMD Opteron 1212HE
Memory: 2GB DDR RAM
Hard Drive: 250GB SATA
Ethernet: 100Mb Primary Ethernet Card

(/var/log) # uname -a
Linux 2.6.9-67.0.15.ELsmp #1 SMP Tue Apr 22 13:50:33 EDT 2008 i686 athlon i386 GNU/Linux

kernel version:
2.6.9-67.0.15.ELsmp

(/var/log) # free -m
total used free shared buffers cached
Mem: 2026 1976 49 0 143 1474
-/+ buffers/cache: 359 1667
Swap: 1027 0 1027

RAM: 2 G

(/var/log) # df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda5 227G 20G 196G 10% /
/dev/sda1 99M 12M 82M 13% /boot
none 1014M 0 1014M 0% /dev/shm
/dev/sda2 2.0G 196M 1.7G 11% /tmp

Disk usage: 10% used/ 196 G available.

Its an dedicated server and only 1 website is hosted.

Can anybody please suggest how can I optimize site in more appropriate manner so that it will not go down if traffic increases on site.

Thanks
Sandy

Monday
May252009

non-sequential, unique identifier, strategy question

(Please bare with me, I'm a new, passionate, confident and terrified programmer :D ) Background: I'm pre-launch and 1 year into the development of my application. My target is to be able to eventually handle millions of registered users with 5-10% of them concurrent. Up to this point I've used auto-increment to assign unique identifiers to rows. I am now considering switching to a non-sequential strategy. Oh, I'm using the LAMP configuration. My reasons for avoiding auto-increment: 1. Complicates replication when scaling horizontally. Risk of collision is significant (when running multiple masters). Note: I've read the other entries in this forum that relate to ID generation and there have been some great suggestions -- including a strategy that uses auto-increment in a way that avoids this pitfall... That said, I'm still nervous about it. 2. Potential bottleneck when retrieving/assigning IDs -- IDs assigned at the database. My reasons for being nervous about non-sequential IDs: 1. To guarantee uniqueness, the IDs are going to be much larger -- potentially affecting performance significantly My New Strategy: (I haven't started to implement this... I'm waiting for someone smarter than me to steer me in the right direction) 1. Generate a guaranteed-unique ID by concatenating the user id (1-9 digits) and the UNIX timestamp(10 digits). 2. Convert the resulting 11-19 digit number to base_36. The resulting string will be alphanumeric and 6-10 characters long. This is, of course, much shorter (at least with regard to characters) then the standard GUID hash. 3. Pass the new identifier to a column in the database that is type CHAR() set to binary. My Questions: 1. Is this a valid strategy? Is my logic sound or flawed? Should I go back to being a graphic designer? 2. What is the potential hit to performance? 3. Is a 11-19 digit number (base 10) actually any larger (in terms of bytes) than its base-36 equivalent? I appreciate your insights... and High Scalability for supplying this resource!

Click to read more ...

Friday
May222009

Distributed content system with bandwidth balancing

I am looking for a way to distribute files over servers in different physical locations. My main concern is that I have bandwidth limitations on each location, and wish to spread the bandwidth load evenly. Atm. I just have 1:1 copies of the files on all servers, and have the application pick a random server to serve the file as a temp fix... It's a small video streaming service. I want to spoonfeed the stream to the client with a max bandwidth output, and support seek. At present I use php to limit the network stream, and read the file at a given offset sendt as a get parameter from the player for seek. It's psuedo streaming, but it works. I have been looking at MogileFS, which would solve the storage part. With MogileFS I can make use of my current php solution as it supports lighttpd and apache (with mod_rewrite or similar). However I don't see how I can apply MogileFS to check for bandwidth % usage? Any reccomendations for how I can solve this?

Click to read more ...

Tuesday
May122009

P2P server technology?

Is there any type of server technology that allows visitors to a website to become part of the server? Like with bittorrent, users share some of their bandwidth, so would this be possible with web servers where a person goes to a website, downloads and runs the software which makes their internet connection and cpu and hdd become part of the web server?

Click to read more ...

Friday
May082009

Publish/subscribe model does not scale?

on Wiki someone posted "...For relatively small installations, pub/sub provides the opportunity for better scalability than traditional client-server, through parallel operation, message caching, tree-based or network-based routing, etc. However, as systems scale up to become datacenters with thousands of servers sharing the pub/sub infrastructure, this benefit is often lost; in fact, scalability for pub/sub products under high load in large deployments is very much a research challenge." Does anyone have something to say regarding scaling Publish/subscribe models?

Click to read more ...

Wednesday
May062009

Guinness Book of World Records Anyone?

We are planning to be the first company to do a one million user load test and are looking for someone willing to be the first to have been subjected to such a test! Is YOUR site scalable enough? How do you KNOW? http://capcalblog.blogspot.com. Randy Hayes CapCal

Click to read more ...

Monday
Apr272009

Some Questions from a newbie

Hello highscalability world. I just discovered this site yesterday in a search for a scalability resource and was very pleased to find such useful information. I have some questions regarding distributed caching that I was hoping the scalability intelligentsia trafficking this forum could answer. I apologize for my lack of technical knowledge; I'm hoping this site will increase said knowledge! Feel free to answer all or as much as you want. Thank you in advance for your responses and thank you for a great resource! 1.) What are the standard benchmarks used to measure the performance of memcached or mySQL/memcached working together (from web 2.0 companies etc)? 2.) The little research I've conducted on this site suggests that most web 2.0 companies use a combination of mySQL and a hacked memcached (and potentially sharding). Does anyone know if any of these companies use an enterprise vendor for their distributed caching layer? (At this point in time I've only heard of Jive software using Coherence). 3.) In terms of a web 2.0 oriented startup, what are the database/distributed caching requirements typically needed to get off the ground and grow at a fairly rapid pace? 4.) Given the major players in the web 2.0 industry (facebook, twitter, myspace, PoF, Flickr etc, I'm ignoring google/amazon here because they have a proprietary caching layer) what is the most common, scalable back-end setup (mySQL/memcached/sharding etc)? What are its limitations/problems? What features does said setup lack that it really needs? Thank you so much for your insight!

Click to read more ...