PRISM: The Amazingly Low Cost of Using BigData to Know More About You in Under a Minute

There has been a lot of speculation and assumptions around whether PRISM exists and if it is cost effective. I don't know whether it exists or not, but I can tell you if it could be built.
Short answer: It can.
If you believe it would be impossible for someone with access to a social "datapool" to find out more about you (if they really want to track you down) in the tsunami of data, you need to think again.
Devices, apps and websites are transmitting data. Lots of data. The questions are could the data compiled and searched and how costly would it be to search for your targeted data. (hint: It is not $4.56 trillion).
Let's experiment and try to build PRISM by ourselves with a few assumptions:
- Assumption 1: We have all the appropriate "data connectors" that will provide us with data.
- Assumption 2: These connectors provide direct access to social networks, emails, mobile traffic etc.
- Assumption 3: Even though there are commercially available solutions that might perform better for data analysis, we are going to rely mostly on open source tools.
With those assumptions, how much would it cost us to have PRISM up and running and to find information about a person in less than a minute?
Let’s begin with what data is generated every month that might contain information about you.
DATA
Facebook: 500 TB/day * 30 = 1.5 PT/month (source)
Twitter: 8 TB/day * 30 = 240 TB/month 8 TB/day (source)
Email/Other info: 193PT/month Google says 24 PB per day (2008). Five years later lets assume this is 8 times bigger = 192 PB. Now, real user information is 1/3 = 64 PT/day (source)
Mobile traffic/machinetomachine exchanges/vehicles etc: 4000 TB per day = 117 PB/month (source)
Total Data =~312PB month
Hardware Costs
The prices below correspond to renting offtheshelf servers from commercial highend datacenters (considering the data will be stored in a distributed filesystem architecture such as HDFS). This is a worst case scenario that does not include potential discounts due to renting such a high volume of hardware and traffic or acquiring the aforementioned hardware (which incurs a higher initial investment but lower recurring costs) . The hardware configuration used for calculating costs in this case study is comprised of a 2U chassis, dual Intel Hexacore processors, 16 GB of RAM, 30 TB of usable space combined with hardwarelevel redundancy (RAID5).
We’ll be needing about 20K servers, put into 320 46U racks. Cost for the server hardware is calculated to be about €7.5M / month (including servers for auxiliary services). Cost for the racks, electricity and traffic is calculated to be about €0.5M / month (including auxiliary devices and networking equipment).
Total hardware cost per year for 3.75 EB of data storage: €168M
Development Costs
- 3 (top notch) developers > 1.5M per year
- 5 administrators > 1.5M per year
- 8 more supporting developers > 2M per year
- Developer costs > $1M5M per year (assumes avg developer pay of $500k per year) = 3.74M euro
Total personnel costs: €4Μ
Total Hardware & Personnel Costs: €12M per month (€144M per year) = $187M per year
Software
On the software side, the two main components necessary are:
- A Stream (inmemory) Database to alert about specific events or patterns taking place in realtime and to make aggregations and correlations.
- A MapReduce system (like Hadoop) to further analyze the data.
Now that we know the cost of finding anything about you, how would it be done?
The data is "streamed" to the Stream Database from the data connectors (social networks, emails etc), aggregated, and saved to HDFS in order for a MapReduce system to analyze them offline
(Bugsense is doing exactly the same thing with crashes coming from 520M devices around the globe with less than 10 servers using LDB, so we know this is both feasible and cost efficient. Yup, 10 servers for 520M. In realtime).
Next, we’d run a new search query on the 312PT dataset. How long will that take?
We could use Hive in order to run a more SQLish query on our dataset, but this might take a lot of time because data "jobs" need to be mapped, need to be read & processed, and results need to be send back and “reduced”/aggregated to the main machine
To speed this up, we can create a small program that saves data in columnar format in a radix tree (like KDB and Dremel does) so searching is done much faster. How much faster? Probably less than 10 seconds for 400TB for simple queries. That translates (very naively) to less than 10 seconds to find information about you.
Do you think that PRISM can be built using a different tech stack?
Related Articles
Reader Comments (15)
Dual Intel 8 core, 16 GB of RAM, 30 TB of usable space combined with hardware level redundancy (RAID5) for $375/month. Please show me where I can get that kind of prices!!! hs1.8xlarge instance on AWS costs 4 times more and has similar spec. Also RAID5 is not an acceptable storage for any serious load. I think your calculations are order of magnitude lower than they should be.
I like your free, infinite bandwidth.
If you guys have any questions, we are more than happy to answer them! Ping me at @jonromero!
[/home/snowden]
# cat prismdata.txt | grep "$targetname"
---
that could work..
I think the development costs are way underestimated. I have an experience with projects that were probably much smaller and less complex, and there were dozens of developers working just on the integration (ESB and this sort of stuff) + there were multiple dba's and database developers and it wasn't even as volume intensive as the potential PRISM (hundreds gb a month). And I'm not even talking about UI, reporting, BI and so on. And now add that you are doing all of this in extremely constrained environment due to security and other reasons, recruiting is going to be extremely slow and difficult (they apparently failed in this part).
Could you please advise on where the avg developer makes 500k/y?
Nice work on calculations. I am sure hardware, software and development costs will be much higher than given assumptions.
DATA
Facebook: 500 TB/day * 30 = 1.5 PT/month (source)
Twitter: 8 TB/day * 30 = 240 TB/month 8 TB/day (source)
Email/Other info: 193PT/month Google says 24 PB per day (2008). Five years later lets assume this is 8 times bigger = 192 PB. Now, real user information is 1/3 = 64 PT/day (source)
Mobile traffic/machinetomachine exchanges/vehicles etc: 4000 TB per day = 117 PB/month (source)
Total Data =~312PB month
are you kidding????!!!!
the technology was built with spying in mind from the ground up
do a google search of "amdocs phone records" this has been going on for years
do a lil research and as the smart people i know you to be you will see what's going on..= ) have a nice day
Nice article.
Who cares if the cost are off or not. The premise was if it was possible or not. And given the information, its certainly possible. As for cost. Lets not forget that this people printed more than a trillion dollars of debt (as they "borrow" from the FED to print the money) per year. So even if the figure where 1 billion. Is still certainly well within the capacity of the USA government.
They had the means, the opportunity and certainly the motive. And there is a witness. The only thing left is the smoking gun.
Problem is that the suspect is investigating himself and decided that the witness is the guilty party. You got more chance of a meteor hitting the White House than of ever proving what really happened (or not).
Hey Curious,
That's not too far off the cost of a dedicated server - the following is £329 per month
2 x 8 core Intel 2.66
128GB RAM
36TB strorage (12 x 3TB SATA 3)
There are some really cheap options and you can always talk providers into extras. Also you can talk them into discounts/custom builds - if you're spending hundreds per month they look after you. If you are spending millions per month they will do anything for you :)
$500k per year for a developer seems a little high!
If everything else is similarly up in price then it becomes something any country could afford (or large company)
Big problem is still pulling the real information out.
For example Email/text
'The main course is at 8pm, make sure you cover the sweet trolly
In open text does this mean evening meal starts at 8 or confirm the time for the attack to start, make sure you cover our exit
Meaning is everything,you still need the context of the information - you still can't beat a real agent on the ground!
Nice article. Couple of other points to consider.
This doesn't take into consideration encrypted communications.I imagine there would be a similar sized cluster (maybe a couple of supercomputers) for breaking encryption.
The prices are quite low as well. This program was top secret; there is automatically an inflated cost of everything to store and process data in top secret networks.
I'd imagine that the NSA would not be comfortable renting components and would opt to perform the majority of work in-house (or at least through contractors).
Backup / retention has not been considered.
Dev / Test environments has not been considered.
Hey guys, thanks for the replies!
First of all, these are not AWS servers. And yes, the sane thing to do is to build a datacenter. Probably in Utah. I wanted to demonstrate how someone that doesn't want to build a datacenter can do it. Trust me, you can get a great discount if you start renting all these machines - as many people mentioned.
About the salaries: For a great (and also "trusted") engineer 500k is nothing if you consider insurance, taxes and other costs a companies needs to pay. At the end of the day, maybe around 250k will end up in his pocket (people on hackernews suggest even less).
Thanks again for reading this and commenting! Feel free to "bug" me at @jonromero!
The NSA released Accumulo. Might be related :)