There has been a lot of speculation and assumptions around whether PRISM exists and if it is cost effective. I don't know whether it exists or not, but I can tell you if it could be built.
Short answer: It can.
If you believe it would be impossible for someone with access to a social "datapool" to find out more about you (if they really want to track you down) in the tsunami of data, you need to think again.
Devices, apps and websites are transmitting data. Lots of data. The questions are could the data compiled and searched and how costly would it be to search for your targeted data. (hint: It is not $4.56 trillion).
Let's experiment and try to build PRISM by ourselves with a few assumptions:
With those assumptions, how much would it cost us to have PRISM up and running and to find information about a person in less than a minute?
Let’s begin with what data is generated every month that might contain information about you.
Facebook: 500 TB/day * 30 = 1.5 PT/month (source)
Twitter: 8 TB/day * 30 = 240 TB/month 8 TB/day (source)
Email/Other info: 193PT/month Google says 24 PB per day (2008). Five years later lets assume this is 8 times bigger = 192 PB. Now, real user information is 1/3 = 64 PT/day (source)
Mobile traffic/machinetomachine exchanges/vehicles etc: 4000 TB per day = 117 PB/month (source)
Total Data =~312PB month
The prices below correspond to renting offtheshelf servers from commercial highend datacenters (considering the data will be stored in a distributed filesystem architecture such as HDFS). This is a worst case scenario that does not include potential discounts due to renting such a high volume of hardware and traffic or acquiring the aforementioned hardware (which incurs a higher initial investment but lower recurring costs) . The hardware configuration used for calculating costs in this case study is comprised of a 2U chassis, dual Intel Hexacore processors, 16 GB of RAM, 30 TB of usable space combined with hardwarelevel redundancy (RAID5).
We’ll be needing about 20K servers, put into 320 46U racks. Cost for the server hardware is calculated to be about €7.5M / month (including servers for auxiliary services). Cost for the racks, electricity and traffic is calculated to be about €0.5M / month (including auxiliary devices and networking equipment).
Total hardware cost per year for 3.75 EB of data storage: €168M
Total personnel costs: €4Μ
On the software side, the two main components necessary are:
Now that we know the cost of finding anything about you, how would it be done?
The data is "streamed" to the Stream Database from the data connectors (social networks, emails etc), aggregated, and saved to HDFS in order for a MapReduce system to analyze them offline
(Bugsense is doing exactly the same thing with crashes coming from 520M devices around the globe with less than 10 servers using LDB, so we know this is both feasible and cost efficient. Yup, 10 servers for 520M. In realtime).
Next, we’d run a new search query on the 312PT dataset. How long will that take?
We could use Hive in order to run a more SQLish query on our dataset, but this might take a lot of time because data "jobs" need to be mapped, need to be read & processed, and results need to be send back and “reduced”/aggregated to the main machine
To speed this up, we can create a small program that saves data in columnar format in a radix tree (like KDB and Dremel does) so searching is done much faster. How much faster? Probably less than 10 seconds for 400TB for simple queries. That translates (very naively) to less than 10 seconds to find information about you.
Do you think that PRISM can be built using a different tech stack?