Monday
Oct042021
Stuff The Internet Says On Scalability For October 4th, 2021
Monday, October 4, 2021 at 9:11AM
Hey, HighScalability is here again!
The circulatory system of the internet. @tylermorganwall
Love this Stuff? I need your support on Patreon to keep this stuff going.
Sorry for the long gap in posting, but I’ve been building a new app. I’m looking for testers for my new iOS fitness app: Max reHIT Workout. It guides you through proven reduced-exertion high-intensity interval workouts. If that interests you, please give it a try through TestFlight. I’d appreciate any feedback and suggestions for improvement. Thanks!
Number Stuff:
- 233 billion: transactions processed by Amazon Aurora on Prime Day. 1,595 terabytes of data stored and 615 terabytes of data transferred. Due to increased efficiency 6,000 fewer physical servers were used compared to 2020 at a 20% lower cost. CloudFront: peak load of over 290 million HTTP requests per minute, for a total of over 600 billion HTTP requests. SQS: 47.7 million messages per second at the peak. EBS: 1.1 trillion requests per day and transferred 614 petabytes per day. DynamoDB: 89.2 million requests per second.
- Megabits/second: current fastest quantum networks.
- 319 Tb/s: new world record transmission over 3,001 km with 4-core fiber.
- 1 Billion: TikTok MAU. Facebook has ~ 2.89 billion monthly active users.
- 66: zero-days have been found in use this year, highest number ever, almost double the total for 2020, exploits carry price tags north of $1 million.
- 7500x: reduction in in the cost of satellite bandwidth from $300,000,000/Gbps to $40,000/Gbps. ARK's research suggests that it could fall another 40-fold to ~$1,000/Gbps with Starlink+Starship.
- 34%: better price/performance with AWS Lambda Functions powered by AWS Graviton2 processor
- 70%: of App Store revenue is generated by less than 10% of all App Store consumers. Over 98% of Apple’s in-app purchase revenue came from games.
- @jamesurquhar: All of this excites the hell out of me. We are inexorably marching towards a world where the trust boundary between business logic and the required tools and infrastructure to run it is simply a service API. Lamba and its ilk are almost there.
- 25%: of internet traffic is generated by bad bots.
- 200: world-wide distributed network of time balls for time signaling clock towers.
- 5.1M: IOPS Per-Core With AMD Zen 3 + Intel Optane.
- 59.5%: increase in mean mobile download speeds over the last year (55.07 Mbps). Mean fixed broadband increased 31.9% (107.50 Mbps).
- 4.398 trillion: number of parameters to answer the question of life, the universe and everything.
- 95%: of Cockroach Labs Kubernetes Adoption Trends Report respondents handle their cloud infrastructure themselves. 94% run k8s in production. 88% embracing serverless.
- $40 billion: spent in Apple’s App Store in the first six months of 2021. 22.05% increase.
- $1.2 million: average ransomware policy claim in 2021. Up from $450,000.
- 17.2 million request-per-second: DDoS attack thwarted by Cloudflare. 3x larger than the previous largest attack. They serve 25 million HTTP requests per second on average.
- $3B: committed Pinterest spend on AWS through 2029. ~$400M/year. ~$5-10 billion at retail pricing. (@QuinnyPig). ~$0.88 per year per user (jgalt212)
- $30 billion: payout from YouTube to creators in the past three years from ads, merchandising and other service features.
- 62.8tn: a new record for calculating the number of digits in pi. It 108 days and nine hours. ~2x as fast as Google’s previous record. Though I still like cake better.
- 13.89 Mbps: Starlink’s median upload speed. 45 ms median latency. But like for real estate, what matters is location, location, location.
- 181,464: Backblaze drives. HGST had the lowe annual failure rate with .44%
- $44 billion: revenue for both discrete and embedded emerging memory technologies by 2031. The bulk of the market is expected to be in SoCs and 3D XPoint memory.
- 60+ million: new EC2 instances spun up every single day
Quotable Stuff:
- Simon Wardley: There's a gameplay known as ILC, innovate, leverage, commoditize and it's a simple way of basically you take something which is a product, turn it into a utility, allow everybody else to build on top of it, mine the metadata to spot future patterns which you commoditize to new component services and that way, you just move up the stack. Same gameplay the Chinese government has been doing since the 1970s and they're about to dominate the world and everything else. Amazon does this pretty much in the commercial world.
- @GergelyOrosz: When you ask "Why did Company build 7 of the same products that all failed?", it always starts with the current solution struggling. This is an opportunity. Not to fix - which doesn't get you promoted - but to start anew. An all too real story about Promotion Driven Development: Opportunity…Proposal…Funding… Hiring…Building…Launch…Promotion season…Iterating…More funding…More promotions…Growth slowing…Founding team members leaving…More team members leaving…A struggling team…The speculation…The realization…[Opportunity]
- arcbyte: As a Dev, I loved SAFE. It changed everything about how our program operated to the point our delivery became extremely predictable and we still had every 9th and 10th week to tinker on new ideas or do refactoring and cleanup where we wanted. That program purred like nothing else I've ever been a part of. Senior leaders sat with devs, started talking to everyone about their priorities - people we had never seen before we started doing PI events. Antagonism dropped. We went from not being trusted to deploy during an annual heavy use period to being totally trusted to deploy and given additional contracts for maintenance.
- Microsoft: However, the cost seems to be a loss of sense of purpose, which at work, is largely driven through strong and cohesive relationships and seeing how your tasks have impact on others
- Matt Rickard: "Cloud prem" (cloud + on-premise) is a deployment pattern that's becoming more and more common with companies. Vendors deploy software to their client's cloud provider under an isolated account or a separate VPC
- UWE FRIEDRICHSEN: For 10 systems involved in a transaction, you need to calculate 99,5% ^ 10 which results in ~95% success probability for your transaction. This means if you need to update 10 systems together in an ACID fashion, 1 out of 20 transactions will fail on average over time.
- Alex Hudson: I’ll still continue to use Azure, but the proposition is becoming weaker. Are Microsoft attempting to add too many products to keep up with AWS, or releasing things before they’re ready? I don’t know. Azure often has a feeling of being held together with a lot more sticky tape behind the scenes than I would want to know about.
- @doctorow: Spotify uses its industry dominance to extract heavy fees from the labels - creaming 30% of the total revenue generated by a typical track. Big Three monopolists with fat margins can absorb this. Indies? Not so much.
- Zababa: In a way, services and microservices are OO programming in the Alan Kay sense: late binding, message passing, local retention and hiding of state.
- omegalulw: The biggest downsides of (modularized) monoliths lie in 1) scaling - e.g., you can independently scale and load balancer each service, 2) releases - you can have independent releases (at the risk of maintaining version compatibility).
- @ftrain: My spouse asked me to fix the Roomba because it kept turning on all day, and I had to admit that I was starting it up from the app at work because I like when it vacuums twice, and I couldn't figure out why it kept pausing.
- Chris Watson: The most important thing is what is on the plate.
- @shreyas: Since time immemorial, when a CEO asks a PM at Product Review, “what do you need to 10X users/revenue?, “what will make you go faster?, etc the PM steadfastly responds “We need [N] more engineers. The Eng Mgr nods approvingly. A story thread, with some hard truths to swallow:
- Sourcegraph: Putting it all together, the new memory profile looks like this: a 5x reduction overall, which means we can serve search queries for five times more repositories without requiring any more servers. We went from 1400KB of RAM per repo to 310KB with no measurable latency changes.
- @vgill: 100% of the self-caused outages could be predicted by five (5) metrics: 1. Thread-pool usage 2. Memory + heap usage/GC times 3. CPU load 4. Networking failures/connection count 5. Slow DB queries
- @silvexis: We run 100% on demand for exactly that reason. The payoff to reach perfection is just not worth it and auto-scaling adds operational complexity that I will gladly pay to remove. I’m talking about calculating that break even point for steady (or near steady) state workloads
- Gualdrapo: Art is about oneself. Design is about others.
- Corey Quinn: Here's an example: If cross-AZ data transfer were free instead of costing $20,000 per petabyte transferred, what would happen? Customers would have more architectures that span multiple availability zones, meaning that there would be more storage and compute provisioned. Traffic over the (presumably) dedicated fiber between AZs would increase — but seeing as AWS almost certainly owns everything between the two AZs, it would manifest as a potential one-time upgrade investment. Then, of course, AWS would take a hit to its next quarterly earnings as that change manifested across its customers. Yes, I do believe that charge to be significant enough to materially change quarterly earnings.
- calmlynarczyk: If you want to talk systemic AWS mistakes you can make, we accidentally created an infinite event loop between two Lambdas. Racked up a several-hundred-thousand dollar bill in a couple of hours. You can accidentally create this issue across lots of different AWS services if you don't verify you haven't created any loops between resources and don't configure scaling limitations where available. "Infinite" scaling is great until you do it when you didn't mean to.
- @johncutlefish: So: ...performance is a not a linear thing...strengths have corresponding challenges...today is a the sum of countless days...a company is not a monolithic "culture" ...paradoxes abound ...puzzles abound ...messes abound
- Lorin Hochstein: This is why “root cause of failure doesn’t make sense in the context of complex systems failure, because a collection of control processes keep the system up and running. A system failure is a failure of this overall set of processes. It’s just not meaningful to single out a problem with one of these processes after an incident, because that process is just one of many, and it failing alone couldn’t have brought down the system.
- Jesse Duffield: That is, if two pieces of code change for completely different reasons, you should not only separate them but also minimise the dependencies in the code between them. Conversely, if two pieces of code change for the exact same reasons, you should not only move them close together, but also represent their interdependence in the domain with interdependence in the code, whether through sharing some common interface, calling eachother, or factoring out common code.
- Alexander Amini~ We have a ton of bias in the training data we collect [for autonomous vehicles]. Probably 90% of the data we collect for autonomous vehicles is taken under very sunny road conditions, very clean camera conditions. When we deploy autonomous vehicles we want it to perform well on the other 10% of the data, the data that is under nighttime, bad visibility, maybe it’s raining, maybe there are some very tight turns and curves. All this data is underrepresented.
- @ben11kehoe: I'm not sure how folks come to believe things like this. Sure, reduced cost is the sole driver for some small amount of serverless adoption, but in most cases it's a combination of cost and the reduction of operational burden, and even increased cost is sometimes acceptable.
- @aortenzi: In the end, someone sets up all of the networking and servers and provisioning infrastructure and scaling and you don't have to have any idea what the hell spanning tree is. You don't need a networking person, a server tech, a DBA, an LDAP infrastructure. Focus on value, not cost
- Bay Area Belletrist: One callout: there’s always room to be inspired by those around you. The line between inspiration and envy is a thin one. Make sure you’re on the right side of it.
- reureu: I worked with a behavioral economist, and we started running RCTs looking at different approaches to sharing data, and found that dashboards led to less engagement, when there was engagement it was more likely to drive ineffective interventions, and generally our dashboard groups had worse patient outcomes. Non-dashboard approaches had 20x better engagement and anywhere from 20-100% better patient outcomes (depending on the condition).
- Dr. Ian Cutress: In the Q&A following the session, Dr. Christian Jacobi (Chief Architect of Z) said that the system is designed to keep track of data on a cache miss, uses broadcasts, and memory state bits are tracked for broadcasts to external chips. These go across the whole system, and when data arrives it makes sure it can be used and confirms that all other copies are invalidated before working on the data.
- Tom Spring: Super-secure air-gapped computers are vulnerable to a new type of attack that can turn a PC’s memory module into a modified Wi-Fi radio, which can then transmit sensitive data at 100 bits-per-second wirelessly to nearly six feet away.
- Simon Wardley : the future is worrying about things like capital flow through your applications, where money is actually being spent and what functions, monitoring that capital flow, tying it to the actual value you're creating. You look at companies like iRobot or Liberty Mutual and what they've done with serverless. iRobot provide those rumbas, there's 10s of millions out there and the entire thing is run by six people.
- Matthew Hutsonarchive: The entirety of perceptual experience is a neuronal fantasy that remains yoked to the world through a continuous making and remaking of perceptual best guesses, of controlled hallucinations. You could even say that we’re all hallucinating all the time. It’s just that when we agree about our hallucinations, that’s what we call reality.
- @swardley: i.e. you should be working on the software supply chain in your serverless architecture, you should be tying capital flow to user value, you should ... lots of things. You should not be getting dragged into container orchestration or building clusters or stuff that's low level.
- @houlihan_rick: I have actually never used TransactWrite API because the isolation is low and it is possible to read uncommitted data. For Amazon services we used Streams->Lambda for guaranteed update processing and transaction Items for multi-phase commits when needed.
- Fredrik Holmqvist: So, if I had to pick one language/runtime to program server systems in, it would be Erlang[6]. It enables me to solve otherwise hard problems, where the majority of my time is spent on the problem rather than the scaffolding, as it’s already there. There’s no non-exhaustive list of hot frameworks to learn, the runtime is strong enough to do most of the lifting up front. Feeling that the language provides just the abstractions that I need out of the box is something I haven’t found anywhere else, and puts a smile on my face whenever I’m working with it.
- NICK THIEME: The researchers found gradient descent to be a natural PLS in the PPAD-complete problem. Columbia University's Tim Roughgarden said, "[The nature of computation] is something that we as a species should try to understand deeply in all of its many forms. And I think that should be reason enough to be excited about this result."
- @StewartMorganv1: Quietly reduced infrastructure costs by 28% just by moving web apps from ECS to Cloudfront. Moving APIs from containers to Lambda/APIGateway will reduce it by a further 44%
- gagejustins: Broadly we've seen this pattern with infrastructure in general – it's a lot easier to set up a server than it used to be, all things considered. Now obviously if you're a tiny startup, you're more comfortable outsourcing everything to Heroku, and if you're a hyperscale enterprise, you probably want more control on exactly what your database is doing. The thesis here is that on the tail end (hyper scale), things are getting more under control and predictable, and developers there want the same "nice things" you get with platforms like Heroku. Elsewhere in the ecosystem, more and more parts of the stack are getting turned into "simple APIs" (Stripe for payments, Twilio for comms, etc.). And perhaps most interestingly, as serverless for compute seems kind of stuck (maybe?), it may be the case that serverless for databases – whatever that ends up meaning – is actually an easier paradigm for application developers to work with.
- Marc Brooker: The bottom line is that high-percentile latency is a bad way to measure efficiency, but a good (leading) indicator of pending overload. If you must use latency to measure efficiency, use mean (avg) latency. Yes, average latency
- Marc Brooke: In the context of distributed systems, caches are a powerful and useful tool. Unfortunately, applied incorrectly, caching can introduce some highly undesirable system behaviors. Applied incorrectly, caches can make your system unstable. Or worse, metastable. To understand why that is, we need to understand a bit about how systems scale…Good caches have feedback loops. Like back pressure, and limited concurrency. Bad caches are typically open-loop.
- cbushko: The only datapoint I can give is that moving from AWS to GCP reduced our bill by 2/3rds. That isn't a 1 to 1 comparison though as we did a re-platforming at the same time. Moving from AWS ECS to GCP Kubernetes was probably the biggest money saver. It reduced the amount of compute needed by a huge amount. Being able to safely use pre-emptive VMs in Kubernetes is also leads to huge savings.
- Geoff Huston: There was a time, and it was not so long ago, that the telcos dominated the communications sector. In many national environments they were the largest employer, the largest revenue earner, even the largest credit provider. Their position as a domestic monopoly created a secure revenue base that was immune from competitive pressures and the privileged position was all but unassailable. Deregulation opened up this sector to initial waves of like-for-like competition, but this has been quickly followed by a new wave of substitution competition where the rise of Internet-based services has shifted the nexus of value generation from carriage to content and service. The result was initially the commoditisation of the shared public carriage role, but the process of time and the successive improvements of the digital capacity of physical transmission systems has meant that dedicated transmission systems are now accessible to the large cloud and content providers and shared carriage systems have become a niche activity rather than the mainstream of the carriage sector.
- Christof Koch: We know of no fundamental law or principle operating in this universe that forbids the existence of subjective feelings in artifacts designed or evolved by humans.
- @denis_makogon: The reason why I love and hate #Kubernetes is it actually good platform to host scalable apps. But the road from developing the app to making it scalable is so painful. Here are few-reasons-why-thread:
- @etherealmind: OH: " I’ll buy networking from anyone that has things in stock. Welcome to half way through 2021.
- Ilya Sutskever: This is the centrality of prediction, good enough predictions give you everything you can dream about.
- @tmclaughbos: Spent time this weekend on my very long neglected side project. Decided to switch to Lambda destinations and EventBridge.Still cannot get over how much I like this. Adjusting to the approach but I love not even needing to think about writing code to get my data to the next hop
- Sandy Munro: Robots are blind one armed idiots.
- @zackkanter: Kubernetes is a novel techno-human organizational productivity virus developed by Google to destroy startups.
- tcherasaro: Why aren’t there any open FPGAs or at least more Open FPGA tools? Why are vendor FPGA tools so huge, terrible and hard to use?...FPGA development is microchip development. It requires a great deal of tacit knowledge, experience and a background in electrical engineering and digital design and a number of other skills to do well. In order to develop open tools for these devices most of the devs in the community would have to have a lot of this knowledge and experience and most if not all of the FPGA vendor’s IP on all of their devices in order to work on an open tool chain.
- @levi_mccormick: This week, I condensed five tables into a single table design inspired by @houlihan_rick and @alexbdebrie, with some daily encouragement from watching @bryson3gps succeed with the pattern. It has reduced the time to rebuild and push game state to clients by 60%.
- Niklas Begley: Rust outperforms Elixir by 2 orders of magnitude. It usually parses the Markdown into HTML in a matter of microseconds
- @anothercohen: Congrats on your $2m seed round. Your small team now only has 3 months of runway after paying for: - Zendesk - Carta - AWS - Slack - Figma - Zoom - LaunchDarkly - JIRA - Amplitude - Quickbooks - Twilio - Sendgrid- Snowflake - Gsuite - 1password
- @gunnarmorling: Regular reminder: when it comes to persistent state, think about scaling up before scaling out. Set up an RDBMS on a decently sized machine, have someone make sure the queries you run aren't too bad, and you'll get quite far without the headaches of distributed state.
- 2021 Cloud Report: AWS offered the most cost-efficient machine on the OLTP benchmark, Cockroach Labs Derivative TPC-C, with the c5a.4xlarge machine coming in 12% lower than GCP and 35% lower than Azure’s most cost-efficient machines. AWS has also provided the best network latencies three years in a row. Finally, AWS machines with the custom-built Graviton2 processor performed best in the multi-core CPU benchmark. However, AWS demonstrated the lowest performance on seven of the 12 benchmarks, including all storage I/O benchmarks and the single-core CPU benchmark.
- J. Doyne Farmer~ Telos is an emergent property...if there's a key thing that needs to be done for something to survive and propagate a telos will emerge to do that thing. It plays a central role in nature...if you have a system where things are competing with each other and where they have limited capabilities and they are evolving by changing their programs or genomes or strategies and they can make them a little better and survive a little better because they're being selected by that, the ones that survive better are going to propagate more, because of specialization you're going to get an ecology of competing specialists. They also may be cooperating at times but overall there's a kind of competition for who's going to be there. Understanding how they affect each other is key to understanding the system dynamics and it's key to understanding evolution.
- @kelseyhightower: Go 1.17 is released! - 5% performance improvement - 2% smaller amd64 binaries
- @hadip: The Internet Explorer team was the hardest-working team I’ve ever been on. And I’ve worked at multiple start-ups. It was a sprint, not a marathon. We ate every meal at the office. We often held foosball tournaments at 2 am, just to get the team energy back up to continue working!
- Cory Doctorow: But Uber is nearly broke. A true accounting of the last quarter has Uber losing 38 cents on every dollar it took in. $3.7b of its assets are actually worthless paper from failing overseas ridehail companies.
- Pxtl: I've said it before, I'll say it again: Crypto currency is about tearing down the old rules and institutions of banking and finding out one failure at a time why each rule and institution exists.
- boulos: I'd suggest planning more around operating TiloDB as a managed service. You happen to just need lambda, s3, and dynamo today, but the "capturable" value for many customers will be if you also manage it all for them (especially upgrades). You can still offer the open core and let folks run their own, but it sounds like a lot of the goodness comes from the way you run it. Having said that, licensing is currently fraught in this space. Each major "database" vendor (Elastic, Redis Labs, Confluent) is basically trying to find a way to figure out how to avoid AWS (and other clouds) from just taking their code and operating it as a service.
- @kmmccauley: I had a customer who had intermittent packet drops and their IT could not figure out what was causing it. Remote troubleshooting could not find any causes. They flew me out and I find their WAN router plugged into cheap power strip with the office copier😩.
- @swardley: X : You're not a fan of private cloud or kubernetes? Me : They have their uses. X : Really? Me : Yes. Private cloud and kubernetes matter if you’ve got legacy systems and you need a place to strangle them. But it’s not the future and you don’t build new there.
- @alioth: Today I finished migrating a 40k core, single threaded, monolithic python app into Kubernetes. It only took ~8 months and a third of my sanity.
- zz865: Our big project has moved from physical servers to Openshift. Its taken a lot of work, much more than expected. The best thing is that developers like it on their resume, which is a bigger benefit than you'd think as we've kept some good people on the team. For users I see zero benefit. CI pipeline is just more complicated and probably slower. Cost wise it was cheaper for a while but now RedHat are bumping up licensing costs so now I think is about the same costs. Overall it seems like a waste of time, but has been interesting.
- @wycats: It feels like StackOverflow has developed a fatal flaw with regard to web content. Almost every popular question has a top-voted and approved answer that is out of date. In most cases, the answer is totally wrong and no decent developer would recommend it anymore.
- Geoff Huston: Firstly, while its remarkably easy to select one CDN provider and use it for the entirety of one’s online content and service portfolio, it can be more challenging to select two or more such CDN providers and use them together in a self-healing mutual backup setup. For many online service enterprises, it's a case of “It’s way easier to pick just one CDN. Choose wisely! From that point on the enterprise is fate sharing with the CDN provider.
- Geoff Huston: From the point of view of the client, an IP address is merely a conversation token, to be borrowed and used for a conversation with a server and then passed back into a shared pool. This form of sharing a pool of endpoint addresses between clients was implemented by Network Address Translators (NATs) and it is a foundational technology of today's Internet.
- Ably: Ably is a public cloud customer. Our entire production environment exists on AWS and currently nowhere else. We run on EC2 instances. The total number of machines fluctuates with autoscaling throughout the day, but is always at least many thousands, across ten AWS regions. These machines do run Docker, and most of our software is deployed in containers. What we don’t use is any of the well-known runtime orchestration layers. When created, each instance already knows which containers to run based on which autoscaling group it is in.
- David Bollier:But it’s time to face the facts: the commons constitutes a versatile system for organizing reliable flows of productive, creative social energy.
- James S. A. Corey: “Things get lost, Avasarala said. “There was one time during the finance crisis that we found a whole auditing division that no one remembered. Because that’s how you do it. You take part of a problem and you put it somewhere, get some people working on it, and then you get another part of the problem and get other people working on that. And pretty soon you have seven, eight, a hundred different little boxes with work going on, and no one talking to anyone because it would break security protocol.
- Andrew Theken: The migration of Postmark to AWS was a huge milestone that allowed us to modernize core services and positioned Postmark for growth over the years to come. As a result of the migration, we can dynamically scale many of our services and data stores, and this has led to even higher reliability and availability of Postmark endpoints and processing, as well as more consistent and predictable performance for customers.
- Netflix: The overall ProRes video processing speed is increased from 50GB/Hour to 300GB/Hour. From a different perspective, the processing time to movie runtime ratio is reduced from 6:1 to about 1:1. This significantly improves the Studio high quality proxy generation efficiency and workflow latency.
- Oxford Protein Informatics Group: This time, however, DeepMind decided to develop an end-to-end model. Instead of using the MSA to predict constraints, they produced a deep learning architecture that takes the MSA as input (plus some template information, but that is another story) and outputs a full structure at the end. Their motivation is simple. Given that the ~170,000 structures available in the PDB constitute a small dataset, they want to make the most of inductive bias — introducing constraints into the model architecture that ensure that the information is assimilated rapidly and efficiently.
- @danielgross: Tried Github Copilot; fascinating. You still need to code, but a new skill of "goading" is required. Your mind starts modeling OpenAI's mind and you're trying to learn how to best express needs to the computer. Maybe engineering is now equal parts coding and teaching ability.
Useful Stuff:
- AWS’ egress margin is Cloudflare’s opportunity. Will Cloudflare challenge the gods and win? Or will they end up like Prometheus? Announcing Cloudflare R2 Storage: Rapid and Reliable Object Storage, minus the egress fees:
- Architecture structure always follows cost structure. So what can you build when bandwidth becomes as close to free as it gets? What a nice thought.
- We all want and need a competitor to the cloud oligopoly that has become even more entrenched over the last several years.
- How can R2 work? The network is a fixed rather than variable cost for Cloudflare. So saith the CEO.
- Ben Thompson thinks Cloudflare’s Disruption of AWS is happening, but from a developer perspective that is less clear. If you are starting a new project AWS will still win most checklist bake-offs. Workers, etc. aren't disruptive. Lambda wins on features and capabilities. R2, however, creates a new pricing blackhole for sucking in certain kinds of projects. Data gravity is real, what happens when data anti-gravity becomes real too? They still need RDS, search, event bridge, ML, etc. But that’s what disruption is all about.
- @QuinnyPig: Let's tie this together. I can pay 2.3¢ per GB plus a whopping 9¢ per GB of transfer, *OR* I can pay 2.75¢ per GB to keep it in both places, secure in the knowledge that my egress traffic is a one-time 9¢ charge, the end…One final point: Now let’s remember that the internet is 1-to-many. If 1 million people download that 1GB this month, my cost with @cloudflare R2 this way rounds up to 13¢. With @awscloud S3 it’s $59,247.52. THAT is why people are losing their minds over this.
- eastdakota: Bandwidth (ingress and egress) always free, regardless of volume. Transactions free at low volume (~<1/sec) but we’ll charge at higher volumes. Storage we charge for. For both transactions and storage, we aim to be at least 10% less expensive than S3. And, again, for the sake of absolute clarity: egress/ingress always free.
- sudhirj: To be honest, I think the biggest draw will be for companies (like where I work) that put large objects on S3 and distribute it to hundreds / thousands / millions of customers. The egress direct from S3 is on the order of 8 cents a GB, and with Cloudfront in front of it it’s a few cents lower, and you can negotiate pricing a little lower if you’re big enough. But not an order of magnitude. We’d stick R2 in front of an S3 bucket and wipe off a the biggest portion of our bill.
- Corey: Do you know that it costs more for me to move data between an availability zone in one region to an availability zone in that same region than it costs me to move that data from Virginia to Ohio or vice versa? It costs twice as much to do that. And that seems like a really weird thing. Their data transfer pricing, their egress pricing specifically, has not materially moved in ages, and it still feels like 1998 pricing. It starts at nine cents a gigabyte to cents something externally to the internet.
- kondro: Durable Objects Storage is $1/million writes at 4KB and $0.20/million reads[2], which is pretty good. But you can also only access it via a Durable Object which limits its usage (especially as neither Durable Objects or its storage are in a production state with final pricing yet) and means you need to also pay for the Durable Object runtime on top fo the Storage pricing whenever you want to access that data…Durable Objects' Store isn't a particularly good alternative to just using S3 or DynamoDB unless you're already using Workers/Durable Objects…Ilove DynamoDB for a lot of things, but as a store for objects much larger than 1KiB it's not very cost-effective.
- kkielhofner: My contention, as I've commented here before, is that AWS is taking advantage of a new "cloud native" generation of developers, startup founders, C-suite, etc who have never purchased bandwidth/transit, run an AS (autonomous system), etc. As mentioned in the twitter thread Amazon is essentially charging 1998 prices for bandwidth. I wouldn't be surprised at all if many of their services are essentially loss-leaders and Amazon more than makes up for it with their ridiculous markup on bandwidth. This "cloud native" generation can wrap their heads around instances, request rates, storage volume, K8s, containers, functions, etc but the pricing on bandwidth is very nebulous, notoriously difficult to estimate/control, and poorly understood. Without organizational experience and knowledge of the real bandwidth costs in 2021 the assumption is likely "well, that's just what bandwidth costs".
- ItsTobias: I manage the hosting infrastructure for a fairly high traffic site. (Millions of users, terabytes of bandwidth per day kind of scale). We use the Cloudflare business plan we pay in the region of hundreds of dollars per month for our cloudflare service. Cloudflare have been trying to get us on to their enterprise plan for the last 12 months or so. Every quote they have provided us so far though is considerably more expensive than comparative services on AWS. We are most likely pushing the boundary fairly hard on what is acceptable usage for their business plan, but their enterprise plans offer virtually no benefit over their business plans but are 10 to 20 times the total cost.
- EuroBSDcon videos are now available now available. You might like Serving Netflix Video at 400Gbps on FreeBSD by Drew Gallatin. Here are the slides.
- drewg123: But note that these are flash servers; they serve the most popular content we have. We also have "storage" servers with huge numbers of spinning drives that serve the longer tail. They are constrained by spinning rust speeds, and can't serve this fast.
- drewg123: Its important to consider that we've poured man years into this workload on FreeBSD. Just off the top of my head, we've worked on in house, and/or contributed to or funded, or encouraged vendors to pursue: - async sendfile (so sendfile does not block, and you don't need thread pools or AIO) - RACK and BBR TCP in FreeBSD (for good QoE) - kTLS (so you can keep using sendfile with tls, saves ~60% CPU over reading data into userspace and encrypting there) - Numa - kTLS offload (to save memory bandwidth by moving crypto to the NIC) Not to mention tons of VM system and scheduler improvements which have been motivated by our workload. FreeBSD itself has improved tremendously over the last few releases in terms of scalability
- stingraycharles: Basically, everything is about “pinning the hardware on a low level. You pin the network adapters to be handled by a specific set of cores. You pin the filesystem to handle specific sets of files on specific cores. You then ensure the router in the rack to distribute the http requests to the network ports exactly in the way that they always arrive on the network adapters that have those files pinned. It’s not much different from partitioning / sharding and “smart load balancing a cluster of servers, it’s just on a lower level of abstraction.
- beefstake: Large servers with multiple CPUs behave more like a very fast local area network then they do one machine when you start needing to go really fast. Netflix is showing an architecture here that tries to use each processor and it's attached resources as single units and avoid sending large amounts of data between CPUs in the same server. End result is much better performance because the internal network between the CPUs is removed as a bottleneck.
- How to Create a Very Inexpensive Serverless Database. File systems as databases are as old as either, and it works right up until you want to search, query, and simultaneously write/read from the same file. But it is a good overview of your various cloud storage options.
- TacoBell is serverless-first. Unfortunately, AWS doesn’t have a $5 box deal. #56: Serverless at TacoBell.
- 7,000 restaurants, millions of pricing points, menus, tax rules, store hours.
- 99% reduction in costs by adopting serverless compared to having EC2 servers running 24 hours a day.
- After a pilot project they liked serverles, so decided to go serverless-first. On their new order middleware platform they decided against using any of that last gen tech like EC2 and containers.
- Lots of Step Function love. Step Functions are used to process orders, apply rules, and EventBridge to drive async flows like the delivery process. Easy to onboard new integrations and Step Function make it easy to know what’s going on. Step Functions make it easy to add retries, error handlers, circuit breaker patterns. It was easy to mock downstream and upstream APIs to load test the system.
- The negative about Step Functions is the cost.
- Servless removes a lot of the work that doesn’t produce business value. Can immediately start working on business value. Gives the team that feeling they can build anything. You think in terms of composing services.
- Developers have fun using serverless. They want to do more.
- Develop a Community of Practice. Have a few evangelists for serverless. Have example repos that show how to do things in the new environment. Have a Slack channel where the team can exchange knowledge. Have experts in important topics like Dynamodb and App Sync. Biweekly meeting to demo new code and practices.
- It’s a paradigm shift, so it’s hard to get everybody on board across the organization.
- Not really writing that much complex code. It’s less about writing huge algorithms and reinventing the wheel and more about understanding the services and the best way to use them.
- It’s easier to pick a cloud and use it completely rather than have to worry about being cloud agnostic. Took the full step into using AWS, serverless, dynamodb, event bridge.
- In praise of not copying, but copying. Reverse Engineering Greatness.
- Copying or over relying on someone else's proven formula is a losing strategy. It's likely they have some characteristics that enabled them to succeed with their strategy that you don’t have. Audience expectations change over time so you can't just copy.
- Completely new often fails because we don’t trust the new. Optimal newness is a minor dose of novelty. Don't overwhelm with originality. Combine what they've experienced before with something slightly new.
- Copying makes you more creative. It reveals to you what masters do.
- Take risks, but minimize the impact if they fail.
- Veritas: Architecting a Global File System with AWS Storage and InfoScale. In the UK you have to be able to failover to on-prem, to the cloud, and back again.
- Your mental model of a computer is so last century. In the age of SoCs everything is different. USENIX ATC '21/OSDI '21 Joint Keynote Address-It's Time for Operating Systems to Rediscover Hardware.
- [Linux] is not an operating system that has been designed, this is an operating system that has been congealed. Modern SoCs contain many many processors and Linux as an OS only runs on a small subset of those processors. The rest of the OS is what is “congealed.
- The argument is we actually need to create an OS that runs the entirety of the SoC. Linux is not solving these problems. It’s not doing security, boot loading, power management, etc. It’s a security nightmare because Linux thinks it’s the only thing running on the machine. It’s not.
- Those other processors have their own operating systems that violate what Linux thinks and so those processors are capable of compromising the system. This is what’s called a Cross SoC attack.
- We need a way to manage complex, asymmetric, heterogenous, non-uniform multiprocessors that are modern computers. More and more of these functions are being moved into hardware and are being hidden from the OS. People don’t know what modern hardware looks like, modern hardware has changed, so all problems are poorly solved with Linux. We have Linux blinders. There’s OS denial and ignorance. We need to fix this problem.
- The Universe is Hostile to Computers. Lesson: Don't build with radioactive materials. 1 alpha particle is all that’s needed to flip a bit in your DRAM.
- Algorithms are everywhere. How Carburetors are Made (Basically Magic). The brain of a carburetor is an algorithm etched in metal.
- People just love building databases. Building PlanetScale with PlanetScale.
- Our mission at PlanetScale is to build the best database for developers
- leerob: moved my personal site from using Firebase/Redis to just MySQL with PlanetScale[2] and the results have been solid. Averaging 128ms on function response times with a p95 of 250ms.
- sorenbs: PlanetScale is the serverless relational database we have been waiting for all this time. It's truly excellent, and being based on the same tech that runs youtube, we know it will scale well.
- kmavm: The assumption that tenants are perfectly isolated is actually the original sin of early Slack infrastructure that we adopted Vitess to migrate away from. From some earlier features in the Enterprise product (which joins lots of "little Slacks" into a corporate-wide entity) to more post-modern features like Slack Connect or Network Shared Channels, the idea that each tenant is fully isolated was increasingly false.
- Also, Tilo: A Novel Approach to Entity-Resolution using Serverless Technology
- Do you have a whiskey room? DRUNK: Can Alcohol Make You More Creative, Sociable, and Attractive?. There’s a connection between just enough alcohol and creativity. The idea is that there's a sweet spot of inebriation where you regain the cognitive flexibility that children have. For example, as a team when you can’t solve a problem instead of sitting in front of your computer and trying to grind through it, go have a drink and talk it through. Often with an informal exchange with your cognitive control gently down regulated, a solution can be found.
- Is there any place for monoliths in 2021? One of the cons of under monolith is "High coupling between components," which is simply not true. Components inside a monolith invoke each other via interfaces. Properly done, those components can be compiled together in one executable or distributed across a cluster. A monolith is not just a big library. It has as much structure as you put into it.
- Zillow: Near Real-Time Natural Language Processing (NLP) for Customer Interactions. Zillow said they saved 75% on compute costs by not using k8s or individual services.
- Consider something being hard is an anti-pattern. A signal that you should step back and try a different approach. EFFORTLESS: Embrace the Easy Option: try the easier path; make failure cheap; make learning size mistakes, you'll accelerate far faster; perfectionism is the enemy of the done; large working things start from small working things; you don't have to decide everything right now. The Extreme Programming brigade should adopt this book as their new mascot.
- Consistency, convergence, and confluence are not the same. Don't Get Stuck in the CON Game (V3): Convergence is a property of an object. It means you have a merge algorithm that allows you to smash divergent replicas together and get the identical result. Eventual convergence is the same as plain-old convergence but sounds cooler and may help you be successful when meeting new people in a bar. A dataflow component confluent if it produces the same set of outputs for all orderings of its inputs. Consistent: The word consistent is not consistent. I pretty much think it means nothing useful unless preceded by strong, sequential, or causal. Eventual consistency means nothing both now AND later. It does, however, confuse the heck out of a lot of people.
- New platforms here, get your new platforms.
- Zoho Catalyst: a highly scalable serverless platform that lets developers build and deploy world-class solutions without managing servers. Even better, you pay nothing till you deploy the project to production. Get a free, full-featured sandbox and up to 125 million free invocations. It has servereless functions, a RDBMS,, file storage, authentication, orchestration, ML, OCR, and Push Notifications.
- Fastly Compute@Edge: allows you to build high scale, globally distributed applications and execute code at the edge — without having to manage the underlying infrastructure. Fastly in their marketing seems to be taking a benefits driven approach with their copy rather than a technology driven approach, so it’s hard to know what it actually does.
- Wasm Cloud: platform for writing portable business logic that can run anywhere from the edge to the cloud, that boasts a secure-by-default, boilerplate-free developer experience with rapid feedback loop. Not all that helpful, and to learn more you have to sign in.
- The good thing about serverless is it’s much easier to build platforms. But I think they really need to connect with developers in a way that makes it clear how they solve their problems. I know this kind of messaging is difficult, but it would make a big difference in adoption.
- A Deep Dive into Airbnb’s Server-Driven UI System. While it’s true a massive amount of brain cycles have been spent on tricking consumers to click on ads, directly after that brain cycle expenditure in size and dubiosity is making one UI to rule them all. Though what Airbnb did here is great, there are only so many ways late binding and a specification language can be combined. Leaky abstractions and the eventual need to make the native platform sing, make fools of us all.
- Shopify Capacity Planning at Scale:
- From an investment perspective, planning for the largest scale scenario means spending a lot of money very quickly to handle sales that might not happen. Alternatively, not deploying enough machines means having too little computing power and putting our merchant storefronts at risk of outages. I
- Our Google Cloud resource needs depend on how much traffic our merchants see during BFCM. We worked with our data scientists to forecast traffic levels and set those levels as a bar for our platform to scale to. Additionally, we looked into historical numbers, applied a safety margin, and projected how many buyers would check out or view online stores.
- We created a master resourcing plan for our Google Cloud implementation and estimated how things like CPUs and storage would scale to BFCM traffic levels. Owners for our top 10 or so resource areas were tasked to estimate what they needed for BFCM. These estimates were detailed breakdowns of the machine types, geographic locations, and quantities of resources like CPUs. We also added buffers to our overall estimates to allow flexibility to change our resourcing needs, move machines across projects, or failover traffic to different regions if we needed to. What also helps is that we partition each component into a separate GCP project, which makes it a lot easier to think of quotas per every project.
- We’ve started the practice of regular scale-up testing at Shopify. Investment in our internal load testing tooling over the years is fundamental to our ability to run such large scale, platform-wide load tests.
- The Cost of Cloud, a Trillion Dollar Paradox. My wife and I pay over $20,000 a year on health insurance. That’s with high copays and doesn’t count the cost of specialist care. Once upon a time health-care used to be at the edge. We used to manage our own health care in either the home or the village. Over time as knowledge, technology, and infrastructure developed, health care moved into the health-care cloud. Now there’s a growing sense of awareness about the long-term cost implications of the health-care cloud as it starts to contribute significantly to our total cost of revenue. We’re considering the dramatic step of repatriating the majority of health-care or perhaps adopting a hybrid approach. Those who have done this have reported significant cost savings. Yet it’s a lot of effort. And the health-care cloud is great. Though I do shudder at home chemotherapy treatments. And I don’t have room for an MRI machine. But at our scale, the calculus changes. Our profit margins would increase. Our market-cap would skyrocket, and you know how that’s good, for, well, someone. I think we might even have time to do our normal jobs after all the health-care work gets done. MRI machines require a lot of maintenance. But there’s so much market cap to reclaim from health-care cloud optimization. I hear some of you saying: does it make sense to become your own hospital-cloud and still work at the jobs that actually bring in revenue? Pssh. Anyone can create, run, and maintain their own health-care cloud. It’s not like it’s a specialized skill. These days it’s all on YouTube. When you consider the true cost of the health-care cloud, how could you do otherwise?
Soft Stuff:
- dgraph-io/badger: an embeddable, persistent and fast key-value (KV) database written in pure Go. It is the underlying database for Dgraph, a fast, distributed graph database. It's meant to be a performant alternative to non-Go-based key-value stores like RocksDB.
Video Stuff:
- EMEA 2021 Success Stories - Spike-based neuromorphic computing for the extreme edge: IMEC develops compute architectures for event-based sensors, using neurons with co- located memory and processing.
- Learn about the cloud for free: over 700 Azure Friday video episodes.
Pub Stuff:
- How Fast Do Algorithms Improve?: We find enormous heterogeneity in algorithmic progress, with nearly half of algorithm families experiencing virtually no progress, while 14% experienced improvements orders of magnitude larger than hardware improvement (including Moore’s law). Overall, we find that algorithmic progress for the median algorithm family increased substantially but by less than Moore’s law for moderate-sized problems and by more than Moore’s law for big data problems.
- Programmable DNA-Based Boolean Logic Microfluidic Processing Unit: In the MPU controlled via a personal computer or smartphone application, the molecules with two input DNAs and a logic template DNA were reacted for the basic AND and OR operations. Furthermore, the DNA molecules reacted in a cascading manner for combinational AND and OR operations. Finally, we demonstrated a 2-to-1 multiplexer and the XOR operation with a three-step cascade reaction using the simple DNA-based MPU, which can perform Boolean logic operations (AND, OR, and NOT).
- Spire: A Cooperative, Phase-Symmetric Solution to Distributed Consensus: This paper presents a new consensus algorithm named Spire, characterised by a phase-symmetric, cooperative structure. Processes do not contend for leadership; instead, they collude to iteratively establish a dominant value and may do so concurrently without conflicting. Each successive iteration is structured identically to the previous, employing the same messages and invoking the same behaviour.
- Applications of Deep Neural Networks with Keras: 500+ page textbook for a Applications of Deep Learning course
- FoundationDB: A Distributed Unbundled Transactional Key Value Store: The paper discusses a distributed key value store that Apple, Snowflake, and VMWare (among others) run core services on at immense scaleApple’s CloudKit is built on FoundationDB, in addition to other services (as described in their SIGMOD’21 announcement). Snowflake’s usage of FoundationDB is explained in this great talk.. Unlike other large-scale data stores that forego implementing transactions in order to simplify scaling, FoundationDB was designed with strictly serializableStrict serializability means that transactions can be given a definite order. Achieving strict serializability is easy on a single node database, but is difficult to scale to an enormous distributed database (part of why the paper is so interesting). For background on the topic, I would recommend Peter Bailis’s blog. transactions from the ground up.
Reader Comments (3)
Cockroach labs report is bogus.
They derive clickbait headlines based on specific restricted configs.
E.g. the repport shows AWS around 8.5 Gbps and GCP around 25 Gbps.
Whereas AWS also offers 100G and 400G networking and people have confirmed that 100Gbps is attainable:
https://medium.com/faun/100g-networking-in-aws-a-network-performance-deep-dive-d569c675c6c3
So either CockroachDB engineers are incompentent or their sales is too eager.
CockroachDB must be shabby tech if their sales needs to resort to these kind of tactics.
Why is this website not yet HTTPS?
Because squarespace does not support HTTPS for SS5 websites. And no, cloudflare doesn't work because SS redirects to http so you get a redirect loop.