Entries in Database (63)

Wednesday
Jul162008

The Mother of All Database Normalization Debates on Coding Horror

Jeff Atwood started a barn burner of a conversation in Maybe Normalizing Isn't Normal on how to create a fast scalable tagging system. Jeff eventually asks that terrible question: which is better -- a normalized database, or a denormalized database? And all hell breaks loose. I know, it's hard to imagine database debates becoming contentious, but it does happen :-) It's lucky developers don't have temporal power or rivers of blood would flow. Here are a few of the pithier points (summarized):

  • Normalization is not magical fairy dust you sprinkle over your database to cure all ills; it often creates as many problems as it solves. (Jeff)
  • Normalize until it hurts, denormalize until it works. (Jeff)
  • Use materialized views which are tables created and maintained by your RDBMS. So a materialized view will act exactly like a de-normalized table would - except you keep you original normalized structure and any change to original data will propagate to the view automatically. (Goran)
  • According to Codd and Date table names should be singular, but what did they know. (Pablo, LOL)
  • Denormalization is something that should only be attempted as an optimization when EVERYTHING else has failed. Denormalization brings with it it's own set of problems. You have to deal with the increased set of writes to the system (which increases your I/O costs), you have to make changes in multiple places when data changes (which means either taking giant locks - ugh or accepting that there might be temporary or permanent data integrity issues) and so on. (Dare Obasanjo)
  • What happens, is that people see "Normalisation = Slow", that makes them assume that normalisation isn't needed. "My data retrieval needs to be fast, therefore I am not going to normalise!" (Tubs)
  • You can read fast and store slow or you can store fast and read slow. The biggest performance killer is so called physical read. Finding and accessing data on disk is the slowest operation. Unless child table is clustered indexed and you're using the cluster index in the join you will be making lots of small random access reads on the disk to find and access the child table data. This will be slow. (Goran)
  • The biggest scalability problems I face are with human processes, not computer processes. (John)
  • Don't forget that the fastest database query is the one that doesn't happen, i.e. caching is your friend. (Chris)
  • Normalization is about design, denormalization is about optimization. (Peter Becker)
  • You're just another knucklehead. (BuggyFunBunny)
  • Lets unroll our loops next. RDBMS is about shared *transactional data*. If you really don't care about keeping the data right all the time, then how you store it doesn't matter.(Christog)
  • Jeff, are you awake? (wiggle)
  • Denormalization may be all well and good, when you need the performance and your system is STABLE enough to support it. Doing this in a business environment is a recipe for disaster, ask anyone who has spent weeks digging through thousands of lines of legacy code, making sure to support the new and absolutely required affiliation_4. Then do the whole thing over again 3 months later when some crazy customer has five affiliations. (Sean)
  • Do you sex a cat, or do you gender it? (Simon)
  • This is why this article is wrong, Jeff. This is why you're an idiot, in case the first statement wasn't clear enough. You just gave an excuse to be lazy to someone who doesn't have a clue. (Marcel)
  • This is precisely why you never optimize until *after* you profile (find objectively where the bottlenecks are). (TED)
  • Another great way to speed things up is to do more processing outside of the database. Instead of doing 6 joins in the database, have a good caching plan and do some simple joining in your application. (superjason)
  • Lastly - No one seems to have mentioned that a decently normalized db greatly reduces application refactoring time. As new requirements come along, you don't have have to keep pulling stuff apart and putting back in new configurations of the db, (Ed)
  • Keep a de-normalized replica for expensive operations (e.g. reports), Cache Results for repeat queries (Memcache), Partition the database for scalability (vertical or horizontal) (Gareth)
  • Speaking from long experience, if you don't normalize, you will have duplicates. If you don't have data constraints, you will have invalid data. If you don't have database relational integrity, you will have orphan "child" records, etc. Everybody says "we rely on the application to maintain that", and it never, never does. (A. Lloyd Flanagan)
  • I don't think you can make any blanket statements on normal vs. non-normal form. Like everything else in programming, it all depends on requirements and intended goals. (Wayne)
  • De-normalization is for reporting, not for OLTP. (Eric)
  • Your six-way join is only needed because you used surrogate instead of natural keys. When you use natural keys, you will discover you need much fewer joins because often data you need is already present as foreign keys. (Leandro)
  • What I think is funny is the number of people who think that because they use LINQ or Hibernate they aren't affected by these issues. (Sam)
  • You miss the point of normalization entirely. Normalization is about optimizing large numbers of small CrUD operations, and retrieving small sets of data in order to support those crud operations. Think about people modifying their profiles, recording cash registers operations, recording bank deposits and withdrawals. Denormalization is about optimizing retrieval of large sets of data. Choosing an efficient database design is about understanding which of those operations is more important. (RevMike)
  • Multiple queries will hurt performance much less than the multi-join monstrosity above that will return indistinct and useless data. (Chris)
  • Cache the generated view pages first. Then cache the data. You have to think about your content- very infrequently will anyone be updating it, it's all inserts. So you don't have to worry about normalization too much. (Matt)
  • I wonder if one factor at play here is that it's very easy to write queries for de-normalized data, but it's relatively hard to write a query that scales well to a large data set. (Thomi)
  • Denormalization is the last possible step to take, yet the first to be suggested by fools. (Jeremy)
  • There is a simple alternative to denormalisation here -- to ensure that the values for a particular user_id are physically clustered. (David)
  • The whole issue is pretty simple 99% of the time - normalized databases are write optimized by nature. Since writing is slower then reading and most database are for CRUD then normalizing makes sense. Unless you are doing *a lot* more reading then writing. Then if all else fails (indexes, etc.) then create de-normalized tables (in addition to the normalized ones). (Rob)
  • Don't fear normalization. Embrace it. (Charles)
  • I read on and discovered all the loons and morons who think they know a lot more then they do about databases. (Paul)
  • Put all the indexable stuff into the users table, including zipcode, email_address, even mobile_phone -- hey, this is the future!; Put the rest of the info into a TEXT variable, like "extra_info", in JSON format. This could be educational history, or anything else that you would never search by; If you have specific applications (think facebook), create separate tables for them and join them onto the user table whenever you need. (Greg)
  • How is the data being used? Rapid inserts like Twitter? New user registration? Heavy reporting? How one stores data vs. how one uses data vs. how one collects data vs. how timely must new data be visible to the world vs. should be put OLTP data into a OLAP cube each night? etc. are all factors that matter. (Steve)
  • It might be possible to overdo it, but trust me, I have had 20 times the problems with denormalized data than with normalized. (PRMAN)
  • Your LOGICAL model should *always* be fully normalized. After all, it is the engine that you derive everything else from. Your PHYSICAL model may be denormalized or use system specific tools (materialized views, cache tables, etc) to improve performance, but such things should be done *after* the application level solutions are exhausted (page caching, data caches, etc.) (Jonn)
  • For very large scale applications I have found that application partitioning can go a long way to giving levels of scalability that monolithic systems fail to provide: each partition is specialized for the function it provides and can be optimized heavily, and when you need the parts to co-operate you bind the partitions together in higher level code. (John)
  • People don't care what you put into a database. They care what you can get out of it. They want freedom and flexibility to grow their business beyond "3 affiliations". (PRMan)
  • People don't care what you put into a database. They care what you can get out of it. They want freedom and flexibility to grow their business beyond "3 affiliations". (Steve)
  • I normalise, then have distinct (conceptually transient) denormalised cache tables which are hammered by the front-end. I let my model deal with the nitty-gritty of the fix-ups where appropriate (handy beginUpdate/endUpdate-type methods mean the denormalised views don't get rebuilt more than necessary). (Mo)
  • Stop playing with mySQL. (Jonathan)
  • What a horrible, cowboy attitude to DB design. I hope you don't design any real databases. (Dave)
  • IOW, scalability is not a problem, until it is. Strip away the scatalogical reference, and all you have is a boring truism. (Yawn)
  • Is my Site OLTP? If the answer is yes then Normalize. Is my site OLAP? If the answer is yes then De-Normalize! (WeAreJimbo)
  • This is a dangerous article, or perhaps you just haven't seen the number of horrific "denormalised" databases I have. People use these posts as excuses to build some truly horrific crimes against sanity. (AbGenFac)
  • Be careful not to confuse a denormalised database with a non-normalised database. The former exists because a previously normailsed database needed to be 'optimised' in some way. The latter exists because it was 'designed' that way from scratch. The difference is subtle, but important. (Bob) OK, more than a few quotes. There's certainly no lack of passion on the issue! One thing I would add is to organize your application around an application level service layer rather than allowing applications to access the database directly at a low level. Amazon is a good example of this approach. Many of the denormalization comments have to do with the problems of data inconsistency, which is of course true because that's why normalization exists. Many of these problems can be reduced if there's a single service access point over which to get data. It's when data access is spread throughout an application that we see serious problems.

    Related Articles

  • Denormalization Patterns by Kenneth Downs
  • When Databases Lie: Consistency vs. Availability in Distributed Systems by Dare Obasanjo
  • Stored procedure reporting & scalability by Jason Young
  • When Not to Normalize your SQL Database by Dare Obasanjo

    Click to read more ...

  • Tuesday
    May272008

    How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale

    Update 3: ReadWriteWeb says Google App Engine Announces New Pricing Plans, APIs, Open Access. Pricing is specified but I'm not sure what to make of it yet. An image manipulation library is added (thus the need to pay for more CPU :-) and memcached support has been added. Memcached will help resolve the can't write for every read problem that pops up when keeping counters. Update 2: onGWT.com threw a GAE load party and a lot of people came. The results at Load test : Google App Engine = 1, Community = 0. GAE handled a peak of 35 requests/second and a sustained 10 requests/second. Some think performance was good, others not so good. My GMT watch broke and I was late to arrive. Maybe next time. Also added a few new design rules from the post. Update: Added a few new rules gleaned from the GAE Meetup: Design By Explicit Cost Model and Puts are Precious. How do you structure your database using a distributed hash table like BigTable? The answer isn't what you might expect. If you were thinking of translating relational models directly to BigTable then think again. The best way to implement joins with BigTable is: don't. You--pause for dramatic effect--duplicate data instead of normalize it. *shudder* Flickr anticipated this design in their architecture when they chose to duplicate comments in both the commentor and the commentee user shards rather than create a separate comment relation. I don't know how that decision was made, but it must have gone against every fiber in their relational bones... But Flickr’s reasoning was genius. To scale you need to partition. User data must spread across the shards. So where do comments belong in a scalable architecture? From one world view comments logically belong to a relation binding comments and users together. But if your unit of scalability is the user shard there is no separate relation space. So you go against all your training and decide to duplicate the comments. Nerd heroism at its best. Let inductive rules derived from observation guide you rather than deductions from arbitrarily chosen first principles. Very Enlightenment era thinking. Voltaire would be proud. In a relational world duplication is removed in order to prevent update anomalies. Error prevention is the driving force in relational modeling. Normalization is a kind of ethical system for data. What happens, for example, if a comment changes? Both copies of the comment must be updated. That leads to errors because who can remember where all the data is stored? A severe ethical violation may happen. Go directly to relational jail :-) BigTable data ethics are more Mardi Gras than dinner with the in-laws. Data just wants to have fun. BigTable won’t stop you from hurting yourself. And to get the best results you may have to engage in some conventionally risky behaviors. But if those are the glass bead necklaces you have to give for a peak at scalability, why not take a walk on the wild side? For a more modern post-relational discussion of data ethics I’m using as my primary source a thread of conversations from JA Robson, Ben the Indefatigable, Michael Brunton-Spall, and especially Brett Morgan. According to our new Voltaire, Locke, Bacon, and Newton, here’s what it takes to act ethically in a BigTable world:
  • Don’t bother with BigTable unless your goal is to create a web site that scales to millions of users. The techniques for building scalable read-mostly web applications are difficult and require a radical mindset change. Standard relational techniques work very well until you scale to huge numbers of users. It is at that point you need to break the rules and do something counter-intuitively different. More of the same will not work. If you don’t plan to get to that point it may not be worth the effort to change. BigTable is targeted at building web applications, It's nature makes it a poor match for OLAP, data warehousing, data mining, and other applications performing complex data manipulations.
  • Assume slower random data access rather than fast sequential access. Every get of an entity could be from a different disk block on a different machine in a cluster. Calculating, for example, the average over a column in SQL can be efficient because data is stored together on disk. In BigTable data can be anywhere so iterating over every value in a column is expensive. Each read is potentially a random block from anywhere which means the average retrieval time can be relatively high. The implication is to use BigTable you must adopt some unfamiliar and unintuitive strategies in order to deal with such a very different performance profile. Using relational database we are used to writing applications against fast highly performant databases. With BigTable you have to become familiar with the rules for developing against a slower but more scalable database. Neither approach is better for all purposes, but BigTable has the edge for high scalability.
  • Group data for concurrent reads. Given the high cost of reading data from BigTable your application will not scale if every page requires a large number of reads. The solution: denormalize. Store data in the same entity based on what data needs to be read concurrently. Relational modeling groups data together based on the “minimize problems” rule. BigTable’s new rule is “maximize concurrent reads” which implies denormalization. Store entities so they can be read in one access rather than performing a join requiring multiple reads. Instead of storing attributes in separate entities in order to remove duplication, duplicate the attributes and store them where they need to be used. Following this rule minimizes the number of reads required to return an entity.
  • Disk and CPU are cheap so stop worrying about them and scale. A criticism of denormalization is storing duplicate data wastes disk space. Google’s architecture trades disk space for better performance. Disk is (relatively) cheap, so don’t fight it. On the CPU front a data center’s worth of CPU is at your service. As long as you structure your application in the way GAE forces you to, your application can scale as large as it needs to simply by running on more machines. All scalability bottlenecks have been removed.
  • Structure data around how it will be used. Trade SQL sets for application based entities. Queries are slow so the closer data is to the format it is to be used the faster pages will render. It’s like the database model becomes the model previously used at the caching layer. Complete entities tend to be cached, not low level detail rows. That’s what BigTable models should look like because that’s how concurrent reads are maximized. This isn’t the same as an object oriented database because the behavior is provided by applications, behavior is not bound to the entity so multiple applications can read the same entities yet implement very different behaviors.
  • Compute attributes at write time. Since looping over large columns of data is inefficient with BigTable the idea is to calculate values at write time instead of read time. For example, instead of calculating an average by reading an entire column at read time, track the total number and the total value at write time so the average can be calculated with one read on page display. Programmer effort is made up front at write time to minimize the work needed at read time. Preventing applications from iterating over huge data is key for making applications scale. Given the limitations of GAE transactions and quotas, GAE may not be appropriate for business applications that need exact summary statistics. Warning: if the summary stat is written on every read request then this approach will not scale as writes don't scale.
  • Create large entities with optional fields. Normalization creates lots of small entities. Instead, create larger entities with optional parts so you can do one read and then determine what’s present at run time. This shifts work from the database to the CPU while minimizing the number joins.
  • Define schemas in models. Denormalization requires user developed code to properly keep data consistent across multiple entities. The database won’t do it for you anymore. Schemas are really defined in code because it’s only code that can track all the relationships and maintain correctness. All database access must go through the models or otherwise the much feared inconsistency problems will result.
  • Hide updates using Ajax. Updates are slow so big bang updates of many entities will appear slow to users . Instead, use Ajax to update the database in little increments. As a user enters form data update the database so the update cost is amortized over many calls rather than one big call at the end. The result is a good user experience and a more scalable app.
  • Puts are Precious. Updating entities in large batches, say even 200 at a time, isn't part of the BigTable model. Entity attributes are automatically and synchronously indexed on writes. Indexing is an expensive operation that accumulates a lot of CPU time so the number updates that can be performed in one query is quite limited. The work around is to perform updates in smaller batches driven by an external CPU. Even when GAE provides the ability run batches within GAE the programming model for writes needs to be accounted for in a design.
  • Design By Explicit Cost Model. If you are going to be charged for an operation GAE wants you to explicitly ask for it. This is why some automatic navigation between objects isn't provided because that will force an explicit query to be written. Writing an explicit query is a sort of EULA for being charged. Click OK in the form of a query and you've indicated that you are prepared to pay for a database operation.
  • Place a many-to-many relation in the entity with the fewest number of elements. One way to create a many-to-many relationship is to have a list property that contains keys to the other related entities. A Company entity, for example, could contain a list of keys to Contact entities or a Contact entity could contain a list of keys to Company entities. Since it's likely a Contact is associated with fewer Companies the list should be contained in the Contact. The reasoning is maintaining large lists is relatively inefficient so you want to minimize the number of items in a list as much as possible.
  • Avoid unbounded queries. Large queries don't scale. Consider showing only the most recent 10 or so values from an attribute.
  • Avoid contention on datastore entities. If every request to your app reads or writes a particular entity, latency will increase as your traffic goes up because reads and writes on a given entity are sequential. One example construct you should avoid at all costs is the global counter, i.e. an entity that keeps track of a count and is updated or read on every request.
  • Avoid large entity groups. Any two entities that share a common ancestor belong to the same entity group. All writes to an entity group are sequential, so large entity groups can bog down popular apps quickly if there are a lot of writes to that group. Instead, use small, localized groups in your design.
  • Shard counters. Increment one of N counters and sum those N counters on the read side. This avoids the dreaded write bottleneck. See Efficient Global Counters by App Engine Fan for more details. An excellent example showing some of these principles in action can be found in this GQL thread. Take this nicely normalized schema:
    Customer: 
     - Name 
     - Country 
    Product: 
    - Code 
    - Name 
    - Description 
    Purchases: 
    - Reference to Product Entity 
    - Reference to Customer Entity 
    - Date of order 
    
    Anyone from a relational background would look at this schema and give it a big thumbs up. With a little effort we can imagine the original physical purchase order that has now been normalized into three different tables. To recreate the original purchase order a join on purchases, produce and customer is needed. Read speed is not optimized, safety is optimized. Here’s what the same schema looks like optimized for reading:
    Purchase: 
    - Customer Name 
    - Customer Country 
    - Product Code 
    - Product Name 
    - Purchase Order Number 
    - Date Of Order
    
    The three original tables have been folded into one entity. Now a purchase order can be read in one get operation. No join necessary. Notice how the entity looks more like an original purchase order. It is also what would probably be cached and is what our model would probably look like. But what if you want to update a product name or a customer name? Those attributes are duplicated in all entities. Here’s where the protection offered by the relational model comes in. Only one entity needs updating in a normalized model. In BigTable you have to remember everywhere a customer name and product name and change every instance to new values. It’s not a simple, safe, or reliable approach. But it does optimize for read speed and scalability. For an application with a high proportion of updates to reads this approach wouldn’t make sense. But on the web reads usually dominate. How often do you really change a customer name or a product name? Seldom. How often do you read them? All the time. Designing to scale for reads and taking the pain on writes takes some getting used to. It’s a massive change to standard relational tactics. But this is what it takes to scale web applications, even if it feels a little strange at first.

    Related Articles

  • ER-Modeling with Google App Engine (updated)
  • Tips on writing scalable apps

    Click to read more ...

  • Monday
    Apr072008

    Rumors of Signs and Portents Concerning Freeish Google Cloud

    Update 2: Rumor no more. Google Jumps Head First Into Web Services With Google App Engine. The quick and dirty of it: developers simply upload their Python code to Google, launch the application, and can monitor usage and other metrics via a multi-platform desktop application. There were 10,000 developer slots open and of course I was too late. More as the cobra strikes. Update: TechCrunch reports Google To Launch BigTable As Web Service next week. It competes with Amazon's SimpleDB. Though it won't be truly comparable until they also release an EC2 and S3 equivalent. An internet hit for each data access is a little painful. As Jimmy says in Goodfellas, "That's the way. You don't take no sh*t from nobody. " First Dave Winer hallucinates a pig on the mean streets of Walnut Creek that told him Google's long foretold cloud offering will be free for bloggers of "modest needs." GigaOM then says a free cloud service is how Google could eat Amazon's bacon for lunch. The reason for this free cloud buffet is said to be the easier integration of acquisitions who must presumably be in the Google cloud to be taken out. All the free stuff Google offers earns almost no money. They make money on search. Hosting every last CPU cycle on earth has to be costly. What's the return? Cheaper integration of new startups that will also provide no new revenue? Perhaps I am simply not clever enough to see the revolutionary brilliance in this line of thought. Though I would be quite pleased to have Google shareholders subsidize my projects. Folknologist thinks Google may keep costs down by requiring developers to code to a Cloud Virtual Machine based on Java byte codes... Applications would be built using G-ROR, a javascript style RoR framework. Revenue generation would come from an upsell of more memory and CPU. But aren't VMs already the perfect encapsulation from the cloud provider perspective? They just load 'em and run 'em. Seems cost effective enough. For the developer VMs also allow all required flexibility. You don't need to be locked into one environment. You can pick from a large number of operating systems and even wider variety of frameworks. Why lock in? If the model is to treat the cloud like one giant Tomcat application server so you can squeeze more users on the same amount of hardware then Google would just be the worlds largest shared hosting company. Not a cloud at all. And multi-tenant execution of applications in the same application server was always a really bad idea given how one badly programmed app can bring down the whole bunch. Not to mention security concerns. VMs offer better control, manageability, and security. I could see an Adoption Led market angle for Google. You could start small in a shared container and then as you grow move your app without change to a larger, more powerful, unshared container. We certainly do need a better way to create, deploy, and manage applications across VMs and data centers, but I don't quite see how this allows Google to make money offering an expensive service any better than the current VM approach. Though with all their cash maybe they plan to just wait it out until all the others bash themselves apart on the rocky shores of free. Just in case this is an April fools joke, I already know I am an idiot, so no harm done.

    Click to read more ...

    Sunday
    Mar302008

    Scaling Out MySQL

    This post covers two main options for scaling-out MySql and compare between them. The first is based on data-base clustering and the second is based on In Memory clustering a.k.a Data Grid. A special emphasis is given to a pattern which shows how to scale our existing data base without changing it through a combination of Data Grid and data base as a background service. This pattern is referred to as Persistency as a Service (PaaS). It also address many of the fequently asked question related to how performance, reliability and scalability is achieved with this pattern.

    Click to read more ...

    Tuesday
    Mar182008

    Database War Stories #3: Flickr

    [Tim O'Reilly] Continuing my series of queries about how "Web 2.0" companies used databases, I asked Cal Henderson of Flickr to tell me "how the folksonomy model intersects with the traditional database. How do you manage a tag cloud?"

    Click to read more ...

    Monday
    Mar172008

    Microsoft's New Database Cloud Ready to Rumble with Amazon

    Update: Zdnet says Ozzie signals Microsoft’s surrender to the cloud. CD ROMs are to the internet as the internet is to the cloud and Microsoft aims to scratch and claw its way into this paradigm shift as well. The gloves are off. The tag line for Microsoft's new SQL Server Data Service is Your Data, Any Place, Any Time. Thems fighten' words. Microsoft is itch'n for a fight! Who will be Amazon's second? The service description: SQL Server Data Services (SSDS) are highly scalable, on-demand data storage and query processing utility services. Built on robust SQL Server database and Windows Server technologies, these services provide high availability, security and support standards-based web interfaces for easy programming and quick provisioning. Sounds like a fast uppercut aimed squarely at SimpleDB's jaw. As a developer what do you need to know?

  • Highly available and highly scalable.
  • Targeted at applications that can tolerate high internet latencies. Primarily read-mostly data sets "storing data that is naturally partitionable into disjoint data sets requiring little or no cross-correlations."
  • Pricing isn't yet available.
  • No announced SLA, though you will be notified of downtime.
  • Access is via REST or SOAP.
  • Data is redundantly stored and backed up.
  • Geo-redundant data copies are maintained. Data is stored in large storage clusters in various Microsoft data centers located across North America. Plan to expand to more areas later.
  • The data model is flexible. No schemas required. "Simply add new attributes to your data set when needed, and the system will automatically store, index, and query your data accordingly."
  • String, numeric, datetime, and boolean data type are supported. All attributes are indexed.
  • The data model is: Customer { SSDS account (1..N) { Authority (1..N) { Container (0..N) { Entity (0..N) - Authorities give billing entities a way to organize their usage for accounting, security and co-location purposes. All containers under a single authority are provisioned within the same data center. As such authorities are the unit of geo-scale and geo-location. For example: Seattle or San Francisco. - Containers create contexts and scope for entity storage and query. For example, within its authorities, operations could choose to assign each member their own container, intended to contain a set of personal data for that member. Containers are the unit of consistency in the Microsoft SSDS service. For example: Autos for Sale, Services Offered. - Entities are the fundamental unit of storage in the system. Entities are a bag of scalars with no enforced type. For example, an individual member‟s jobs, educational institutions, contacts, recommendations, etc. could all be modeled as entities.
  • Data manipulation operations include: - Creation and deletion of containers. There are no updatable container properties. - Creation, replacement, and deletion of entities. - Retrieval of a single container in a serialized format. - Retrieval of a single entity in a serialized format.
  • A text based query language is supported that follow the LINQ pattern for C#. Only complete entities are returned. For example: To query addressed to a container to retrieve all entities in that container having a “City” property equal to “Seattle” and a “State” property equal to “WA” would be written as follows: from e in entities where e[“City”] == “Seattle” && e[“State”] == “WA” select e
  • Resource queries are supported. This URI returns all the entities in that container: http://mydomain.ssds.microsoft.com/ChildrensBooksContainer1
  • Customers will be able to associate entities with blobs which could be accessed as URL addressable resources.
  • SSDS is not just a hosted version of SQL Server. It's SQL Server deconstructed, running on blade servers with SATA drives.
  • A local on-premise version of SSDS will also be available to "allow users to better synchronize between the enterprise and the cloud, especially when handling reporting, analytics and business-intelligence tasks."
  • Not sure if you can do a full text search on a string property.
  • Not sure how large attributes can be. Should I store my 25MB documents as a string field so it can be searched? Or is that too expensive? Or is it even supported? Where Microsoft is behind Amazon is that they don't have EC2 and S3. Microsoft is having you traverse the internet for each database access. Now let''s say you have chunks of large storage stored off in a storage cloud. That's a lot of overhead. A major benefit of AWS is the free bandwidth and higher performance within the AWS cloud. If your application is completely hosted within the cloud the only slow part is from the server to the user's browser. All-in-all SSDS seems very comparable to and competitive with SimpleDB. Hard to say without solid pricing information. But it will be interesting to see how Microsoft's cloud strategy evolves and how we can all benefit from the competition.

    Related Articles

  • Home Page for SSDS
  • The Current Pros and Cons List for SimpleDB
  • Microsoft SQL Server Data Services
  • SQL Server Data Services FAQ
  • Microsoft launches its alternative to Amazon’s SimpleDB

    Click to read more ...

  • Monday
    Feb182008

    limit on the number of databases open

    Have a few doubts.. here are the qs 1) is there any limit on the number of databases that can be accessed simultaneously? (MySQL) 2) will it be a problem to scale in the future if there are large number of small databases(2-5 MB) each?

    Click to read more ...

    Tuesday
    Jan292008

    Too many databases

    Hi, I am using drupal for my clients website, and was thinking is it possible to host all ( about 500) of them on the same server(maybe VPS or dedicated). Here is the situation..... Each clients website has a database with about 50 tables each, all the databases are small in size about 2-5 MB .... and the websites are low traffic websites with say.. 50 hits/day on avg.... that means about 2000 queries/db/day ..... (avg 40 queries per hit).... Wanted to know if it is possible to have so many databases about 500 on the same server? what are the things that i should look into if i should make this happen?

    Click to read more ...

    Sunday
    Dec022007

    Database-Clustering: a8cjdbc - update: version 1.3

    The new version of a8cjdbc finished some limitations. Now Clobs and Blobs are supported, and some fixes using binary data. The version was also fully tested with Postgres and mySQL. Since Version 1.3 there is also a free trail version for download available. Check it out and test yourself... Take a look at: http://www.activ8.at/homepage/en/a8cjdbc.php I've downloaded the latest version and setup a environment with one virtual database and two database backends. I tried to make a "non real life szenario": The first backend was a Postgres node, the second was a mySQL node. Everything works fine - failover - recoverylog, etc... with to different backend database types. So check out the trial version and test yourself the clustered driver and give me some results about your experience with a8cjdbc. As I only tested mySQL and Postgres (and the non real life szenario with two different backend types) - maybe someone else have experiences with out databases? greetings Wolfgang

    Click to read more ...

    Sunday
    Dec022007

    a8cjdbc - update verision 1.3

    The new version of a8cjdbc finished some limitations. Now Clobs and Blobs are supported, and some fixes using binary data. The version was also fully tested with Postgres and mySQL. Since Version 1.3 there is also a free trail version for download available. Check it out and test yourself... Take a look at: http://www.activ8.at/homepage/en/a8cjdbc.php I've downloaded the latest version and setup a environment with one virtual database and two database backends. I tried to make a "non real life szenario": The first backend was a Postgres node, the second was a mySQL node. Everything works fine - failover - recoverylog, etc... with to different backend database types. So check out the trial version and test yourself the clustered driver and give me some results about your experience with a8cjdbc. As I only tested mySQL and Postgres (and the non real life szenario with two different backend types) - maybe someone else have experiences with out databases? greetings Wolfgang

    Click to read more ...