posted Jul 20, 2011, 5:03 PM by Kuwon Kang
MONDAY, DECEMBER 6, 2010 AT 9:34AM 
It's a truism that we should choose the right tool for the job. Everyone says that. And who can disagree? The problem is this is not helpful advice without being able to answer more specific questions like: What jobs are the tools good at? Will they work on jobs like mine? Is it worth the risk to try something new when all my people know something else and we have a deadline to meet? How can I make all the tools work together? In the NoSQL space this kind of real-world data is still a bit vague. When asked, vendors tend to give very general answers like NoSQL is good for BigData or key-value access. What does that mean for for the developer in the trenches faced with the task of solving a specific problem and there are a dozen confusing choices and no obvious winner? Not a lot. It's often hard to take that next step and imagine how their specific problems could be solved in a way that's worth taking the trouble and risk. Let's change that. What problems are you using NoSQL to solve? Which product are you using? How is it helping you? Yes, this is part the research for my webinar on December 14th, but I'm a huge believer that people learn best by example, so if we can come up with real specific examples I think that will really help people visualize how they can make the best use of all these new product choices in their own systems. Here's a list of uses cases I came up with after some trolling of the interwebs. The sources are so varied I can't attribute every one, I'll put a list at the end of the post. Please feel free to add your own. I separated the use cases out for a few specific products simply because I had a lot of uses cases for them they were clearer out on their own. This is not meant as an endorsement of any sort. Here's a master list of all the NoSQL products. If you would like to provide a specific set of use cases for a product I'd be more than happy to add that in. General Use CasesThese are the general kinds of reasons people throw around for using NoSQL. Probably nothing all that surprising here. - Bigness. NoSQL is seen as a key part of a new data stack supporting: big data, big numbers of users, big numbers of computers, big supply chains, big science, and so on. When something becomes so massive that it must become massively distributed, NoSQL is there, though not all NoSQL systems are targeting big. Bigness can be across many different dimensions, not just using a lot of disk space.
- Massive write performance. This is probably the canonical usage based on Google's influence. High volume. Facebook needs to store 135 billion messages a month. Twitter, for example, has the problem of storing 7 TB/data per day with the prospect of this requirement doubling multiple times per year. This is the data is too big to fit on one node problem. At 80 MB/s it takes a day to store 7TB so writes need to be distributed over a cluster, which implies key-value access, MapReduce, replication, fault tolerance, consistency issues, and all the rest. For faster writes in-memory systems can be used.
- Fast key-value access. This is probably the second most cited virtue of NoSQL in the general mind set. When latency is important it's hard to beat hashing on a key and reading the value directly from memory or in as little as one disk seek. Not every NoSQL product is about fast access, some are more about reliability, for example. but what people have wanted for a long time was a better memcached and many NoSQL systems offer that.
- Flexible schema and flexible datatypes. NoSQL products support a whole range of new data types, and this is a major area of innovation in NoSQL. We have: column-oriented, graph, advanced data structures, document-oriented, and key-value. Complex objects can be easily stored without a lot of mapping. Developers love avoiding complex schemas and ORM frameworks. Lack of structure allows for much more flexibility. We also have program and programmer friendly compatible datatypes likes JSON.
- Schema migration. Schemalessness makes it easier to deal with schema migrations without so much worrying. Schemas are in a sense dynamic, because they are imposed by the application at run-time, so different parts of an application can have a different view of the schema.
- Write availability. Do your writes need to succeed no mater what? Then we can get into partitioning, CAP, eventual consistency and all that jazz.
- Easier maintainability, administration and operations. This is very product specific, but many NoSQL vendors are trying to gain adoption by making it easy for developers to adopt them. They are spending a lot of effort on ease of use, minimal administration, and automated operations. This can lead to lower operations costs as special code doesn't have to be written to scale a system that was never intended to be used that way.
- No single point of failure. Not every product is delivering on this, but we are seeing a definite convergence on relatively easy to configure and manage high availability with automatic load balancing and cluster sizing. A perfect cloud partner.
- Generally available parallel computing. We are seeing MapReduce baked into products, which makes parallel computing something that will be a normal part of development in the future.
- Programmer ease of use. Accessing your data should be easy. While the relational model is intuitive for end users, like accountants, it's not very intuitive for developers. Programmers grok keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. This is a developer led coup. The response to a database problem can't always be to hire a really knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a system that they can make work for themselves. It shouldn't be so hard to make a product perform. Money is part of the issue. If it costs a lot to scale a product then won't you go with the cheaper product, that you control, that's easier to use, and that's easier to scale?
- Use the right data model for the right problem. Different data models are used to solve different problems. Much effort has been put into, for example, wedging graph operations into a relational model, but it doesn't work. Isn't it better to solve a graph problem in a graph database? We are now seeing a general strategy of trying find the best fit between a problem and solution.
- Avoid hitting the wall. Many projects hit some type of wall in their project. They've exhausted all options to make their system scale or perform properly and are wondering what next? It's comforting to select a product and an approach that can jump over the wall by linearly scaling using incrementally added resources. At one time this wasn't possible. It took custom built everything, but that's changed. We are now seeing usable out-of-the-box products that a project can readily adopt.
- Distributed systems support. Not everyone is worried about scale or performance over and above that which can be achieved by non-NoSQL systems. What they need is a distributed system that can span datacenters while handling failure scenarios without a hiccup. NoSQL systems, because they have focussed on scale, tend to exploit partitions, tend not use heavy strict consistency protocols, and so are well positioned to operate in distributed scenarios.
- Tunable CAP tradeoffs. NoSQL systems are generally the only products with a "slider" for choosing where they want to land on the CAP spectrum. Relational databases pick strong consistency which means they can't tolerate a partition failure. In the end this is a business decision and should be decided on a case by case basis. Does your app even care about consistency? Are a few drops OK? Does your app need strong or weak consistency? Is availability more important or is consistency? Will being down be more costly than being wrong? It's nice to have products that give you a choice.
More Specific Use Cases- Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, clickstreams, etc.
- Syncing online and offline data. This is a niche CouchDB has targeted.
- Fast response times under all loads.
- Avoiding heavy joins for when the query load for complex joins become too large for a RDBMS.
- Soft real-time systems where low latency is critical. Games are one example.
- Applications where a wide variety of different write, read, query, and consistency patterns need to be supported. There are systems optimized for 50% reads 50% writes, 95% writes, or 95% reads. Read-only applications needing extreme speed and resiliency, simple queries, and can tolerate slightly stale data. Applications requiring moderate performance, read/write access, simple queries, completely authoritative data. Read-only application which complex query requirements.
- Load balance to accommodate data and usage concentrations and to help keep microprocessors busy.
- Real-time inserts, updates, and queries.
- Hierarchical data like threaded discussions and parts explosion.
- Dynamic table creation.
- Two tier applications where low latency data is made available through a fast NoSQL interface, but the data itself can be calculated and updated by high latency Hadoop apps or other low priority apps.
- Sequential data reading. The right underlying data storage model needs to be selected. A B-tree may not be the best model for sequential reads.
- Slicing off part of service that may need better performance/scalability onto it's own system. For example, user logins may need to be high performance and this feature could use a dedicated service to meet those goals.
- Caching. A high performance caching tier for web sites and other applications. Example is a cache for the Data Aggregation System used by the Large Hadron Collider.
- Voting.
- Real-time page view counters.
- User registration, profile, and session data.
- Document, catalog management and content management systems. These are facilitated by the ability to store complex documents has a whole rather than organized as relational tables. Similar logic applies to inventory, shopping carts, and other structured data types.
- Archiving. Storing a large continual stream of data that is still accessible on-line. Document-oriented databases with a flexible schema that can handle schema changes over time.
- Analytics. Use MapReduce, Hive, or Pig to perform analytical queries and scale-out systems that support high write loads.
- Working with heterogenous types of data, for example, different media types at a generic level.
- Embedded systems. They don’t want the overhead of SQL and servers, so they uses something simpler for storage.
- A "market" game, where you own buildings in a town. You want the building list of someone to pop up quickly, so you partition on the owner column of the building table, so that the select is single-partitioned. But when someone buys the building of someone else you update the owner column along with price.
- JPL is using SimpleDB to store rover plan attributes. References are kept to a full plan blob in S3.
- Federal law enforcement agencies tracking Americans in real-time using credit cards, loyalty cards and travel reservations.
- Fraud detection by comparing transactions to known patterns in real-time.
- Helping diagnose the typology of tumors by integrating the history of every patient.
- In-memory database for high update situations, like a web site that displays everyone's "last active" time (for chat maybe). If users are performing some activity once every 30 sec, then you will be pretty much be at your limit with about 5000 simultaneous users.
- Handling lower-frequency multi-partition queries using materialized views while continuing to process high-frequency streaming data.
- Priority queues.
- Running calculations on cached data, using a program friendly interface, without have to go through an ORM.
- Unique a large dataset using simple key-value columns.
- To keep querying fast, values can be rolled-up into different time slices.
- Computing the intersection of two massive sets, where a join would be too slow.
- A timeline ala Twitter.
Redis Use CasesRedis is unique in the repertoire as it is a data structure server, with many fascinating use cases that people are excited to share. - Calculating whose friends are online using sets.
- Memcached on steroids.
- Distributed lock manager for process coordination.
- Full text inverted index lookups.
- Tag clouds.
- Leaderboards. Sorted sets for maintaining high score tables.
- Circular log buffers.
- Database for university course availability information. If the set contains the course ID it has an open seat. Data is scraped and processed continuously and there are ~7200 courses.
- Server for backed sessions. A random cookie value which is then associated with a larger chunk of serialized data on the server) are a very poor fit for relational databases. They are often created for every visitor, even those who stumble in from Google and then leave, never to return again. They then hang around for weeks taking up valuable database space. They are never queried by anything other than their primary key.
- Fast, atomically incremented counters are a great fit for offering real-time statistics.
- Polling the database every few seconds. Cheap in a key-value store. If you're sharding your data you'll need a central lookup service for quickly determining which shard is being used for a specific user's data. A replicated Redis cluster is a great solution here - GitHub use exactly that to manage sharding their many repositories between different backend file servers.
- Transient data. Any transient data used by your application is also a good fit for Redis. CSRF tokens (to prove a POST submission came from a form you served up, and not a form on a malicious third party site, need to be stored for a short while, as does handshake data for various security protocols.
- Incredibly easy to set up and ridiculously fast (30,000 read or writes a second on a laptop with the default configuration)
- Share state between processes. Run a long running batch job in one Python interpreter (say loading a few million lines of CSV in to a Redis key/value lookup table) and run another interpreter to play with the data that’s already been collected, even as the first process is streaming data in. You can quit and restart my interpreters without losing any data.
- Create heat maps of the BNP’s membership list for the Guardian
- Redis semantics map closely to Python native data types, you don’t have to think for more than a few seconds about how to represent data.
- That’s a simple capped log implementation (similar to a MongoDB capped collection)—push items on to the tail of a ’log’ key and use ltrim to only retain the last X items. You could use this to keep track of what a system is doing right now without having to worry about storing ever increasing amounts of logging information.
- An interesting example of an application built on Redis is Hurl, a tool for debugging HTTP requests built in 48 hours by Leah Culver and Chris Wanstrath.
- It’s common to use MySQL as the backend for storing and retrieving what are essentially key/value pairs. I’ve seen this over-and-over when someone needs to maintain a bit of state, session data, counters, small lists, and so on. When MySQL isn’t able to keep up with the volume, we often turn to memcached as a write-thru cache. But there’s a bit of a mis-match at work here.
- With sets, we can also keep track of ALL of the IDs that have been used for records in the system.
- Quickly pick a random item from a set.
- API limiting. This is a great fit for Redis as a rate limiting check needs to be made for every single API hit, which involves both reading and writing short-lived data.
- A/B testing is another perfect task for Redis - it involves tracking user behaviour in real-time, making writes for every navigation action a user takes, storing short-lived persistent state and picking random items.
- Implementing the inbox method with Redis is simple: each user gets a queue (a capped queue if you're worried about memory running out) to work as their inbox and a set to keep track of the other users who are following them. Ashton Kutcher has over 5,000,000 followers on Twitter - at 100,000 writes a second it would take less than a minute to fan a message out to all of those inboxes.
- Publish/subscribe is perfect for this broadcast updates (such as election results) to hundreds of thousands of simultaneously connected users. Blocking queue primitives mean message queues without polling.
- Have workers periodically report their load average in to a sorted set.
- Redistribute load. When you want to issue a job, grab the three least loaded workers from the sorted set and pick one of them at random (to avoid the thundering herd problem).
- Multiple GIS indexes.
- Recommendation engine based on relationships.
- Web-of-things data flows.
- Social graph representation.
- Dynamic schemas so schemas don't have to be designed up-front. Building the data model in code, on the fly by adding properties and relationships, dramatically simplifies code.
- Reducing the impedance mismatch because the data model in the database can more closely match the data model in the application.
VoltDB Use CasesVoltDB as a relational database is not traditionally thought of as in the NoSQL camp, but I feel based on their radical design perspective they are so far away from Oracle type systems that they are much more in the NoSQL tradition. - Application: Financial trade monitoring
- Data source: Real-time markets
- Partition key: Market symbol (ticker, CUSIP, SEDOL, etc.)
- High-frequency operations: Write and index all trades, store tick data (bid/ask)
- Lower-frequency operations: Find trade order detail based on any of 20+ criteria, show TraderX's positions across all market symbols
- Application: Web bot vulnerability scanning (SaaS application)
- Data source: Inbound HTTP requests
- Partition key: Customer URL
- High-frequency operations: Hit logging, analysis and alerting
- Lower-frequency operations: Vulnerability detection, customer reporting
- Application: Online gaming leaderboard
- Data source: Online game
- Partition key: Game ID
- High-frequency operations: Rank scores based on defined intervals and player personal best
- Lower-frequency transactions: Leaderboard lookups
- Application: Package tracking (logistics)
- Data source: Sensor scan
- Partition key: Shipment ID
- High-frequency operations: Package location updates
- Lower-frequency operations: Package status report (including history), lost package tracking, shipment rerouting
- Application: Ad content serving
- Data source: Website or device, user or rule triggered
- Partition key: Vendor/ad ID composite
- High-frequency operations: Check vendor balance, serve ad (in target device format), update vendor balance
- Lower-frequency operations: Report live ad view and click-thru stats by device (vendor-initiated)
- Application: Telephone exchange call detail record (CDR) management
- Data source: Call initiation request
- Partition key: Caller ID
- High-frequency operations: Real-time authorization (based on plan and balance)
- Lower-frequency operations: Fraud analysis/detection
- Application: Airline reservation/ticketing
- Data source: Customers (web) and airline (web and internal systems)
- Partition key: Customer (flight info is replicated)
- High-frequency operations: Seat selection (lease system), add/drop seats, baggage check-in
- Lower-frequency operations: Seat availability/flight, flight schedule changes, passenger re-bookings on flight cancellations
Analytics Use CasesKevin Weil at Twitter is great at providing Hadoop use cases. At Twitter this includes counting big data with standard counts, min, max, std dev; correlating big data with probabilities, covariance, influence; and research on Big data. Hadoop is on the fringe of NoSQL, but it's very useful to see what kind of problems are being solved with it. - How many request do we serve each day?
- What is the average latency? 95% latency?
- Grouped by response code: what is the hourly distribution?
- How many searches happen each day at Twitter?
- Where do they come from?
- How many unique queries?
- How many unique users?
- Geographic distribution?
- How does usage differ for mobile users?
- How does usage differ for 3rd party desktop client users?
- Cohort analysis: all users who signed up on the same day—then see how they differ over time.
- Site problems: what goes wrong at the same time?
- Which features get users hooked?
- Which features do successful users use often?
- Search corrections and suggestions (not done now at Twitter, but coming in the feature).
- What can web tell about a user from their tweets?
- What can we tell about you from the tweets of those you follow?
- What can we tell about you from the tweets of your followers?
- What can we tell about you from the ratio of your followers/following?
- What graph structures lead to successful networks? (Twitter’s graph structure is interesting since it’s not two-way)
- What features get a tweet retweeted?
- When a tweet is retweeted, how deep is the corresponding retweet three?
- Long-term duplicate detection (short term for abuse and stopping spammers)
- Machine learning. About not quite knowing the right questions to ask at first. How do we cluster users?
- Language detection (contact mobile providers to get SMS deals for users—focusing on the most popular countries at first).
- How can we detect bots and other non-human tweeters?
Poor Use Cases- OLTP. Outside VoltDB, complex multi-object transactions are generally not supported. Programmers are supposed to denormalize, use documents, or use other complex strategies like compensating transactions.
- Data integrity. Most of the NoSQL systems rely on applications to enforce data integrity where SQL uses a declarative approach. Relational databases are still the winner for data integrity.
- Data independence. Data outlasts applications. In NoSQL applications drive everything about the data. One argument for the relational model is as a repository of facts that can last for the entire lifetime of the enterprise, far past the expected life-time of any individual application.
- SQL. If you require SQL then very few NoSQL system will provide a SQL interface, but more systems are starting to provide SQLish interfaces.
- Ad-hoc queries. If you need to answer real-time questions about your data that you can’t predict in advance, relational databases are generally still the winner.
- Complex relationships. Some NoSQL systems support relationships, but a relational database is still the winner at relating.
- Maturity and stability. Relational databases still have the edge here. People are familiar with how they work, what they can do, and have confidence in their reliability. There are also more programmers and toolsets available for relational databases. So when in doubt, this is the road that will be traveled.
Related Articles |
posted Jul 20, 2011, 5:03 PM by Kuwon Kang
WEDNESDAY, JUNE 15, 2011 AT 8:08AM 
You need answers, I know, but all I have here are some questions to consider when thinking about which database to use. These are taken from my webinar What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications. It's a companion article to What The Heck Are You Actually Using NoSQL For? Actually, I don't even know if there are a 101 questions, but there are a lot/way too many. You might want to use these questions as kind of a NoSQL I Ching, guiding your way through the immense possibility space of options that are in front of you. Nothing is fated, all is interpreted, but it might just trigger a new insight or two along the way. Where Are You Starting From?- A can do anything green field application?
- In the middle of a project and worried about hitting bottlenecks?
- Worried about hitting the scaling wall once you deploy?
- Adding a separate loosely coupled service to an existing system?
- What are your resources? expertise? budget?
- What are your pain points? What's so important that if it fails you will fail? What forces are pushing you?
- What are your priorities? Prioritize them. What is really important to you, what must get done?
- What are your risks? Prioritize them. Is the risk of being unavailable more important than being inconsistent?
What Are You Trying To Accomplish?- What are you trying to accomplish?
- What's the delivery schedule?
- Do the research to be specific, like Facebook did with their messaging system:
Facebook chose HBase because they monitored their usage and figured out what was needed: a system that could handle two types of data patterns.
Things To Consider...Your Problem- Do you need to build a custom system?
- Access patterns: 1) A short set of temporal data that tends to be volatile 2) An ever-growing set of data that rarely gets accessed 3) High write loads 4) High throughput, 5) Sequential, 6) Random
- Requires scalability?
- Is availability more important than consistency, or is it latency, transactions, durability, performance, or ease of use?
- Cloud or colo? Hosted services? Resources like disk space?
- Can you find people who know the stack?
- Tired of the data transformation (ORM) treadmill?
- Store data that can be accessed quickly and is used often?
- Would like a high level interface like PaaS?
Things To Consider...Money- Cost? With money you have different options than if you don't. You can probably make the technologies you know best scale.
- Inexpensive scaling?
- Lower operations cost?
- No sysadmins?
- Type of license?
- Support costs?
Things To Consider...Programming- Flexible datatypes and schemas?
- Support for which language bindings?
- Web support: JSON, REST, HTTP, JSON-RPC
- Built-in stored procedure support? Javascript?
- Platform support: mobile, workstation, cloud
- Transaction support: key-value, distributed, ACID, BASE, eventual consistency, multi-object ACID transactions.
- Datatype support: graph, key-value, row, column, JSON, document, references, relationships, advanced data structures, large BLOBs.
- Prefer the simplicity of transaction model where you can just update and be done with it? In-memory makes it fast enough and big systems can fit on just a few nodes.
Things To Consider...Performance- Performance metrics: IOPS/sec, reads, writes, streaming?
- Support for your access pattern: random read/write; sequential read/write; large or small or whatever chunk size you use.
- Are you storing frequently updated bits of data?
- High Concurrency vs High Performance?
- Problems that limit the type of work load you care about?
- Peak QPS on highly-concurrent workloads?
- Test your specific scenarios?
Things To Consider...Features- Spooky scalability at a distance: support across multiple data-centers?
- Ease of installation, configuration, operations, development, deployment, support, manage, upgrade, etc.
- Data Integrity: In DDL, Stored Procedure, or App
- Persistence design: Memtable/SSTable; Apend-only B-tree; B-tree; On-disk linked lists; In-memory replicated; In-memory snapshots; In-memory only; Hash; Pluggable.
- Schema support: none, rigid, optional, mixed
- Storage model: embedded, client/server, distributed, in-memory
- Support for search, secondary indexes, range queries, ad-hoc queries, MapReduce?
- Hitless upgrades?
Things To Consider...More Features- Tunability of consistency models?
- Tools availability and product maturity?
- Expand rapidly? Develop rapidly? Change rapidly?
- Durability? On power failure?
- Bulk import? Export?
- Hitless upgrades?
- Materialized views for rollups of attributes?
- Built-in web server support?
- Authentication, authorization, validation?
- Continuous write-behind for system sync?
- What is the story for availability, data-loss prevention, backup and restore?
- Automatic load balancing, partitioning, and repartitioning?
- Live addition and removal of machines?
Things To Consider...The Vendor- Viability of the company?
- Future direction?
- Community and support list quality?
- Support responsiveness?
- How do they handle disasters?
- Quality and quantity of partnerships developed?
- Customer support: enterprise-level SLA, paid support, none
|
posted Jul 20, 2011, 5:01 PM by Kuwon Kang
posted Jul 20, 2011, 4:47 PM by Kuwon Kang
posted Jul 20, 2011, 4:39 PM by Kuwon Kang
MONDAY, DECEMBER 6, 2010 AT 9:34AM by highscalability

It's a truism that we should choose the right tool for the job. Everyone says that. And who can disagree? The problem is this is not helpful advice without being able to answer more specific questions like: What jobs are the tools good at? Will they work on jobs like mine? Is it worth the risk to try something new when all my people know something else and we have a deadline to meet? How can I make all the tools work together? In the NoSQL space this kind of real-world data is still a bit vague. When asked, vendors tend to give very general answers like NoSQL is good for BigData or key-value access. What does that mean for for the developer in the trenches faced with the task of solving a specific problem and there are a dozen confusing choices and no obvious winner? Not a lot. It's often hard to take that next step and imagine how their specific problems could be solved in a way that's worth taking the trouble and risk. Let's change that. What problems are you using NoSQL to solve? Which product are you using? How is it helping you? Yes, this is part the research for my webinar on December 14th, but I'm a huge believer that people learn best by example, so if we can come up with real specific examples I think that will really help people visualize how they can make the best use of all these new product choices in their own systems. Here's a list of uses cases I came up with after some trolling of the interwebs. The sources are so varied I can't attribute every one, I'll put a list at the end of the post. Please feel free to add your own. I separated the use cases out for a few specific products simply because I had a lot of uses cases for them they were clearer out on their own. This is not meant as an endorsement of any sort. Here's a master list of all the NoSQL products. If you would like to provide a specific set of use cases for a product I'd be more than happy to add that in. General Use CasesThese are the general kinds of reasons people throw around for using NoSQL. Probably nothing all that surprising here. - Bigness. NoSQL is seen as a key part of a new data stack supporting: big data, big numbers of users, big numbers of computers, big supply chains, big science, and so on. When something becomes so massive that it must become massively distributed, NoSQL is there, though not all NoSQL systems are targeting big. Bigness can be across many different dimensions, not just using a lot of disk space.
- Massive write performance. This is probably the canonical usage based on Google's influence. High volume. Facebook needs to store 135 billion messages a month. Twitter, for example, has the problem of storing 7 TB/data per day with the prospect of this requirement doubling multiple times per year. This is the data is too big to fit on one node problem. At 80 MB/s it takes a day to store 7TB so writes need to be distributed over a cluster, which implies key-value access, MapReduce, replication, fault tolerance, consistency issues, and all the rest. For faster writes in-memory systems can be used.
- Fast key-value access. This is probably the second most cited virtue of NoSQL in the general mind set. When latency is important it's hard to beat hashing on a key and reading the value directly from memory or in as little as one disk seek. Not every NoSQL product is about fast access, some are more about reliability, for example. but what people have wanted for a long time was a better memcached and many NoSQL systems offer that.
- Flexible schema and flexible datatypes. NoSQL products support a whole range of new data types, and this is a major area of innovation in NoSQL. We have: column-oriented, graph, advanced data structures, document-oriented, and key-value. Complex objects can be easily stored without a lot of mapping. Developers love avoiding complex schemas and ORM frameworks. Lack of structure allows for much more flexibility. We also have program and programmer friendly compatible datatypes likes JSON.
- Schema migration. Schemalessness makes it easier to deal with schema migrations without so much worrying. Schemas are in a sense dynamic, because they are imposed by the application at run-time, so different parts of an application can have a different view of the schema.
- Write availability. Do your writes need to succeed no mater what? Then we can get into partitioning, CAP, eventual consistency and all that jazz.
- Easier maintainability, administration and operations. This is very product specific, but many NoSQL vendors are trying to gain adoption by making it easy for developers to adopt them. They are spending a lot of effort on ease of use, minimal administration, and automated operations. This can lead to lower operations costs as special code doesn't have to be written to scale a system that was never intended to be used that way.
- No single point of failure. Not every product is delivering on this, but we are seeing a definite convergence on relatively easy to configure and manage high availability with automatic load balancing and cluster sizing. A perfect cloud partner.
- Generally available parallel computing. We are seeing MapReduce baked into products, which makes parallel computing something that will be a normal part of development in the future.
- Programmer ease of use. Accessing your data should be easy. While the relational model is intuitive for end users, like accountants, it's not very intuitive for developers. Programmers grok keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. This is a developer led coup. The response to a database problem can't always be to hire a really knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a system that they can make work for themselves. It shouldn't be so hard to make a product perform. Money is part of the issue. If it costs a lot to scale a product then won't you go with the cheaper product, that you control, that's easier to use, and that's easier to scale?
- Use the right data model for the right problem. Different data models are used to solve different problems. Much effort has been put into, for example, wedging graph operations into a relational model, but it doesn't work. Isn't it better to solve a graph problem in a graph database? We are now seeing a general strategy of trying find the best fit between a problem and solution.
- Avoid hitting the wall. Many projects hit some type of wall in their project. They've exhausted all options to make their system scale or perform properly and are wondering what next? It's comforting to select a product and an approach that can jump over the wall by linearly scaling using incrementally added resources. At one time this wasn't possible. It took custom built everything, but that's changed. We are now seeing usable out-of-the-box products that a project can readily adopt.
- Distributed systems support. Not everyone is worried about scale or performance over and above that which can be achieved by non-NoSQL systems. What they need is a distributed system that can span datacenters while handling failure scenarios without a hiccup. NoSQL systems, because they have focussed on scale, tend to exploit partitions, tend not use heavy strict consistency protocols, and so are well positioned to operate in distributed scenarios.
- Tunable CAP tradeoffs. NoSQL systems are generally the only products with a "slider" for choosing where they want to land on the CAP spectrum. Relational databases pick strong consistency which means they can't tolerate a partition failure. In the end this is a business decision and should be decided on a case by case basis. Does your app even care about consistency? Are a few drops OK? Does your app need strong or weak consistency? Is availability more important or is consistency? Will being down be more costly than being wrong? It's nice to have products that give you a choice.
More Specific Use Cases- Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, clickstreams, etc.
- Syncing online and offline data. This is a niche CouchDB has targeted.
- Fast response times under all loads.
- Avoiding heavy joins for when the query load for complex joins become too large for a RDBMS.
- Soft real-time systems where low latency is critical. Games are one example.
- Applications where a wide variety of different write, read, query, and consistency patterns need to be supported. There are systems optimized for 50% reads 50% writes, 95% writes, or 95% reads. Read-only applications needing extreme speed and resiliency, simple queries, and can tolerate slightly stale data. Applications requiring moderate performance, read/write access, simple queries, completely authoritative data. Read-only application which complex query requirements.
- Load balance to accommodate data and usage concentrations and to help keep microprocessors busy.
- Real-time inserts, updates, and queries.
- Hierarchical data like threaded discussions and parts explosion.
- Dynamic table creation.
- Two tier applications where low latency data is made available through a fast NoSQL interface, but the data itself can be calculated and updated by high latency Hadoop apps or other low priority apps.
- Sequential data reading. The right underlying data storage model needs to be selected. A B-tree may not be the best model for sequential reads.
- Slicing off part of service that may need better performance/scalability onto it's own system. For example, user logins may need to be high performance and this feature could use a dedicated service to meet those goals.
- Caching. A high performance caching tier for web sites and other applications. Example is a cache for the Data Aggregation System used by the Large Hadron Collider.
- Voting.
- Real-time page view counters.
- User registration, profile, and session data.
- Document, catalog management and content management systems. These are facilitated by the ability to store complex documents has a whole rather than organized as relational tables. Similar logic applies to inventory, shopping carts, and other structured data types.
- Archiving. Storing a large continual stream of data that is still accessible on-line. Document-oriented databases with a flexible schema that can handle schema changes over time.
- Analytics. Use MapReduce, Hive, or Pig to perform analytical queries and scale-out systems that support high write loads.
- Working with heterogenous types of data, for example, different media types at a generic level.
- Embedded systems. They don’t want the overhead of SQL and servers, so they uses something simpler for storage.
- A "market" game, where you own buildings in a town. You want the building list of someone to pop up quickly, so you partition on the owner column of the building table, so that the select is single-partitioned. But when someone buys the building of someone else you update the owner column along with price.
- JPL is using SimpleDB to store rover plan attributes. References are kept to a full plan blob in S3.
- Federal law enforcement agencies tracking Americans in real-time using credit cards, loyalty cards and travel reservations.
- Fraud detection by comparing transactions to known patterns in real-time.
- Helping diagnose the typology of tumors by integrating the history of every patient.
- In-memory database for high update situations, like a web site that displays everyone's "last active" time (for chat maybe). If users are performing some activity once every 30 sec, then you will be pretty much be at your limit with about 5000 simultaneous users.
- Handling lower-frequency multi-partition queries using materialized views while continuing to process high-frequency streaming data.
- Priority queues.
- Running calculations on cached data, using a program friendly interface, without have to go through an ORM.
- Unique a large dataset using simple key-value columns.
- To keep querying fast, values can be rolled-up into different time slices.
- Computing the intersection of two massive sets, where a join would be too slow.
- A timeline ala Twitter.
Redis Use CasesRedis is unique in the repertoire as it is a data structure server, with many fascinating use cases that people are excited to share. - Calculating whose friends are online using sets.
- Memcached on steroids.
- Distributed lock manager for process coordination.
- Full text inverted index lookups.
- Tag clouds.
- Leaderboards. Sorted sets for maintaining high score tables.
- Circular log buffers.
- Database for university course availability information. If the set contains the course ID it has an open seat. Data is scraped and processed continuously and there are ~7200 courses.
- Server for backed sessions. A random cookie value which is then associated with a larger chunk of serialized data on the server) are a very poor fit for relational databases. They are often created for every visitor, even those who stumble in from Google and then leave, never to return again. They then hang around for weeks taking up valuable database space. They are never queried by anything other than their primary key.
- Fast, atomically incremented counters are a great fit for offering real-time statistics.
- Polling the database every few seconds. Cheap in a key-value store. If you're sharding your data you'll need a central lookup service for quickly determining which shard is being used for a specific user's data. A replicated Redis cluster is a great solution here - GitHub use exactly that to manage sharding their many repositories between different backend file servers.
- Transient data. Any transient data used by your application is also a good fit for Redis. CSRF tokens (to prove a POST submission came from a form you served up, and not a form on a malicious third party site, need to be stored for a short while, as does handshake data for various security protocols.
- Incredibly easy to set up and ridiculously fast (30,000 read or writes a second on a laptop with the default configuration)
- Share state between processes. Run a long running batch job in one Python interpreter (say loading a few million lines of CSV in to a Redis key/value lookup table) and run another interpreter to play with the data that’s already been collected, even as the first process is streaming data in. You can quit and restart my interpreters without losing any data.
- Create heat maps of the BNP’s membership list for the Guardian
- Redis semantics map closely to Python native data types, you don’t have to think for more than a few seconds about how to represent data.
- That’s a simple capped log implementation (similar to a MongoDB capped collection)—push items on to the tail of a ’log’ key and use ltrim to only retain the last X items. You could use this to keep track of what a system is doing right now without having to worry about storing ever increasing amounts of logging information.
- An interesting example of an application built on Redis is Hurl, a tool for debugging HTTP requests built in 48 hours by Leah Culver and Chris Wanstrath.
- It’s common to use MySQL as the backend for storing and retrieving what are essentially key/value pairs. I’ve seen this over-and-over when someone needs to maintain a bit of state, session data, counters, small lists, and so on. When MySQL isn’t able to keep up with the volume, we often turn to memcached as a write-thru cache. But there’s a bit of a mis-match at work here.
- With sets, we can also keep track of ALL of the IDs that have been used for records in the system.
- Quickly pick a random item from a set.
- API limiting. This is a great fit for Redis as a rate limiting check needs to be made for every single API hit, which involves both reading and writing short-lived data.
- A/B testing is another perfect task for Redis - it involves tracking user behaviour in real-time, making writes for every navigation action a user takes, storing short-lived persistent state and picking random items.
- Implementing the inbox method with Redis is simple: each user gets a queue (a capped queue if you're worried about memory running out) to work as their inbox and a set to keep track of the other users who are following them. Ashton Kutcher has over 5,000,000 followers on Twitter - at 100,000 writes a second it would take less than a minute to fan a message out to all of those inboxes.
- Publish/subscribe is perfect for this broadcast updates (such as election results) to hundreds of thousands of simultaneously connected users. Blocking queue primitives mean message queues without polling.
- Have workers periodically report their load average in to a sorted set.
- Redistribute load. When you want to issue a job, grab the three least loaded workers from the sorted set and pick one of them at random (to avoid the thundering herd problem).
- Multiple GIS indexes.
- Recommendation engine based on relationships.
- Web-of-things data flows.
- Social graph representation.
- Dynamic schemas so schemas don't have to be designed up-front. Building the data model in code, on the fly by adding properties and relationships, dramatically simplifies code.
- Reducing the impedance mismatch because the data model in the database can more closely match the data model in the application.
VoltDB Use CasesVoltDB as a relational database is not traditionally thought of as in the NoSQL camp, but I feel based on their radical design perspective they are so far away from Oracle type systems that they are much more in the NoSQL tradition. - Application: Financial trade monitoring
- Data source: Real-time markets
- Partition key: Market symbol (ticker, CUSIP, SEDOL, etc.)
- High-frequency operations: Write and index all trades, store tick data (bid/ask)
- Lower-frequency operations: Find trade order detail based on any of 20+ criteria, show TraderX's positions across all market symbols
- Application: Web bot vulnerability scanning (SaaS application)
- Data source: Inbound HTTP requests
- Partition key: Customer URL
- High-frequency operations: Hit logging, analysis and alerting
- Lower-frequency operations: Vulnerability detection, customer reporting
- Application: Online gaming leaderboard
- Data source: Online game
- Partition key: Game ID
- High-frequency operations: Rank scores based on defined intervals and player personal best
- Lower-frequency transactions: Leaderboard lookups
- Application: Package tracking (logistics)
- Data source: Sensor scan
- Partition key: Shipment ID
- High-frequency operations: Package location updates
- Lower-frequency operations: Package status report (including history), lost package tracking, shipment rerouting
- Application: Ad content serving
- Data source: Website or device, user or rule triggered
- Partition key: Vendor/ad ID composite
- High-frequency operations: Check vendor balance, serve ad (in target device format), update vendor balance
- Lower-frequency operations: Report live ad view and click-thru stats by device (vendor-initiated)
- Application: Telephone exchange call detail record (CDR) management
- Data source: Call initiation request
- Partition key: Caller ID
- High-frequency operations: Real-time authorization (based on plan and balance)
- Lower-frequency operations: Fraud analysis/detection
- Application: Airline reservation/ticketing
- Data source: Customers (web) and airline (web and internal systems)
- Partition key: Customer (flight info is replicated)
- High-frequency operations: Seat selection (lease system), add/drop seats, baggage check-in
- Lower-frequency operations: Seat availability/flight, flight schedule changes, passenger re-bookings on flight cancellations
Analytics Use CasesKevin Weil at Twitter is great at providing Hadoop use cases. At Twitter this includes counting big data with standard counts, min, max, std dev; correlating big data with probabilities, covariance, influence; and research on Big data. Hadoop is on the fringe of NoSQL, but it's very useful to see what kind of problems are being solved with it. - How many request do we serve each day?
- What is the average latency? 95% latency?
- Grouped by response code: what is the hourly distribution?
- How many searches happen each day at Twitter?
- Where do they come from?
- How many unique queries?
- How many unique users?
- Geographic distribution?
- How does usage differ for mobile users?
- How does usage differ for 3rd party desktop client users?
- Cohort analysis: all users who signed up on the same day—then see how they differ over time.
- Site problems: what goes wrong at the same time?
- Which features get users hooked?
- Which features do successful users use often?
- Search corrections and suggestions (not done now at Twitter, but coming in the feature).
- What can web tell about a user from their tweets?
- What can we tell about you from the tweets of those you follow?
- What can we tell about you from the tweets of your followers?
- What can we tell about you from the ratio of your followers/following?
- What graph structures lead to successful networks? (Twitter’s graph structure is interesting since it’s not two-way)
- What features get a tweet retweeted?
- When a tweet is retweeted, how deep is the corresponding retweet three?
- Long-term duplicate detection (short term for abuse and stopping spammers)
- Machine learning. About not quite knowing the right questions to ask at first. How do we cluster users?
- Language detection (contact mobile providers to get SMS deals for users—focusing on the most popular countries at first).
- How can we detect bots and other non-human tweeters?
Poor Use Cases- OLTP. Outside VoltDB, complex multi-object transactions are generally not supported. Programmers are supposed to denormalize, use documents, or use other complex strategies like compensating transactions.
- Data integrity. Most of the NoSQL systems rely on applications to enforce data integrity where SQL uses a declarative approach. Relational databases are still the winner for data integrity.
- Data independence. Data outlasts applications. In NoSQL applications drive everything about the data. One argument for the relational model is as a repository of facts that can last for the entire lifetime of the enterprise, far past the expected life-time of any individual application.
- SQL. If you require SQL then very few NoSQL system will provide a SQL interface, but more systems are starting to provide SQLish interfaces.
- Ad-hoc queries. If you need to answer real-time questions about your data that you can’t predict in advance, relational databases are generally still the winner.
- Complex relationships. Some NoSQL systems support relationships, but a relational database is still the winner at relating.
- Maturity and stability. Relational databases still have the edge here. People are familiar with how they work, what they can do, and have confidence in their reliability. There are also more programmers and toolsets available for relational databases. So when in doubt, this is the road that will be traveled.
Related Articles |
posted Jul 20, 2011, 4:35 PM by Kuwon Kang
posted Jul 19, 2011, 5:04 PM by Kuwon Kang
One of the major concerns of many IT organizations is cloud vendor lock-in. This concern was expressed recently in "Banks fear cloud vendor lock-in ," from IT Wire: The onset of cloud computing gives vendors the chance to lock customers in to their infrastructure, using proprietary protocols to ensure they’re on the monthly billing cycle as long as possible.
The OpenStack project emerged with a mission to address this concern by creating a community-led open source project enabling any organization to create and offer cloud computing services running on standard hardware. GigaSpaces joined the OpenStack project with the mission to enable any organization to build its own Platform-as-a-Service (PaaS), with its own choice of language and best-of-breed middleware stack. In this post, I’ll try to provide more insight into our current and future plans around OpenStack, and more specifically our joint collaboration with the Citrix OpenCloud initiative. GigaSpaces OpenStack Explained- What does GigaSpaces' OpenStack support mean?
One of the goals for our second-generation PaaS/SaaS enablement platform was to enable smooth migration between different cloud providers. We were able to achieve this goal through the use of our own abstraction (the Scaling Handler) and through the integration with the JClouds project that provides common abstraction to most of the existing cloud providers. With that, we can ensure that any application can be moved from the likes of Amazon to OpenStack or to an organization's own private cloud with zero changes to the application code or configuration. The only change involves is setting the user/key of the specific cloud. 
By adding support for OpenStack, we now enable users to safely move to an OpenStack-based cloud when they're ready and with little effort, yet they gain all the benefits that comes with it in terms of cost, openness etc.
- Can I use the OpenStack integration outside of GigaSpaces' platform?
Yes. The OpenStack integration was developed in close collaboration with GridDyanmics and Adrian Cole from JClouds. We intend to make the integration with OpenStack available to the entire Java community through the open source JClouds project. In this way, any Java application can easily run on OpenStack.
- What specific application stack is currently supported?
Today, there are a wide variety of tools and frameworks available throughout the ecosystems in Java, .Net, Ruby, and more. We feel that limiting the platform to a specific stack that we control -- as in the case of Google App Engine -- is too restrictive. Instead, we wanted to create a foundation that enables users of the platform to easily host any service or framework they choose on the platform. We also wanted to do this in a way that provides consistent behavior in terms of deployment, monitoring, elasticity, and scaling. Therefore, we developed a Universal Service Manager and Service Orchestration framework. The framework makes it easier users to plug in their own choice of services that comprise the application stack (Tomcat, MySQL, NoSQL through Cassandra, Hadoop, Ruby...). In addition, we will provide a set of built-in recipes. At first we will be targeting Big Data stacks, which include integration with NoSQL and Hadoop, and also reporting tools to make the deployment of Big Data applications significantly simpler. The integration with OpenStack will make it possible to bring these benefits to any application running on OpenStack.
- How is this related to the previously announced Open Elastic Platform with Citrix?
The new platform builds on the collaboration announced in October 2010 by Citrix and GigaSpaces, highlighting the integration between Citrix® NetScaler®, Citrix® XenServer® -- both part of the Citrix OpenCloud solution -- and GigaSpaces eXtreme Application Platform (XAP) . In this new release we are adding the following new development: Greater Openness Openness at the IaaS layer -– The specific integration with NetScaler (load balancer) and XenServer would be done through more open interfaces provided through the Citrix/Openstack contribution, which means that OpenStack users can use those intefaces to plug in any hypervisor or load balancer. Openness at the application layer –- The previous version of GigaSpaces XAP offered a limited set of application stack support features, mostly geared to services that we control and mostly in Java. With the new platform we offer significantly greater flexibility on various fronts:
- Open to any application stack and language: As I mentioned earlier, users can now easily host any application or service and build their stack of choice, while at the same time managing and controling their application in a consistent way in terms of the deployment, monitoring, elasticity, etc.
- Users can choose their container of choice: The new platform will include support for Tomcat in addition to our existing Jetty support.
- Users can choose standard APIs to scale their data: In most cases users looking for data scalability have had to rewrite their applications. This is still the case with most of the NoSQL and in-memory data grid solutions. Through standard JPA support in our platform, we can finally reduce that lock-in barrier.
Better Application Monitoring, Specifically Geared for PaaS One of the challenges with many of the existing monitoring systems is that they were not geared for PaaS-based deployment. In this new offering we provde PaaS-driven monitoring that is tightly integrated with the platform, and which can interact with the platform to deal with failure or scaling events without any human intervention. In addition, it provides a holistic view that includes the application and infrastructure, as can be seen in these screenshots:  Furthermore, the new monitoring was designed to integrate with the existing set of data-center monitoring tools. 
Better Performance and Scalability By leveraging the Citrix OpenCloud solution, GigaSpaces will provide tighter integration between the platform and the infrastructure layers to up-level services offered through OpenStack clouds. In this way, we leverage the years of experience of both platforms in delivering high performance and low latency to mission-critical applications and remove the hassle involved in tuning the entire stack. We believe that through this joint work we can bring a lot of that experience into the OpenStack community. We can also now offer more fine-grained multi-tenancy support –- by combining process and VM-based multi-tenancy. This makes it possible to achieve significantly higher density and utilization of existing resources, and reduces the amount of resources -- and therefore the cost -- associated with serving a particular application load.
- What does this means for Enterprises, ISV/SaaS, and Cloud/IaaS providers?
Any user of this new offering will benefit from the greater openess and reduced lock-in concern. They wouldn’t be relying on single vendor for their future. This also means that they will have better control over their own stack and cost margins, and will therefore gain the ability to offer more competitive prices with other equivalent cloud providers. More specifically - Enterprises can use this offering to build their own Enterprise PaaS layer that is specifically geared for big-data analytics, e-commerce, and financial applications.
- SaaS ISVs can use this offering to SaaS-enable their application and provide the same solution off- and on-premises.
- Infrastructure/Cloud providers can use this offering to offer an Amazon-like services stack, including RDS, SimpleDB, SQS, Cloudwatch, and Elastic Beanstalk as a pre-packaged solution.
Final Words I’m very excited about all of these new developments. OpenStack fills in the missing piece in the cloud –- as its name suggests, it unlocks the cloud by providing a truly open cloud stack and provides an essential substrate that drives innovation and collaboration -- which couldn’t be done before. It's interesting to compare the level of effort that we had to invest in the past when we did the first integration with Citrix to the one with OpenStack. Openstack enables us to work completely in parallel and make progress without too much coordination and still achieve substantial progress. Moreover, every delivery of our work can be relevant to wider spectrum of users. It also enabled us to join forces with other members of the community, such as Adrian Cole from JClouds and GridDynamics , simply because we share a similar goal to support the OpenStack open cloud mission. This is only the beginning. I hope that with this new development we can continue to work toward greater openess at the application platform layer together with Citrix and the rest of the OpenStack community. I’m going to give a talk on on the subject (PaaS on Openstack) during the upcomingOpenStack Design Summit on April 26th, 2011, in Santa Clara, California, where I hope to share some of these thoughts and hopefully get more work on this happening within the community. AvailabilityI will also be announcing the first GigaSpaces code contribution to OpenStack at the Design Summit. Note that the integration with OpenStack will be made available to the entire Java community through the open source JClouds project. You can also get an early look at our upcoming 2nd-generation CEAP release by registering for the Early Access Program . References: |
posted Jul 19, 2011, 5:03 PM by Kuwon Kang
In my last post (GigaSpaces OpenStack Explained) I introduced our plan to add support for OpenStack in our platform: One of the goals for our second-generation PaaS/SaaS enablement platform was to enable smooth migration between different cloud providers. We were able to achieve this goal through the use of our own abstraction (the Scaling Handler) and through the integration with the JClouds project that provides common abstraction to most of the existing cloud providers. With that, we can ensure that any application can be moved from the likes of Amazon to OpenStack or to an organization's own private cloud with zero changes to the application code or configuration. The only change involves is setting the user/key of the specific cloud. 
By adding support for OpenStack, we now enable users to safely move to an OpenStack-based cloud when they're ready and with little effort, yet they gain all the benefits that comes with it in terms of cost, openness etc.
Yesterday, I had a session during the OpenStack Summit where I tried to present a more general view on how we should be thinking about PaaS in the context of OpenStack. The key takeaway : The main goal of PaaS is to drive productivity into the process by which we can deliver new applications. Most of the existing PaaS solutions take a fairly extreme approach with their abstraction of the underlying infrastructure and therefore fit a fairly small number of extremely simple applications and thus miss the real promise of PaaS. Amazon's Elastic Beanstalk took a more bottom up approach giving us better set of tradeoffs between the abstraction and control which makes it more broadly applicable to a larger set of applications. The fact that OpenStack is opensource allows us to think differently on the things we can do at the platform layer. We can create a tighter integration between the PaaS and IaaS layers and thus come up with better set of tradeoffs into the way we drive productivity without giving up control. Specifically that means that: - Anyone should be able to:
- Build their own PaaS in a snap
- Run on any cloud (public/private)
- –Gain multi-tenancy, elasticity… Without code changes.
- Provide a significantly higher degree of control without adding substantial complexity over our:
- Language choice
- –Operating System
- –Middleware stack
- Should come pre-integrated with a popular stack:
- Spring,Tomcat, DevOps, NoSQL, Hadoop...
- Designed to run the most demanding mission-critical apps
The slides below illustrate a very high-level overview of the problems as well as how OpenStack can help provide a solution, from the presentation at the Summit: We’ve recorded a short demo that shows how all different pieces pieces actually work together in the context of OpenStack: You should note that since the demo was mainly targeted at illustrating the OpenStack integration through JClouds/OpenStack provider, it doesn’t cover much of the feature set, such as Multi-tenancy, fine-grained monitoring, or fail-over, nor does it show deploying full-blown web-apps and big-data apps, etc. The actual code for the JClouds/OpenStack provider should be available through the JClouds project shortly. Call for actionToday, there is a lot of work and interesting innovation being done in the PaaS world by different providers. Unfortunately a lot of that work is done with very little collaboration. The OpenStack community can be a great environment to put all those great ideas into something meaningful and open. I hope that our initial joint work with OpenStack, Citrix, JClouds, and GridDynamics can be a good start in that direction. I’m trying to figure out what’s should be the right way to establish a more formalized Open-PaaS group as part of the OpenStack community. Any ideas/help would be greatly appreciated... Email this |
posted Jul 19, 2011, 4:31 PM by Kuwon Kang
GigaSpaces Citrix integration on top of OpenStackIn my one of my previous posts (GigaSpaces OpenStack Explained) I made a reference to the joint work that we are doing with Citrix through the integration of our new PaaS API: The specific integration with NetScaler (load balancer) and XenServer will be achieved through more open interfaces provided through the Citrix/Openstack contribution, which means that OpenStack users can use those intefaces to plug in any hypervisor or load balancer.
In this post I’d like to elaborate more specifically on the current and planned integration work. GigaSpaces Citrix integration on top of OpenStackThe block diagram below describes the main layers that comprise the joint GigaSpaces/Citrix integration. The OpenStack layers (marked in bold-green) enable making these integration points more open, as we will be using the OpenStack API for the Compute (NOVA) and Load-Balancer API instead of using the Citrix API directly. 
OpenStack Compute (Nova)The GigaSpaces integration with OpenStack Nova API is done through the previously announced JClouds provider contribution. This integration enables you to run Citrix Xen VMs and manage the them through the Citrix Management console. Similarly, you can use this same integration to plug-in other VMs. OpenStack Load BalancerThe OpenStack Load Balanacer API is a slight variation on the current Rackspace Load Balancer API . This integration follows the exact path as with the NOVA integration i.e. we will use the JClouds Load Balancer abstraction and plug-in an OpenStack load balancer provider as one of the providers' plug-ins. This integration enables you to run the Citrix Netscaler load balancer through the OpenStack API, and thus enable you to leverage the performance and scaling benefit of Netscalaer while at the same time keeping an open interface to plug-in other load balancers. Citrix certified version of OpenStackOpenStack enables you to download the full source code of the project and build an Amazon like cloud infrastructure in your local environment. As with many other open-source offerings (Linux is a good example) – most enterprises don't have the skills or the time to go through the process of building their own cloud from the source. These organizations would need a pre-packaged version of OpenStack that comes with built-in support, production management tools etc. The Citrix certified version of OpenStack is geared specifically for this purpose. The solution will be comprised of two primary components: a Citrix-certified version of OpenStack and a cloud-optimized version of XenServer. The product will be sold with Rackspace Cloud Builders, who will provide deployment services, training, and ongoing support for customer clouds. Customers can get started building their clouds today with the Early Access Program . The program will provide access to the software (Citrix), a reference architecture and PowerEdge C server platforms (Dell), and services (RAX Cloud Builders) they need to begin building their cloud. However, as Open Cloud and OpenStack are about openness, customers can work with any group of providers/partners to build their cloud, and will be supported in the Early Access Program. The end result is open source technologies delivering a massively scalable cloud operating system. Citrix OpenStack virtual appliance projectThis video demonstrates the deployment and management of an OpenStack cloud, with the software packaged as a virtual appliance. This allows service providers to track, install, and upgrade their cloud as a single virtual machine image, avoiding the complexity of deploying directly from packages. The video shows a complete installation from bare metal to working cloud, using Citrix's packaged solution. It then goes on to show how the cloud can be scaled up to add new nodes in one click. References |
posted Jul 19, 2011, 4:21 PM by Kuwon Kang
PaaS on OpenStackIn my last post (GigaSpaces OpenStack Explained) I introduced our plan to add support for OpenStack in our platform: One of the goals for our second-generation PaaS/SaaS enablement platform was to enable smooth migration between different cloud providers. We were able to achieve this goal through the use of our own abstraction (the Scaling Handler) and through the integration with the JClouds project that provides common abstraction to most of the existing cloud providers. With that, we can ensure that any application can be moved from the likes of Amazon to OpenStack or to an organization's own private cloud with zero changes to the application code or configuration. The only change involves is setting the user/key of the specific cloud. 
By adding support for OpenStack, we now enable users to safely move to an OpenStack-based cloud when they're ready and with little effort, yet they gain all the benefits that comes with it in terms of cost, openness etc.
Yesterday, I had a session during the OpenStack Summit where I tried to present a more general view on how we should be thinking about PaaS in the context of OpenStack. The key takeaway : The main goal of PaaS is to drive productivity into the process by which we can deliver new applications. Most of the existing PaaS solutions take a fairly extreme approach with their abstraction of the underlying infrastructure and therefore fit a fairly small number of extremely simple applications and thus miss the real promise of PaaS. Amazon's Elastic Beanstalk took a more bottom up approach giving us better set of tradeoffs between the abstraction and control which makes it more broadly applicable to a larger set of applications. The fact that OpenStack is opensource allows us to think differently on the things we can do at the platform layer. We can create a tighter integration between the PaaS and IaaS layers and thus come up with better set of tradeoffs into the way we drive productivity without giving up control. Specifically that means that: - Anyone should be able to:
- Build their own PaaS in a snap
- Run on any cloud (public/private)
- –Gain multi-tenancy, elasticity… Without code changes.
- Provide a significantly higher degree of control without adding substantial complexity over our:
- Language choice
- –Operating System
- –Middleware stack
- Should come pre-integrated with a popular stack:
- Spring,Tomcat, DevOps, NoSQL, Hadoop...
- Designed to run the most demanding mission-critical apps
The slides below illustrate a very high-level overview of the problems as well as how OpenStack can help provide a solution, from the presentation at the Summit: We’ve recorded a short demo that shows how all different pieces pieces actually work together in the context of OpenStack: You should note that since the demo was mainly targeted at illustrating the OpenStack integration through JClouds/OpenStack provider, it doesn’t cover much of the feature set, such as Multi-tenancy, fine-grained monitoring, or fail-over, nor does it show deploying full-blown web-apps and big-data apps, etc. The actual code for the JClouds/OpenStack provider should be available through the JClouds project shortly. Call for actionToday, there is a lot of work and interesting innovation being done in the PaaS world by different providers. Unfortunately a lot of that work is done with very little collaboration. The OpenStack community can be a great environment to put all those great ideas into something meaningful and open. I hope that our initial joint work with OpenStack, Citrix, JClouds, and GridDynamics can be a good start in that direction. I’m trying to figure out what’s should be the right way to establish a more formalized Open-PaaS group as part of the OpenStack community. Any ideas/help would be greatly appreciated... |
|