Moving data to Cloud–Considerations and thoughts August 20th, 2014

Vinod Kumar

Being a data person and meeting customers almost every single day on considerations of movement to Azure (Cloud), I take the opportunity to demystify a lot basic concepts with customers. Post the session, it is about thinking in the right direction while taking data to the cloud. This blog reflects the lessons learned best practices, approaches and understanding from real world experiences. There are not written on stone sort of recommendations, but these approaches I highly recommend customers to set their minds at while taking data to cloud.

Know your workload

The basic conversation here is to identify if the application is a “Green Field” project or an “Existing Application”. In a “Green Field” project, we will be exposing data via a service for the first time. Like a literal green field, we are dealing with an undeveloped environment and don’t have any expectations, limitations, dependencies or other constraints that would result from having an existing service.

For existing applications, services may reside on our on-premise servers, partner hosted environment or third-party cloud environments. In such scenario, there may be a need to move, scale, or provide a redundant environment for these services. When such applications come, I look at asking if the end-user open on opening up just a subset of of existing data as a new service on cloud.

When working movement to Cloud, I generally ask if the customer wants to migrate as-is their current deployment or migrate data / services as an opportunity to enhance it providing a super set of functionality found in the original application.

If you already have an existing application, but it is reaching the limits of scale and is difficult or impossible due to technology or financial based constraints to scale – one option is to use the cloud as a “scale layer”. In this scenario, cloud infrastructure can be put in front of your existing data services to provide significant scale. This is yet another workload to look at.

Exposing your data

There can be different levels of motivations when exposing your data via API’s. I generally quiz customers around this – is it compliance requirement, is it self-motivated for others to integrate or is it a requirement from consumers. Based on this, the discussion will decide if it is need to have, want to have or nice to have.

Irrespective how the data gets exposed by applications, there are few more analysis we do, these are some typical queries we ask:

  1. Data Location: Are there constraints in location of where data can be stored? Is geography a contraint?
  2. Data type and Format: Are there any standards which mandate data to be transmitted in XML, JASON, Binary, Text, OData, DOM, Zip, Images format etc?
  3. Sharing: Is there a specific API to be implemented like WebAPI, WCF, Web Services, FTP, SAPI etc.
  4. Consumption: Are there requirements to retrieve the data in its entirety or be able to query the data and retrieve only a subset? Is queryability a required and do we need to certify the same?
  5. Commercial Use: It is critical to understand if it is internal systems consuming the data or available for commercial use? Is there a need to monetize the access of data and services delivered?
  6. Data discoverability: Is there a need to expose data and advertised somehow? Should it integrate with gateways, govt services or third party data services like Azure DataMarket?

Know your Data

There are multiple technologies which can be used to host data, each with their own pros, cons, and pricing. Data size is relevant as there can be key factors when determining which technology should be used to host the data. In addition to understanding the current data size, it’s important to also understand the data growth rate. Evaluating the data growth rate can be used to extrapolate storage needs in future years and help scope the technology used.

Know how often the data gets updated. If customer wants to expose data to public, then there maybe sync or out-of-sync updates to source data and data exposed via services.

Putting a copy of your data into the cloud requires the actual transfer of data from your location to the datacenter of cloud hosting provider. These transfers require time and typically have associated bandwidth charges that should be considered – both for the initial transfer of the current state of your data set and the periodic refreshes of data.

Sharding is something that most cloud storage mechanisms support. When looking to move large data sets to the cloud, consider sharding and think about the optimal ways to partition data for optimized for query performance. Examples of different types of sharding include sharding by region, state, by time period (month, year etc.), postal code, customer’s name etc.

Query pattern is yet another dimension we need to keep in mind. Since you pay based on bandwidth utilization, it is critical to understand the query patterns and how queries come to access data. Since we are paying for the data usage, it is important to understand the amount of data we transfer in every request, the number of columns queried etc.

As we build our applications, it is important to understand if there are any Exporting requirements for other systems to consume. In an hybrid scenario, we will need to run queries that result in the export of files to the local file system and then those must be transported and imported into a cloud store. Depending on the cloud technology being used for storage, the import may be trivial or it may require the creation of custom code or scripts to export the data from the source and import it into the cloud.

Are you into data selling business?

There have been very less customers who are in a B2C scenario, but for those who are on this business – need to understand there is new costs to handle while servicing data as a service. It is no more a freebie. Even though cloud computing can greatly reduce costs, there are still costs involved to host, load and support a data service.

As these costs add up, companies look at a minimum of Cost Recovery rather than Profit in the first run. They look at optimizing it later to make profit by volumes rather than profits out of single transactions. I have seen customers adopt different strategies for pricing in a Data as a Service or Software as a Service strategy. There is no one right way or the other –

  1. Tiered Pricing – This is like mobile plans, you charge by standard transactions rates per month, time, data etc.
  2. Pay as you go pricing – this is something to borrow from any cloud vendor charging.
  3. Peak Pricing and Non-Peak Pricing – We can have different pricing for different times of the day.
  4. One time usage Subscription – It is something like a trial costing.

There are many of these mechanisms customers adopt in their strategy, but these are just representative methods one might use.

Conclusion

I seriously hope I have outlined some thoughts in your mind as you migrate your workload to cloud or enable your data on cloud. Cloud is no different than what you do on-premise – just that we need to be double careful and take few extra steps as we enable our customers. In future posts, there will be more thoughts that I will try to bring out.

Tags: , , , , , , , , , ,

This entry was posted on Wednesday, August 20th, 2014 at 08:30 and is filed under Technology. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply