MongoDB MapReduce vs. SQL Server group by – Which is faster?

I decided to try out a well-written sample on the MongoVue site to compare just how MapReduce with MongoDB vs. good old T-SQL group by really work.  I had to make a couple of tweaks to what they had written because:

1) There was one error in the MapReduce code example.  I point this out in the screencast.

2) I wanted to use SQL Server rather than MySQL for the RDBMS test.

I made two short screencasts showing exactly how I proceeded.  For completeness on the SQL Server side I did two comparisons.  The first one was over a heap table.  Next I took the recommendations of the query optimizer, which included to create not only a clustered index, but also a non-clustered index with included columns for the lat/long values.  Below is the 5 minute video showing the detail.

Then I ran the MapReduce job as written in the example on their blog using the MongoVue interface.  It took me a bit of fiddling around to get the sample into the MongoVue interface.  I have a tip if you plan to try to get this to work — work with the output console (named ‘Learn Shell’) at the bottom of the MongoVue interface to verify that you’ve entered the MapReduce code correctly and into the correct section of the GUI interface.  Finally the ‘In & Out’ section of the MapReduce interface in MongoVue wasn’t very well explained on the MongoVue site.  Take a look at my screencast to see how I chose to work with that section.

It was interesting to note a couple of things through this process:

1) The T-SQL execution proved to be the fastest solution for this problem for a data set of this size, even BEFORE I did any optimization.  The T-SQL query (group by) query on the view against the heap table ran in 5 seconds on the 37,000+ records.  After optimization (adding the indexes to the base table and recreating the view), that same query ran in 3 seconds.

2) The MapReduce took 11 seconds to run on my instance of MongoDB. My instance has no sharding and no replicas.

If you’d like to try this out as well, I’ve zipped the source data and sample code and made it available for download here.

Posted in Big Data, noSQL, SQL Server 2012 | Leave a comment

MongoDB for the .NET Developer

I’ve been spending some time taking a closer look at MongoDB, in conjunction with a consulting job that I’ve been working on for a start-up.  The developers were been unhappy with their experience trying to develop on a RDBMS for the cloud and they asked me to propose alternatives.  While NoSQL is NOT the be-all or end-all for all start-ups, there are certain types of data that lend themselves well to this model. To that end, I’ve created a presentation which I’ll be be sharing at technical events this year.

Posted in Agile, Big Data, Cloud, noSQL | 1 Comment

AWS RDS SQL Server vs. SQL Azure Smackdown – Importing Data

This is first in a series of comparisons between Amazon Web Services RDS SQL Server and SQL Azure. It is useful for me to understand exactly which features and tools work with cloud-deployed instances of SQL Server. In this screencast I take a look at common methods to import data. These include backup/restore, DACPAC and other tools such as the SQL Azure Migration Wizard (available from CodePlex).

Do the tools work? How well? Watch the video and find out.

Posted in AWS, Azure, Cloud, Microsoft, SQL Azure, SQL Server 2008, SQL Server 2012 | Leave a comment

Hadoop on Azure – Deck and Screencasts

Enjoy this deck (and linked screencasts) covering Hadoop on Azure.

 

Posted in Azure, Big Data, Cloud, Hadoop, Microsoft, noSQL | Leave a comment

First Look – ASP.NET on Amazon Web Service Elastic Beanstalk

Here’s part two of my look at the big AWS announcement yesterday – full support for ASP.NET and SQL Server on AWS Elastic Beanstalk and AWS RDS.  I found it remarkably easy to get up and going with ASP.NET on AWS Elastic Beanstalk.  My short screencast shows me working first directly in the AWS administrative console (setting up a sample application).  Here I can monitor and configure my ASP.NET applications.

It’s important to understand that when you test this functionality, you are spinning up EC2 instances and so long as you leave them running, you will be charged for compute cycles.  As with any other EC2 instance, you can, of course, pause these instances to reduce costs if you are testing.  You will still be charged for storage, but that cost is so small (like $ 5.00 / month for the smallest-sized instances), that ‘pausing’ is a good strategy to test out the functionality at minimal cost.  When you are done, be sure to terminate the instances in the EC2 console.

In the second part of the screencast, I test out deploying an ASP.NET application from Visual Studio to AWS Elastic Beanstalk (and also deploying an upgrade).  I used the AWS Visual Studio toolkit to quickly complete the deployment and upgrade.  All-in-all, the process was simple and powerful.  I am really excited to explore more of this new functionality and will blog as I continue learning.

Enjoy the screencast too.

Posted in AWS, Cloud, Microsoft | 1 Comment

First Look – SQL Server on Amazon Web Services RDS

Of course I had to try it out!  Here’s the announcement from @Werner – and the documentation from the AWS site on the new support for SQL Server in AWS RDS.  There is a free usage tier as follows:

“If you are a new Amazon RDS customer, you can get started with Amazon RDS for SQL Server with a Free Usage Tier, which includes 750 hours per month of Amazon RDS micro instances with SQL Server Express Edition, 20GB of database storage and 10 million I/O requests per month.”

Particularly of note is the announcement that in addition to supporting all editions of SQL Server 2008 R2 (i.e. Express, Developer, Standard and Enterprise), AWS intends to support SQL Server 2012 on RDS later this year as well.

I recorded a short screencast to show you start-to-connecting in SSMS with SQL Server on AWS RDS — enjoy!

BTW…take a look at out my new class ‘No SQL for the SQL Server Pro’ (May 22 in Anaheim, CA — reduced price too!)

Posted in AWS, Cloud, Microsoft, SQL Server 2008 | 1 Comment

Using Variety to get schema info for MongoDB

I tried out a utility on GitHub called Variety for MongoDB.  As the creator (@JamesCropcho) says of Variety:

“This lightweight tool helps you get a sense of your application’s schema, as well as any outliers to that schema. Particularly useful when you inherit a codebase with data dump and want to quickly learn how the data’s structured. Also useful for finding rare keys.”

I used this tool on my imported ‘NotAdvWorks’ database to take a look at the structure of the Person.Person collection.  I also used the MongoVue tool to verify the output from Variety, which was stored as a new database called ‘varietyResults’

As I am working on my first couple of production projects with MongoDB, I am really working to get my head around just what ‘schema-less database’ really means for the applications I design. I am also offering a one-day class on May 22, in Anaheim, CA (‘NoSQL for SQL Server Pros‘) to share what I am learning as I work with NoSQL in the real world with my ‘relational’ audience.

I recorded a short YouTube video of working with Variety – enjoy.

Posted in Big Data, Cloud, noSQL | 1 Comment