Stumbling toward Ha(doop)piness: A Data Analyst's Journey to Making Hadoop Useful

For the past several years, Hadoop has been one of the big things in the “big data” world, but the question is, “how useful is it for an analyst?” This is the story of one analyst who was given responsibility for the care and feeding of an in-house Cloudera Hadoop Distribution cluster and his journey toward making it useful for him and others. Coming from a world of analysis where his most common tools were Excel, Access, SQL Server, and Oracle, although with a bit of a programming background, his challenge was to figure out how to make it work, and how to do it without screwing up any of the data. This talk will go through some of those challenges and his solutions for them:

  • working with large data sets, small data sets, messy data sets;
  • loading data from text files, from databases, from the web;
  • cleaning up data before it gets on the cluster and afterward;
  • making the data and Hadoop cluster useful for other analysts.

What won’t this talk be about? It won’t be about the “best” way to do something, or the “best” way to set up a Hadoop cluster, or, really the “best” anything. This will be about practical difficulties one person experienced and some practical solutions to those problems.

What Attendees Will Learn from Presentation?

  • Some of the pitfalls of working with structured data on Hadoop using Hive and Impala
  • Ways to avoid those pitfalls
  • Some techniques for making it easier to load data into a Hadoop table
  • Some techniques for cleaning up data once it’s in Hadoop to make analyses work better

View the presentation, Stumbling toward Ha(doop)piness: A Data Analyst’s Journey to Making Hadoop Useful