My objective as a coach at the SkillsUP lab is always to help our students to be able to tell story with tidy data. The story, be it a solution to real-world problems or answers to a specific question, starts with data.
In the academic world, data is fed from a golden spoon to demonstrate how to implement a specific method. This creates the misconception that any data in CSV format is immediately usable for data analysis. How many times have you seen this in a presentation about data?
What Is a Tidy Dataset?
Once you add the latest popular off-the-shelf intelligent data visualization tool, it’s like rubbing salt in a wound. When one feeds data in a black box that can produce charts instantly , then all is good. And insights can be drawn from the charts without any sweat. You can visualize anything. That does not mean that it is relevant to the question you are trying to answer.
From my experience, there is no such thing as a “tidy and clean dataset” from a real world data set. When I say “tidy and clean dataset”, I mean it is trustworthy, usable and relevant.
An article called Tidy Data (PDF download) written by Hadley Wickham published in Journal of Statistical Software define “Tidy Data” as
"Tidy datasets are all alike but every messy dataset is messy in its own way. Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning)."
What About "Clean Data" Then?
To answer that, we just have to define what is “dirty data”, reverse that and we get clean data. Wikipedia describes “dirty data” as “inaccurate, incomplete or inconsistent data”. On top of that, I want to add “aggregate data” in the data. It is worth mentioning that a data structure that facilitates reporting is different than one supports analyzing. Depending on what one wants to achieve, one has to restructure one’s data accordingly.
Downloaded data from internet (especially CBS) often contains rows for aggregate data. I have seen repeatedly the consequences of using these datasets without understanding how the dataset was structured. If used directly from the source in pivot tables or charts can have a disastrous consequence. Or at the very least, unintended ones for the business.
I will not quote any statistics here because it depends on each project scope and starting point. If you do not have data to start with, then your effort will be on collecting data. Regardless of your project scope, a clean & trustworthy data set is a MUST have. This is where you should make a serious investment. Getting to clean and trustworthy data means that your data professionals should have technical skills, and common sense, curiosity, pay attention to details and display critical thinking.
How You Learn 'Tidy Data' at SkillsUP Lab
SkillsUP Lab is a structured, well laid out program with the objective that the students will gain the knowledge of Data Science Tools & Methodology and be able to apply the knowledge to bring value to their work place, you can check our SkillsUp Lab program. SkillsUP Lab core components are:
- Theoretical and knowledge base: the students will be exposed to theoretical part to the Data Science.
- Data Experience Lab (DexLab): the students will be exposed to a simulated real life situation to help them learn how to apply their theoretical knowledge and to sharpen the above mentioned personality traits.
- PowerSkills: the students will be learning Power Skills to build their teambuilding, leadership, presentation conflict resolution and adaptive skills for success in the workplace.
I frequently mention that “it is not the garden, but gardening that counts”. Hence I don’t support the notion that anyone is qualified to be called a Data Scientist just because they have a “Data Science” certificate.
I do agree with the statement “Data Science is what Data Scientist does”. I agree that those who have earned a “Data Science” certificate, have developed skills to work with data. We describe our graduates of the SkillsUP Lab as “Data Professionals” for the skills in cleaning, analyzing, visualizing and modelling data that they have learned. These skills, combined with their knowledge gained in other sectors makes them more than capable of applying these highly valuable data skills across multiple types of organizations and function.