Quantcast
Channel: Nectarine Imp» big data
Viewing all articles
Browse latest Browse all 3

Data Scientists; Thy Name Bewilders Some, Still

0
0

fisticuffs

Data Science and underground fighting clubs

There is the odd and perplexing problem we all face, sometimes daily – what name to give to something. The perplexity comes when that thing blurs normal boundaries and is neither fish nor foul. Today on the Interwebs we have various folks wondering if Data Scientists isn’t just a souped up name for statistician or programmer. Having now had a year’s experience wearing the hat thus named Data Scientist, I shall let you in on the secrets of our guild so that you may more fully appreciate if, indeed, we are something new or just relabeling of old dog food. Yes, I am breaking the first rule of the fisticuffs guild, oh, uh I mean the Club of Data Scientists.

“You can’t just arrogantly assume you have a solution without figuratively getting your hands stained with the ink of data.”

The main measurements that differentiate a Statistician and a Programmer are presumed to be the knowledge and practical experience of statistics and the knowledge and practical experience of software engineering. Both are diverse disciplines of study and many people who claim to be either of them are poor at their skill. However, being employed in such a position they have the unassailable proof of credentials to such a title so we can forget that for a moment. There is often a connection between the two disciplines in the science of Math (though some suggest that even Math itself doesn’t exist). However, others argue that while both sides use Math, very few make statements upon Math. In other words they don’t extend the science. Fair point and one that few people really appreciate. However, we can clearly state that, regardless of possible connections, the two disciplines are completely different. A statistician who uses R to create programs is not a programmer just as much as a programmer who links their program to org.apache.commons/commons-math3 to calculate cumulative probability density  is not a statistician.

it_crowdI have the pleasure of working with genius software engineers. I don’t say that lightly. These folks are the best engineers I’ve worked with. And yet they are intimidated by the application of probability theory as much as I am about setting up Datomic. I, myself have a lot of experience but I am a slow programmer. I would be a B+ student at best evaluated for my programming skills. Especially compared to my team. However, I started off as a Physics major back in the day. While I ultimately decided I disliked the academic world and left science for consulting, I’ve always been rooted in a math focused world. My work has been about data understanding for decades now. It started with data bases. Then OLAP data mining. Then web scale data mining. When I got recruited to work with the Army on advanced intelligence projects the focus was the modern trend of statistical AI. I’ve been very comfortable with it and in that position I finally returned to my roots as a scientist. However, I had drifted as a software engineer. When I went out onto my own, I rediscovered my engineering roots but I also found that this new enlightenment I had received on Big Data was incredibly hard to explain to the IT crowd. The reason why was because they just were not used to discussing things in terms of heuristics, probabilities or had much use otherwise for machine learning and ontologies. Many of them fully understood Taxonomy as a concept and many programs have de facto taxonomies as part of their design. However, non-deterministic systems were generally alien concepts.

Flash forward to today and any business that is looking at problems with big data that need to be solved or with getting a grasp of the ever unfolding growth of unstructured text they are looking to hire people that specifically understand the problem and their solutions. Experience is key. A proven base of experience in this field is essential – not just to be hired but to have success. I have found that big data solutions are immune to guesswork and gut instinct. You can’t just arrogantly assume you have a solution without figuratively getting your hands stained with the ink of data. Once in you then need to have the understanding of statistical science to understand what potential solutions there are and then have the software engineering experience to think of how to solve it with computation.

And that is just for times when problems are handed to you. A good data scientist knows how to identify issues that the uniformed just don’t see. I work with some of the brightest lights in the field of legal discovery. Its a daily process I go thru in helping them understand the art of the possible in terms of applying machine learning and statistical analysis to data to make the job of legal discovery better (easier, more precise and less reliant on brilliant guesswork.) I have this vision only because I’ve been faced in the past 10 years with some really tough problems to solve already. It helps that I already had a habit of applying scientific process. My approach is different than the one they would take. My understanding of how machine learning works comes from having seen it in action and having read a lot about it. It’s not needed to set up a server or a web site or program an iOS game and so most programmers think it’s neat but don’t actually use it or know much about it.

machine-learning-svm

So, no, I strongly refute the notion that all data scientists are just souped up statisticians or programmers. Of course there are many who claim they are, but really, are not. If you think you are one and believe the Harvard Business Review’s opinion it makes you sexy – you are simply deluded. A real data scientist realizes their contribution is only one part of a fully functional system and that it’s application is for a small class of problems. Its a tricky position to be in. You require skills and process from two very different disciplines in order to be good. You require what I consider to be a deep theoretical background (and a deep experiential background helps!) in order to see the problems and persistently be able to at least attempt solutions to them. The scientist in you recognizes that your first ideas may not work. While there are solutions to problems that no one else is thinking of, you probably could not do what is necessary to implement everything needed as elegantly as is possible with a team. My co-workers would be hobbled having to catch up on all the theoretical knowledge I have, while I would be hobbled trying to perfect my practical ability to their level. Would it help to have an actual statistician on our team? Most likely because they would have even de

eper theoretical knowledge on how to prove our models are working, but they would have to have the background knowledge to speak to the engineers to get their solutions turned into software… which if they had it they would be a data scientist! Would just a Statistician or just an Ace Programmer be able to create what we are creating as a team? Absolutely not. So, my initial distrust of the concept of Data Scientist has given way to the realization that I am in a niche that is broadly defined but which is required to create a new class of application for new methodologies supporting existing endeavors or making possible new endeavors. Not every person with the Data Scientist hat is going to pass everyone’s sniff test for what they think a data scientist is. However, the discipline exists and the need for it exists. There are thousands of other disciplines that are a mix of two singular disciplines; Computational Biologist, Dog Behaviour Specialist, Paranormal Investigator. Ok, I can’t come up with any good ones right now but there are a lot of situations where duel disciplines are necessary and often they must be embodied in one specialist. I’ll leave it up to YOU dear reader to fill the comments with better sample evidence. Damn I’m lazy.

My conclusion to all of this is, if you think you need to hire a data scientist consider what is it about what you are doing that requires machine learning, statistics and programming. It is not always easy to know that that is the most efficient approach. If you are a statistician or an ace programmer and you want to be hired as a Data Scientists consider the implications of what you will need in terms of programming skills, theoretical background, statistical skills and an appreciation of the actual problems you might face in that role. I LOVE my job. However, it isn’t for everyone. I love theory and reading papers and trying to solve problems with machine learning methods. I think that passion is needed. That’s true for most things. Passion for continuous learning of statistical techniques and machine learning is what I am talking about here. If you are unemployed, don’t even bother. No one is hiring unemployed data scientists. However, that doesn’t mean you can’t start developing the skills you need and transition from another job.


Viewing all articles
Browse latest Browse all 3

Latest Images

Trending Articles





Latest Images