Data Science is a young field and it means different thing to different people. Here I’ll give my take on what it is and isn’t. My working definition that I’ll try out is:

Data Science is the practice of using data and algorithms to make predictions or prescriptions in an organisation

I’ll flesh that out a bit and give a bit of background.

First I’ll start off with a well worn Venn diagram and an early attempt to define the field.

A well-worn Venn diagram

DS Venn Diagram

This diagram was made by New-York Based data scientist Drew Conway in 2010. It is supposed to represent the three core sets of skills that make up the discipline:

  • Hacking skills - skills in coding and software development, particularly around handling big datasets
  • Maths and statistics knowledge - particularly around machine learning algorithms but covering many aspects of applied maths and statistics
  • Substantive expertise - This term is a bit vague, but basically means some knowledge about the context of the data

According to Conway, and others since, somebody with some combination all three of these skill could be a data scientist.

At the time this diagram was made there were only handful of people in the world with the job title ‘Data Scientist’; they mostly existed at the big internet companies in the USA. Over the past eight years their number has increased and the job has evolved.

How to define a field?

I have some issues with that description. It is more a description of a set of skills than a definition of a profession. We don’t describe astrophysics as ‘the intersection of maths skills, telescopes and physics’ it is ‘the branch of astronomy concerned with the physical nature of stars and other celestial bodies’. Likewise medicine is not ‘the intersection of biology, chemistry and people’ or whatever but is ‘the science or practice of the diagnosis, treatment, and prevention of disease’. A better description of data science would describe the scope and nature of its pursuit.

Here is a rough list of some of the things that I think make up data science:

  • It’s subject is human organisations (often businesses but these could be government or non-profits)
  • It is concerned with using data to improve these organisations and the products and services they provide
  • It is all about solving problems using data
  • The data analysed is often generated for some other purpose
  • The data is often big, diverse and messy
  • The data are analysed using computer programming
  • The outputs of data science should influence decisions
  • The outputs are generally either predictions of prescriptions (suggestions for courses of action)
  • The outputs can also be insights and analyses
  • The outputs are often software (data products)
  • The outputs are usually the result of some mathematical model

So a key element is the context - we are working in an applied setting, using data and maths to influence decisions and solve problems. An important aspect is that it takes place in a (loosely defined) business context. Data scientists do not generally study natural phenomenon or do research purely for its own sake. Many other sciences already exits that do this. The practice is applied - it is toward solving business problems. Another key part here is that the output of data science should not be passive - they should support a course of action. And it tends to be prospective - it is about solving future problems rather than understanding what happened in the past. The data tends to be big and messy, meaning that the data scientists must be skilled in storing, cleaning and handling data.

Very few of these elements are hard and fast and there is a lot of overlap with existing fields. However, a useful way to sharpen the definition is to contrast with other fields:

  • Statistics (tends to retrospective and concerned with quantifying and modelling uncertainty)
  • Business Intelligence & analytics (tends to be retrospective and concerned with accurately quantifying and presenting business data)
  • Operational research (tends to not be concerned with modelling large diverse messy data and tends to be focused on physical operations)
  • Software development (tends to be prospective - designing and building tool for users, not always using data)
  • Machine Learning (tends to be mainly focused on the algorithm rather than the context and application)
  • Artificial Intelligence (a slippery term, but tends to be semi-autonomous systems comprising sensing, goal seeking and action)

Data science uses element of all of these fields but is distinct from them. Which leads to my definition:

Data Science is the practice of using data and algorithms to make predictions or prescriptions in an organisation

I think that the key part is around making prediction or prescriptions. Data science is forward looking and its outputs ought to be data ‘products’ that combine data in novel ways to automatically suggest optimal courses of action. Producing simple retrospective analysis is business intelligence.

What I’ve jotted down here is really my working definition and probably reflects my biases and background. Please leave comments below with any thoughts on what I might have missed or got wrong.