Background Research – Responsible Data Science

What is data science?

Data science is creating value from the data you have. In traditional statistics, data storage and retrieving isn’t done so an approach through machine learning is offered, which is based on predictive modeling rather than making inferences [1].

How the Data Science process in today’s world?

The activities of Greater Data Science are classified into 6 divisions:

1. Data Exploration and Preparation

Exploration: Data scientists spend lots of time and effort to exploring data to sanity-check its primary properties, and to uncover surprising features [1]. Such detective work is attached to important insights to every data-driven effort [2].

Preparation. This stage can be named nowadays as data cleaning. Lots of datasets include abnormalities and artificiality [3]. To avoid such issues, generally recoding and reformatting the values is done to have a better pre processing such as creating subsets [1].

2. Data Representation and Transformation

The sources of data are variable, including hardware and software constraints, a lot so a data scientist should have knowledge of wide range of formats to adapt them. Data scientists develop skills to apply transformation from original data to new and more revealing form [1].

Data Scientists develop skills in two specific areas:

Modern Databases. There are different representations of data such as o SQL and noSQL databases, distributed databases, and live data streams and data scientists need to know the structures, transformations, and algorithms involved in these different representations [1].

Mathematical Representations. For the representation of data, sometimes mathematical structures are also used such as image, sensor, and network data. For example, to get features with acoustic data, one often transforms to the cepstrum or the Fourier transform; for image and sensor data the wavelet transform or some other multi scale transform (e.g. pyramids in deep learning). Data scientists develop skill with these tools and make logical judgements about when to apply one of them.

3. Computing with Data

Data scientists need to know and use different languages, such as popular languages like R and Python that are used in analysis of data and data processing. Data scientists improve workflows which arrange the work to be divided across many jobs to be run sequentially or across many machines. also they improve workflows that show the steps of an individual data analysis or research project. Finally, they improve packages which summarize commonly-used workflow stock them for use in future projects.

4. Data Modeling

Data scientists design simple plots with different color or symbols to give a crucial new factor, and they often make their understanding of a dataset clear by designing a new plot which codifies it. Also they create panel boards to follow up data processing pipelines, accessing widely distributed data. Finally, they form visualizations that reflect conclusions from a modeling exercise.

5. Data Visualization and Presentation

The used tools and viewpoints by data scientists.

Generative modeling, One suggests a stochastic model that form the data and extracts methods to give reason to properties of underlying generative mechanism. It matches with traditional academic statistics and its offshoots [4].

Predictive modeling, One sets up methods which predict well some given data universe. This is equivalent to modern machine learning, and its industrial offshoots [5].

6. Science about Data Science

Data analysis is the most complicated of all sciences. What data analysts actually do is explained, and it is claimed that the true effectiveness of a tool is related to the probability of deployment times the probability of effective results once deployed [6]. As data analysis and predictive modelling turn into an ever more distributed globally, ‘Science about Data Science’ will develop obviously in significance [1]. So how data analysis as practiced is effecting ‘all of science’.

The Next 50 Years of Data Science/ The steps taken for responsible data science

There are significant clues that enable us to predict where the data science comes. One of the problem is having no standard/ common way to share code and data. Today there are many continuing development efforts to develop standard tools enabling reproducibility [7, 8, 9]. In addition, today, getting data is problematic; it might involve reading of individual papers and manual extraction and collection, or web scrape and the cleaning of data, all strategies have possibilities for errors and they are time consuming [1]. However this start to change with the codes and data that include computational results of individuals, which can be cited universally and programmatically recoverable with the way that goes to a new future world [1]. It suggests a way that the metadata as the source of the result, constantly joint with a URL, make it permanently cited programmatically and recoverable [10, 11]. In the next 50 years, the wide range of data will be present to evaluate the algorithms performance through a whole group of situations. Rather than extracting the best procedures under idealized predictions within mathematical models, performance is evaluated by experimental methods, based on the whole literature of science or appropriate subclusters of it [1].

References

[1] Donoho, D. (2015, September). 50 years of Data Science. In Princeton NJ, Tukey Centennial Workshop.

[2] Madigan, D., Stang, P. E., Berlin, J. A., Schuemie, M., Overhage, J. M., Suchard, M. A., … & Ryan, P. B. (2014). A systematic statistical approach to evaluating evidence from observational studies. Annual Review of Statistics and Its Application, 1, 11-39.

[3] Marchi, M., & Albert, J. (2013). Analyzing baseball data with R. CRC Press.

[4] McNutt, M. (2014). Reproducibility. Science, 343(6168), 229-229.

[5] Pan, Z., Trikalinos, T. A., Kavvoura, F. K., Lau, J., & Ioannidis, J. P. (2005). Local literature bias in genetic epidemiology: an empirical evaluation of the Chinese literature. PLoS medicine, 2(12), e334.

[6] Peng, R. D. (2009). Reproducible research and biostatistics. Biostatistics, 10(3), 405-408.

[7] Stodden, V., & Miguez, S. (2013). Best practices for computational science: Software infrastructure and environments for reproducible and extensible research.

[8] Stodden, V., Leisch, F., & Peng, R. D. (Eds.). (2014). Implementing reproducible research. CRC Press.

[9] Freire, J., Bonnet, P., & Shasha, D. (2012, May). Computational reproducibility: state-of-the-art, challenges, and database research opportunities. In Proceedings of the 2012 ACM SIGMOD international conference on management of data (pp. 593-596). ACM.

[10] Gavish, M., & Donoho, D. (2012). Three dream applications of verifiable computational results. Computing in Science & Engineering, 14(4), 26-31.

[11] Gavish, M., & Donoho, D. (2011). A universal identifier for computational results. Procedia Computer Science, 4, 637-647.