Data Science is an interdisciplinary field of knowledge that uses mathematics, statistics, scientific computing, the scientific method, engineering processes and algorithms to obtain (collect or extract), process, analyze and present reports from noisy, structured and unstructured data.[1] Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow or a profession.[2].
Data science integrates knowledge of the underlying application domain (e.g., applied economics, marketing research, finance, operations research, medicine, information technology, natural sciences)[3] with statistics, data analysis, computer science, mathematics and their related methods to understand and analyze "real" phenomena with data.[4] It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, information (technology)") and domain knowledge.[5] However, data science is different from computer science, statistics, and information science. Turing Award winner Jim Gray envisioned data science as a "fourth paradigm" of science (empirical, theoretical, computational, and now data-driven) and stated that "everything about science is changing due to the impact of information technology" and the deluge of data.[6][7].
A data scientist is a professional who, through the writing and application of programming code and knowledge in statistics, works on data collection, data cleaning, data exploration, data modeling, data visualization, the implementation of machine learning solutions and the interpretation of results.[8] Data scientists come from different professions or backgrounds: mathematicians, engineers, economists, actuaries, physicists, chemists, and sometimes from fields that may seem very distant such as medicine.
History
In 1962, John W. Tukey preceded the term “Data Science” in his article “The Future of Data Analysis” by explaining an evolution of mathematical statistics. In this, he first defined data analysis as: "Procedures for analyzing data, techniques for interpreting the results of said procedures, ways of planning the collection of data to make its analysis easier, more precise or accurate, and all the machinery and results of mathematical statistics that are applied to data analysis."
Data science has been considered a recently created discipline for many, but in reality this concept was used for the first time by the Danish scientist Peter Naur in the 1960s as a substitute for computer science. In 1974 he published the book
Big Data Analysis in Projects
Introduction
Data Science is an interdisciplinary field of knowledge that uses mathematics, statistics, scientific computing, the scientific method, engineering processes and algorithms to obtain (collect or extract), process, analyze and present reports from noisy, structured and unstructured data.[1] Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow or a profession.[2].
Data science integrates knowledge of the underlying application domain (e.g., applied economics, marketing research, finance, operations research, medicine, information technology, natural sciences)[3] with statistics, data analysis, computer science, mathematics and their related methods to understand and analyze "real" phenomena with data.[4] It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, information (technology)") and domain knowledge.[5] However, data science is different from computer science, statistics, and information science. Turing Award winner Jim Gray envisioned data science as a "fourth paradigm" of science (empirical, theoretical, computational, and now data-driven) and stated that "everything about science is changing due to the impact of information technology" and the deluge of data.[6][7].
A data scientist is a professional who, through the writing and application of programming code and knowledge in statistics, works on data collection, data cleaning, data exploration, data modeling, data visualization, the implementation of machine learning solutions and the interpretation of results.[8] Data scientists come from different professions or backgrounds: mathematicians, engineers, economists, actuaries, physicists, chemists, and sometimes from fields that may seem very distant such as medicine.
History
In 1962, John W. Tukey preceded the term “Data Science” in his article “The Future of Data Analysis” by explaining an evolution of mathematical statistics. In this, he first defined data analysis as: "Procedures for analyzing data, techniques for interpreting the results of said procedures, ways of planning the collection of data to make its analysis easier, more precise or accurate, and all the machinery and results of mathematical statistics that are applied to data analysis."
Concise Survey of Computer Methods
[10]
where the concept of data science is widely used, which allowed a freer use in the academic world.
In 1977, the International Association for Statistical Computing (IASC) is established as a section of the International Statistical Institute (ISI). “It is the mission of the IASC to relate traditional statistical methodology, modern computer technology, and expert knowledge of the subject, to convert data into information and knowledge."[11].
In 1996 the term 'Data Science' was used for the first time at a conference called "Data Science, Classification and Related Methods", which took place at a meeting of members of the International Federation of Classification Societies (IFCS) based in Kobe, Japan.[11] In 1997, C.F. Jeff Wu") gave a talk called "Statistics = Data Science?", where he described statistical work as a trilogy made up of data collection, data analysis and modeling, and decision making, calling for statistics to be renamed data science, and statisticians as data scientists.[12].
In 2001, William S. Cleveland introduced data science as an independent discipline, extending the field of statistics to include advances in computing with data in his article "Data science: an action plan for expanding the technical areas of the field of statistics." Cleveland established six technical areas that he believed would make up the field of data science: multidisciplinary investigations, models and methods for data, computing with data, pedagogy, tool evaluation, and theory.[13]
In April 2002, the 'International Council for Science: Committee on Data for Science and Technology' (CODATA) began publishing the Data Science Journal"),[14] focused on problems such as the description of data systems, their publication on the Internet, their applications, and their legal problems. Shortly after, in January 2003, Columbia University began publishing The Journal of Data Science"),[15] which offered a platform for all data professionals to present their perspectives and exchange ideas.
In 2005, The National Science Board published "Long-Lived Digital Data Collections Enabling Research and Education in the 21st Century", defining data scientists as "computer and information scientists, database and software programmers, and disciplinary experts, [...] who are crucial to the successful management of a digital collection of data, whose primary activity is to conduct creative research and analysis."[16]
It was in 2008 that Jeff Hammerbacher and DJ Patil reused it to define their own works carried out on Facebook and LinkedIn, respectively,[17].
In 2009, researchers Yangyong Zhu and Yun Xiong from the 'Research Center for Dataology and Data Science', published “Introduction to Dataology and Data Science”, where they state that “unlike natural sciences and social sciences, Datalogy and Data Science take data on the Internet and its object of study.”[11].
In 2013 the 'IEEE Task Force on Data Science and Advanced Analytics' was launched,[18] while the first international conference 'IEEE International Conference on Data Science and Advanced Analytics' was launched in 2014.[19] In 2015, the International Journal on Data Science and Analytics was launched by Springer to publish original works in data science and big data analytics.[20].
Applications
Marketing
In September 1994, BusinessWeek published the article “Database Marketing,” stating that companies collect a large amount of information about customers, which is analyzed to predict the likelihood that they will purchase a product. They claim that this knowledge is used to craft a precisely calibrated marketing message for the individual to seek. They also explain that, in the 1980s, an enthusiasm sparked by the spread of barcode readers ended in widespread disappointment as many companies were overwhelmed by the large amount of data to be able to do something useful with their customers' information. However, many companies believe that there is no choice but to challenge the marketing and database frontier to further develop the necessary technologies.[21].
In 2014, Swedish music streaming company Spotify purchased The Echo Nest, a company specialized in music data science. This is now in charge of storing and analyzing the information of its 170 million users.[22] With the help of said company, in 2015 Spotify launched a personalized music service called Discover Weekly that weekly recommends to its users a selection of songs that might interest them through algorithms and analysis of the data of the music listened to and the search history of the past week. The service received a general good reception[23] and currently appears as a strong selling point compared to the company's competition.[24].
Netflix, the North American streaming multimedia content company, offers its more than 120 million users a platform capable of analyzing, through algorithms, users' consumption habits to differentiate the content they are looking for and determine what new content may interest them. Todd Yellin"), vice president of product at Netflix, explained that some of the stored data can extend from the time of day its users connect, how much time they spend on the platform, their list of recently viewed content (to even analyze the specific order of these). All the information that is stored is used specifically to be analyzed, learn from the user and be able to give them accurate recommendations.[25].
Governance
In Latin America, the Inter-American Development Bank (IDB) has developed exploratory studies in which data science is analyzed in the implementation and design of public policies in the region, taking cases in countries such as Argentina and Brazil, presenting recommendations for their implementation and maintenance.
These range from topics such as sustainable urban mobility, smart cities, security, data ownership and privacy. Among the suggestions presented in the research is that of achieving "public value intelligence, which "has the potential to be a strategic component for decision-making and the design, implementation and evaluation of public policies." Another of them is the ability to achieve from this field an improvement in the accountability of governments to citizens and promote progress in terms of data curation in public institutions.[26].
Data science and Big data
Textually, Big Data (or big data) refers to enormous volumes of data that cannot be processed effectively with the traditional applications that are currently applied.[27] According to the Amazon Web Service guide, Big Data is considered a considerable collection of data that has difficulties to be stored in traditional databases, and also to be processed on standard servers and to be analyzed with common applications.
The term is usually related to data science, as that is usually your source of information for analysis; Data science analyzes large sets of messy and incomplete data to arrive at findings that drive decisions about operations and products.
Data Scientist
Contenido
Las personas que se dedican a la ciencia de datos se les conoce como científico de datos, de acuerdo con el proyecto Master in Data Science define al científico de datos como una mezcla de estadísticos, informáticos, matemáticos y pensadores creativos, con las siguientes habilidades:.
El proceso que sigue un científico de datos para responder cuestiones que se le plantean se puede resumir en estos pasos:.
El doctor en estadística Nathan Yau, precisó lo siguiente: el científico de datos es un estadístico que debería aprender interfaces de programación de aplicaciones (API), bases de datos y extracción de datos; es un diseñador que deberá aprender a programar; y es un computólogo que deberá saber analizar y encontrar datos con significado.[29].
En la tesis doctoral de Benjamin Fry explicó que el proceso para comprender mejor a los datos comenzaba con una serie de números y el objetivo de responder preguntas sobre los datos, en cada fase del proceso que él propone (adquirir, analizar, filtrar, extraer, representar, refinar e interactuar), se requiere de diferentes enfoques especializados que aporten a una mejor comprensión de los datos. Entre los enfoques que menciona Fry están: ingenieros en sistemas, matemáticos, estadísticos, diseñadores gráficos, especialistas en visualización de la información y especialistas en interacciones hombre-máquina, mejor conocidos por sus siglas en inglés “HCI” (Human-Computer Interaction). Además, Fry afirmó que contar con diferentes enfoques especializados lejos de resolver el problema de entendimiento de datos, se convierte en parte del problema, ya que cada especialización conduce de manera aislada el problema y el camino hacia la solución se puede perder algo en cada transición del proceso.[30].
Drew Conway en su página web explica con la ayuda de un diagrama de Venn, las principales habilidades que le dan vida y forma a la ciencia de datos, así como sus relaciones de conjuntos.
The importance of a data scientist
Data science has recently become very important in our life as an emerging discipline or profession (data scientist), and has become the focus of attention of more and more organizations worldwide, as Google's chief economist Hal Varian pointed out, "The sexiest job in the next 10 years will be being a statistician", words on which Thomas H. Davenport reflected") to publish his article in 2012: Data Scientist: The Sexiest Job of the 21st Century
[31]
where he describes the profile that the data scientist must have as the hybrid of a data hacker, an analyst, a communicator, and a trusted advisor, an extremely powerful and rare combination. Davenport also points out that the data scientist is not comfortable, as is colloquially said, “on a short leash,” that is, he must have the freedom to experiment and explore possibilities. Additionally, Davenport in the same article presents a decalogue on how to find the data scientist that the organization needs (see page 74 of the article).
The report published by “McKinsey” in 2011[32] estimated that for the world of big data in which we live, it expects that the demand for expert talent in data analysis could reach 440,000 to 490,000 jobs by 2018.
Among the technological challenges we face we highlight:
[3] ↑ Danyluk, A.; Leidig, P. (2021), «Computing Competencies for Undergraduate Data Science Curricula», ACM Data Science Task Force Final Report .: https://dstf.acm.org/DSTF_Final_Report.pdf
[4] ↑ Hayashi, Chikio (1 de enero de 1998). «What is Data Science? Fundamental Concepts and a Heuristic Example». En Hayashi, Chikio; Yajima, Keiji; Bock, Hans-Hermann; Ohsumi, Noboru; Tanaka, Yutaka; Baba, Yasumasa, eds. Data Science, Classification, and Related Methods. Studies in Classification, Data Analysis, and Knowledge Organization (en inglés). Springer Japan. pp. 40-51. ISBN 9784431702085. doi:10.1007/978-4-431-65950-1_3.: https://www.springer.com/book/9784431702085
[9] ↑ Tukey, John W. (1962-03). «The Future of Data Analysis». The Annals of Mathematical Statistics (en inglés) 33 (1): 1-67. ISSN 0003-4851. doi:10.1214/aoms/1177704711. Consultado el 1 de octubre de 2018.: https://projecteuclid.org/euclid.aoms/1177704711
[10] ↑ Peter Naur (1974). Encyclopedia of Computer Science. Petrocelli Books. 91-44-07881-1.
[13] ↑ Cleveland, W. S. (2001). Data science: an action plan for expanding the technical areas of the field of statistics. (en inglés). International Statistical Review / Revue Internationale de Statistique. p. 21–26.
[14] ↑ «Data Science Journal». Available Volumes. Retrieved from Japan Science and Technology Information Aggregator, Electronic: http://www.jstage.jst.go.jp/browse/dsj/_vols. abril de 2012.: http://www.jstage.jst.go.jp/browse/dsj/_vols
[16] ↑ National Science Board (2005). «US NSF - NSB-05-40, Long-Lived Digital Data Collections Enabling Research and Education in the 21st Century». www.nsf.gov (en inglés). National Science Foundation. Consultado el 3 de febrero de 2017.: http://www.nsf.gov/pubs/2005/nsb0540/
[26] ↑ «El uso de datos masivos y sus técnicas analíticas para el diseño e implementación de políticas públicas en Latinoamérica y el Caribe (2017)». Banco Interamericano de Desarrollo. Consultado el 29 de noviembre de 2018.: https://publications.iadb.org/handle/11319/8485
[30] ↑ Fry, Benjamin (abril de 2014). «Thesis proposal: Computational Information Design» (en inglés). Consultado el 24 de septiembre de 2015.: http://benfry.com/phd/dissertation-110323c.pdf
[31] ↑ Peter Drucker (2012). Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review.
Data science has been considered a recently created discipline for many, but in reality this concept was used for the first time by the Danish scientist Peter Naur in the 1960s as a substitute for computer science. In 1974 he published the book Concise Survey of Computer Methods
[10]
where the concept of data science is widely used, which allowed a freer use in the academic world.
In 1977, the International Association for Statistical Computing (IASC) is established as a section of the International Statistical Institute (ISI). “It is the mission of the IASC to relate traditional statistical methodology, modern computer technology, and expert knowledge of the subject, to convert data into information and knowledge."[11].
In 1996 the term 'Data Science' was used for the first time at a conference called "Data Science, Classification and Related Methods", which took place at a meeting of members of the International Federation of Classification Societies (IFCS) based in Kobe, Japan.[11] In 1997, C.F. Jeff Wu") gave a talk called "Statistics = Data Science?", where he described statistical work as a trilogy made up of data collection, data analysis and modeling, and decision making, calling for statistics to be renamed data science, and statisticians as data scientists.[12].
In 2001, William S. Cleveland introduced data science as an independent discipline, extending the field of statistics to include advances in computing with data in his article "Data science: an action plan for expanding the technical areas of the field of statistics." Cleveland established six technical areas that he believed would make up the field of data science: multidisciplinary investigations, models and methods for data, computing with data, pedagogy, tool evaluation, and theory.[13]
In April 2002, the 'International Council for Science: Committee on Data for Science and Technology' (CODATA) began publishing the Data Science Journal"),[14] focused on problems such as the description of data systems, their publication on the Internet, their applications, and their legal problems. Shortly after, in January 2003, Columbia University began publishing The Journal of Data Science"),[15] which offered a platform for all data professionals to present their perspectives and exchange ideas.
In 2005, The National Science Board published "Long-Lived Digital Data Collections Enabling Research and Education in the 21st Century", defining data scientists as "computer and information scientists, database and software programmers, and disciplinary experts, [...] who are crucial to the successful management of a digital collection of data, whose primary activity is to conduct creative research and analysis."[16]
It was in 2008 that Jeff Hammerbacher and DJ Patil reused it to define their own works carried out on Facebook and LinkedIn, respectively,[17].
In 2009, researchers Yangyong Zhu and Yun Xiong from the 'Research Center for Dataology and Data Science', published “Introduction to Dataology and Data Science”, where they state that “unlike natural sciences and social sciences, Datalogy and Data Science take data on the Internet and its object of study.”[11].
In 2013 the 'IEEE Task Force on Data Science and Advanced Analytics' was launched,[18] while the first international conference 'IEEE International Conference on Data Science and Advanced Analytics' was launched in 2014.[19] In 2015, the International Journal on Data Science and Analytics was launched by Springer to publish original works in data science and big data analytics.[20].
Applications
Marketing
In September 1994, BusinessWeek published the article “Database Marketing,” stating that companies collect a large amount of information about customers, which is analyzed to predict the likelihood that they will purchase a product. They claim that this knowledge is used to craft a precisely calibrated marketing message for the individual to seek. They also explain that, in the 1980s, an enthusiasm sparked by the spread of barcode readers ended in widespread disappointment as many companies were overwhelmed by the large amount of data to be able to do something useful with their customers' information. However, many companies believe that there is no choice but to challenge the marketing and database frontier to further develop the necessary technologies.[21].
In 2014, Swedish music streaming company Spotify purchased The Echo Nest, a company specialized in music data science. This is now in charge of storing and analyzing the information of its 170 million users.[22] With the help of said company, in 2015 Spotify launched a personalized music service called Discover Weekly that weekly recommends to its users a selection of songs that might interest them through algorithms and analysis of the data of the music listened to and the search history of the past week. The service received a general good reception[23] and currently appears as a strong selling point compared to the company's competition.[24].
Netflix, the North American streaming multimedia content company, offers its more than 120 million users a platform capable of analyzing, through algorithms, users' consumption habits to differentiate the content they are looking for and determine what new content may interest them. Todd Yellin"), vice president of product at Netflix, explained that some of the stored data can extend from the time of day its users connect, how much time they spend on the platform, their list of recently viewed content (to even analyze the specific order of these). All the information that is stored is used specifically to be analyzed, learn from the user and be able to give them accurate recommendations.[25].
Governance
In Latin America, the Inter-American Development Bank (IDB) has developed exploratory studies in which data science is analyzed in the implementation and design of public policies in the region, taking cases in countries such as Argentina and Brazil, presenting recommendations for their implementation and maintenance.
These range from topics such as sustainable urban mobility, smart cities, security, data ownership and privacy. Among the suggestions presented in the research is that of achieving "public value intelligence, which "has the potential to be a strategic component for decision-making and the design, implementation and evaluation of public policies." Another of them is the ability to achieve from this field an improvement in the accountability of governments to citizens and promote progress in terms of data curation in public institutions.[26].
Data science and Big data
Textually, Big Data (or big data) refers to enormous volumes of data that cannot be processed effectively with the traditional applications that are currently applied.[27] According to the Amazon Web Service guide, Big Data is considered a considerable collection of data that has difficulties to be stored in traditional databases, and also to be processed on standard servers and to be analyzed with common applications.
The term is usually related to data science, as that is usually your source of information for analysis; Data science analyzes large sets of messy and incomplete data to arrive at findings that drive decisions about operations and products.
Data Scientist
Contenido
Las personas que se dedican a la ciencia de datos se les conoce como científico de datos, de acuerdo con el proyecto Master in Data Science define al científico de datos como una mezcla de estadísticos, informáticos, matemáticos y pensadores creativos, con las siguientes habilidades:.
El proceso que sigue un científico de datos para responder cuestiones que se le plantean se puede resumir en estos pasos:.
El doctor en estadística Nathan Yau, precisó lo siguiente: el científico de datos es un estadístico que debería aprender interfaces de programación de aplicaciones (API), bases de datos y extracción de datos; es un diseñador que deberá aprender a programar; y es un computólogo que deberá saber analizar y encontrar datos con significado.[29].
En la tesis doctoral de Benjamin Fry explicó que el proceso para comprender mejor a los datos comenzaba con una serie de números y el objetivo de responder preguntas sobre los datos, en cada fase del proceso que él propone (adquirir, analizar, filtrar, extraer, representar, refinar e interactuar), se requiere de diferentes enfoques especializados que aporten a una mejor comprensión de los datos. Entre los enfoques que menciona Fry están: ingenieros en sistemas, matemáticos, estadísticos, diseñadores gráficos, especialistas en visualización de la información y especialistas en interacciones hombre-máquina, mejor conocidos por sus siglas en inglés “HCI” (Human-Computer Interaction). Además, Fry afirmó que contar con diferentes enfoques especializados lejos de resolver el problema de entendimiento de datos, se convierte en parte del problema, ya que cada especialización conduce de manera aislada el problema y el camino hacia la solución se puede perder algo en cada transición del proceso.[30].
Drew Conway en su página web explica con la ayuda de un diagrama de Venn, las principales habilidades que le dan vida y forma a la ciencia de datos, así como sus relaciones de conjuntos.
The importance of a data scientist
Data science has recently become very important in our life as an emerging discipline or profession (data scientist), and has become the focus of attention of more and more organizations worldwide, as Google's chief economist Hal Varian pointed out, "The sexiest job in the next 10 years will be being a statistician", words on which Thomas H. Davenport reflected") to publish his article in 2012: Data Scientist: The Sexiest Job of the 21st Century
[31]
where he describes the profile that the data scientist must have as the hybrid of a data hacker, an analyst, a communicator, and a trusted advisor, an extremely powerful and rare combination. Davenport also points out that the data scientist is not comfortable, as is colloquially said, “on a short leash,” that is, he must have the freedom to experiment and explore possibilities. Additionally, Davenport in the same article presents a decalogue on how to find the data scientist that the organization needs (see page 74 of the article).
The report published by “McKinsey” in 2011[32] estimated that for the world of big data in which we live, it expects that the demand for expert talent in data analysis could reach 440,000 to 490,000 jobs by 2018.
Among the technological challenges we face we highlight:
[3] ↑ Danyluk, A.; Leidig, P. (2021), «Computing Competencies for Undergraduate Data Science Curricula», ACM Data Science Task Force Final Report .: https://dstf.acm.org/DSTF_Final_Report.pdf
[4] ↑ Hayashi, Chikio (1 de enero de 1998). «What is Data Science? Fundamental Concepts and a Heuristic Example». En Hayashi, Chikio; Yajima, Keiji; Bock, Hans-Hermann; Ohsumi, Noboru; Tanaka, Yutaka; Baba, Yasumasa, eds. Data Science, Classification, and Related Methods. Studies in Classification, Data Analysis, and Knowledge Organization (en inglés). Springer Japan. pp. 40-51. ISBN 9784431702085. doi:10.1007/978-4-431-65950-1_3.: https://www.springer.com/book/9784431702085
[9] ↑ Tukey, John W. (1962-03). «The Future of Data Analysis». The Annals of Mathematical Statistics (en inglés) 33 (1): 1-67. ISSN 0003-4851. doi:10.1214/aoms/1177704711. Consultado el 1 de octubre de 2018.: https://projecteuclid.org/euclid.aoms/1177704711
[10] ↑ Peter Naur (1974). Encyclopedia of Computer Science. Petrocelli Books. 91-44-07881-1.
[13] ↑ Cleveland, W. S. (2001). Data science: an action plan for expanding the technical areas of the field of statistics. (en inglés). International Statistical Review / Revue Internationale de Statistique. p. 21–26.
[14] ↑ «Data Science Journal». Available Volumes. Retrieved from Japan Science and Technology Information Aggregator, Electronic: http://www.jstage.jst.go.jp/browse/dsj/_vols. abril de 2012.: http://www.jstage.jst.go.jp/browse/dsj/_vols
[16] ↑ National Science Board (2005). «US NSF - NSB-05-40, Long-Lived Digital Data Collections Enabling Research and Education in the 21st Century». www.nsf.gov (en inglés). National Science Foundation. Consultado el 3 de febrero de 2017.: http://www.nsf.gov/pubs/2005/nsb0540/
[26] ↑ «El uso de datos masivos y sus técnicas analíticas para el diseño e implementación de políticas públicas en Latinoamérica y el Caribe (2017)». Banco Interamericano de Desarrollo. Consultado el 29 de noviembre de 2018.: https://publications.iadb.org/handle/11319/8485
[30] ↑ Fry, Benjamin (abril de 2014). «Thesis proposal: Computational Information Design» (en inglés). Consultado el 24 de septiembre de 2015.: http://benfry.com/phd/dissertation-110323c.pdf
[31] ↑ Peter Drucker (2012). Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review.