Big Data Processing | Construpedia

Big Data Processing

Introduction

big data,[1][2] also called massive data, data intelligence or large-scale data (from English big data), is a term that refers to data sets so large and complex that they require non-traditional computer data processing applications to process them properly. Data is the symbolic reproduction of a quantitative or qualitative attribute or variable; according to the RAE "Information about something specific that allows its exact knowledge or serves to deduce the consequences derived from a fact."[3] Therefore, the procedures used to find repetitive patterns within that data are more sophisticated and require specialized software. In scientific texts in Spanish, the English term big data is often used directly, as it appears in Viktor Schönberger's essay "The revolution of massive data.[4][5].

Modern usage of the term big data tends to refer to the analysis of user behavior, extracting value from stored data, and formulating predictions through observed patterns. The discipline dedicated to big data is part of the information and communication technologies sector. This discipline deals with all activities related to systems that manipulate large data sets.

The most common difficulties linked to the management of these large volumes of data focus on its collection and storage,[6] in searches, sharing, and analysis,[7] and in visualizations and representations. The tendency to manipulate enormous volumes of data is due in many cases to the need to include such information for the creation of statistical reports and predictive models used in various subjects, such as analysis of business, advertising, infectious diseases, espionage and monitoring of the population, or the fight against organized crime.[8].

The upper limit of processing has been growing over the years.[9] It is estimated that the world stored about 5 zettabytes in 2014. If you put this information in books, converting the images and all that to their equivalent in letters, you could make 4,500 stacks of books that reach to the sun.[10].

Scientists regularly encounter limits in analysis due to the large amount of data in certain areas, such as meteorology, genomics,[11] connectomics") (an approach to the study of the brain; in English: Connectomics; in French: Conectomique), complex simulations of physical processes[12] and research related to biological and environmental processes.[13]

Data sets are growing in volume due in part to the massive collection of information from wireless sensors and mobile devices (for example VANETs), the constant growth of application logs (for example "Log" records), cameras (remote sensing systems), microphones, radio frequency identification readers.[14][15].

Big Data Processing

Introduction

Definition

The term has been in use since the 1990s, and some credit John Mashey for popularizing it. Big data is a term that refers to an amount of data such that it exceeds the ability of conventional software to be captured, managed, and processed in a reasonable time. The volume of big data is constantly growing. In 2012, it was estimated to range in size from a dozen terabytes to several petabytes of data in a single data set. In the MIKE2.0 methodology"), dedicated to investigating topics related to information management"), they define big data[21] in terms of useful permutations, complexity and difficulty in deleting individual records.

It has also been defined as data massive enough to highlight issues and concerns around the effectiveness of anonymity from a more practical than theoretical perspective.[22].

In 2001, in a research report based on related conferences and presentations,[23] the META Group (now Gartner "Gartner (company)") defined the constant growth of data as an opportunity and challenge for research in volume, velocity and variety. Gartner' continues to use big data as a reference.[24] In addition, large vendors in the big data market are developing solutions to address the most critical demands on how to process such an amount of data, such as MapR and Cloudera.

A 2016 definition states that big data represents information assets characterized by such high volume, velocity and variety that they require specific technology and analytical methods for their transformation into value.

• - Machine learning: Big data often doesn't ask why and simply detects patterns.[28].

• - Digital footprint: Big data is often a cost-free byproduct of digital interaction.

A 2018 definition states that with big data, parallel computing tools are needed to handle the data," and notes: "This represents a distinct and clearly defined change in computing used through parallel programming theories and loss of some of the guarantees and capabilities made by Codd's relational model."[29].

The growing maturity of the concept clearly and clearly describes the difference between big data (large-scale data) and business intelligence:.

• - Business intelligence uses descriptive statistics with data with high information density to measure things, detect trends, etc.

• - For its part, big data uses inductive statistics and concepts for identifying non-linear systems"),[30] to infer laws (regressions, non-linear relationships and causal effects) from large data sets with low information density, in order to reveal relationships and dependencies, or to make predictions of results and behaviors.[31].

Architecture

Big data repositories have existed in many forms, often created by corporations with a special need. Historically, commercial vendors offered parallel database management systems for big data starting in the 1990s. For many years, WinterCorp published a larger database report[34].

Teradata Corporation in 1984, commercialized the DBC 1012 parallel processing system. Teradata systems were the first to store and analyze 1 terabyte of data in 1992. Hard drives were 2.5 GB in 1991, so the definition of big data continually evolves according to Kryder's Law. Teradata installed the first petabyte-class RDBMS-based system in 2007. Starting from As of 2017, there are a few dozen Petabyte-class Teradata relational databases installed, the largest of which exceed 50 PB. Systems until 2008 were 100% structured relational data. Since then, Teradata has added unstructured data types, including XML, JSON, and Avro.

In 2000, Seisint Inc. (now LexisNexis Group") developed a C++-based distributed file sharing framework for data storage and queries. The system stores and distributes structured, semi-structured, and unstructured data across multiple servers. Users can create queries in a dialect of C++ called ECL. ECL uses an "apply schema on read" method to infer the structure of stored data when it is queried, rather than when it is stored. In 2004, LexisNexis acquired Seisint Inc.[35] and in 2008 it acquired ChoicePoint, Inc.[36] and its high-speed parallel processing platform was merged into HPCC (or High Performance Computing Cluster) systems and in 2011, HPCC was open sourced under the Apache Quantcast File System v2.0 license.[37].

CERN and other physics experiments have collected large data sets for many decades, usually analyzed through high-performance computers (supercomputers) rather than the reduced product map architectures, which generally refer to the movement of big data.

In 2004, Google published a paper on a process called MapReduce that uses a similar architecture. The MapReduce concept provides a parallel processing model, and an associated implementation was released to process large amounts of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are collected and delivered (the Reduce step). The framework was very successful, so others wanted to replicate the algorithm. Therefore, an implementation of the MapReduce framework was adopted by an Apache open source project called Hadoop.[38] Apache Spark was developed in 2012 in response to the limitations of the MapReduce paradigm, as it adds the ability to configure many operations (not just map followed by reduce).

MIKE2.0 is an open approach to information management that recognizes the need for revisions due to the implications of big data identified in an article titled "Providing Big Data Solutions".[39] The methodology addresses big data management in terms of useful permutations of data sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual records.[40].

Studies from 2012 showed that a multi-layer architecture is an option to address the problems presented by big data. A distributed parallel architecture distributes data among multiple servers; These parallel execution environments can dramatically improve data processing speeds. This type of architecture inserts data into a parallel DBMS, which implements the use of the MapReduce and Hadoop frameworks. This type of framework seeks to make processing power transparent to the end user through the use of a front-end server.[41].

Big data analytics for manufacturing applications is marketed as a 5C (connection, conversion, cybernetics, cognition and configuration) architecture.[42].

The data lake allows an organization to shift its focus from centralized control to a shared model to respond to the changing dynamics of information management. This allows for rapid segregation of data in the data lake, reducing overhead time.[43][44].

Technology

Contenido

Existen muchísimas herramientas para el manejo de los macrodatos. Algunos ejemplos incluyen Apache_Hadoop, NoSQL, Apache_Cassandra, inteligencia empresarial, aprendizaje automático y MapReduce. Estas herramientas tratan con algunos de los tres tipos de macrodatos:[45].

• - Datos estructurados: datos que tienen bien definidos su longitud y su formato, como las fechas, los números o las cadenas de caracteres. Se almacenan en tablas. Un ejemplo son las bases de datos relacionales y los almacenes de datos.

• - Datos no estructurados: datos en el formato tal y como fueron recolectados, carecen de un formato específico. No se pueden almacenar dentro de una tabla ya que no se puede desgranar su información a tipos básicos de datos. Algunos ejemplos son los PDF, documentos multimedia, correos electrónicos o documentos de texto.

• - Datos semiestructurados: datos que no se limitan a campos determinados, pero que contiene marcadores para separar los diferentes elementos. Es una información poco regular como para ser gestionada de una forma estándar. Estos datos poseen sus propios metadatos semiestructurados[46] que describen los objetos y las relaciones entre ellos, y pueden acabar siendo aceptados por convención. Como ejemplos tenemos los archivos tipo hojas de cálculo, HTML, XML o JSON.

Un informe de 2011 del McKinsey Global Institute") caracteriza los componentes principales y el ecosistema de macrodatos de la siguiente manera:[47].

• - Técnicas para analizar datos, como pruebas A / B, aprendizaje automático y procesamiento del lenguaje natural.

• - Grandes tecnologías de datos, como inteligencia de negocios, computación en la nube y bases de datos.

• - Visualización, como tablas, gráficos y otras visualizaciones de los datos.

Los macrodatos multidimensionales también se pueden representar como cubos de datos o, matemáticamente, tensores. Los sistemas de bases de datos Array se han propuesto proporcionar almacenamiento y soporte de consultas de alto nivel en este tipo de datos. Las tecnologías adicionales que se aplican a los macrodatos incluyen un cálculo basado en tensor eficiente,[48] como el aprendizaje de subespacio multilineal,[49] bases de datos de procesamiento paralelo masivo (MPP), aplicaciones basadas en búsqueda, extracción de datos,[50] sistemas de archivos distribuidos"), bases de datos distribuidas, nube e infraestructura basada en HPC(aplicaciones, almacenamiento y recursos informáticos)[51] e Internet. A pesar de que se han desarrollado muchos enfoques y tecnologías, sigue siendo difícil llevar a cabo el aprendizaje automático con grandes datos.[52].

Algunas bases de datos relacionales de MPP tienen la capacidad de almacenar y administrar petabytes de datos. Implícita es la capacidad de cargar, supervisar, realizar copias de seguridad y optimizar el uso de las tablas de datos de gran tamaño en el RDBMS.[53].

El programa de Análisis Topológico de Datos") de DARPA busca la estructura fundamental de los conjuntos de datos masivos y en 2008 la tecnología se hizo pública con el lanzamiento de una compañía llamada Ayasdi.[54].

Los profesionales de los procesos de análisis de macrodatos generalmente son hostiles al almacenamiento compartido más lento,[55] prefieren el almacenamiento de conexión directa (DAS) en sus diversas formas, desde unidad de estado sólido (SSD) hasta disco SATA de gran capacidad enterrado dentro de nodos de procesamiento paralelo. La percepción de las arquitecturas de almacenamiento compartidas, la red de área de almacenamiento (SAN) y el almacenamiento conectado a la red (NAS), es que son relativamente lentas, complejas y costosas. Estas cualidades no son consistentes con los sistemas de análisis de datos grandes que prosperan en el rendimiento del sistema, infraestructura de productos básicos y bajo costo.

La entrega de información real o casi en tiempo real es una de las características definitorias del análisis de macrodatos. Por lo tanto, se evita la latencia siempre que sea posible. Los datos en la memoria son buenos; los datos en el disco giratorio en el otro extremo de una conexión FC SAN no lo son. El costo de una SAN en la escala necesaria para las aplicaciones analíticas es mucho mayor que otras técnicas de almacenamiento.

Hay ventajas y desventajas para el almacenamiento compartido en el análisis de macrodatos, pero los practicantes de análisis de macrodatos a partir de 2011 no lo favorecieron.

Capture

Where does all this data come from? We manufacture them directly and indirectly second after second. An iPhone today has more computing capacity than NASA did when humans reached the Moon,[56] so the amount of data generated per person and in unit of time is very large. We catalog the origin of the data according to the following categories:[57].

• - Generated by the people themselves. The act of sending emails or messages on WhatsApp, posting a status on Facebook, publishing work relationships on Linkedin, tweeting content or responding to a survey on the street are things we do daily and that create new data and metadata that can be analyzed. It is estimated that every minute a day more than 200 million emails are sent, more than 700,000 pieces of content are shared on Facebook, two million searches are performed on Google or 48 hours of video are edited on YouTube.[58] On the other hand, tracking usage in an ERP system, including records in a database or entering information into a spreadsheet are other ways to generate this data.

• - Obtained from transactions. Billing, loyalty cards, telephone calls, telephone tower connections, access to public Wi-Fi, payment with credit cards or transactions between bank accounts generate information that, when processed, can be relevant data. For example, bank transactions: What the user knows as an income of X euros, the system will capture it as an action carried out on a specific date and time, in a specific place, between registered users, and with certain metadata.

• - Electronic and web marketing. A large amount of data is generated when browsing the Internet. With web 2.0, the webmaster-content-reader paradigm has been broken and users themselves become content creators thanks to their interaction with the site. There are many tracking tools used mostly for marketing and business analysis purposes. "Mouse movements are recorded in heat maps and there is a record of how much we spent on each page and when we visited them.

• - Obtained from machine-to-machine (M2M) interactions. They are data obtained from the collection of metrics obtained from devices (meters, temperature, light, height, pressure, sound sensors...) that transform physical or chemical magnitudes and convert them into data. They have existed for decades, but the arrival of wireless communications (Wi-Fi, Bluetooth, RFID, etc.) has revolutionized the world of sensors. Some examples are GPS in the automotive industry, vital signs sensors (very useful for life insurance), bracelets at festivals,[59] monitors of the operation and driving of automobiles (very useful information is obtained for insurers),[60] smartphones (they are location sensors).

• - Biometric data collected. They generally come from security, defense and intelligence services.[61] They are quantities of data generated by biometric readers such as retinal scanners, fingerprint scanners, or DNA chain readers. The purpose of this data is to provide security mechanisms and is usually guarded by defense ministries and intelligence departments. An example of an application is the cross-linking of DNA between a sample from a crime and a sample in our database.

Transformation

Once the sources of the necessary data have been found, we will most likely have an endless number of source tables that will not be related. The next objective is to have the data collected in the same place and give it an appropriate format.

This is where extract, transform and load (ETL) platforms come into play. Its purpose is to extract data from different sources and systems, then perform transformations (data conversions, dirty data cleaning), format changes, etc.) and finally load the data into the specified database or data warehouse.[62] An example of an ETL platform is the Pentaho Data Integration, more specifically its Spoon application.

NoSQL Storage

The term NoSQL refers to Not Only SQL and are storage systems that do not comply with the entity-relationship schema.[63] They provide a much more flexible and concurrent storage system and allow large amounts of information to be manipulated much more quickly than relational databases.

We distinguish four large groups of NoSQL databases:

• - Key-value storage (key-value): data is stored in a similar way to data maps or dictionaries, where the data is accessed from a unique key.[64] The values (data) are isolated and independent of each other, and are not interpreted by the system. They can be simple "Variable (programming)") variables such as integers or characters, or objects. On the other hand, this storage system lacks a clear and established data structure, so it does not require very strict data formatting.[65].

They are useful for simple key-based operations. An example is the increase in the loading speed of a website that can use different user profiles, having the files that must be included mapped according to the user ID and that have been calculated previously. Apache Cassandra is the key-value storage technology most recognized by users.[66].

• - Document storage: documentary databases bear a great resemblance to Key-Value databases, differing in the data they store. If in the previous one a specific data structure was not required, in this case we save semi-structured data.[66] This data is now called documents, and can be formatted in XML, JSON, Binary JSON or whatever the same database accepts.

CouchDB or MongoDB[66] are perhaps the best known. Special mention should be made of MapReduce, a Google technology initially designed for its PageRank algorithm, which allows you to select a subset of data, group or reduce it and load it into another collection, and Hadoop, which is an Apache technology designed to store and process large amounts of data.

• - Graph storage: graph databases break with the idea of tables and are based on graph theory, where it is established that the information is the nodes and the relationships between the information are the edges,[66] something similar to the relational model. Its greatest use is contemplated in cases of relating large amounts of data that can be highly variable. For example, nodes "Node (computing)") can contain objects "Object (programming)"), variables "Variable (programming)"), and attributes "Attribute (computing)") that are different from each other. JOIN operations are replaced by traversals through the graph, and a list of adjacencies between the nodes is saved.[64] We find an example in social networks: on Facebook each node is considered a user, which can have edges "Edge (graph theory)") of friendship with other users, or edges "Edge (graph theory)") of publication with nodes "Node (computing)") of content. Solutions such as Neo4J") and GraphDB[66] are the best known within graph databases.

Data analysis

Analysis allows you to look at the data and explain what is happening. Having the necessary data stored according to different storage technologies, we will realize that we will need different data analysis techniques such as the following:

• - Association: allows finding relationships between different variables.[67] Under the premise of causality, the aim is to find a prediction in the behavior of other variables. These relationships can be cross-selling systems in electronic businesses.

• - Data mining (data mining): aims to find predictive behaviors. It encompasses the set of techniques that combine statistical and machine learning methods with database storage.[68] It is closely related to the models used to discover patterns in large amounts of data.

• - Grouping (clustering): cluster analysis is a type of data mining that divides large groups of individuals into smaller groups of which we did not know their similarity before the analysis.[68] The purpose is to find similarities between these groups, and the discovery of new ones, knowing what qualities define them. It is an appropriate methodology to find relationships between results and make a preliminary evaluation of the structure of the analyzed data. There are different clustering techniques and algorithms[69].

• - Text analytics (text analytics): much of the data generated by people is text, such as emails, web searches or content. This methodology allows us to extract information from this data and thus model topics and issues or predict words.[70].

• - Topological data analysis (TDA): aims to analyze the geometric and topological structure of the data. Developed since the 90s using algebraic topology tools such as persistent homology.[71] It has proven to be useful for the clustering of some data and for the analysis of oncological data, being capable of predicting response treatments and generating diagnoses.[72].

Data visualization

As the National Institute of Statistics "Instituto Nacional de Estadistica (Spain)") says in its tutorials, "a picture is worth a thousand words or a thousand data."[73] The mind appreciates much more a well-structured presentation of statistical results in graphs or maps instead of in tables with numbers and conclusions. In big data we go one step further: paraphrasing Edward Tufte, one of the most recognized experts in data visualization worldwide, “the world is complex, dynamic, multidimensional, paper is static and flat. How are we to represent the rich visual experience of the world on the mere plain?

Mondrian "Mondrian (informatics)")[74] is a platform that allows you to view information through the analyzes carried out on the data we have. With this platform we try to reach a more specific audience, and a more limited utility such as a comprehensive scorecard of an organization. In recent years, other platforms such as Tableau, Power BI and Qlik have become widespread.[75].

On the other hand, infographics have become a viral phenomenon, where the results of different analyzes of our data are collected, and are attractive, entertaining and simplified material for mass audiences.[76].

Applications

Los macrodatos han sido utilizados por la industria de los medios, las empresas y los gobiernos para dirigirse con mayor precisión a su público y aumentar la eficiencia de sus mensajes.

Los macrodatos han aumentado la demanda de especialistas en administración de la información tanto que Software AG"), Oracle Corporation, IBM, Microsoft, SAP, EMC, HP y Dell han gastado más de $ 15 mil millones en firmas de software especializadas en administración y análisis de datos. En 2010, esta industria valía más de $ 100 mil millones y crecía a casi un 10 por ciento anual: aproximadamente el doble de rápido que el negocio del software en general.[77].

Las economías desarrolladas usan cada vez más tecnologías intensivas en datos. Hay 4600 millones de suscripciones de teléfonos móviles en todo el mundo, y entre 1000 y 2000 millones de personas que acceden a Internet. Entre 1990 y 2005, más de mil millones de personas en todo el mundo ingresaron a la clase media, lo que significa que más personas se volvieron más alfabetizadas, lo que a su vez llevó al crecimiento de la información. La capacidad efectiva mundial para intercambiar información a través de redes de telecomunicaciones era de 281 petabytes en 1986, 471 petabytes en 1993, 2.2 exabytes en 2000, 65 exabytes en 2007[78] y las predicciones cifran el tráfico de internet en 667 exabytes anualmente para 2014. Según una estimación, un tercio de la información almacenada en todo el mundo está en forma de texto alfanumérico e imágenes fijas,[79] que es el formato más útil para la mayoría de las aplicaciones de macrodatos. Esto también muestra el potencial de los datos aún no utilizados (es decir, en forma de contenido de video y audio).

Si bien muchos proveedores ofrecen soluciones estándar para los macrodatos, los expertos recomiendan el desarrollo de soluciones internas personalizadas para resolver el problema de la compañía si la empresa cuenta con capacidades técnicas suficientes.[80].

Government

The use and adoption of big data within government processes allows for efficiencies in terms of cost, productivity and innovation, but is not without its flaws.[81] Data analytics often requires multiple parts of government (central and local) to work collaboratively and create new processes to achieve the desired outcome.

Big data is commonly used to influence the democratic process. The people's representatives can see everything the citizens do, and the citizens can dictate the public life of the representatives through tweets and other methods of spreading ideas in society. The presidential campaigns of Obama and Trump used them widely[82] and there are experts who warn that representative democracy must be "reinvented." If not, it is possible that it could become an information dictatorship."[83].

The Inter-American Development Bank (IDB) has developed studies in Latin America in which it presents different cases of the use of big data in the design and implementation of public policies. Highlighting interventions on issues of urban mobility, smart cities and security, among other topics. Their recommendations have revolved around how to build public institutions that manage, through the use of massive data, to be more transparent and help make better decisions.[84].

International development

Research on the effective use of information and communication technologies for development (also known as ICT4D) suggests that big data technology can make important contributions but also present unique challenges to international development.[85][86] Advances in big data analytics offer cost-effective opportunities to improve decision-making in critical development areas such as healthcare, employment, economic productivity, crime, security and resource management, and natural disasters.[87] In addition, User-generated data offers new opportunities to offer an unprecedented voice. However, long-standing challenges for developing regions, such as inadequate technological infrastructure and scarcity of economic and human resources, exacerbate existing concerns with big data, such as privacy, imperfect methodology, and interoperability issues.[87].

Industry

Big data provides an infrastructure for transparency in the manufacturing industry, which is the ability to unravel uncertainties such as the performance and availability of inconsistent components. Predictive manufacturing") as an applicable approach for near-zero downtime and transparency requires a large amount of data and advanced prediction tools for a systematic process of data into useful information.[88] A conceptual framework of predictive manufacturing starts with data acquisition where different types of sensory data are available, such as acoustics, vibration, pressure, current, voltage and controller data. A large amount of sensory data, in addition to historical data, builds the big data in manufacturing. Big data generated They act as the input into predictive tools and preventive strategies such as Prognostics and Health Management (PHM).[89].

Media

Media and advertising professionals approach big data as many actionable data points about millions of people. The industry appears to be moving away from the traditional approach of using specific media environments, such as newspapers, magazines or television shows, and instead leverages consumers with technologies that reach target people at optimal times in optimal locations. The ultimate goal is to serve, or transmit, a message or content that (statistically speaking) is in line with the consumer's mentality. For example, publishing environments increasingly tailor messages (ads) and content (articles) to appeal to consumers that have been exclusively collected through various data mining activities.[90].

• - Consumer orientation (for seller advertising)[91].

• - Data mining.

• - Data journalism: Editors and journalists use big data tools to provide unique and innovative information and infographics.

Music

Musical emotion recognition (REM) (Music Emotion Recognition MER) is a recent and evolving field of scientific research. Broadly speaking, it can be said that REM revolves around several ideas regarding the psychological understanding of the relationship between human affection and music. One of the central ideas of REM lies in the ability to determine, through automatic systems, by entering various data (musical signals) and variables (computational parameters), which and what type of emotions are perceived from musical compositions, and try to perceive how each of the forms of their structural features can produce certain types of characteristic reactions in listeners.[92].

Insurance

Health insurance providers collect data on "social determinants" such as food and television consumption, marital status, clothing size, and purchasing habits, from which they make predictions about health costs to detect health problems in their clients. It is controversial whether these predictions are currently being used to set prices.[93].

Sports

In an area where so much money moves, they tend to use new technologies before basic users. We find, for example, that the analysis of matches constitutes a fundamental part in the training of professionals, and the decision-making of coaches.

Amisco")[94] is a system applied by teams in some of the most important leagues in Europe since 2001. It consists of 8 cameras and various computers installed in the stadiums, which record the movements of the players at a rate of 25 records per second, and then send the data to a central where they do a massive analysis of the data. The information that is returned as a result includes a reproduction of the match in two dimensions, the technical data and statistics, and a summary of the physical data of each player, allowing you to select several different dimensions and visualizations of data.[94].

Finance

The growth of data in the financial world requires the use of big data for rapid data processing, advanced customer segmentation, creation of dynamic pricing strategies, risk management, fraud prevention, decision-making support, detecting consumer trends, defining new ways of doing things better, detecting alerts and other types of complex events, and advanced competition monitoring.[95].

Marketing and sales

Big data is increasingly used for advanced consumer segmentation, automating product personalization, adapting communications to the timing of the sales cycle, capturing new sales opportunities, supporting real-time decision making, and crisis management.[96][97].

Investigation

La búsqueda encriptada y la formación de grupos de macrodatos se demostraron en marzo de 2014 en la Sociedad Estadounidense de Educación en Ingeniería"). Gautam Siwach") participó en abordar los desafíos de macrodatos por el Laboratorio de Ciencias de la Computación e Inteligencia Artificial del MIT y Amir Esmailpour"), en el Grupo de Investigación de UNH, investigó las características clave de macrodatos como la formación de clusters y sus interconexiones. Se centraron en la seguridad de los macrodatos y la orientación del término hacia la presencia de diferentes tipos de datos en forma cifrada en la interfaz de la nube al proporcionar las definiciones sin procesar y los ejemplos de tiempo real dentro de la tecnología. Además, propusieron un enfoque para identificar la técnica de codificación para avanzar hacia una búsqueda acelerada sobre texto encriptado que conduzca a las mejoras de seguridad en macrodatos.[98].

En marzo de 2012, la Casa Blanca anunció una "Iniciativa de macrodatos" nacional que consistía en seis departamentos y agencias federales comprometiendo más de $ 200 millones para proyectos de investigación de macrodatos.

La iniciativa incluyó una subvención de la National Science Foundation "Expeditions in Computing" de $ 10 millones durante 5 años para el AMPLab[99] en la Universidad de California, Berkeley.[100] El AMPLab también recibió fondos de DARPA, y más de una docena de patrocinadores industriales y utiliza macrodatos para atacar una amplia gama de problemas, desde predecir la congestión del tráfico[101] hasta combatir el cáncer.[102].

La Iniciativa de macrodatos de la Casa Blanca también incluyó un compromiso del Departamento de Energía de proporcionar $ 25 millones en financiamiento durante 5 años para establecer el Instituto de Administración, Análisis y Visualización de Datos Escalables (SDAV),[103] dirigido por Lawrence Berkeley National Laboratory del Departamento de Energía. Laboratorio. El Instituto SDAV tiene como objetivo reunir la experiencia de seis laboratorios nacionales y siete universidades para desarrollar nuevas herramientas que ayuden a los científicos a gestionar y visualizar datos en las supercomputadoras del Departamento.

El estado de Massachusetts anunció la Iniciativa de macrodatos de Massachusetts en mayo de 2012, que proporciona fondos del gobierno estatal y de empresas privadas a una variedad de instituciones de investigación. El Instituto de Tecnología de Massachusetts alberga el Centro de Ciencia y Tecnología de Intel para los macrodatos en el Laboratorio de Ciencias de la Computación e Inteligencia Artificial del MIT, que combina fondos y esfuerzos de investigación gubernamentales, corporativos e institucionales.[104].

La Comisión Europea está financiando el Foro público privado de macrodatos, que duró dos años, a través de su Séptimo Programa de Framework para involucrar a empresas, académicos y otras partes interesadas en la discusión de problemas de macrodatos. El proyecto tiene como objetivo definir una estrategia en términos de investigación e innovación para guiar las acciones de apoyo de la Comisión Europea en la implementación exitosa de la economía de macrodatos. Los resultados de este proyecto se utilizarán como aportación para Horizonte 2020, su próximo programa.

El gobierno británico anunció en marzo de 2014 la fundación del Instituto Alan Turing, que lleva el nombre del pionero de la informática y el descifrador de códigos, que se centrará en nuevas formas de recopilar y analizar grandes conjuntos de datos.[105].

En el Día de la Inspiración del Canadian Open Data Experience (CODE) de la Universidad de Waterloo Stratford Campus"), los participantes demostraron cómo el uso de la visualización de datos puede aumentar la comprensión y el atractivo de los grandes conjuntos de datos y comunicar su historia al mundo.[106].

Para que la fabricación sea más competitiva en los Estados Unidos (y en el mundo), es necesario integrar más ingenio e innovación estadounidenses en la fabricación; Por lo tanto, la National Science Foundation ha otorgado al centro de investigación cooperativa Industry Industry para Intelligent Maintenance Systems (IMS) en la Universidad de Cincinnati para que se concentre en el desarrollo de herramientas y técnicas predictivas avanzadas aplicables en un entorno de macrodatos.[107] En mayo de 2013, el IMS Center celebró una reunión de la junta asesora de la industria centrada en los macrodatos, donde presentadores de varias compañías industriales discutieron sus preocupaciones, problemas y objetivos futuros en el entorno de macrodatos.

Ciencias sociales computacionales: cualquier persona puede usar Interfaces de programación de aplicaciones (API) proporcionadas por grandes titulares de datos, como Google y Twitter, para realizar investigaciones en las ciencias sociales y del comportamiento.[108] A menudo, estas API se proporcionan de forma gratuita. Tobias Preis usó los datos de Tendencias de Google para demostrar que los usuarios de Internet de países con un producto interno bruto (PIB) per cápita más alto tienen más probabilidades de buscar información sobre el futuro que la información sobre el pasado. Los hallazgos sugieren que puede haber un vínculo entre el comportamiento en línea y los indicadores económicos del mundo real.[109][110][111] Los autores del estudio examinaron los registros de consultas de Google realizados por la relación del volumen de búsquedas para el año siguiente ('2011') con el volumen de búsquedas del año anterior ('2009'), al que denominaron 'índice de orientación futura'.[112] Compararon el índice de orientación futura con el PIB per cápita de cada país y encontraron una fuerte tendencia en los países donde los usuarios de Google informan más sobre el futuro para tener un PIB más alto. Los resultados sugieren que potencialmente puede haber una relación entre el éxito económico de un país y el comportamiento de búsqueda de información de sus ciudadanos capturado en macrodatos.

Tobias Preis") y sus colegas Helen Susannah Moat") y H. Eugene Stanley") introdujeron un método para identificar los precursores en línea de los movimientos bursátiles, utilizando estrategias de negociación basadas en los datos del volumen de búsquedas provistos por Google Trends.[113] Su análisis del volumen de búsqueda de Google para 98 términos de relevancia financiera variable, publicado en Scientific Reports"),[114] sugiere que los aumentos en el volumen de búsqueda para términos de búsqueda relevantes financieramente tienden a preceder grandes pérdidas en los mercados financieros.[115][116][113][117][118][119].

Los grandes conjuntos de datos vienen con desafíos algorítmicos que anteriormente no existían. Por lo tanto, existe una necesidad de cambiar fundamentalmente las formas de procesamiento.

Los talleres sobre algoritmos para conjuntos de datos masivos modernos (MMDS) reúnen a científicos informáticos, estadísticos, matemáticos y profesionales del análisis de datos para analizar los desafíos algorítmicos de macrodatos.[120].

Big Data Sampling

An important research question that can be asked about large data sets is whether you need to see the entire data to draw certain conclusions about the properties of the data or whether a sample is good enough.

The name big data contains a term related to size, and this is an important characteristic of big data. But sampling (statistics) "Sampling (statistics)") allows the selection of correct data points within the larger data set to estimate the characteristics of the entire population. For example, there are around 600 million tweets produced every day. Is it necessary to look at all of them to determine the topics being discussed during the day? Is it necessary to look at all the tweets to determine the sentiment on each of the topics? In manufacturing different types of sensory data such as acoustics, vibration, pressure, current, voltage and controller data are available in short time intervals. To predict downtime, it may not be necessary to examine all the data, but a sample may be sufficient. Big data can be broken down by several categories of data points, such as demographic, psychographic (Marketing), behavioral, and transactional data. With large sets of data points, marketers can create and use more personalized consumer segments for more strategic targeting.

Some work has been done on sampling algorithms for big data. A theoretical formulation has been developed for sampling Twitter data.[121].

Health and medicine

Around mid-2009, the world experienced an influenza A pandemic, called swine flu or H1N1. The website Google Flu Trends")[122] attempted to predict it from search results. Google Flu Trends used data from users' searches that contained flu-like symptoms and grouped them by location and date, and aimed to predict flu activity up to two weeks in advance than traditional systems. However, in 2013 it was discovered that it predicted twice as many doctor visits as there actually were. Its creators They made two mistakes: a) the new tool had generated a lot of interest in the public, who consulted it more out of curiosity than out of necessity, which generated noise in the information, and b) the prediction algorithms of the search engines. In an article in the Science magazine "Science (magazine)") in 2014, the mistakes made by Google Flu Trends were analyzed: "wanting to replace the most traditional and proven methods of data collection and analysis with big data techniques, instead of just replacing them with big data techniques. apply these techniques as a complement, as Brittany Wenger did with Cloud4cancer". Google Flu Trends stopped working.[123].

More specifically, in New Zealand[124] they crossed Google's flu trends data with existing data from national health systems, and found that they were aligned. The graphs showed a correlation with searches for flu-related symptoms and the extent of the pandemic in the country. Countries with underdeveloped forecasting systems can benefit from reliable and public forecasting to provide their population with appropriate safety measures.

Between 1853 and 1854, a cholera epidemic in London killed thousands of people. Doctor John Snow studied death records and discovered that most cases occurred in a specific neighborhood: people had drunk water from the same well. When they closed it, the number of cases began to decrease.[123].

In 2012, at the Google Science Fair, Brittany Wenger, an 18-year-old student, presented a software design project to help with the early diagnosis of breast cancer. He called the platform Cloud4cancer"), which uses an artificial intelligence network and hospital databases to differentiate a sample of benign tissue from one of a malignant tumor. The intelligent system designed by Wenger distinguishes the two types of tumors in seconds, entering the observed characteristics into the platform. It is possible that this system will later be applied to other conditions, such as leukemia.[123].

Defense and security

To increase security against attacks by the organizations themselves, whether they are companies in the economic environment or the defense ministries themselves in the environment of cyber attacks, the usefulness of big data technologies is contemplated in scenarios such as border surveillance and security, the fight against terrorism and organized crime, against fraud, citizen security plans or tactical planning of missions and military intelligence.[125].

Specific case of the Aloja project

The Aloja project[126] has been initiated by a joint commitment by the Barcelona Supercomputing Center (BSC) and Microsoft Research. The objective of this big data project is to "achieve automatic optimization in Hadoop deployments on different infrastructures."

Specific case of sustainability

Conservation International is an organization with the purpose of raising awareness in society about caring for the environment in a responsible and sustainable way. With the help of HP's Vertica Analytics platform, they have placed 1,000 cameras across sixteen forests on four continents. These cameras incorporate sensors, and as a hidden camera they record the behavior of the fauna. With these images and data from the sensors (precipitation, temperature, humidity, solar...) they obtain information about how climate change or the wear and tear of the land affects their behavior and development.[127].

Reviews

Las críticas al paradigma de los macrodatos vienen en dos formas, aquellas que cuestionan las implicaciones del enfoque en sí mismo, y las que cuestionan la forma en que se realiza actualmente.[128] Un enfoque de esta crítica es el campo de los estudios de datos críticos.

Criticisms of the big data paradigm

"A crucial problem is that we do not know much about the underlying empirical microprocesses that lead to the emergence of the typical network characteristics of big data."[129] In their critique, Snijders, Matzat and Reips point out that very strong assumptions are often made about mathematical properties that may not at all reflect what is really happening at the level of microprocesses. Mark Graham has widely criticized Chris Anderson's claim that big data will mark the end of theory:[130] focusing in particular on the notion that big data must always be contextualized in its social, economic and political contexts.[131] Even when companies invest eight- and nine-figure sums to derive insights from supplier and customer information streaming, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, big data, no matter how comprehensive or well-analyzed, is complemented by “big judgment,” according to a Harvard Business Review article.[132].

Along the same lines, it has been noted that decisions based on big data analysis inevitably "are informed by the world as it was in the past or, at best, as it is currently." Fed by a large amount of data about past experiences, algorithms can predict future development if the future is similar to the past.[133] If the system dynamics of the future change (if it is not a stationary process), the past can say little about the future. Making predictions in changing environments would require a deep understanding of system dynamics, which requires theory. In response to this criticism, Alemany Oliver and Vayre suggested using "abductive reasoning as a first step in the research process to bring context to consumers' digital footprints and make new theories emerge."[134] Additionally, it has been suggested to combine big data approaches with computer simulations, such as agent-based models and Complex Systems. Agent-based models are becoming better at predicting the outcome of the social complexities of even unknown future scenarios through computer simulations that rely on a collection of mutually interdependent algorithms.[135][136] Finally, the use of multivariate methods that explore the latent structure of the data, such as factor analysis and cluster analysis, have proven useful as analytical approaches that go beyond bivariate approaches (crosstabs). typically employed with smaller data sets.

In health and biology, conventional scientific approaches are based on experimentation. For these approaches, the limiting factor is the relevant information that can confirm or refute the initial hypothesis.[137] A new postulate is now accepted in biological sciences: the information provided by large volume data (omics) without prior hypotheses is complementary and sometimes necessary for conventional approaches based on experimentation.[138] In massive approaches, the formulation of a relevant hypothesis to explain the data is the limiting factor.[139] The search logic is inverted and reversed. they must consider the limits of induction ("Glory of Science and the Scandal of Philosophy", C. D. Broad, 1926).

Privacy advocates are concerned about the threat to privacy posed by the increased storage and integration of personally identifiable information; expert panels have published several policy recommendations to align practice with privacy expectations.[140][141][142] The misuse of big data in various cases by the media, businesses, and even government has allowed for the abolition of trust in almost all the fundamental institutions that sustain society.[143].

Nayef Al-Rodhan argues that a new type of social contract will be needed to protect individual freedoms in a context of big data and giant corporations holding vast amounts of information. The use of big data should be better monitored and regulated at national and international levels.[144] Barocas and Nissenbaum argue that one way to protect individual users is by providing information about the types of information that is collected, with whom it is shared, under what limitations, and for what purposes.[145].

The danger of big data can also be seen in the impact it has on the education system. Students can be negatively affected by fear of being supervised, which affects their well-being and causes stress related to their performance. Therefore, privacy is essential to ensure that students are protected.[146].

Criticisms of the 'V' model

The 'V' model of big data is compelling as it focuses on computational scalability and lacks a loss around the perceptibility and understandability of information. This led to cognitive big data, which characterizes the big data application according to:[147].

• - Complete the data: understanding the non-obvious aspects of the data;

• - Data correlation, causality and predictability: causality as a non-essential requirement to achieve predictability;

• - Explanation and interpretation: human beings want to understand and accept what they understand, where algorithms do not solve it;

• - Automated decision-making level: algorithms that support automated decision-making and algorithmic self-learning;.

Criticism of the novelty

Large data sets have been analyzed by computing machines for more than a century, including analytics of the US census conducted in 1890 by IBM punch card machines that computed statistics including means and variations of populations across the continent. In more recent decades, scientific experiments such as CERN have produced data on scales similar to today's commercial "big data." However, scientific experiments have tended to analyze their data using specialized high-performance computing (supercomputing) clusters and grids, rather than clouds of cheap basic computers as in the current commercial wave, implying a difference in culture and technology.

Criticism of Big Data Execution

Ulf-Dietrich Reips") and Uwe Matzat") wrote in 2014 that big data had become a "fad" in scientific research. Researcher Danah Boyd has expressed concern about the use of big data in science, neglecting principles such as choosing a representative sample due to being too concerned with handling large amounts of data.[148] This approach can bias the results in one way or another. Integration across heterogeneous data resources - some that can be considered big data and others not - presents formidable logistical and analytical challenges, but many researchers argue that such integrations likely represent the most promising new frontiers in science.[149] In the provocative article "Critical Questions for Big Data",[150] the authors title big data as part of the mythology: "big data sets offer a superior form of intelligence and knowledge [...], with the aura of truth, objectivity and precision". Big data users often "get lost in the sheer volume of numbers," and "working with big data remains subjective, and what it quantifies does not necessarily have a closer claim on objective truth." Recent developments in the BI domain, such as proactive reporting, especially point to improvements in the usability of big data, through automated filtering of non-useful data and correlations.[151].

Big data analysis is often shallow compared to analysis of smaller data sets. [194] In many big data projects, there is no big data analysis, but the challenge is to extract, transform and load some of the data preprocessing.[152].

Big data is a buzzword and a "vague term",[153][139] but at the same time an "obsession" with businessmen, consultants, scientists and the media. Big data samples like Google Flu Trends have not generated good predictions in recent years, exaggerating flu outbreaks by a factor of two. Similarly, Academy Awards and election predictions based solely on Twitter were more often off target. Big data often presents the same challenges as small data; Adding more data does not solve bias problems, but it may emphasize other problems. In particular, data sources such as Twitter are not representative of the general population, and results drawn from such sources may lead to erroneous conclusions. Google Translate, which is based on statistical analysis of big text data, does a good job of translating web pages. However, results from specialized domains can be dramatically biased. On the other hand, big data can also introduce new problems, such as the problem of multiple comparisons): the simultaneous testing of a large set of hypotheses is likely to produce many false results that mistakenly appear significant. Ioannidis argued that "most published research results are false"[154] due to essentially the same effect: when many scientific teams and researchers each perform experiments (i.e., process a large amount of scientific data, although not with big data), the probability of a "significant" result is false grows rapidly, even more so when positive results are published. Furthermore, the results of big data analysis are only as good as the model on which they are based. In one example, big data participated in the attempt to predict the results of the US presidential election in 2016[155] with varying degrees of success. Forbes predicted: "If you believe in big data analysis, it's time to start planning for Hillary Clinton's presidency and everything that comes with it. implies".[156].

Algorithmic criminalistics

Algorithms analyze large amounts of data to predict criminal behavior. However, this practice raises serious ethical questions about privacy and potential bias in data-driven decisions, underscoring the need for informed debate and regulatory policies that safeguard individual rights in the era of big data.[157].

Definition

It has also been defined as data massive enough to highlight issues and concerns around the effectiveness of anonymity from a more practical than theoretical perspective.[22].

• - Machine learning: Big data often doesn't ask why and simply detects patterns.[28].

• - Digital footprint: Big data is often a cost-free byproduct of digital interaction.

The growing maturity of the concept clearly and clearly describes the difference between big data (large-scale data) and business intelligence:.

• - Business intelligence uses descriptive statistics with data with high information density to measure things, detect trends, etc.

Architecture

Big data analytics for manufacturing applications is marketed as a 5C (connection, conversion, cybernetics, cognition and configuration) architecture.[42].

Technology

Contenido

Un informe de 2011 del McKinsey Global Institute") caracteriza los componentes principales y el ecosistema de macrodatos de la siguiente manera:[47].

• - Técnicas para analizar datos, como pruebas A / B, aprendizaje automático y procesamiento del lenguaje natural.

• - Grandes tecnologías de datos, como inteligencia de negocios, computación en la nube y bases de datos.

• - Visualización, como tablas, gráficos y otras visualizaciones de los datos.

Hay ventajas y desventajas para el almacenamiento compartido en el análisis de macrodatos, pero los practicantes de análisis de macrodatos a partir de 2011 no lo favorecieron.

Capture

Transformation

NoSQL Storage

We distinguish four large groups of NoSQL databases:

Data analysis

Data visualization

Applications

Los macrodatos han sido utilizados por la industria de los medios, las empresas y los gobiernos para dirigirse con mayor precisión a su público y aumentar la eficiencia de sus mensajes.

Government

International development

Industry

Media

• - Consumer orientation (for seller advertising)[91].

• - Data mining.

• - Data journalism: Editors and journalists use big data tools to provide unique and innovative information and infographics.

Music

Insurance

Sports

Finance

Marketing and sales

Investigation

Los grandes conjuntos de datos vienen con desafíos algorítmicos que anteriormente no existían. Por lo tanto, existe una necesidad de cambiar fundamentalmente las formas de procesamiento.

Big Data Sampling

Some work has been done on sampling algorithms for big data. A theoretical formulation has been developed for sampling Twitter data.[121].

Health and medicine

Defense and security

Specific case of the Aloja project

Specific case of sustainability

Reviews

Criticisms of the big data paradigm

Criticisms of the 'V' model

• - Complete the data: understanding the non-obvious aspects of the data;

• - Data correlation, causality and predictability: causality as a non-essential requirement to achieve predictability;

• - Explanation and interpretation: human beings want to understand and accept what they understand, where algorithms do not solve it;

• - Automated decision-making level: algorithms that support automated decision-making and algorithmic self-learning;.

Navegación

Big Data Processing

Introduction

Big Data Processing

Introduction

Definition

Characteristics

Architecture

Technology

Contenido

Capture

Transformation

NoSQL Storage

Data analysis

Data visualization

Applications

Government

International development

Industry

Media

Music

Insurance

Sports

Finance

Marketing and sales

Investigation

Big Data Sampling

Health and medicine

Defense and security

Specific case of the Aloja project

Specific case of sustainability

Reviews

Criticisms of the big data paradigm

Criticisms of the 'V' model

Criticism of the novelty

Criticism of Big Data Execution

Algorithmic criminalistics

Big data virtualization

References

Definition

Characteristics

Architecture

Technology

Contenido

Capture

Transformation

NoSQL Storage

Data analysis

Data visualization

Applications

Government

International development

Industry

Media

Music

Insurance

Sports

Finance

Marketing and sales

Investigation

Big Data Sampling

Health and medicine

Defense and security

Specific case of the Aloja project

Specific case of sustainability

Reviews

Criticisms of the big data paradigm

Criticisms of the 'V' model

Criticism of the novelty

Criticism of Big Data Execution

Algorithmic criminalistics

Big data virtualization

References