Technology
Contenido
Existen muchísimas herramientas para el manejo de los macrodatos. Algunos ejemplos incluyen Apache_Hadoop, NoSQL, Apache_Cassandra, inteligencia empresarial, aprendizaje automático y MapReduce. Estas herramientas tratan con algunos de los tres tipos de macrodatos:[45].
• - Datos estructurados: datos que tienen bien definidos su longitud y su formato, como las fechas, los números o las cadenas de caracteres. Se almacenan en tablas. Un ejemplo son las bases de datos relacionales y los almacenes de datos.
• - Datos no estructurados: datos en el formato tal y como fueron recolectados, carecen de un formato específico. No se pueden almacenar dentro de una tabla ya que no se puede desgranar su información a tipos básicos de datos. Algunos ejemplos son los PDF, documentos multimedia, correos electrónicos o documentos de texto.
• - Datos semiestructurados: datos que no se limitan a campos determinados, pero que contiene marcadores para separar los diferentes elementos. Es una información poco regular como para ser gestionada de una forma estándar. Estos datos poseen sus propios metadatos semiestructurados[46] que describen los objetos y las relaciones entre ellos, y pueden acabar siendo aceptados por convención. Como ejemplos tenemos los archivos tipo hojas de cálculo, HTML, XML o JSON.
Un informe de 2011 del McKinsey Global Institute") caracteriza los componentes principales y el ecosistema de macrodatos de la siguiente manera:[47].
• - Técnicas para analizar datos, como pruebas A / B, aprendizaje automático y procesamiento del lenguaje natural.
• - Grandes tecnologías de datos, como inteligencia de negocios, computación en la nube y bases de datos.
• - Visualización, como tablas, gráficos y otras visualizaciones de los datos.
Los macrodatos multidimensionales también se pueden representar como cubos de datos o, matemáticamente, tensores. Los sistemas de bases de datos Array se han propuesto proporcionar almacenamiento y soporte de consultas de alto nivel en este tipo de datos. Las tecnologías adicionales que se aplican a los macrodatos incluyen un cálculo basado en tensor eficiente,[48] como el aprendizaje de subespacio multilineal,[49] bases de datos de procesamiento paralelo masivo (MPP), aplicaciones basadas en búsqueda, extracción de datos,[50] sistemas de archivos distribuidos"), bases de datos distribuidas, nube e infraestructura basada en HPC(aplicaciones, almacenamiento y recursos informáticos)[51] e Internet. A pesar de que se han desarrollado muchos enfoques y tecnologías, sigue siendo difícil llevar a cabo el aprendizaje automático con grandes datos.[52].
Algunas bases de datos relacionales de MPP tienen la capacidad de almacenar y administrar petabytes de datos. Implícita es la capacidad de cargar, supervisar, realizar copias de seguridad y optimizar el uso de las tablas de datos de gran tamaño en el RDBMS.[53].
El programa de Análisis Topológico de Datos") de DARPA busca la estructura fundamental de los conjuntos de datos masivos y en 2008 la tecnología se hizo pública con el lanzamiento de una compañía llamada Ayasdi.[54].
Los profesionales de los procesos de análisis de macrodatos generalmente son hostiles al almacenamiento compartido más lento,[55] prefieren el almacenamiento de conexión directa (DAS) en sus diversas formas, desde unidad de estado sólido (SSD) hasta disco SATA de gran capacidad enterrado dentro de nodos de procesamiento paralelo. La percepción de las arquitecturas de almacenamiento compartidas, la red de área de almacenamiento (SAN) y el almacenamiento conectado a la red (NAS), es que son relativamente lentas, complejas y costosas. Estas cualidades no son consistentes con los sistemas de análisis de datos grandes que prosperan en el rendimiento del sistema, infraestructura de productos básicos y bajo costo.
La entrega de información real o casi en tiempo real es una de las características definitorias del análisis de macrodatos. Por lo tanto, se evita la latencia siempre que sea posible. Los datos en la memoria son buenos; los datos en el disco giratorio en el otro extremo de una conexión FC SAN no lo son. El costo de una SAN en la escala necesaria para las aplicaciones analíticas es mucho mayor que otras técnicas de almacenamiento.
Hay ventajas y desventajas para el almacenamiento compartido en el análisis de macrodatos, pero los practicantes de análisis de macrodatos a partir de 2011 no lo favorecieron.
Capture
Where does all this data come from? We manufacture them directly and indirectly second after second. An iPhone today has more computing capacity than NASA did when humans reached the Moon,[56] so the amount of data generated per person and in unit of time is very large. We catalog the origin of the data according to the following categories:[57].
• - Generated by the people themselves. The act of sending emails or messages on WhatsApp, posting a status on Facebook, publishing work relationships on Linkedin, tweeting content or responding to a survey on the street are things we do daily and that create new data and metadata that can be analyzed. It is estimated that every minute a day more than 200 million emails are sent, more than 700,000 pieces of content are shared on Facebook, two million searches are performed on Google or 48 hours of video are edited on YouTube.[58] On the other hand, tracking usage in an ERP system, including records in a database or entering information into a spreadsheet are other ways to generate this data.
• - Obtained from transactions. Billing, loyalty cards, telephone calls, telephone tower connections, access to public Wi-Fi, payment with credit cards or transactions between bank accounts generate information that, when processed, can be relevant data. For example, bank transactions: What the user knows as an income of X euros, the system will capture it as an action carried out on a specific date and time, in a specific place, between registered users, and with certain metadata.
• - Electronic and web marketing. A large amount of data is generated when browsing the Internet. With web 2.0, the webmaster-content-reader paradigm has been broken and users themselves become content creators thanks to their interaction with the site. There are many tracking tools used mostly for marketing and business analysis purposes. "Mouse movements are recorded in heat maps and there is a record of how much we spent on each page and when we visited them.
• - Obtained from machine-to-machine (M2M) interactions. They are data obtained from the collection of metrics obtained from devices (meters, temperature, light, height, pressure, sound sensors...) that transform physical or chemical magnitudes and convert them into data. They have existed for decades, but the arrival of wireless communications (Wi-Fi, Bluetooth, RFID, etc.) has revolutionized the world of sensors. Some examples are GPS in the automotive industry, vital signs sensors (very useful for life insurance), bracelets at festivals,[59] monitors of the operation and driving of automobiles (very useful information is obtained for insurers),[60] smartphones (they are location sensors).
• - Biometric data collected. They generally come from security, defense and intelligence services.[61] They are quantities of data generated by biometric readers such as retinal scanners, fingerprint scanners, or DNA chain readers. The purpose of this data is to provide security mechanisms and is usually guarded by defense ministries and intelligence departments. An example of an application is the cross-linking of DNA between a sample from a crime and a sample in our database.
Transformation
Once the sources of the necessary data have been found, we will most likely have an endless number of source tables that will not be related. The next objective is to have the data collected in the same place and give it an appropriate format.
This is where extract, transform and load (ETL) platforms come into play. Its purpose is to extract data from different sources and systems, then perform transformations (data conversions, dirty data cleaning), format changes, etc.) and finally load the data into the specified database or data warehouse.[62] An example of an ETL platform is the Pentaho Data Integration, more specifically its Spoon application.
NoSQL Storage
The term NoSQL refers to Not Only SQL and are storage systems that do not comply with the entity-relationship schema.[63] They provide a much more flexible and concurrent storage system and allow large amounts of information to be manipulated much more quickly than relational databases.
We distinguish four large groups of NoSQL databases:
• - Key-value storage (key-value): data is stored in a similar way to data maps or dictionaries, where the data is accessed from a unique key.[64] The values (data) are isolated and independent of each other, and are not interpreted by the system. They can be simple "Variable (programming)") variables such as integers or characters, or objects. On the other hand, this storage system lacks a clear and established data structure, so it does not require very strict data formatting.[65].
They are useful for simple key-based operations. An example is the increase in the loading speed of a website that can use different user profiles, having the files that must be included mapped according to the user ID and that have been calculated previously. Apache Cassandra is the key-value storage technology most recognized by users.[66].
• - Document storage: documentary databases bear a great resemblance to Key-Value databases, differing in the data they store. If in the previous one a specific data structure was not required, in this case we save semi-structured data.[66] This data is now called documents, and can be formatted in XML, JSON, Binary JSON or whatever the same database accepts.
CouchDB or MongoDB[66] are perhaps the best known. Special mention should be made of MapReduce, a Google technology initially designed for its PageRank algorithm, which allows you to select a subset of data, group or reduce it and load it into another collection, and Hadoop, which is an Apache technology designed to store and process large amounts of data.
• - Graph storage: graph databases break with the idea of tables and are based on graph theory, where it is established that the information is the nodes and the relationships between the information are the edges,[66] something similar to the relational model. Its greatest use is contemplated in cases of relating large amounts of data that can be highly variable. For example, nodes "Node (computing)") can contain objects "Object (programming)"), variables "Variable (programming)"), and attributes "Attribute (computing)") that are different from each other. JOIN operations are replaced by traversals through the graph, and a list of adjacencies between the nodes is saved.[64] We find an example in social networks: on Facebook each node is considered a user, which can have edges "Edge (graph theory)") of friendship with other users, or edges "Edge (graph theory)") of publication with nodes "Node (computing)") of content. Solutions such as Neo4J") and GraphDB[66] are the best known within graph databases.
Data analysis
Analysis allows you to look at the data and explain what is happening. Having the necessary data stored according to different storage technologies, we will realize that we will need different data analysis techniques such as the following:
• - Association: allows finding relationships between different variables.[67] Under the premise of causality, the aim is to find a prediction in the behavior of other variables. These relationships can be cross-selling systems in electronic businesses.
• - Data mining (data mining): aims to find predictive behaviors. It encompasses the set of techniques that combine statistical and machine learning methods with database storage.[68] It is closely related to the models used to discover patterns in large amounts of data.
• - Grouping (clustering): cluster analysis is a type of data mining that divides large groups of individuals into smaller groups of which we did not know their similarity before the analysis.[68] The purpose is to find similarities between these groups, and the discovery of new ones, knowing what qualities define them. It is an appropriate methodology to find relationships between results and make a preliminary evaluation of the structure of the analyzed data. There are different clustering techniques and algorithms[69].
• - Text analytics (text analytics): much of the data generated by people is text, such as emails, web searches or content. This methodology allows us to extract information from this data and thus model topics and issues or predict words.[70].
• - Topological data analysis (TDA): aims to analyze the geometric and topological structure of the data. Developed since the 90s using algebraic topology tools such as persistent homology.[71] It has proven to be useful for the clustering of some data and for the analysis of oncological data, being capable of predicting response treatments and generating diagnoses.[72].
Data visualization
As the National Institute of Statistics "Instituto Nacional de Estadistica (Spain)") says in its tutorials, "a picture is worth a thousand words or a thousand data."[73] The mind appreciates much more a well-structured presentation of statistical results in graphs or maps instead of in tables with numbers and conclusions. In big data we go one step further: paraphrasing Edward Tufte, one of the most recognized experts in data visualization worldwide, “the world is complex, dynamic, multidimensional, paper is static and flat. How are we to represent the rich visual experience of the world on the mere plain?
Mondrian "Mondrian (informatics)")[74] is a platform that allows you to view information through the analyzes carried out on the data we have. With this platform we try to reach a more specific audience, and a more limited utility such as a comprehensive scorecard of an organization. In recent years, other platforms such as Tableau, Power BI and Qlik have become widespread.[75].
On the other hand, infographics have become a viral phenomenon, where the results of different analyzes of our data are collected, and are attractive, entertaining and simplified material for mass audiences.[76].