208 Big Data terms from A-Z: The updated glossary of Big data definitions
Industries and corporates are already busy crunching it! Larry Page called it the next cool thing. So, if your radar is scanning for the latest Big data terms and jargons, look no further. We have curated an updated glossary of Big data definitions just for you.
Buckle up and let’s get going!
ACID test is applied to the data transactions/state for ensuring its four major attributes. These attributes are ‘Atomicity,’ ‘Consistency,’ ‘Isolation,’ and ‘Durability,’ hence, the acronym ‘ACID’. These attributes are explained ahead.
Atomicity – It represents two or more pieces of information involved in a transaction. Either all of the pieces are committed to a transaction, or none of them are involved.
Consistency – A successful data transaction creates a new and valid state of data. If a failure were to occur during such a transaction/data change, all the actions are rolled back, and the data is restored to its previous state, the one before which the transaction occurred. This is called ‘consistency’.
Isolation – A data change or transaction that is not yet committed/validated must remain detached or ‘isolated’ from any other transaction. This is known as the isolation. Isolation ensures the sanctity of a validated transaction.
Durability – Once a data transaction is validated, it will be available in this new and correct state, even if the system fails or reboots. Validated transactions/data are hence preserved by the system. It is called the durability of data.
Simply put, searching, collecting and presenting data in an organized manner is called aggregation.
An algorithm represents a coded instruction used in software, consisting of a mathematical formula, to perform data analysis. Some commonly used data processing algorithms are regression, clustering, recommendation and classification type algorithms.
Data Analytics consists of processing of raw data to extract useful information, patterns and insights. Analytics is focused on drawing conclusions and inferences from the bulk datasets. Following are three major types of data analytics;
Descriptive analytics: It is the initial stage of data processing that creates the summary of useful information from the raw data. Think of descriptive analytics as a ‘summary’ of the ‘story data has to tell’. It prepares the bulk data for further analysis.
Predictive analytics: Predicting the ‘most likely’ future event, based on historical and recent data, consists of predictive analytics. Such predictions are not 100% sure to occur but are most ‘likely’ the ones to occur next in the series of events.
Prescriptive analytics: In short, prescriptive analytics are used to make decisions or decide the course of action, once the prediction for a likely future event has been made. Prescriptive analytics is majorly used in business analytics to drive decision making.
‘Outlier detection’ or ‘anomaly detection’ is the identification of observations, items or events, which do not match or conform to a projected pattern or other items in the database. Such anomalies or outliers can provide critical information about a rare event, or just be a contaminant.
Anonymization is destroying links and points in the database to preserve the identity of people. It ensures the privacy of individuals and protects information that could lead to their identification.
An application is a software, powered by algorithms, to execute certain tasks and processes related to the data.
Apache Flink is an open source framework written in Java and Scala. It is used for scalable streaming and batch data processing. It is developed by Apache software foundation.
Apache Hadoop facilitates the processing and storage of massive datasets across a distributed computing network. It is an open-source framework written in Java.
It is an open-source stream processing platform written in Java and Scala. It provides a high-throughput on a unified low-latency platform. Apache Kafka is used for real-time handling of data feeds with reliability and robustness.
Apache NiFi is an open-source ‘data logistics’ framework for facilitating the flow of data across systems. It is written in Java and employs ‘flow-based’ programming for real-time data flow management.
Apache Spark allows accessing of data from Hadoop, Cassandra, etc. It is an open-source engine that runs atop the Apache Hadoop or the cloud network. Apache Spark is designed specifically to handle ‘big data’ and its analytics.
Artificial intelligence or AI is the ‘machines’ acting with apparent intelligence. Modern AI employs statistical and predictive analysis of large amounts of data to ‘train’ the computer systems to make decisions, that appear as intelligence.
Automatic Identification and Data Capture or AIDC
AIDC refers to numerous technologies through which data identification and data collection of various objects, individuals, audios or images are automatically executed, without the need for manual database entries. AIDC systems have wide applications, including but not limited to inventory management, security, logistics & retail industries.
Behavioral analytics is the identification of patterns and insights corresponding to ‘human behavior’ from the data. The focus of behavioral analytics is understanding the intentions of users, using the ‘data trails’ and information they generate online. It allows for mapping trends and possible actions by them in the future.
Big data represents massive data sets that can be computationally analyzed to reveal insights, patterns, and trends. It is analyzed through statistical analysis and predictive analytics. Big data analysis can reveal information hidden from general human intelligence and predict possible future events based on the analysis.
Big data scientist
A big data scientist is a professional who analyses big data through mathematical algorithms and extracts useful information from it.
Biometrics is the statistical analysis of ‘physical and behavioral characteristics’ of humans. For example, physical characteristics may include fingerprint scans, retina scans, etc. While, behavioral characteristics can include tone, personality, gestures, etc.
Binary Large Object or ‘BLOB’ is a service that stores unstructured data ‘as a collection of binary data’ on a cloud database management system. ‘Blobs’ are typically multimedia object files.
A Brontobyte is a unit of data size management used to express very large amounts of data. 1 Brontobyte equals 1,000,000,000,000,000,000,000,000,000 or 10^27 bytes. It is approximately equal to 1,000 Yottabytes.
Business intelligence or BI
Business intelligence consists of analyzing and visualizing business data through, various technologies and applications, to drive enhanced business decision making.
Cascading is a software abstraction layer for Apache Hadoop and Apache Flink. Using any JVM-based language (Java, JRuby, Clojure, etc.), one can create and execute complex data processing workflows on a Hadoop cluster. Cascading has diminished skills barrier in creating complex applications by hiding the complexity of underlying ‘MapReduce jobs.’
Call detail record analysis or CDRA
Call details are structured into records by telecommunication companies and become ‘call detail records.’ The information data of a call like time, duration, location, etc. are included in the details. Hence, CDRs prove to be useful in various analysis applications.
Cassandra is an open-source distributed NoSQL database management system, designed to facilitate handling of large distributed data across commodity servers. Structured as ‘key-value,’ Cassandra is a high-performance system without any single point of failure.
Chukwa is a subproject of Hadoop, developed for large-scale log collection and its analysis. It is designed above Hadoop distributed file system and MapReduce framework. Chukwa allows for displaying, monitoring and analysis of the results.
Classification analysis can be regarded as the process of collection of ‘summary’ of the given data. It collects and analyzes relevant information about the data through ‘metadata,’ which is nothing but the description of the given data.
Clickstream analytics is the process of analyzing and reporting the aggregate data about a user’s Web activity. It can reveal which page a user visits on the website, in what order, his/her interests, etc.
Clojure is a general functioning programming language, developed as a dialect of LISP. It is powered by Java Virtual Machine (JVM). Clojure emphasizes on recursive reiteration and best suited for concurrent data operations.
Cloud computing involves ‘the usage of a network of remote servers’ for storing, managing and processing the data. Simply put, it is the practice of data management over a distributed network, rather than using local or personal servers.
Clusters represent subsets of the data that have similar characteristics. A cluster can also refer to a group of machines working together on a network for data mining, processing, etc.
Identification of the information or items in data that share common attributes or characteristics, and grouping or ‘clustering’ them together is called clustering analysis.
Cold data storage
Inactive data that is rarely used or accessed is stored on low-power servers. This refers to the cold data storage. Retrieving cold data takes a long time. Typically, cold data is important for long-term compliance purposes.
Columnar database or column-oriented database contains data structured in columns rather than rows. A columnar database provides faster access to the data. For example, all the dates will be listed under the ‘dates’ column, names under the ‘names’ column and so on.
It is a systematic procedure of comparisons and analysis on extensive datasets to detect patterns and insights.
The function of comparators is comparing keys. It is achieved in either of the two ways. First, a key can be compared by comparing the deserialized objects and implementing the interface. Second, a RawComparator interface can be implemented, and the keys can be compared using their respective raw bytes.
Complex event processing (CEP)
CEP involves combining data from multiple sources and inferring patterns or events suggesting complex circumstances. CEP monitors all the events across a systems network and provides necessary information for action in real-time.
A decision that ‘appears’ to be made from the data, but that has actually been made by intuition or misinterpretation of it, is called confabulation.
Constrained Application Protocol or CoAP
CoAP is a specialized internet application protocol developed for ‘constrained devices’ to enable them to communicate with the wider internet. Here, constraints mean limits of the devices and limits of the resources of the network. CoAP enables communications between devices on the ‘same’ constrained network, devices and general internet and intra-constrained networks.
Complex structured data
Data comprising two or more complex and interrelated subsets that cannot be interpreted by standard machine languages and tools is a complex structured data.
Execution of processes and tasks in parallel at the same time is called concurrency.
Correlation analysis is the determination of ‘how closely the two data sets are correlated.’ Take for example two data sets, ‘subscriptions’ and ‘magazine ads.’ When more ads get displayed, more subscriptions for a magazine get added, i.e., these data sets correlate. A correlation coefficient of ‘1’ is a perfect correlation, 0.8 represents a strong correlation while a value of 0.12 represents weak correlation.
The correlation coefficient can also be negative. In the cases where data sets are inversely related to each other, a negative correlation might occur. For example, when ‘mileage’ goes up, the ‘fuel costs’ go down. A correlation coefficient of -1 is a perfect negative correlation.
Cross-channel analytics is a process within the business analytics, where useful data sets from various sources or ‘channels’ are linked together. Analysis of such data will reveal marketing insights and customer behavior trends, which can be useful for any enterprise.
A dashboard contains the visual or graphic representation of the analysis executed by the algorithms.
Data access refers to the retrieval of stored data.
A person authorized to access a database to maintain its structure, security, integrity, and content is a database administrator or DBA.
The collection of data from various channels or sources for analysis is called data aggregation.
The storage, arrangement, and integration of data in a given model, according to set policies, rules, and standards of an enterprise refers to the data architecture.
A digital structure around which the data is organized and is readily accessible is called a database. A database is managed by a ‘database management system’ or DBMS.
Database-as-a-service or DaaS
A cloud-hosted database that is sold as a commercial service is called DaaS. The users can subscribe the database content for use in return for monthly or annual bills. For example, Amazon web service is a DaaS.
A data center is a physical location that contains server systems and data storage systems. A data center’s operations might be run by a single organization or leased out to other users as per the viability.
Data cleansing or data scrubbing is the removal of incorrect or improperly formatted database entries. It also includes amending the duplicate and incomplete pieces of information. Data scrubbing is achieved by data cleansing tools that make database management easier.
A professional responsible for database structure and storage management is known as a data custodian.
Data-directed decision making
Decisions taken on the basis of descriptive or predictive data analysis are called data-directed decisions. This type of decision making is typically seen in business intelligence.
The ‘trails’ or ‘byproducts’ of information generated through online or digital activities is known as ‘data exhaust.’ For example, web cookies, browsing history, call logs and temporary files are classified as data exhaust.
Data ethical guidelines
To ensure transparency, privacy, and security of data, organizations adhere to a set of guidelines, known as ethical data guidelines.
An ongoing stream of structured data from various sources or channels that provides the users with the updated information is called a data feed. A person here ‘receives the data in ‘streams’ as per their interests around a subject or a topic. Data feeds are typically known by their method of delivery. For example, RSS data feed or Twitter data feed are popular data feeding streams.
Data flow management
Data flow management is the management of inflow and outflow of huge amounts of data from ‘consumer’ and ‘producer’ devices. The collected raw data is then prepared for business analytics. Data flow management is achieved through aggregation, data stream analysis, schema translation, splitting, format conversion, etc.
The overall management of availability, usability, integrity, and security of the data used in an organization comes under data governance. A data governance program includes an administrator/governing body, procedures, and execution system for those procedures. The focus here is to maintain the integrity of the data with best possible data management practices.
Data integration includes collection or combination of informational data from various channels and unifies it for the user. A single view of data from multiple sources allows for easy interpretation and preliminary analysis by the users.
The overall measure of completeness, consistency, and accuracy of data make up for its ‘integrity.’ Data integrity can be regarded as the ‘measure’ of trust an organization has in its data content.
Data lake represents a storage system where data is kept in its raw or native format. Raw data is readily accessible from a data lake for further use.
Database management system or DBMS
A ‘DBMS’ allows for managing the database. The data content can be accessed systematically through a database management system.
Data mart or data marketplace
A subset access-layer of a data-warehouse oriented at providing data to the individual users or businesses is referred as the data mart. Data mart provides the users with specific data, whereas data-warehouses contain broad and in-depth data content.
The process of transporting data between computer systems, storage systems or other formats is called data migration. Data migration becomes crucial when implementing or upgrading a system.
Generation of new information by analyzing large pre-existing databases is data mining.
A data model defines the structure of the data, how it will be stored, accessed, and communicated for functional and technical purposes. Data modeling is the first step while designing a database and object-oriented programming.
Data operationalization means a process that defines the variables into measurable factors, for building a functional data-based system.
A data point represents a discrete unit of information placed on a graph or a bar chart.
Preparing the data for analysis by aggregating, scrubbing or cleaning and consolidating it is called data preparation.
Retrieval, analysis, classification or transformation of information pieces by computer systems is called data processing.
Examination of data from an existing database and collection of descriptive summaries and statistics from it is known as data profiling. Data profiling is done to know whether the existing data can be applied for other purposes.
Data quality is a quantitative and qualitative measurement of the available data set’s ‘fitness’ or ‘worthiness’ for operations, decision making and planning.
Data replication is the copying of data from storage of a computer or a server to another database to ensure consistent levels of accessible information for all the users, without interfering individual operations. The distributed data system is the result of data replication.
Database management operations ensuring the protection and integrity of the database and denying unauthorized access to it consists of data security.
Structural representation of data in tabular, columnar or rows format is a data set.
Lack of proper governance and management of a data lake might create a clutter of information known as a data swamp. The information subsets are lost amidst the vast pool of files and data retrieval becomes very tedious in such a case.
Examination of datasets for their quality, i.e., their arrangement, accuracy and integrity are known as data validation. Data validation is a crucial step to follow before an individual/organization can analyze it to make decisions.
Graphical representation of data for its interpretation and analysis by human counterparts, in the form of bars graphs, charts, etc., consists of data visualization. It is done to achieve more effective and efficient communication of information.
A storage system where vast amounts of structured data are collected. This data is usually derived from a data lake, processed and then transported to a data warehouse.
De-identification is severing of all the links that can lead to the identification of a person. It is same as ‘anonymization.’
A demographic data set defines the attributes or characteristics of a population. It might include information like gender, age, average income, geography, etc. for that population.
A ‘layer’ or ‘superficial’ cluster of all the devices like smartphones, sensors, gateways, electronic equipment, etc., that stream data according to their function and interaction with the environment forms the ‘device layer.’
It is the statistical analysis for sorting the data into different groups or categories using the ‘discriminant function.’ In discriminant analysis, the pre-existing information about a few groups or ‘clusters’ is used to form the classification algorithm for that data set.
Just like a distributed data storage, a distributed cache spans across multiple servers that allow for an increased transactional processing capacity and continuously growing size, rather than being located within a single system.
Distributed objects in object-oriented programming are the objects distributed across a network, with multiple-distributed processes running over a network of computers. These objects function together by sharing data together via a distributed network.
Any application run by more than one processor or computer, via a distributed network, refers to the distributed processing. ‘Parallel processing’ is an example of distributed processing, wherein a computer employs more than one processor to execute the program operations.
Distributed file system
Simply put, a distributed file system or DFS is a data storage system, where the files are stored on a network of servers. A DFS system allows for an easier and faster access to the stored data and its processing.
Document store databases
A document store database represents a specialized database for storage, retrieval, and management of data files quickly.
Drill is an open-source software framework developed by Apache software foundation for the interactive analysis of large data sets.
Elasticsearch is an open-source Java-based search engine developed on top of ‘Apache Lucene.’ It can search and save files in diverse formats.
One exabyte represents one billion gigabytes or one million terabytes of data.
Exploratory analysis is the method of finding a data set’s major characteristics, by finding patterns without following standard procedures of analytics. It is carried out as a preliminary operation to gauge the given data’s nature.
External data is the one located outside of a system. For example, data stored in pen drive or portable hard-drives is external data.
Extract, transform and load or ETL
ETL is a data warehousing procedure which is self-explanatory. The data is first ‘extracted’ from various sources or channels, then ‘transformed’ by scrubbing and structuring to fit into the operational requirements and finally get ‘loaded’ in the database corresponding to that warehouse.
In case of the failure of a computer or a node, the system automatically switched to another one. This is known as the failover.
A fault-tolerant design consists of ‘redundant nodes’ in a system that ensures its functioning, even if certain points/nodes fail. A system becomes highly fault-tolerant if it has no single-point of failure, and switched automatically at the failover.
Apache Flume is a distributed service for moving large amounts of streaming data into Hadoop file distribution system (HDFS). It collects, sorts and transports the ‘logs’ of streaming data to populate Hadoop with it.
The application of typical gaming elements like points score, rules, and competition, in a ‘non-gaming’ context like Big data, refers to gamification. Gamification is used to encourage online user engagement, drive e-commerce analytics and analyze data stream flows.
Graphics processing unit or GPU-accelerated database is a data management system that employs graphics processing. The data in such a case can be ‘graphically’ visualized in an interactive manner. Popular examples of a GPU-database are Kinetica and MapD.
Graph analytics consists of methods for arranging and visualizing relations between different data points on a graph.
A graph database simply employs ‘graph structures’ for data storage and access. Here, ‘graphs’ represent entities with edges and properties for storing the data in ‘nodes.’ In a graph database, every element is related to its neighboring element.
Grid computing ‘pools in’ the computing resources, typically via a cloud network, to execute a common goal or task. In other words, grid computing combines the processing power from different machines connected to a network. ‘Mining pools’ for cryptocurrencies is an excellent example of grid computing.
Hadoop facilitates the processing and storage of massive datasets across a distributed computing network. Refer ‘Apache Hadoop.’
Hama is a project developed by Apache software foundation. Hama is a distributed computing framework, designed to undertake heavy scientific computations, by using ‘bulk synchronous parallel computing’ operations. Heavy scientific computations include network algorithms, graphs, matrix, etc.
Developed by SAP, HANA is an in-memory database platform for application development and processing high-volume ‘real-time’ data transactions and analytics.
It is an open-source distributed database that runs in parallel with Hadoop system. HBase provides the users with the ability to ‘update’ the Hadoop database regularly. It also enables the users to have quick ‘lookups’ in the Hadoop database.
HCatalog is another distributed system to complement the Hadoop network. HCatalog allows access to the ‘metadata’ for all the data present in Hadoop clusters. It also allows the users to analyze and process the data without knowing its actual location within the Hadoop cluster.
Hadoop distributed file system or HDFS
HDFS is a Java-based file system that represents the storage layer of Hadoop distributed network. It stores large volumes of unstructured data and runs on commodity hardware.
Developed by Facebook, Hive is a data-warehousing service based in Hadoop. It was developed for SQL-programmers to convert their programs into ‘MapReduce easily.’ Hive employs the machine language called the HiveQL, similar to SQL. Hive programs can be easily integrated with business intelligence and visual analytics.
High-performance computing or HPC
HPC involves supercomputers for processing highly advanced and complex tasks.
Hadoop user experience or HUE
HUE is an open-source web-interface developed for users to access and work on Hadoop easily. It features various tools integrated to a dashboard such as HDFS browser, Oozie application for task management, MapReduce, Hive and Impala UI, Hadoop API, etc.
Impala UI is a tool developed by Cloudera. It provides quick & interactive SQL queries directly on the data stored in HDFS or HBase using the same metadata, ODBC driver, HiveQL and interfaces as Apache Hive. Impala provides a unified platform for ‘real-time batch-oriented queries.’
The analytics run by integrating data-analytics into the data warehouse.
An in-memory database is a data management system that stores the data in the primary memory instead of secondary memory. This allows for a faster ELT operation and processing of the data.
In-memory data grid or IMDG
Similar to a distributed network, an IMDG stores data in the memory of servers across a network. This allows for higher scalability and faster access to the data.
The inflow or intake of ‘loads of streaming data from multiple channels or sources’ by a database management system is termed as ‘Ingestion.’
Internet-of-things or IoT
IoT is the interconnection of various devices, at any given time, to the internet, where they continuously send and receive data. It includes the ‘device layer’ of the network which contains smartphones, cars, appliances, electronics, etc.
Juridical data compliance or JDC
Juridical data compliance is a term that comes into play while using distributed data management and storage systems such as cloud network. JDC refers to the compliance of laws and regulations regarding data, which need to be adhered to in case the data is stored in a foreign country or a region.
Key value stores
Key value stores remove the need for a fixed data model, by allowing the storage of data in a schema-less manner. In key-value stores, the data can be stored as the data type object of a programming language.
Key-value databases store data with a ‘unique primary key’ for identification of the information. This makes for easier and faster way to look up and access the data.
Latency is defined as the ‘delay in delivery or response of data from one point to another.’ In other words, latency denotes the ‘time-lag’ of a system.
Any computer system or technology that has become obsolete and no more supported by current tech-platforms is a legacy system.
Attributes or languages used to identify relationships between uncomparable sources of data refer to the ‘linked data.’
Load balancing is a performance optimization technique where workload gets evenly distributed across the machines on a network.
Location analytics allows for geospatial information like region, latitude and longitude to be arranged into the datasets. The data is collected via GPS on the devices.
A log file is a record file created by a system for its reference. It records ‘events’ that have occurred during any computer operation.
Machine generated data
Any data generated from any non-human source, like applications processes, temporary files, etc. is machine-generated data.
The data transferred between two machines while they are communicating with each other via a network is machine-2-machine data.
Machine learning or ML involves the development of algorithms to figure out insights from extensive and vast data. ‘Learning’ refers to ‘refining’ of the models by supplying additional data, to make it perform better with each iteration.
A software framework developed by Apache that serves as the computing layer of Hadoop. MapReduce processes the data at node level by dividing the ‘query’ into multiple parts or a ‘map,’ and then ‘reduces’ the result to output the ‘answer’ for a query. MapReduce also looks after scheduling and re-executing any failed operations.
Combinations of two or more ‘refined’ datasets into a unified application for a specific purpose is called a mashup. For example, the combination of a geolocation dataset with the demographic dataset to create an app for booking cabs.
It is a data-mining library that uses data-mining algorithms to cluster, perform regression testing, statistical modeling of the data. The data is then implemented using MapReduce function.
Metadata is the information summary of the given data. Metadata tells us ‘what the given data is all about.’
MongoDB is an open-source document-oriented NoSQL database. Here, the data structures are saved in JSON documents as dynamic schemas in BSON format. MongoDB allows for easier and faster integration of data into the applications.
A multi-dimensional database is designed for data warehousing and online analytical applications (OLAP).
A multi-value database is a string that can directly manipulate HTML and XML strings. It is a NoSQL database that can interpret three-dimensional data directly.
Natural language processing or NLP
Natural language processing is a collection of techniques to structurize and process raw text from human spoken languages to extract information.
Network analysis includes the analysis of nodes and their relationship with each other in a network.
A neural network uses algorithmic processes that mimic the human brain. It attempts to find insights and hidden patterns from vast data sets. A neural network runs on learning architectures and is ‘trained’ on large data sets to make such predictions.
NewSQL is the latest evolved data system that is well-defined, and better than SQL. NewSQL outruns the NoSQL database in performance too.
‘Not only SQL’ or NoSQL is a collection of numerous database management systems, but that can also be stored and retrieved even if they are modeled in any other format other than the tabular format. Tabular databases are used to classify relational databases. NoSQL is not dependent on ‘tabular’ database architecture and does not necessarily use SQL for data manipulation.
Object-databases employ a ‘query’ language to retrieve the data stored in it as ‘objects.’ Such databases are different from a graph or relational databases simply because they store information as ‘object clusters.’
Object-based image analysis
Individual pixels can be used to analyze the digital images. Object-based image analysis exploits the same fact. A selected set of related pixels or ‘image objects’ can be analyzed using this technique. It forms the basis of visual or image recognition and classifying algorithms.
Online analytical processing or OLAP
OLAP uses three operations for the analysis of multidimensional data. These three operations are;
Consolidation: The aggregation of available data into a structured form.
Drill-down: An operation with which the users can access the details about the data.
Slice and dice: It provides the users with an ability to select subsets and analyze them from different perspectives.
Online transactional processing or OLTP
OLTP allows for the analysis and pattern recognition from large ‘transactional’ datasets by the users.
Oozie enables the users to create a workflow processing and management system and creates a ‘user defined’ series to execute the tasks in an intelligent sequence. For example, a user can define tasks, composed in various languages like Hive and MapReduce, and link them together with each other. Oozie allows the users to define the initiation of a ‘query’ when certain conditions regarding the data are met.
OpenDremel is the open-source version of ‘Big-Query Java code’ by Google. It is under the process of integration with Apache Drill.
Open data center alliance or ODCA
ODCA is an alliance of IT organizations at the global level with an aim to accelerate the migration of cloud computing.
Operational data store or ODS
ODS stores the data from multiple sources and allows for online transactional processing on the data. In this manner, more operations can be performed on the data before it is sent to the warehouse for reporting.
Optimization analysis is the ‘optimization step’ during the encoding of an algorithm and the products based on it. This enables the developers to create different variations of the algorithm-based product and to test it against particular variables.
Observations that diverge far away from the overall pattern in a sample are called outliers. Detection of such anomalies by a system is referred to as ‘Outlier detection.’ An outlier might indicate an error or a rare event.
Parallel data analysis
Parallel data analysis involves the fragmenting of an analytical problem into smaller pieces, then processing each of it with algorithms. A parallel data analysis can occur across a cloud network also.
Parallel method invocation or PMI
PMI enables a program code to call or ‘invoke’ multiple functions and run them in parallel for executing a task.
Parallel queries get processed by multiple system threads for a faster outcome, often over a network.
Pattern recognition is the classification or labeling of already recognized patterns by the system. Pattern recognition falls under machine learning.
Pentaho is an open-source business intelligence (BI) software that provides the users with OLAP, data integration, ELT capacity, dashboard, data mining and reporting services.
One petabyte equals one million gigabytes or approximately one-thousand terabytes.
Yahoo developed ‘Pig’ as a Hadoop-based language, to overcome the limitations of ‘complex deep & long data pipelines’ of SQL. It is easier to learn and work with.
Predictive analytics involves extraction of information from existing data sets to determine patterns and insights. These patterns and insights are used to predict future outcomes or event occurrences.
Predictive modeling utilizes predictive analysis algorithms to identify trends, patterns, and insights from large and structured datasets, and predict the ‘next most likely event to occur.’
Extensive data sets and public information that could be aggregated with public funding or initiate is called public data.
Quantified-self or ‘Lifelogging’ is a concept which aims at acquiring data about an individual’s life patterns throughout the day. This is a technological movement whose goal is to intimately understand a person’s life by monitoring the quantitative inputs (like food consumption, finances, environmental quality, etc.), biological parameters and mental states (like mood, blood-oxygen level, etc.) and performance parameters, through smart applications.
A query represents a question that aims at expressing doubt or extraction of information. Here, the query is used in the context of databases.
It is an analysis conducted on the ‘input query’ by a user or a system to return the most relevant and optimum result.
R is an open-source programming language for statistical analysis and graph generation, available for different operating systems.
Radio frequency identification or RFID
A device that uses a defined radio-frequency for transmitting data wirelessly is an RFID enabled device.
In simple words, re-identification is the opposite of ‘anonymization.’ Here, several data sets are combined to track the identity of an individual within the anonymized data cluster.
The bits of data that are instantaneously created, stored and analyzed by a DBMS within a fraction of seconds refers to the real-time data.
A recommendation engine tracks and learns from a user’s online habits, typically purchase preferences and items of interest. It then uses that data to ‘recommend’ items of interest to the users. This engine is a common feature of the e-commerce websites.
The reference data summarizes or describes a data-point or an object for the user.
Regression analysis aims at measuring the dependency of one dependent variable and other independent variables. It assumes a linear causal relationship from one variable to another variable. Examples of regression analysis functions are linear regression, logistic regression, lasso regression, etc.
Relational database management system (RDBMS)
An RDBMS stores and analyzes the ‘grouped datasets’ that share common features or ‘relations.’
Resilient distributed dataset
A resilient distributed dataset represents data stored across multiple systems which have ‘no single point of failure.’ In other words, this data is fault-tolerant. Apache spark primarily interprets resilient distributed datasets.
Determining the quickest, shortest and most efficient pathway to transport entities. The aim here is to decrease the costs and maximize the output result.
Scalability is the intrinsic property of any system that represents ‘its ability to cope up’ as the workload increases. As the system ‘scales’ up to handle increased workload demand, it must not fail and compromise with performance.
The schema defines the structural organization of the data in a database.
Search data is the collection and analysis of the ‘terms’ and keywords being searched by the people over a search engine, in a defined time-period. For example, Google analytics is powered by search data and its analysis.
A semi-structured data does not possess a formal structure to sort data, rather consists of tags to identify the data cluster records.
Sentiment analysis is a survey or analysis conducted to find out the feelings of people about certain items, products, services, etc. Such types of analytical surveys are regularly carried out on social media platforms and email services. For example, Google and Facebook conduct sentiment analysis regularly.
A server is a dedicated node or a computer system for a network, which carries out data transactions and delivers requests by the users over that network.
A shard refers to a discrete partition of a database.
The signal analysis consists of interpreting, processing, and reporting of ‘behaviour or features of a phenomena’ through sensory devices. A signal can represent images, sounds, radiation, online data, biological parameters, etc.
A simulation represents the real-world processes or a system to study its behavior under the influence of various variables. Analysis conducted on such a simulation is simply called simulation analysis.
A smart grid employs sensors in an ‘energy grid’ to optimize energy transfer processes by analyzing real-time data.
The analysis of geographical and topological data to recognize patterns distributed within a geographic space is called spatial analysis.
Structured Query Language or SQL
A commonly used programming language to map and extract data from the relational databases.
Sqoop is a ‘transportation’ or connectivity software for transferring the information from non-Hadoop data warehouses into the Hadoop database. The user can specify a target location within the Hadoop and then move the data from, say Oracle, to that target location.
The storm is an open-source and free system by Twitter, which is used for real-time distributed computing. Storm efficiently processes streams of unstructured data flows in ‘real-time.’
Taxonomy is the process of classifying (or labeling) the data according to a predetermined system. Using taxonomy, catalogs of the structured data can be formed for easy mapping and retrieval.
Telemetry is the acquisition of real-time (or lagging or delayed) information about an object or a situation via remote sensory devices. For example, live video feed by a drone, Skype call on a smartphone, etc. fall under telemetry.
One terabyte equals approximately one-thousand gigabytes of data.
A linguistic and statistical analysis of text-based data (typically generated by humans), which is employed in machine learning is called text analytics.
Thrift is a software framework used for building ‘cross-language’ services easily and effectively. A user can build a service via the integrated code-generation engine in the ‘Thrift.’ The user can work smoothly between C++, Ruby, Java, etc. to create cross-language services product.
Time-series analysis of the data comprises of studying the data at fixed time intervals. Such data must be well-defined and analyzed at identical time intervals.
It is simply a form of data that is dynamic in nature and changes with time. For example, an online bank transaction, product shipping data, etc. represent transactional data.
An open declaration or ‘view’ about the processes and operations being done on the people’s data. It falls under the ethical regulations of data management, and the organizations have to be transparent about it. Transparency is mandatory for public data and consumer data services.
Any dataset that lacks a defined structure is called unstructured dataset. It is usually the ‘text-rich’ and image data that lies under this category.
The extensive study of ‘big data’ and deriving useful insights from it are beneficial for organizations and the people. This benefit in turn ‘add value’ to the lives of beneficiaries. As the businesses grow, so do industries and so do people. For consumers, they get exactly what they want as the businesses understand them closely.
Variability defines the nature of the data whose meaning can change rapidly. Variability will eventually change the interpretation of the data as well.
The term velocity describes the ‘speed with which the data is generated, stored, processed and visualized’ by the system.’
Veracity defines the accuracy and the correctness of the analyzed data. Organizations have to be sure that the processed data is reliable and veracity indicates its reliability.
The volume represents the size of the data expressed in formal units. The units of data volume can range from megabytes to petabytes. Brontobytes is a newer unit to express massive amounts of data.
WebHDFS Apache Hadoop
WebHDFS is an Apache Hadoop service for providing HDFS access to the native libraries. It provides a functional HTTP REST API for the HDFS access.
An open-source service open for public and organizational use, that provides ‘real-time weather data’ to them. Real-time weather updates can be used for various purposes like logistics movement, energy-grid operations, event management, etc.
XML databases are directly linked to document-oriented databases and allow for storing the data in XML format. The data stored in XML format can be transformed into any format required.
One yottabyte equals one thousand zettabytes. One zettabyte equals one thousand exabytes. One exabyte equals one-billion terabytes. The present volume of our digital world it approximately one-yottabyte, and will double every 18-months.
One zettabyte equals thousand exabytes or one-billion terabytes. Since 2016, global networks witness the exchange of one-zettabyte of data every day.
Zones are the well-defined areas within a data lake, tagged for a specific purpose.
There you go! You just swam through the updated glossary of Big-data terms. Think you have something to add to this list? Go ahead and share your knowledge with us in the comment box below.