Data Science Glossary

Data Science Glossary

There has been much hype surrounding deep learning and data science learning in recent times, and one of the cornerstones of deep learning is the neural network. In this article, we will look at what a neural network is and get familiar with the relevant terminologies.
In simplest terms, a neural network is an interconnection of neurons. Now the question arises, what is a neuron? To understand neurons in deep learning, we first need to look at neurons from a biological perspective, from which the concept of neural network has actually been taken.


A/B Testing: A mechanism of testing two techniques or two versions to determine which one is better.
Activation Function: A function that maps the input to a neuron, to the output. If the result of this function is above a certain threshold, the neuron is activated. Sigmoid Function and ReLU are some examples of the activation function.
Artificial Intelligence: The mechanism by which a machine can take input from the environment via sensors, process this input using the experiences it has gained and take rational and intelligent decision on the environment using actuators, much like how humans do.
AutoEncoders:  A neural network where the output is set to be the same as the input and the goal is to train the hidden layers such that the input can be represented in fewer dimensions by encoding.


Backpropagation: A technique used in training deep learning networks to update the weights and biases by calculating the gradient so that the accuracy of the network can be improved iteratively.
Bayes’ Theorem: This theorem gives the probability of an event when we already have information about some other condition related to the event. The relation is given by P(A|B) = P(B|A) / P(A)*P(B)
Bias: A parameter that can be used to tweak the output values towards or away from the decision boundary.
Big Data: Large volume of data that cannot be processed by traditional data processing applications and may be analysed to generate valuable insights from the data.


Classification: The process of determining the score for an individual entry in order to find out which class the individual belongs to.
Clustering: The method of grouping a set of data such that all elements of a particular group have some common parameters or characteristics.
Confusion Matrix: A matrix that defines the performance of a classification model on a test data for which the true values are known. This matrix uses the True Positive(TP), True Negative (TN), False Positive(FP) and False Negative(FN) to evaluate the performance.
Control Set: When testing a predictive or machine learning model using cross-validation, you train the model on the training set, and test its performance on the control set. 
Cost Function: A function for a model that determines the error in prediction of a dependent variable given an independent variable.
Cross-validation: The process of splitting the labelled data into training set and testing set, and after the model has been trained with the training set, testing the model using the testing set for which we already know what the output should be. This results in determining how well the model is working on previously unseen data.


Data Mining: The process of gathering relevant data for the area of interest, generate a model to find patterns and relationships amongst the data, and present the derived information in appropriate and useful form.
Data Science: The method of using appropriate tools and techniques to represent the most efficient algorithm for generating insights from a problem set that we have substantial expertise and data on.
Data Wrangling: The process of acquiring data from multiple sources, cleaning the data (removing/replacing missing/redundant data), combining the data to acquire only required fields and entries, and preparing the data for easy access and analysis.
Decision Boundary: A boundary that separates the elements of one class from the elements of another class. 
Decision Tree: A supervised learning algorithm that models a tree where every branch represents a set of alternatives and leaves represent the decisions. By taking a series of decisions along the branches, we ultimately reach the desired result at one of the leaves. 
Deep Learning (DL): A subfield of machine learning that uses the power of neural network to accelerate the processing of algorithms to make computations on huge amount of data.
Dependent Variable: A variable that is under test and changes with respect to the change in an independent variable. In a housing price example, change in area results in change in price. Here, price is a dependent variable which depends on the independent variable area.
Dimensionality Reduction: The process of reducing the number of features or dimensions of a training set without losing much information from the data and increase the model’s performance.


Exploratory Data Analysis:  A data analysis approach used to discover insights in the data, often using graphical techniques.


Feature: The input variables in a problem set that can be measured to provide information.
Feature Selection: Selection of a subset of the most relevant attributes in a problem set. This is done for effective model construction.


Gradient Descent:  An optimization process to minimize the cost function by finding optimum values for the parameters involved.
Graphics Processing Unit (GPU): A chip that processes complex mathematical functions rapidly and is generally used for image rendering. Deep learning models generally require lots of processing that are not feasible with time constraints using ordinary CPU and hence GPUs are required.


Hyperparameter: Parameters in a model that cannot be directly learnt from the data and is decided by setting different values to determine which works best for the model. Tuning the learning rate for a model is an example of hyperparameter.


Independent Variable: A variable that is used to change a dependent variable and experiment its values. See dependent variable for example.


Label: The output variables in a problem set, whose values are either provided or needs to be predicted.
Latent Variable: A hidden variable that cannot be computed directly, but are measured by computation of other variables that can be measured directly. The other measurable variables ultimately impact the hidden variable. See this link for good examples of latent variable.
Learning Rate: A hyperparameter that is used to adjust the weights of a network with respect to the gradient loss. Small learning rate could result in taking a long time for the model to converge whereas large learning rate could result in the model never being converged. So, an optimal learning rate must be found for best results.
Linear Regression: A regression technique where the linear relationship between a dependent variable and one or more independent variables is determined. The dependent and independent variables are continuous in nature.
Logistic Regression: A classification technique where the relationship between a binary dependent variable and one or more independent variables is determined.


Machine Learning (ML): The process that uses data and statistics to learn and generate algorithms and models to perform intelligent action on previously unseen data.
Model Fitting: Process of checking the accuracy, performance or predictive power of a model, using metrics such as R-Squared, against your data set. Usually done in a cross-validation setting, comparing multiple models to choose the best one. 
Model Selection: The process of selecting a statistical model from a set of alternative models for a problem set that results in the right balance between approximation and estimation errors.
Model Tuning: The process of tweaking the hyperparameters in order to improve the efficiency of the model.


Natural Language Processing (NLP): A field of computer science associated with the study of how computers can understand and interact with humans using the natural language as spoken by humans.
Neural Network: A layered interconnection of neurons that receive input, process the input and use the activation function to generate an output. This output will be the input that is passed onto the next layer and so on until the final output layer. Also known as Artificial Neural Network, it is inspired from how the human brain works.


Outlier: Unusual observations in data that lie at an abnormal distance from where majority of the data points are located.
Overfitting: A model is overfitted when it is trained with lots of data such that it learns even the noise parameters in the given data and does not work well with data it has not seen before.


Percentiles: The 20% percentile (say) is the value, for a random variable, such that 20% of the observations are less than that value. The 50% percentile is the media. Percentiles are used in robust statistical testing, for instance to compare two distributions.
Perceptron: A single layer neural network.
Predictive Analysis: A process of predicting unseen events using historical data and statistical techniques derived from data mining and machine learning.


Random Forest: A combination of many decision trees in a single model. The predictions are closer to the average in case of random forest as predictions from multiple decision trees are taken into account.
Rectified Linear Unit (ReLU):  An activation function for which the output is 0 if the input is less than 0, otherwise the output is equal to the input. So, it takes the max of zero, i.e. f(x) = max(x,0)
Regression: A technique of measuring the relationship between the mean value of a dependent variable and one or more independent variables.
Reinforcement Learning: A machine learning technique where an agents learn by taking action in an environment and observing the results. The agent can start with no knowledge of the environment and it learns from experience.
Recurrent Neural Network (RNN): A recurrent neural network is a class of artificial neural network where connections between nodes form a directed graph along a sequence. This allows it to exhibit temporal dynamic behavior for a time sequence. Unlike feedforward neural networks, RNNs can use their internal state to process sequences of inputs.
R-Squared: Metric used to measure how close predicted values are to actual values. Equivalent to the correlation between the predicted values and the dependent variable, in a linear regression problem.


Sigmoid Function: An S-shaped activation function for which the output approaches 0 if the input approaches -infinity and the output approaches 1 if the input approaches +infinity. The sigmoid function is given by sigmoid(x)=1/(1+e^(-x)) Softmax: A function used in multiple classification logistic regression that calculates the probability of each target class over all possible target classes. The probabilities of each class will be between 0 and 1, and the sum of the probabilities will be equal to 1.
Supervised Learning: A machine learning technique where we provide the model with both the input and the actual output data, and train the model such that the predicted output closely resembles the actual output so that it can later make predictions on unseen data.
Support Vector Machine (SVM): A supervised machine learning technique which outputs an optimal hyperplane that can classify new examples into multiple classes.


Testing Set: A portion of already acquired data we have separated to test the accuracy of the model after the model has been trained with training set. To split 20% of the total available data as test set is considered good but may be changed as per model requirements.
Training Set: A portion of already acquired data we have separated to acquire parameters and train the model which may later be tested with the test set. To split 80% of the total available data as training set is considered fair.
Time series: One-dimensional data indexed by time. Many models and statistical techniques have been developed to handle this kind of data, for instance auto-regressive processes. On important feature of a time series is its auto-correlation structure, which tells you what kind of model fits best with your data.


Underfitting: A model is underfitted when it is not able to capture the parameters from the given data and hence does not work well even on previously seen data.
Unsupervised Learning: A machine learning technique where we provide the model with only the input data and train the model such that the model learns to determine the pattern in the data and later make predictions on unseen data.


Weight: The strength of connection between two neurons of two successive layers in a neural network.

Originally posted here. 

Link: Data Science Glossary