Neural Network & Machine Learning Notes
World → data → information → knowledge → wisdom
OSEMN: Obtaining data, Scrubbing data, Exploring data, Modelling data, Interpreting data
To succeed in machine learning: accessing & managing data, skill sets, disparate technologies
3-legged stool: domain expert, data expert & predictive modelling expert
Define data sources → Cross Industry Standard Process – Data Mining (CRISP-DM) → data exploration → prepare data → data aggregation, preprocessing & warehousing → sampling data → formulate hypothesis → design experiment → machine learning → clustering → anomaly detection (e.g. one-class classifier) → association-rule mining (e.g. do you want fries with that? 1. Frequent itemset. 2. Apriori algorithm) → building deployable models → prediction (e.g. propensity model & regression) → decision making → collect data → inference / conclusions → evaluating your model → updating your model
To succeed in machine learning: accessing & managing data, skill sets, disparate technologies
3-legged stool: domain expert, data expert & predictive modelling expert
Define data sources → Cross Industry Standard Process – Data Mining (CRISP-DM) → data exploration → prepare data → data aggregation, preprocessing & warehousing → sampling data → formulate hypothesis → design experiment → machine learning → clustering → anomaly detection (e.g. one-class classifier) → association-rule mining (e.g. do you want fries with that? 1. Frequent itemset. 2. Apriori algorithm) → building deployable models → prediction (e.g. propensity model & regression) → decision making → collect data → inference / conclusions → evaluating your model → updating your model
Learning rates (MLPs) → momentum (MLPs) → number of iterations/epochs → number of hidden layers and number of neurons in each hidden layers → stopping criteria → weight updates → quick propagation → resilient propagation (Rprop) → second-order methods
Feedforward Neural Network (FNN)
- Also known as Artificial Neural Network (ANN)
- Perceptron is a single layer neural network (binary)
- Also known as Multilayer Perceptron (MLP)
- Global basis functions (sigmoid)
- Require non-linear regression, i.e. using an optimization algorithm to minimize error and that down the regression
- Non-linear regression = have a little variation
- Better at approximating sparse data
- More accurate for non-uniform point distribution, again mainly because cross-validation works better for uniformly dense sets
- When using ensembles, typically 4500 neural networks must be computed individually (including the ensembles and hidden nodes options) could take days
- Σj WjXj+B W = weight X = node B = bias k = period l = layer j = neuron
Radial Basis Function (RBF)
- Localised basis functions (e.g. Gaussian)
- Using linear regression (if the spread (S) and number of basis functions (m) is fixed)
- Cross-validation in the outer loop of the regression to determine m and S. Cross-validation means that we minimise the PRESS error over m and S to theoretically get the best predictor
- In uniformly dense sets it may be better since, theoretically, cross-validation should provide a more accurate response surface
- A single design point (i.e. a converged solution), using the default SRSM (sequential) approach with linear basis functions (the default approach) is still the best and cheapest
- E.g. a power grid fails, it will find the most effective route to patch back
- Φ(x,c) = Φ(‖x-c ‖)
- Anything that satisfy the above formula is RBF x is the node, c is the center
- Φ is Phi, the golden number, 1.618 (‖x-c ‖) is the norm of the vector
- Φ(r) = e-( εr)2
- This is the Gaussian model r = ‖x-xi ‖ (i the previous node) e is Euler number, 2.718 ε is Epsilon, the definition of limit
Kohonen Self-Organizing Map Neural Network (SOM)
- Finding the outline of an image and will map to the nearest vector
- Does not handle categorical variables well, computationally expensive & potentially inconsistent solutions
- Each node’s weight is initialised
- A vector is chosen randomly from the training dataset
- The winning node that are most like the input vector become the Best Matching Unit (BMU)
- BMU’s neighbor is calculated and decrease overtime
- BMU and its neighbor become more like the sample vector and learn less as it farther
- Repeat step 2 for N iterations
- Wv(s+1) = Wv(s) + θ(u,v,s) · α(s) · (D(t) - Wv(s))
- Wv is the current weight vector of node v
- s is the current iteration
- u is the index of the best matching unit (BMU) in the map
- v is the index of the node in the map
- t is the index of the target input data vector in the input data set D
- θ(u,v,s) is a restraint due to distance from BMU, usually called the neighbourhood function
- α(s) is a learning restraint due to iteration progress
- D(t) is a target input data vector
Recurrent Neural Network (RNN)
- Common associated with Long Short-Term Memory (LSTM) (ht = tanh(Ct) · ot)
- Elman and Jordan (yt = σh(Wyht + by)
- Gate Recurrent Unit (GRU) (ht = (1−zt) · ht−1 + zt · ħt)
- Continuous-time (CTRNN) - neuron of the incoming spike train, postsynaptic node
- Ideal for time base data processing
- Also used for speech and translation
- Include the brute force approach – predict and train child network using all possible neural network
Convolutional Neural Network
- Max pooling (sample-based discretization process; down-sampling) One of max pooling property is that it is non-convexity
- Average pooling (instead of the highest probable nodes, it averages the nodes)
- Also known as AlexNet
- Common image dataset can be found at Canadian Institute For Advanced Research (CIFAR)
- An improvement version is called the Capsule Neural Network
- Pretrained image datasets: VGG, ResNet50, Inception (GoogLeNet), Xception, Residual Network
Modular Neural Network
- Multiple neural network
Energy based Neural Network
- Traditional model using electrical pulse
- ising model paradigm, generative model
- Undirected, e.g. Restricted Boltzmann Machine (RBM)
- Gating – max entropy to obtain energy-based models for binary gate
Autoencoder
- Learn efficient data coding in an unsupervised manner
- Compress data from the input layer into a short code, and then uncompress that code into something that closely matches the original data
- E.g. learning how to ignore noise
- z = σ(Wx + b)
3 Stages as compared to Human Neural Network
Dendrites (input)
Dendrites (input)
- Model Evaluation – good fit between model and data Confidence Interval – how reliable a statistical estimate is Confusion Matrix – context of clustering Gain and Life Chart – ratio of result with and without the predictive model Kolmogorov-Smirnov Chart – compare distributions under nonparametric Gini Coefficient – classification problems Cross Validation – how the model will perform in the future Predictive Power – concept of entropy or Gini index
- Single-variable selection techniques Chi-square test CHAID stump using Chi-square test Association Rules Confidence, 1 antecedent ANOVA Kolmogorov-Smirnov (K-S) distance, two sample tests Linear regression forward selection (1 step) Principle component analysis
Cell nucleus (nodes) x Synapse (weight)
- Transfer function
- Only have AND, NAND, OR, XOR, NOR, NOT
- Layers
- Softmax
- Activations functions Sigmoid function – 0 to 1 | σ(z) = 1 / (1+e-z) Threshold (Step) function – 0 or 1 | y = 0, x<n; 1, x≥n Rectified Linear Unit (ReLU) function – correct non-linear input | R(z) = max(0, z) Leaky ReLU – fix ReLU problem with 0.01 allowance | max(0.1x,x) Hyperbolic tangent (tanh) function – -1 to 1 | tanh(z) = 1-tanh2(z) Exponential Linear Unit (ELU) Deeply-supervised Net (DSB) – under CNN Network-in-Network (NiN) – micro network Maxout – maximum of multiple possible outputs Highway network – regulate info flow Expected Signal Propagation (ESP) - technique in Bayesian machine learning
- Predictive Sample, Explore, Modify, Model, Access (SEMMA
Human Brain
- Processing Neuron Nervous System
- Implicit Memory Procedural (muscle memory) Priming (external correlation) Classical conditioning (internal correlation)
- Explicit Memory Semantic (knowledge and facts) Episodic (Life’s events)
- Stages Sensory (all the senses) Working (short term and temporary) Long term (storage)
Hyper intelligent
- Could be achieved with quantum processor (qubit)
- Computer be so smart it will have thought we are just like cockroaches
General intelligence
- g factor (psychometric: cognitive abilities & human intelligence)
- Neuromorphic processor
- Turing test and Lovelace test
- 4th quadrant of math (highest point between abstraction and application
Training & Machine Learning (ML)
- Epoch – one forward and backward pass Batch size – training examples on how many epoch Iterations – number of epoch based on batch size (e.g. 1000 training exercise, batch size of 500 need 2 iterations to complete)
- Multi-Task Learning (MTL) – subfield of ML, multiple learning tasks are solved at the same time With the need for explicit learning Pre-training – use the weights saved from the previous network to initiate the new training
- Supervised learning – task driven; split, apply, combine (classification and regression) E.g. student in a school, logistic regression Learning rules (create rules with data and answers) / converting frequencies to probabilities
- Unsupervised learning – data driven (clustering, dimensionality reduction, recommendation) Generative Adversarial Networks (GAN) – NN contesting with each other in a zero-sum game Requires lesser data and fewer parameters. Can be used for high resolution image processing, text-to-image synthesis, train with less data and provides prediction on missing data
- Transfer learning – storing knowledge gained while solving one problem and applying it to a different one
- Reinforcement learning – algorithm learns to react to an environment (reward maximisation)
- Deep learning - the vanishing gradient problem (layer come with a cost)
- Principal component – a linear combination of the predictor variables Loadings – the weights that transform the predictors into the components Scree Plot – showing the relative importance of the components
- K-means clustering – divide data into different groups Expectation Maximization (EM) - computes probabilities based on one or more probability distributions Gaussian mixture - probabilistic model that assumes all the data points Cluster – group of records that are similar Cluster means – the vector of variable means for the records in a cluster Hierarchical clustering – more flexible than k-means with more non-numerical variables Agglomerative algorithm – until 1 cluster is left (d(x,y) = √(x1-y1)2 + (x2-y2)2 + … + (xp-yp)2) Divisive clustering – different from agglomerative, top down, start from big and subdivide to small Dendrogram – visual records and hierarchy or clusters to which they belong Distance – measure of how close one record to another Dissimilarity – how close one cluster to another
- Model-based clustering – similar records and not necessarily close to one another Multivariate normal distribution - generalization of the 1D normal distributon to higher dimensions Bayesian Information Criteria (BIC) and Akaike information criterion (AIC)
- Markov Model – a stochastic model used to model randomly change system (sequence data model) Hidden Markov Model – Markov model with unobserved states Markov Chain Monte Carlo – for sampling from a probability distribution Kalman Filter / Linear Quadratic Equation (LQE) – accurate estimates of unknown variables
- Scaling and categorial variables under unsupervised learning Scaling – squashing or expanding data Scaling (magnitude scaling, sigmoid, min-max normalization, z-score, rank scoring) Normalization – subtracting the mean and dividing by the standard deviation Gower’s distance – bringing all variables to a 0-1 range
- Sample types Sample – subset from a larger dataset Population – large dataset or idea of a data set N(n) the size of the population Random sampling – drawing elements into a sample at random Stratified sampling – divide the population into strata Simple random sample – sampling the population without stratify Sample bias – a sample that misrepresent the population Bias – systematic error Data snooping – extensive hunting through data in search of something interesting Vast search effect – bias or no reproducibility resulting from repeated data modeling Bootstrap sample – a sample taken with replacement from an observed data set Resampling – taking repeated samples from observed data
- Tree model, Classification and Regression Tree (CART), decision tree Recursive partitioning – repeatedly dividing and subdividing the data until it is homogeneous Split value – predictor value that divides the record Node – the graphical or rule representation of a split value Leaf – the if-then else rules Loss – the number of misclassifications at a stage in the splitting process Impurity – mix of classes found in a sub partition of the data Measurement: Gini (Gini(E) = 1−∑j=1 pj) and entropy (H(E) = −∑j=1pj log pj) Pruning – progressively cutting its branches to reduce overfitting
- Ensemble – forming a prediction by using a collection of models (split and boost accuracy) Bagging – a general technique to form a collection of models by bootstrapping the data Random forest – a type of bagged estimate based on decision tree models Variable importance – measure of the importance of a predictor variable in the performance of the model Flocking algorithm - self-propelled entities and collective animal behavior
- Boosting – giving more weight to the records with large residuals for each successive round Adaboost – reweighting the data based on the residuals Gradient boosting – minimizing a cost function Stochastic gradient boosting – resampling of records and columns in each round Regularization – to avoid overfitting by adding penalty term (and dropout) Hyperparameters – parameters to be set before fitting the algorithm XGBoost – public software for Stochastic gradient boosting
Detections
- Edge detection Canny Deriche Differential Sobel Prewitt Roberts Cross
- Corner detection Harris Operator Shi and Tomasi Level Curve Curvature Hessian Feature Strength Measures
- Blob detection Laplacian of Gaussian (LoG) Determinant of Hessian (DoH) PCBR Maximally Stable Extremal Regions Difference of Gaussians (DoG)
- Ridge detection Hough transform Generalized Hough Transform
- Structure tensor Generalized Structure Tensor Affine invariant feature detection Affine Shape Adaptation Harris Affine Hessian Affine Feature description Scale-Invariant Feature Transform (SIFT) Speeded Up Robust Features (SURF) Gradient Location and Orientation Histogram (GLOH) GIST Histogram of Oriented Gradients (HOG) Local Binary Pattern (LBP)
- Scale space Scale-Space Axioms Axiomatic Theory of Receptive Fields Implementation Details Pyramids
Common terms
- Cost function – how good the trained data
- Forward propagation – get the output and compare with real value to get error (n+1) or Forward Pass – e.g. CNN & RNN
- Backpropagation – get the derivative of error with each weight and sub-trace this value from the weight value
- Skip-Thought Vectors – distributed sentence encoder
- Text-guided Attention Model – understand an image and generate natural language of descriptions
- Ubiquitous Computing – computing to be made to appear anytime and anywhere
- Data wrangling (munging) – transforming and mapping raw data to rational data
Numerical Optimization / Recommendation
- Stochastic Gradient Descent (SGD) – get the optimal minimum Gradient descent can write programming code better than human
- Broyden-Fletcher-Goldfarb-Shanno (BFGS) – solving unconstrained nonlinear optimization problems
- Baseline Model Low-Rank Matrix Factorization Alternating Least Square Contrastive Divergence Restricted Boltzmann Machine
Quantitative Finance
- Insertion sort – simple sorting algorithm
- Orthogonalization – linear algebra, finding a set of orthogonal vectors that span a subspace
- Binary search – search algorithm that finds the position of a target value within a sorted array
- Dickey-fuller test – check financial data that is always changing
- Sharpe Ratio – calculate risk-adjusted returns, can’t work with negative skewness
Alternatives as Sortino ratio, return over maximum drawdown (RoMaD), Treynor ratio - Simpson’s Paradox – reverse trend when variables combined
- Anova – analysis of variance, difference of means, f-statistic
Regression (ŷ = b0 + b1X)
- Response – the variable we are trying to predict Independent variable – the variable to predict the response Record – the vector of predictor and outcome values for a specific individual Intercept – the predicted value when x = 0 Regression coefficient – the slope of the regression line Fitted values – the estimates ŷi obtained from the regression line Residuals – the difference between the observed and fitted values Least squares – fit a regression by minimizing the sum of squared residuals Maximum likelihood - a method of estimating the parameters of a statistical model, given observations
- Root mean squared error – the square root of the average squared error of the regression Residual standard error – same as root mean squared error but with freedom R-squared – proportion of variance explained by the model from 0 to 1 Weighted regression – regression with records of having different weights Cross-validation – set aside 1/k → train remaining data → apply score model to 1/k holdout → restore first 1/k data and set aside next 1/k → repeat training until each record is used →average or combine the model assessment metrics
- Dummy variables – binary 0-1 variables derived by recording factor data Reference coding – factor to use as references and another factor to compare that level One hot encoder – all factors levels are retained Deviation coding – compares each level against the overall mean
- Correlated variables – difficult to interpret individual coefficients Multicollinearity – near perfect correlation rendering unstable to compute Collinearity - 1 predictor variable in a multiple regression model can be linearly predicted Main effects – predictor and outcome variable are independent Interactions – interdependent relationship between 2 or more predictors Bayesian Linear Regression – regression model with prior distribution is assumed
- Standardized residuals – residuals divided by the standard error Outliers – records that are distant from the rest of the data Influential value – value that make a big difference in the regression equation Leverage – degree of influence that a single record has Non-normal residuals – it can invalidate some technical requirements Heteroskedasticity – some outcome range experience residuals with higher variance Partial residual plots – diagnostic plot to show between outcome variable and a single predictor
- Polynomial regression – adds polynomial terms to a regression Spline regression – fitting a smooth curve with a series of polynomial segments Knots – values that separate spline segments Generalized additive models – spline models with automated selection of knots Lasso-based linear regression – performs both variable selection and regularization to enhance the prediction accuracy
Advance Regression Techniques
- Adaptive regression, locally estimated scatterplot smoothing (LoESS), proportional hazard regression, quantile regression, robust regression
Data Analysis
- Mean is an average value (x̄ = Σx / n) Trimmed mean take away the n front and back (x̄ = (Σn-p + 1x(i)) / (n – 2p)) Weighted mean adds multiplier into the mean (x̄w = Σi=1wixi / Σiwi)
- Attitudinal data: how customer think or feel Behavioral data: how customer interact with the business Demographic data: information of customer base
- Labeled data – a group of samples that have been tagged with one or more labels
Data Smoothing
- Reduction of noises that is causing the fluctuations Methods to reduce: Random Walk Moving Average Exponential Smoothing
Recommender System
- Concept Collective Intelligence Relevance Star Ratings Long Tail
- Methods and challenges Cold Start Collaborative Filtering Dimensionality Reduction Implicit Data Collection Item-Item Collaborative Filtering Matrix Factorization Preference Elicitation Principal Component Analysis (PCA) Similarity Search Social Loafing Topic Model
- Text Recommendation Algorithm Singular Value Decomposition (SVD) M = U∑V* Latent Dirichlet Allocation (LDA)
- Gains and Lift Charts
Reduce Dimensionality / Approximation Serie
- Taylor Network Fourier Transform Discrete Fourier Transform (DFT) Discrete Wavelet Transformation (DWT) Singular Value Decomposition (SVD) Locally Linear Embedding (LLE) ISOMap t-SNE Multidimensional Scaling (MDS)
Classification
- Conditional probability – probability of observing some event P(Xi|Yi) Posterior probability – an outcome after the predictor information has been incorporated
- Covariance – one variable varies in concert with another Discriminant function – maximizes the separation of the classes Discriminant weights – the scores to estimate the probabilities belonging to another class Techniques: Linear Discriminant Analysis (LDA), Covariance Matrix, Fisher’s Linear Discriminant
- Logit – function that maps the probability of belonging to a class with range from ±∞ Odds – ratio of success (1) to not success (0) Log odds – response in the transformed model that mapped back to a probability Techniques: generalized linear models
- Accuracy – the percent of cases classified correctly Confusion matrix – record counts vs their actual classification status Sensitivity – the percent of 1s correctly classified Specificity – the percent of 0x correctly classified Precision – the percent of predicted 1s that are 1s Receiver Operating Characteristic (ROC) curve – a plot of sensitivity versus specificity Area Underneath Curve (AUC) Lift – how effective the model is at different probability cutoff Percent Correct Classification (PCC) – measure overall accuracy
- Undersample – use fewer of the prevalent class records in the classification model Oversample – use more of the rare class records in the classification model Up weight or down weight – attach incorrect weight or prevalent class in the model Data generation – each new bootstrapped record is slightly different from its source K – number of neighbor considered in the nearest neighbor calculation K-Nearest Neighbor (KNN) – hunt similar features and plot a circumference around it Euclidean distance – the "ordinary" straight-line distance between two points in Euclidean space Manhattan distance – between real vectors using the sum of their absolute difference (∑i=1 |xi - yi|) Hamming distance – categorical variable (DH = ∑i=1 (xi - yi))
- Bayesian classification predict class membership probabilities such as the probability that a given tuple Naïve Bayes Network – probability based on evidence Bayesian nonparametric model – allowing the data to determine the complexity of the model
- Support Vector Machine (SVM) – binary classifier Kernel Method - a class of algorithms for pattern analysis, member for SVM
Distribution
- Standard distribution, normal (Gaussian) distribution, uniform distribution Bernoulli Beta Binomial Bivariate Normal Binomial G-and-H Burr Bimodal Heavy Tailed Continuous Probability Categorical Yule-Simon Lévy Cauchy Degenerate Cumulative Frequency Cumulative Geometric Beta Dirichlet Extreme Value Discrete Probability Empirical Gompertz PERT Erlang Exponential Generalized Error Wishart Weibull T Factorial Hypergeometric Folded / Half Normal Wallenius Fat Tail F J Shaped Inverse Normal Inverse Gaussian Von Mises U-Shaped Fisk Laplace Kumaraswamy Multivariate Normal Lindley Lognormal Kent Long Tail Marginal Mixture Multimodal Multinomial Rician Nakagami Normal Negative Binomial Open Ended Pareto Zeta Pearson Unimodal Tukey Lambda Power Law Rayleigh Poisson Reciprocal Uniform Relative Frequency Skewed Triangular Stable Symmetric Tweedie Truncated Normal Trapezoidal Generalized Error – training error / loss function
- Sample statistic – metric calculated drawn from a larger population Data distribution – frequency distribution of individual values in a data set Sampling distribution – frequency distribution of sample statistic Central limit theorem – tendency of sampling distribution takes on a normal shape Standard error – the variability of a sample statistic over many samples (SE = s/√n)
- Boxplot – quick way to visualize data Frequency table – tally of count of numeric data that fall into a set of intervals Histogram – frequency table with x and y axis Density plot – smoothed version of histogram
- Confidence level – the percentage of confidence interval to contain the statistic interest Interval endpoints – top and bottom confidence interval
- Error – difference between a data point and a predicted or average value Standardize – subtract the mean and divide by the standard deviation Z-score – result of standardizing an individual data point Standard normal – mean = 0 and standard deviation = 1 QQ-Plot – to visualize Error is zero (0) – stay as it is Error is positive (+) – increase the neurons and/or the weight Error is negative (-) – decrease the neurons and/or the weight
- Trial – an event with a discrete outcome Success – the out of interest for a trial Binomial – having 2 outcomes, measure error at 95% confidence level Binomial trial – a trial with 2 outcomes Binomial distribution – distribution of number of successes in x trials
- Lambda – the rate per unit of time or space at which events occur Poisson distribution – frequency distribution of the number of events in sampled unit Exponential distribution – frequency distribution of the time or distance between events Weibull distribution – generalized version of the exponential which rate can shift
- Skewness – measurement of asymmetry of a distribution Kurtosis – the sharpness of the peak of a frequency-distribution curve Tail – long narrow portion of a frequency distribution Skew – One tail of a distribution is longer than the other Anscombe's Quartet – datasets appeared identical yet very different when graphed Fix positive skew – log transform log(x), multiplicative inverse 1/x, square root sqrt(x) Fix negative skew – square xn, inv log -log10(1+abs(x))
Correlation
To measure: Spurious (not related), Crosstab (cross tabulation), Scatterplot
- Correlation coefficient – metric that measure from -1 to +1Correlation matrix – a table of value that correlates diagonallyScatterplot – plot with x and y axisPearson correlation coefficient (r) - linear correlation between two variables X and Y
- Contingency tables – tally of counts between 2 or more categorical variablesHexagonal binning – 2 numeric variables with the records to binned into hexagonContour plots – density of 2 numeric variables like a topographical mapViolin plots – like a boxplot but showing the density estimate
- Time series characteristic – Hurst exponent, autocorrelation coefficient
Time series – a series of data points indexed (or listed or graphed) in time order
Testing
- Treatment – something to which a subject is exposed
Treatment group – a group of subjects exposed to a specific treatment
Control group – a group of subjects exposed to no or standard treatment
Randomization – randomly assigning subjects to treatments
Subjects – items that are exposed to treatments
Test statistic – matric used to measure the effect of the treatment
- Hypothesis testing - confirmatory data analysis
Null hypothesis – chance is to blame
Alternative hypothesis – counterpoint to the null as on what you hope to prove
One-way test – count chance result only in 1 direction
Two-way test – count chance result in 2 directions
- Permutation test – combining 2 or more sample together and random reallocating observation
With or without replacement – an item is returned to the sample before the next draw
- P-value – probability of obtaining results as unusual or extreme
Alpha – probability threshold of unusualness
Type 1 error – mistakenly concluding an effect is real when it is due to chance
Type 2 error – mistakenly concluding an effect is due to chance when it is real
False discovery rate – across multiple test making type 1 error
Adjustment of p-value – doing multiple tests on the same data
Underfitting – lack of essential variables
Overfitting – fitting the noise
MCAR – missing completely at random
MAR – missing at random
MNAR – missing not at random - T-test - to determine whether there is a significant difference between the means of two groups
Test statistic – metric for the difference or effect of interest
t-statistic – a standardized version of the test statistic
t-distribution – a reference distribution to which the observed t-statistic can be compared
- Pairwise comparison – hypothesis test between 2 groups among multiple groups
Omnibus test – single hypothesis test of the overall variance among multiple group means
Decomposition of variance – separation of components and contributing to an individual value
F-statistic – measure the extent among group means to a chance model
Sum of squares (SS) – deviation from some average value (Σi=1 (yi - ȳ)2)
- Chi-square statistic – measure of some observe data departs from expectation (χ= Σ ΣR2)
Pearson residual - raw residual divided by the square root of the variance function
Expectation or expected – how we expect the data to turn out under some assumption
- Multi-arm bandit algorithm – explicit optimization and more rapid decision
- Effect size – you hope to detect in a statistical test
Power – the probability of detecting a given effect size with a given sample size
Significance level – level at which the test will be conducted
Examples
Quantitative Neural Network
- Neural network does not make any forecast and instead they predict the price data and uncover opportunities
- Example
Using MLP with OHLCV tuple for MACD, Ichimoku Cloud, RSI and volatility)
Objective is to skip sideways movement
[(‘Total Return’, ‘1.66%’), (‘Sharpe Ratio’, ‘16.27’), (‘Max Drawdown’, ‘2.28%’), (‘Drawdown Duration’, ‘204’)]
Signals: 9 Orders: 9 Fills: 9
[(‘Total Return’, ‘3.07%’), (‘Sharpe Ratio’, ‘27.99’), (‘Max Drawdown’, ‘1.91%’), (‘Drawdown Duration’, ‘102’)]
Signals: 7 Orders: 7 Fills: 7 - Potential adjustments
- Forecast something different (e.g. volatility)
- Use multimodal learning on different data source
- Mixed of different classical
Identifying the image and learning to present it in a natural way
Natural Language Processing
STV – Stop Training with Validation
Opinion Finder – text mood detector (negative/positive)
Opinion Finder – text mood detector (negative/positive)
Google Profile of Mood States (GPOMS) – another text mood detector
VIX Volatility Model – Gils Morales and Dr Chris Kacher
A self-learning/self-evolving algorithm that has successfully traded bull, bear, and sideways market
Back-tested results and live results after 2016:
2009 +178.7%
2010 +518.1%
2011 +274.5%
2012 +289.4%
2013 +103.1%
2014 +161.9%
2015 +589.7%
2016 +61.8% (real-time trading started uninterrupted on 11-8-16)
2017 +39.3% (as of 3-3-17, 1st quarter only)
- There are AI winters and the stall zone, however the recovery bloomed exponentially
- Part of the failure also due to companies putting fourth-generation machines into the third-generation business model
- Shouldn’t apply the same formula throughout – different industry has different melting points
- 2022 to 2030: machines will start large scale replacement for human workers
- Human can stay advantageous if their job scope includes analytical thinking, social media and fabrication skills
- Narrow AI: now (the first and second waves)
- General AI: strong AI (third wave)
- Super AI: unstoppable, hyperintelligence (fourth wave)
- Dystopian: we should fear our pending robot overloads
- Utopian: everything is awesome, and technology will always be the answer
- Pragmatist: the future can be as good if we make smart and practical decisions
- Benefits / advantages of AI trader
- No downtime and will not be tired
- No emotions
- Learn better and faster especially in the long run
- Oil vs. Data – data being the new raw material that is infinite and easier to acquire and process