Neural Network & Machine Learning Notes

World → data → information → knowledge → wisdom

OSEMN: Obtaining data, Scrubbing data, Exploring data, Modelling data, Interpreting data
To succeed in machine learning: accessing & managing data, skill sets, disparate technologies
3-legged stool: domain expert, data expert & predictive modelling expert

Define data sources → Cross Industry Standard Process – Data Mining (CRISP-DM) → data exploration → prepare data → data aggregation, preprocessing & warehousing → sampling data → formulate hypothesis → design experiment → machine learning → clustering → anomaly detection (e.g. one-class classifier) → association-rule mining (e.g. do you want fries with that? 1. Frequent itemset. 2. Apriori algorithm) → building deployable models → prediction (e.g. propensity model & regression) → decision making → collect data → inference / conclusions → evaluating your model → updating your model

Learning rates (MLPs) → momentum (MLPs) → number of iterations/epochs → number of hidden layers and number of neurons in each hidden layers → stopping criteria → weight updates → quick propagation → resilient propagation (Rprop) → second-order methods

Feedforward Neural Network (FNN)

Also known as Artificial Neural Network (ANN)
Perceptron is a single layer neural network (binary)
Also known as Multilayer Perceptron (MLP)
Global basis functions (sigmoid)
Require non-linear regression, i.e. using an optimization algorithm to minimize error and that down the regression
Non-linear regression = have a little variation
Better at approximating sparse data
More accurate for non-uniform point distribution, again mainly because cross-validation works better for uniformly dense sets
When using ensembles, typically 4500 neural networks must be computed individually (including the ensembles and hidden nodes options) could take days
Σj WjXj+B W = weight X = node B = bias k = period l = layer j = neuron

Radial Basis Function (RBF)

Localised basis functions (e.g. Gaussian)
Using linear regression (if the spread (S) and number of basis functions (m) is fixed)
Cross-validation in the outer loop of the regression to determine m and S. Cross-validation means that we minimise the PRESS error over m and S to theoretically get the best predictor
In uniformly dense sets it may be better since, theoretically, cross-validation should provide a more accurate response surface
A single design point (i.e. a converged solution), using the default SRSM (sequential) approach with linear basis functions (the default approach) is still the best and cheapest
E.g. a power grid fails, it will find the most effective route to patch back
Φ(x,c) = Φ(‖x-c ‖)
Anything that satisfy the above formula is RBF x is the node, c is the center
Φ is Phi, the golden number, 1.618 (‖x-c ‖) is the norm of the vector
Φ(r) = e-( εr)2
This is the Gaussian model r = ‖x-xi ‖ (i the previous node) e is Euler number, 2.718 ε is Epsilon, the definition of limit

Kohonen Self-Organizing Map Neural Network (SOM)

Finding the outline of an image and will map to the nearest vector
Does not handle categorical variables well, computationally expensive & potentially inconsistent solutions

Each node’s weight is initialised
A vector is chosen randomly from the training dataset
The winning node that are most like the input vector become the Best Matching Unit (BMU)
BMU’s neighbor is calculated and decrease overtime
BMU and its neighbor become more like the sample vector and learn less as it farther
Repeat step 2 for N iterations

Wv(s+1) = Wv(s) + θ(u,v,s) · α(s) · (D(t) - Wv(s))

Wv is the current weight vector of node v
s is the current iteration
u is the index of the best matching unit (BMU) in the map
v is the index of the node in the map
t is the index of the target input data vector in the input data set D
θ(u,v,s) is a restraint due to distance from BMU, usually called the neighbourhood function
α(s) is a learning restraint due to iteration progress
D(t) is a target input data vector

Recurrent Neural Network (RNN)

Common associated with Long Short-Term Memory (LSTM) (ht = tanh(Ct) · ot)
Elman and Jordan (yt = σh(Wyht + by)
Gate Recurrent Unit (GRU) (ht = (1−zt) · ht−1 + zt · ħt)
Continuous-time (CTRNN) - neuron of the incoming spike train, postsynaptic node
Ideal for time base data processing
Also used for speech and translation
Include the brute force approach – predict and train child network using all possible neural network

Convolutional Neural Network

Max pooling (sample-based discretization process; down-sampling) One of max pooling property is that it is non-convexity
Average pooling (instead of the highest probable nodes, it averages the nodes)
Also known as AlexNet
Common image dataset can be found at Canadian Institute For Advanced Research (CIFAR)
An improvement version is called the Capsule Neural Network
Pretrained image datasets: VGG, ResNet50, Inception (GoogLeNet), Xception, Residual Network

Modular Neural Network

Multiple neural network

Energy based Neural Network

Traditional model using electrical pulse
ising model paradigm, generative model
Undirected, e.g. Restricted Boltzmann Machine (RBM)
Gating – max entropy to obtain energy-based models for binary gate

Autoencoder

Learn efficient data coding in an unsupervised manner
Compress data from the input layer into a short code, and then uncompress that code into something that closely matches the original data
E.g. learning how to ignore noise
z = σ(Wx + b)

3 Stages as compared to Human Neural Network

Dendrites (input)

Model Evaluation – good fit between model and data Confidence Interval – how reliable a statistical estimate is Confusion Matrix – context of clustering Gain and Life Chart – ratio of result with and without the predictive model Kolmogorov-Smirnov Chart – compare distributions under nonparametric Gini Coefficient – classification problems Cross Validation – how the model will perform in the future Predictive Power – concept of entropy or Gini index
Single-variable selection techniques Chi-square test CHAID stump using Chi-square test Association Rules Confidence, 1 antecedent ANOVA Kolmogorov-Smirnov (K-S) distance, two sample tests Linear regression forward selection (1 step) Principle component analysis

Cell nucleus (nodes) x Synapse (weight)Transfer function
Only have AND, NAND, OR, XOR, NOR, NOT
Layers
Softmax 
Axon (output)

Activations functions
Sigmoid function – 0 to 1 | σ(z) = 1 / (1+e-z)
Threshold (Step) function – 0 or 1 | y = 0, x<n; 1, x≥n
Rectified Linear Unit (ReLU) function – correct non-linear input | R(z) = max(0, z)
Leaky ReLU – fix ReLU problem with 0.01 allowance | max(0.1x,x)
Hyperbolic tangent (tanh) function – -1 to 1 | tanh(z) = 1-tanh2(z)
Exponential Linear Unit (ELU)
Deeply-supervised Net (DSB) – under CNN
Network-in-Network (NiN) – micro network
Maxout – maximum of multiple possible outputs
Highway network – regulate info flow
Expected Signal Propagation (ESP) - technique in Bayesian machine learning

Predictive
Sample, Explore, Modify, Model, Access (SEMMA

Human Brain

Processing Neuron Nervous System
Implicit Memory Procedural (muscle memory) Priming (external correlation) Classical conditioning (internal correlation)
Explicit Memory Semantic (knowledge and facts) Episodic (Life’s events)
Stages Sensory (all the senses) Working (short term and temporary) Long term (storage)

Hyper intelligent

Could be achieved with quantum processor (qubit)
Computer be so smart it will have thought we are just like cockroaches

General intelligence

g factor (psychometric: cognitive abilities & human intelligence)
Neuromorphic processor
Turing test and Lovelace test
4th quadrant of math (highest point between abstraction and application

Training & Machine Learning (ML)

Epoch – one forward and backward pass
Batch size – training examples on how many epoch
Iterations – number of epoch based on batch size
(e.g. 1000 training exercise, batch size of 500 need 2 iterations to complete)
Multi-Task Learning (MTL) – subfield of ML, multiple learning tasks are solved at the same time
With the need for explicit learning
Pre-training – use the weights saved from the previous network to initiate the new training
Supervised learning – task driven; split, apply, combine (classification and regression)
E.g. student in a school, logistic regression
Learning rules (create rules with data and answers) / converting frequencies to probabilities
Unsupervised learning – data driven (clustering, dimensionality reduction, recommendation)
Generative Adversarial Networks (GAN) – NN contesting with each other in a zero-sum game
Requires lesser data and fewer parameters. Can be used for high resolution image processing, text-to-image synthesis, train with less data and provides prediction on missing data
Transfer learning – storing knowledge gained while solving one problem and applying it to a different one
Reinforcement learning – algorithm learns to react to an environment (reward maximisation)
Deep learning - the vanishing gradient problem (layer come with a cost)
Principal component – a linear combination of the predictor variables
Loadings – the weights that transform the predictors into the components
Scree Plot – showing the relative importance of the components
K-means clustering – divide data into different groups
Expectation Maximization (EM) - computes probabilities based on one or more probability distributions
Gaussian mixture - probabilistic model that assumes all the data points
Cluster – group of records that are similar
Cluster means – the vector of variable means for the records in a cluster
Hierarchical clustering – more flexible than k-means with more non-numerical variables
Agglomerative algorithm – until 1 cluster is left (d(x,y) = √(x1-y1)2 + (x2-y2)2 + … + (xp-yp)2)
Divisive clustering – different from agglomerative, top down, start from big and subdivide to small
Dendrogram – visual records and hierarchy or clusters to which they belong
Distance – measure of how close one record to another
Dissimilarity – how close one cluster to another
Model-based clustering – similar records and not necessarily close to one another
Multivariate normal distribution - generalization of the 1D normal distributon to higher dimensions
Bayesian Information Criteria (BIC) and Akaike information criterion (AIC)
Markov Model – a stochastic model used to model randomly change system (sequence data model)
Hidden Markov Model – Markov model with unobserved states
Markov Chain Monte Carlo – for sampling from a probability distribution
Kalman Filter / Linear Quadratic Equation (LQE) – accurate estimates of unknown variables
Scaling and categorial variables under unsupervised learning
Scaling – squashing or expanding data
Scaling (magnitude scaling, sigmoid, min-max normalization, z-score, rank scoring)
Normalization – subtracting the mean and dividing by the standard deviation
Gower’s distance – bringing all variables to a 0-1 range
Sample types
Sample – subset from a larger dataset
Population – large dataset or idea of a data set
N(n) the size of the population
Random sampling – drawing elements into a sample at random
Stratified sampling – divide the population into strata
Simple random sample – sampling the population without stratify
Sample bias – a sample that misrepresent the population
Bias – systematic error
Data snooping – extensive hunting through data in search of something interesting
Vast search effect – bias or no reproducibility resulting from repeated data modeling
Bootstrap sample – a sample taken with replacement from an observed data set
Resampling – taking repeated samples from observed data
Tree model, Classification and Regression Tree (CART), decision tree
Recursive partitioning – repeatedly dividing and subdividing the data until it is homogeneous
Split value – predictor value that divides the record
Node – the graphical or rule representation of a split value
Leaf – the if-then else rules
Loss – the number of misclassifications at a stage in the splitting process
Impurity – mix of classes found in a sub partition of the data
Measurement: Gini (Gini(E) = 1−∑j=1 pj) and entropy (H(E) = −∑j=1pj log pj)
Pruning – progressively cutting its branches to reduce overfitting
Ensemble – forming a prediction by using a collection of models (split and boost accuracy)
Bagging – a general technique to form a collection of models by bootstrapping the data
Random forest – a type of bagged estimate based on decision tree models
Variable importance – measure of the importance of a predictor variable in the performance of the model
Flocking algorithm - self-propelled entities and collective animal behavior
Boosting – giving more weight to the records with large residuals for each successive round
Adaboost – reweighting the data based on the residuals
Gradient boosting – minimizing a cost function
Stochastic gradient boosting – resampling of records and columns in each round
Regularization – to avoid overfitting by adding penalty term (and dropout)
Hyperparameters – parameters to be set before fitting the algorithm
XGBoost – public software for Stochastic gradient boosting

Detections

Edge detection Canny Deriche Differential Sobel Prewitt Roberts Cross
Corner detection Harris Operator Shi and Tomasi Level Curve Curvature Hessian Feature Strength Measures
Blob detection Laplacian of Gaussian (LoG) Determinant of Hessian (DoH) PCBR Maximally Stable Extremal Regions Difference of Gaussians (DoG)
Ridge detection Hough transform Generalized Hough Transform
Structure tensor Generalized Structure Tensor Affine invariant feature detection Affine Shape Adaptation Harris Affine Hessian Affine Feature description Scale-Invariant Feature Transform (SIFT) Speeded Up Robust Features (SURF) Gradient Location and Orientation Histogram (GLOH) GIST Histogram of Oriented Gradients (HOG) Local Binary Pattern (LBP)
Scale space Scale-Space Axioms Axiomatic Theory of Receptive Fields Implementation Details Pyramids

Common terms

Cost function – how good the trained data
Forward propagation – get the output and compare with real value to get error (n+1) or Forward Pass – e.g. CNN & RNN
Backpropagation – get the derivative of error with each weight and sub-trace this value from the weight value
Skip-Thought Vectors – distributed sentence encoder
Text-guided Attention Model – understand an image and generate natural language of descriptions
Ubiquitous Computing – computing to be made to appear anytime and anywhere
Data wrangling (munging) – transforming and mapping raw data to rational data

Numerical Optimization / Recommendation

Stochastic Gradient Descent (SGD) – get the optimal minimum Gradient descent can write programming code better than human
Broyden-Fletcher-Goldfarb-Shanno (BFGS) – solving unconstrained nonlinear optimization problems
Baseline Model Low-Rank Matrix Factorization Alternating Least Square Contrastive Divergence Restricted Boltzmann Machine

Quantitative Finance

Insertion sort – simple sorting algorithm
Orthogonalization – linear algebra, finding a set of orthogonal vectors that span a subspace
Binary search – search algorithm that finds the position of a target value within a sorted array
Dickey-fuller test – check financial data that is always changing
Sharpe Ratio – calculate risk-adjusted returns, can’t work with negative skewness
Alternatives as Sortino ratio, return over maximum drawdown (RoMaD), Treynor ratio
Simpson’s Paradox – reverse trend when variables combined
Anova – analysis of variance, difference of means, f-statistic

Regression (ŷ = b0 + b1X)

Response – the variable we are trying to predict Independent variable – the variable to predict the response Record – the vector of predictor and outcome values for a specific individual Intercept – the predicted value when x = 0 Regression coefficient – the slope of the regression line Fitted values – the estimates ŷi obtained from the regression line Residuals – the difference between the observed and fitted values Least squares – fit a regression by minimizing the sum of squared residuals Maximum likelihood - a method of estimating the parameters of a statistical model, given observations
Root mean squared error – the square root of the average squared error of the regression Residual standard error – same as root mean squared error but with freedom R-squared – proportion of variance explained by the model from 0 to 1 Weighted regression – regression with records of having different weights Cross-validation – set aside 1/k → train remaining data → apply score model to 1/k holdout → restore first 1/k data and set aside next 1/k → repeat training until each record is used →average or combine the model assessment metrics
Dummy variables – binary 0-1 variables derived by recording factor data Reference coding – factor to use as references and another factor to compare that level One hot encoder – all factors levels are retained Deviation coding – compares each level against the overall mean
Correlated variables – difficult to interpret individual coefficients Multicollinearity – near perfect correlation rendering unstable to compute Collinearity - 1 predictor variable in a multiple regression model can be linearly predicted Main effects – predictor and outcome variable are independent Interactions – interdependent relationship between 2 or more predictors Bayesian Linear Regression – regression model with prior distribution is assumed
Standardized residuals – residuals divided by the standard error Outliers – records that are distant from the rest of the data Influential value – value that make a big difference in the regression equation Leverage – degree of influence that a single record has Non-normal residuals – it can invalidate some technical requirements Heteroskedasticity – some outcome range experience residuals with higher variance Partial residual plots – diagnostic plot to show between outcome variable and a single predictor
Polynomial regression – adds polynomial terms to a regression Spline regression – fitting a smooth curve with a series of polynomial segments Knots – values that separate spline segments Generalized additive models – spline models with automated selection of knots Lasso-based linear regression – performs both variable selection and regularization to enhance the prediction accuracy

Advance Regression Techniques

Adaptive regression, locally estimated scatterplot smoothing (LoESS), proportional hazard regression, quantile regression, robust regression

Data Analysis

Mean is an average value (x̄ = Σx / n) Trimmed mean take away the n front and back (x̄ = (Σn-p + 1x(i)) / (n – 2p)) Weighted mean adds multiplier into the mean (x̄w = Σi=1wixi / Σiwi)
Attitudinal data: how customer think or feel Behavioral data: how customer interact with the business Demographic data: information of customer base
Labeled data – a group of samples that have been tagged with one or more labels

Data Smoothing

Reduction of noises that is causing the fluctuations Methods to reduce: Random Walk Moving Average Exponential Smoothing

Recommender System

Concept Collective Intelligence Relevance Star Ratings Long Tail
Methods and challenges Cold Start Collaborative Filtering Dimensionality Reduction Implicit Data Collection Item-Item Collaborative Filtering Matrix Factorization Preference Elicitation Principal Component Analysis (PCA) Similarity Search Social Loafing Topic Model
Text Recommendation Algorithm Singular Value Decomposition (SVD) M = U∑V* Latent Dirichlet Allocation (LDA)
Gains and Lift Charts

Reduce Dimensionality / Approximation Serie

Taylor Network Fourier Transform Discrete Fourier Transform (DFT) Discrete Wavelet Transformation (DWT) Singular Value Decomposition (SVD) Locally Linear Embedding (LLE) ISOMap t-SNE Multidimensional Scaling (MDS)

Classification

Conditional probability – probability of observing some event P(Xi|Yi) Posterior probability – an outcome after the predictor information has been incorporated
Covariance – one variable varies in concert with another Discriminant function – maximizes the separation of the classes Discriminant weights – the scores to estimate the probabilities belonging to another class Techniques: Linear Discriminant Analysis (LDA), Covariance Matrix, Fisher’s Linear Discriminant
Logit – function that maps the probability of belonging to a class with range from ±∞ Odds – ratio of success (1) to not success (0) Log odds – response in the transformed model that mapped back to a probability Techniques: generalized linear models
Accuracy – the percent of cases classified correctly Confusion matrix – record counts vs their actual classification status Sensitivity – the percent of 1s correctly classified Specificity – the percent of 0x correctly classified Precision – the percent of predicted 1s that are 1s Receiver Operating Characteristic (ROC) curve – a plot of sensitivity versus specificity Area Underneath Curve (AUC) Lift – how effective the model is at different probability cutoff Percent Correct Classification (PCC) – measure overall accuracy
Undersample – use fewer of the prevalent class records in the classification model Oversample – use more of the rare class records in the classification model Up weight or down weight – attach incorrect weight or prevalent class in the model Data generation – each new bootstrapped record is slightly different from its source K – number of neighbor considered in the nearest neighbor calculation K-Nearest Neighbor (KNN) – hunt similar features and plot a circumference around it Euclidean distance – the "ordinary" straight-line distance between two points in Euclidean space Manhattan distance – between real vectors using the sum of their absolute difference (∑i=1 |xi - yi|) Hamming distance – categorical variable (DH = ∑i=1 (xi - yi))
Bayesian classification predict class membership probabilities such as the probability that a given tuple Naïve Bayes Network – probability based on evidence Bayesian nonparametric model – allowing the data to determine the complexity of the model
Support Vector Machine (SVM) – binary classifier Kernel Method - a class of algorithms for pattern analysis, member for SVM

Distribution

Standard distribution, normal (Gaussian) distribution, uniform distribution Bernoulli Beta Binomial Bivariate Normal Binomial G-and-H Burr Bimodal Heavy Tailed Continuous Probability Categorical Yule-Simon Lévy Cauchy Degenerate Cumulative Frequency Cumulative Geometric Beta Dirichlet Extreme Value Discrete Probability Empirical Gompertz PERT Erlang Exponential Generalized Error Wishart Weibull T Factorial Hypergeometric Folded / Half Normal Wallenius Fat Tail F J Shaped Inverse Normal Inverse Gaussian Von Mises U-Shaped Fisk Laplace Kumaraswamy Multivariate Normal Lindley Lognormal Kent Long Tail Marginal Mixture Multimodal Multinomial Rician Nakagami Normal Negative Binomial Open Ended Pareto Zeta Pearson Unimodal Tukey Lambda Power Law Rayleigh Poisson Reciprocal Uniform Relative Frequency Skewed Triangular Stable Symmetric Tweedie Truncated Normal Trapezoidal Generalized Error – training error / loss function
Sample statistic – metric calculated drawn from a larger population Data distribution – frequency distribution of individual values in a data set Sampling distribution – frequency distribution of sample statistic Central limit theorem – tendency of sampling distribution takes on a normal shape Standard error – the variability of a sample statistic over many samples (SE = s/√n)
Boxplot – quick way to visualize data Frequency table – tally of count of numeric data that fall into a set of intervals Histogram – frequency table with x and y axis Density plot – smoothed version of histogram
Confidence level – the percentage of confidence interval to contain the statistic interest Interval endpoints – top and bottom confidence interval
Error – difference between a data point and a predicted or average value Standardize – subtract the mean and divide by the standard deviation Z-score – result of standardizing an individual data point Standard normal – mean = 0 and standard deviation = 1 QQ-Plot – to visualize Error is zero (0) – stay as it is Error is positive (+) – increase the neurons and/or the weight Error is negative (-) – decrease the neurons and/or the weight
Trial – an event with a discrete outcome Success – the out of interest for a trial Binomial – having 2 outcomes, measure error at 95% confidence level Binomial trial – a trial with 2 outcomes Binomial distribution – distribution of number of successes in x trials
Lambda – the rate per unit of time or space at which events occur Poisson distribution – frequency distribution of the number of events in sampled unit Exponential distribution – frequency distribution of the time or distance between events Weibull distribution – generalized version of the exponential which rate can shift
Skewness – measurement of asymmetry of a distribution Kurtosis – the sharpness of the peak of a frequency-distribution curve Tail – long narrow portion of a frequency distribution Skew – One tail of a distribution is longer than the other Anscombe's Quartet – datasets appeared identical yet very different when graphed Fix positive skew – log transform log(x), multiplicative inverse 1/x, square root sqrt(x) Fix negative skew – square xn, inv log -log10(1+abs(x))

Correlation

To measure: Spurious (not related), Crosstab (cross tabulation), Scatterplot

Correlation coefficient – metric that measure from -1 to +1
Correlation matrix – a table of value that correlates diagonally
Scatterplot – plot with x and y axis
Pearson correlation coefficient (r) - linear correlation between two variables X and Y

Contingency tables – tally of counts between 2 or more categorical variables
Hexagonal binning – 2 numeric variables with the records to binned into hexagon
Contour plots – density of 2 numeric variables like a topographical map
Violin plots – like a boxplot but showing the density estimate

Time series characteristic – Hurst exponent, autocorrelation coefficient
Time series – a series of data points indexed (or listed or graphed) in time order

Testing

Treatment – something to which a subject is exposed

Treatment group – a group of subjects exposed to a specific treatment

Control group – a group of subjects exposed to no or standard treatment

Randomization – randomly assigning subjects to treatments

Subjects – items that are exposed to treatments

Test statistic – matric used to measure the effect of the treatment

Hypothesis testing - confirmatory data analysis

Null hypothesis – chance is to blame

Alternative hypothesis – counterpoint to the null as on what you hope to prove

One-way test – count chance result only in 1 direction

Two-way test – count chance result in 2 directions

Permutation test – combining 2 or more sample together and random reallocating observation

With or without replacement – an item is returned to the sample before the next draw

P-value – probability of obtaining results as unusual or extreme
Alpha – probability threshold of unusualness
Type 1 error – mistakenly concluding an effect is real when it is due to chance
Type 2 error – mistakenly concluding an effect is due to chance when it is real
False discovery rate – across multiple test making type 1 error
Adjustment of p-value – doing multiple tests on the same data
Underfitting – lack of essential variables
Overfitting – fitting the noise
MCAR – missing completely at random
MAR – missing at random
MNAR – missing not at random
T-test - to determine whether there is a significant difference between the means of two groups

Test statistic – metric for the difference or effect of interest

t-statistic – a standardized version of the test statistic

t-distribution – a reference distribution to which the observed t-statistic can be compared

Pairwise comparison – hypothesis test between 2 groups among multiple groups

Omnibus test – single hypothesis test of the overall variance among multiple group means

Decomposition of variance – separation of components and contributing to an individual value

F-statistic – measure the extent among group means to a chance model

Sum of squares (SS) – deviation from some average value (Σi=1 (yi - ȳ)2)

Chi-square statistic – measure of some observe data departs from expectation (χ= Σ ΣR2)

Pearson residual - raw residual divided by the square root of the variance function

Expectation or expected – how we expect the data to turn out under some assumption

Multi-arm bandit algorithm – explicit optimization and more rapid decision
Effect size – you hope to detect in a statistical test

Power – the probability of detecting a given effect size with a given sample size

Significance level – level at which the test will be conducted

Examples

Quantitative Neural Network

Neural network does not make any forecast and instead they predict the price data and uncover opportunities
Example
Using MLP with OHLCV tuple for MACD, Ichimoku Cloud, RSI and volatility)
Objective is to skip sideways movement
[(‘Total Return’, ‘1.66%’), (‘Sharpe Ratio’, ‘16.27’), (‘Max Drawdown’, ‘2.28%’), (‘Drawdown Duration’, ‘204’)]
Signals: 9 Orders: 9 Fills: 9
[(‘Total Return’, ‘3.07%’), (‘Sharpe Ratio’, ‘27.99’), (‘Max Drawdown’, ‘1.91%’), (‘Drawdown Duration’, ‘102’)]
Signals: 7 Orders: 7 Fills: 7
Potential adjustments

Forecast something different (e.g. volatility)
Use multimodal learning on different data source
Mixed of different classical

Identifying the image and learning to present it in a natural way

Natural Language Processing

STV – Stop Training with Validation
Opinion Finder – text mood detector (negative/positive)

Google Profile of Mood States (GPOMS) – another text mood detector

VIX Volatility Model – Gils Morales and Dr Chris Kacher

A self-learning/self-evolving algorithm that has successfully traded bull, bear, and sideways market

Back-tested results and live results after 2016:

2009 +178.7%

2010 +518.1%

2011 +274.5%

2012 +289.4%

2013 +103.1%

2014 +161.9%

2015 +589.7%

2016 +61.8% (real-time trading started uninterrupted on 11-8-16)

2017 +39.3% (as of 3-3-17, 1st quarter only)

There are AI winters and the stall zone, however the recovery bloomed exponentially

Part of the failure also due to companies putting fourth-generation machines into the third-generation business model
Shouldn’t apply the same formula throughout – different industry has different melting points

2022 to 2030: machines will start large scale replacement for human workers

Human can stay advantageous if their job scope includes analytical thinking, social media and fabrication skills

Narrow AI: now (the first and second waves)
General AI: strong AI (third wave)
Super AI: unstoppable, hyperintelligence (fourth wave)

Dystopian: we should fear our pending robot overloads
Utopian: everything is awesome, and technology will always be the answer
Pragmatist: the future can be as good if we make smart and practical decisions

Benefits / advantages of AI trader

No downtime and will not be tired
No emotions
Learn better and faster especially in the long run

Oil vs. Data – data being the new raw material that is infinite and easier to acquire and process

Search This Blog

Aloysius Chan's Blog

Neural Network & Machine Learning Notes

Popular posts from this blog

Be Fearless: Change Your Life in 28 Days (Jonathan Alpert, 2012)

Big Data: How the Information Revolution Is Transforming Our Lives (Brian Clegg, 2017)

Discover the Secrets of Marketing That Works Exposed: 47 Time-tested Strategies To Boost Sales, Get Famous & Put Cash In Your Pockets - Fast! (Michael X. Branson, 2014)