Archive for EN

VilniusR – R users group in Lithuania

Today is Lithuania’s independence day and I have created R user group in Lithuania – VilniusR. If you are near by please follow the link, sign up and I hope that we will have a meeting soon.

Comments

Vectorized R vs Rcpp

In my previous post, I tried to show, that Rcpp is 1000 faster than pure R and that generated the fuss in the comments. Being lazy, I didn’t vectorize R code and at the end I was comparing apples vs oranges.

To fix that problem, I built a new script, where I’m trying to compare apples against apples. First piece of code named “ifelse R” uses R “ifelse” function to vectorize code. Second piece of code is fully vectorized code written in R, third – pure C++ code and the last one is C++, where  Rcpp ”ifelse” function is used.

Photobucket

 

name seconds
ifelse R 27.50
vectorized R 10.40
pure C++ 0.44
vectorized C++ 2.24

Here we go – vectorization truly helps, but pure C++ code still 23 times faster. Of course you pay the price when writing it in C++.
I found a bit strange, that vectorized C++ code doesn’t perform that well…

You can get the code from github or review it below:

?View Code RSPLUS
#Author Dzidorius Martinaitis
#Date 2012-02-01
#Description http://www.investuotojas.eu/2012/02/01/vectorized-r-vs-rcpp
 
bid = runif(50000000,5,9)
ask = runif(50000000,5,9)
close = runif(50000000,5,9)
 
x=data.frame(bid=bid,ask=ask,last_price=close)
rez=0
 
###########    ifelse R  #################
answ=as.vector(system.time(
{
rez = ifelse(x$last_price>0,ifelse(x[, "bid"] > x[, "last_price"], x[, "bid"], ifelse((x[, "ask"] > 0) & (x[, "ask"] < x[, "last_price"]), x[, "ask"], x[, "last_price"])), 0.5*(x[, "ask"] + x[,"bid"]))
})[1])
###########   end ifelse R  #################
 
###########    vectorized R  #################
 
answ=append(answ,system.time(
{
lgt0 = x$last_price > 0
bgtl = x$bid > x$last_price
agt0 = x$ask > 0
altl = x$ask > x$last_price
rez = x$last_price
rez[lgt0 & agt0 & altl] = x$ask[lgt0 & agt0 & altl]
rez[lgt0 & bgtl] = x$bid[lgt0 & bgtl]
rez[!lgt0] = (x$ask[!lgt0]+x$bid[!lgt0])/2
}
)[1])
###########   end vectorized R  #################
 
#C++ code starts here
 
library(inline)
library(Rcpp)
 
###########    pure C++  #################
 
code='
NumericVector bid(bid_);NumericVector ask(ask_);NumericVector close(close_);
int bid_size = bid.size();
NumericVector ret(bid_size);
for(int i =0;i<bid_size;i++)
{
  if(close[i]>0)
  {
    if(bid[i]>close[i])
    {
      ret[i] = bid[i]; 
    }
    else if(ask[i]>0 && ask[i]<close[i])
    {
      ret[i] = ask[i];//
    }
    else
    {
      ret[i] = close[i];//
    }
  }
  else
  {
    ret[i]=(bid[i]+ask[i])/2;
  }
 
}
return ret;
'
getLastPrice <- cxxfunction(signature( bid_ = "numeric",ask_ = "numeric",close_="numeric"),body=code,plugin="Rcpp")
rez=0
answ=append(answ,system.time(
  {
    rez=getLastPrice(as.numeric(x$bid),as.numeric(x$ask),as.numeric(x$last_price))
  })[1])
 
###########   end pure C++  #################
 
#summary(rez)
 
 
###########    vectorized C++  #################
code='
NumericVector bid(bid_);NumericVector ask(ask_);NumericVector close(close_);
int bid_size = bid.size();
NumericVector ret=ifelse(close>0,ifelse(bid >close, bid, ifelse(ask > 0,ifelse(ask < close,ask, close),close)), 0.5*(ask + bid));
return ret;
'
getLastPrice <- cxxfunction(signature( bid_ = "numeric",ask_ = "numeric",close_="numeric"),body=code,plugin="Rcpp")
rez=0
answ=append(answ,system.time(
{
  rez=getLastPrice(as.numeric(x$bid),as.numeric(x$ask),as.numeric(x$last_price))
}
)[1])
 
###########   end vectorized C++  #################
 
#summary(rez)
names(answ)=c('ifelse R','vectorized R','pure C++','vectorized C++')
 
library(ggplot2)
a=data.frame(ind=1:4,val=answ)
ggplot(a,aes(ind,val))+geom_point(legend=F)+geom_text(aes(label=names(answ),hjust=c(-0.2,-0.2,-0.2,0.8),vjust=c(0,0,0,-1)),size=4)

Comments (9)

The power of Rcpp

While ago I built two R scripts to track OMX Baltic Benchmark Fund against the index. One script returns the deviation of  fund from the index and it works fast enough. The second calculates the value of the fund every minute and it used to take for while. For example, it spent 2 minutes or more to get the values for one day. Here is an example of the result:

Photobucket

Following piece of code was in question:

?View Code RSPLUS
for(y in 1:NROW(x))
 {
    z=x[y,]
    if(as.numeric(z$last_price>0))
    {
      if(as.numeric(z$bid>z$last_price))rez[y]=z$bid
      else if(as.numeric(z$ask)>0 &amp; as.numeric(z$ask)<z$last_price)rez[y]=z$ask
      else rez[y]=z$last_price
    }
    else
    {
      rez[y]=(z$ask+z$bid)/2
    }
 }

The code above loops over time series and based on set of rules tries to decide which price (bid, ask or previous one) to use for calculations. Pure R script used to take 100 seconds to derive the price.

During the weekend I found time to watch very interesting Rcpp presentation. To my surprise, there are numerous ways to seamlessly integrate C++ into R code. So, I decided to rewrite the code above in C++ (Rcpp and inline packages were used).

?View Code RSPLUS
#c++ code embed in code value
code='
NumericVector bid(bid_);NumericVector ask(ask_);NumericVector close(close_);NumericVector ret(ask_);
int bid_size = bid.size();
for(int i =0;i<bid_size;i++)
{
  if(close[i]>0)
  {
    if(bid[i]>close[i])
    {
      ret[i] = bid[i];
    }
    else if(ask[i]>0 &amp;&amp; ask[i]<close[i])
    {
      ret[i] = ask[i];//
    }
    else
    {
      ret[i] = close[i];//
    }
  }
  else
  {
    ret[i]=(bid[i]+ask[i])/2;
  }
 
}
return ret;
'
#a glue function between C++ and R
getLastPrice = cxxfunction(signature( bid_ = "numeric",ask_ = "numeric",close_="numeric"),body=code,plugin="Rcpp")
 
#and the call of the function
getLastPrice(as.numeric(x$bid),as.numeric(x$ask),as.numeric(x$last_price))

What did I get in return? Well, 0.1 of a second instead of 100 seconds!

Comments (10)

ai-class.com vs ml-class.com

For those who did not know, Stanford university offered free off charge 3 courses at beginning of the autumn. It is kind of shocking – US based institution offers education for free! Take any socialism oriented country and one of the promises is education for free. But it seems, that the argument loosing the power – Stanford, khanacademy and bunch of others offer high quality learning for everyone.

In January (scroll down to get full list), Stanford will provide more than 15 courses for free and I thought that I could provide my based opinion about the courses.

ml-class.com This course was perfect fit for my personality and I loved it. Every week there was video lessons about the topics like machine learning, datamining, and statistical pattern recognition, overview questions and programming exercises, which had to be completed in Octave/Matlab. The quality of the video was superb, the length of the lessons was 8-14 minutes and format of the lessons was great as well (Prof. Andrew Ng was seamlessly switching between the white board and talks).
This course inspired me to build anomaly detection system at my work, where we already spotted few anomalies. Now I’m working on  kind of “spam filter implementation” for text analysis.
For me, the practical part of the course is like the water for the fish – without it theoretical part is empty and to be forgotten within the hours.

ai-class.com This course gave to me a broad view about artificial intelligence: machine learning, robotics, natural language processing, computer vision, search algorithms and etc. I suppose, that because the topics are so different the course was align towards theoretical part – otherwise the practical parts would take forever. However, in the last part there was an optional exercise – to encrypt two texts, which I loved!
The instructors, namely Sebastian Thrun and Peter Norvig, recommend this book: Artificial Intelligence: A Modern Approach. I should say, that the book was very helpful during the course and but I won’t use it outside the course.
The courses have different evaluation systems. AI class will score your homework and exams, where the top 1% will be awarded with special paper and maybe a job offer, while ML class inclined towards delivering knowledge – almost everyone working hard could get 100% score without a penalty. I think, that based on such environments, different communities sprang up - aiqus.com forum is very harsh to any question, where the answers start by stating, like “I know the answer, but hey, I can’t tell you anything, because honor code doesn’t allow and I’m the smartest guy on the Earth”, while ml-class forum is more open minded – if you can’t crack the problem then other students will help you.
I was in light shock, when I saw the format of AI lectures first time – the instructors used real white board, namely paper and pencil and took me a while to get use it.

But overall, I really really enjoy both courses and special thanks to Stanford professors, concretely Andrew Ng, Sebastian Thrun and Peter Norvig!

Comments (1)

C++ is dead. Long live C++

During the summer I was contacted by a hedge fund from Bahamas. The fund was looking for someone with R language skills on-site and insisted for phone interview. Besides obvious questions about finance, statistics, coding and how many tennis balls can fit in Boeing 747 (ok, this question was omitted), they wanted to know if I code in C++. So, I told them true – the last time I wrote a line in C++ was more or less 10 years ago. Long story short – it made me thinking about existence of C++.

10 yeas ago I was told, that C++ is going to disappear soon and Java is the king. At that time neither HN nor stackoverflow existed (meaning, that I had to rely on limited source), so I took it for granted, so here I am.

What do we have 10 years later? Neither C++ is dead, nor Java is sexy anymore. Actually is opposite – if you use Java, then you are clumsy programmer with lack of imagination. Does it sounds offensive? Then read for example the comments of Scala vs Java article and you will get the same feeling. Replacement of  Sun with Oracle does not help either.

But lets go back to C++. Google trends says, that C++ enjoys either maturity or decline. However, if you concentrate on specific industry, you will have a different picture. Kernel development (C not C++), game industry, number crunching, data mining, finance – where C++ matters. I know, I know, that you can write a magic code with Ruby or Python and it will perform almost as C++. And I saw a video, where guys were claiming, that they tuned “a bit”  Java and now it is able to deal with more that 1 million requests a second. Only the thing they did was elimination of garbage collection. The question – is it really worth of doing that way?

Next thing is to check what is demand for C++. Time to time I scroll through HN to be millionaires list and strangely enough C++ is demanded for back end systems, where performance or data amount is an issue. However, if you are targeting finance industry exclusively, then you may find this discussion interesting. Basically, it says, that there is stable demand for C++. Worth to say nonetheless, that C++ is mostly used by front and middle offices (where quants live) and the demand diminish in back office.

With such subjective study in mind I purchased C++ Primer by Stanley B. Lippman, which I recommend to beginners or disillusioned users like me. So far I built 2 small projects and in which one of them I parse 1.5 million terms to get a list of most used terms. R language does that in minutes, C++ in seconds.

Welcome back C++. It seems, that I miss you.

Comments (4)

Trading volume forecast for an illiquid stock

When dealing with transaction cost analysis, a stock’s volume is assumed to be stable or foreseeable.  However, there is different picture, then we are dealing with an illiquid stock.

It is relatively easy to forecast the volume of a liquid stock, because trading volume has high autocorrelation – the volumes at t and t+1 are correlated. For example, let’s take a look at BPN Paribas stock  autocorrelation figure:

Photobucket

X axis shows the lag between the days and Y axis shows percentage of the correlation. For BNP Paribas stock we have 60 % correlation between t and t+1 and 30 % between t and t+2.

Now, let’s look what is autocorrelation  of an illiquid stock:

Photobucket

The figure above shows, that autocorrelation for this emerging market stock is zero. That means, that we can’t forecast tomorrow’s volume, based on today’s volume.

Imagine, that a portfolio manager has to liquidate 100 000 of illiquid stock where the median of daily volume is 4000 and participation rate has to be maximum 20 %. How many days it will take to liquidate the position?

Because the daily volume of the stock is very volatile, we need more randomness in our forecast. To do that, we can use bootstrap – let’s take the last 250 data-points of the volume and generate 10 000 or 100 000 new time series. Once the have a bunch of new time series, let’s check how many days it would take to liquidate 100 000 stock position in each. Finally, we collect the numbers of the days needed for liquidation  and form a new vector. The result is following:

Photobucket

The histogram above shows, that it will take maximum 80 days to liquidate 100 000 of illiquid stock with 95% confidence.

?View Code RSPLUS
#volume - vector of the stock's volume
#shareNumber - number of the shares to liquidate
#loop - number of the time-series to be created
#participationRate - what is going to be participation rate 
simNumberOfDays<-function(volume,shareNumber,loop=10000,participationRate=0.05)
{
  rez=matrix(nrow=loop)
  for(i in 1:loop)
  {
    x=nasa[sample(NROW(volume))]
    y=cumsum(x*participationRate)
    rez[i,]=which(y>shareNumber)[1]
  }
  return(rez)
}
rez=simNumberOfDays(illiquidStock,100000,10000,0.2)

Comments

How big block trades affect stock market prices?

I will be giving a presentation on “Optimal transaction cost” in Vilnius on  16  August. While preparing the presentation and looking for an optimal execution solution, a natural question arises: does the size of the trade affect stock market price? I’m sure, you would say 100 % yes. Well, you would be right, but what is the scale of such effect? Is it possible to profit from execution of the big block trades?

Such test is not trivial and to conduct it, you need high frequency data, which is messy in most of the cases. For testing purpose I chose BNP Paribas stock from February 2011 to May 2011. Initially, I had more than 460 k. trades and more than 320k. quotes. However, the data was filtered by buyers initiated trades. To find buyers initiated trades, I used Lee-Ready Rule – short description can be found here on page 2. I found about Lee – Ready rule while reading Maxdama last post and a damn good summary (check page 42).

The first chart below shows the average return  one trade later (within seconds in most of the cases), when big or small trade was done. X axis represents difference between the trade and following trade, Y axis represents the trade size and the dot size represents number of trades within that cluster of volume. As you can see, small trades add 0.0004% to the price, while big ones (more than 980 of shares) increase the price on average 0.0007%

Photobucket

The next figure shows average return one minute later. This time the different between small trades and big one are almost3 times!

Photobucket

While we can see, that stock market prices are affected by big blocks, there’s no easy way to profit from it. You have to take into account bid/ask spread, plus you are becoming liquidity demander when liquidity is dry. On other end, this test shows the cost for each volume cluster and this cost can be used when choosing an optimal strategy for portfolio/stock liquidation.

Comments

Plotting git statistics

Here’s a funny story – friend of my, avid gamer at that time, was going downhill on a bicycle when wonderful idea flashed his mind: I need to save the current status… Just in case if I crash, I will start again from the top of the hill.

If you are a developer (quantitative or software), then you can use such marvelous feature. I use GitHub for my software and data mining or quantitative projects. Yesterday I came up with an idea to check my statistics of git commits. You can easily find ready to use software, but I was eager to extend my knowledge about git features and keep my machine clean.

I built two scripts – one is Linux shell script to get the data and another one is to plot the data in R.
getstats.sh:

git log master --shortstat --pretty="format: %ai"|
sed -e 's/\+[0-9]*/,/g'|sed ':a;N;$!ba;s/ ,\n/,/g'|
sed 's/ files changed//g'|sed 's/ insertions(,)//g'|
sed 's/ deletions(-)//g' >gitstats.csv

This part of the code: git log master –shortstat –pretty=”format: %ai” dumps all necessary data and the rest of the code makes it ready for R consumption. I found this page helpful, when I tried to format the dump.

gitStats.R:

?View Code RSPLUS
require(ggplot2)
require(xts)
setwd('/home/git/Rproject/gitStats/') 
Sys.setenv(TZ="GMT")
tmp=as.matrix(read.table('gitstats.csv',sep=',',header=FALSE))
commits=xts(cbind(as.double(tmp[,2]),as.double(tmp[,3]),as.double(tmp[,4])),order.by=as.POSIXct(strptime(tmp[,1],'%Y-%m-%d %H:%M:%S')))
 
colnames(commits)=c('Changes','Insertion','Deletion')
tmp=data.frame(Date=as.Date(index(commits)),Changes=as.numeric(commits$Changes),Insertion=as.numeric(commits$Insertion),Deletion=as.numeric(commits$Deletion))
tmp=melt(tmp,id.vars=c('Date'))
png('gitStats.png',width=500)
print(ggplot(tmp,aes(Date,value,color=variable))+geom_jitter(alpha=.65,size=3))
dev.off()
 
#############daily aggregated data##############
factor=as.factor(format(index(commits),'%Y%m%d'))
tmp=cbind(as.numeric(aggregate(commits$Changes,factor,sum)),as.numeric(aggregate(commits$Insertion,factor,sum)),as.numeric(aggregate(commits$Deletion,factor,sum)))
tmp=data.frame(unique(as.Date(index(commits))),tmp)
colnames(tmp)=c('Date','Changes','Insertion','Deletion')
tmp=melt(tmp,id.vars=c('Date'))
png('gitStats2.png',width=500)
print(ggplot(tmp,aes(Date,value,color=variable))+geom_jitter(alpha=.65,size=3))
dev.off()

R script generates this nice plot below:

Photobucket

What does it shows? It shows my activity in master repository. There is two projects – one was suspended in March and another one is under heavy development. As you can see, there was a lot of insertion when the last project was committed and since then numbers of insertion declined. I will come back, when I generate more data.
Do you track your git activity?

Source code

Comments

Artificial intelligence in trading: k-means clustering

There is many flavors of artificial intelligence (AI), however I want to show practical example of the cluster analysis. It is very applicable in finance. For example, one of stylized facts of volatility is, that it moves in clusters, meaning that today’s volatility will be more likely as yesterday’s volatility. To gauge these moves you can use hidden Markov chain (complicated method) or k-means (probably to simplified). However, GARCH model successfully exploits this stylized fact to make prediction of tomorrow’s volatility (it takes into account another fact as well – volatility is mean reverting process).

K-means is based on unsupervised learning – you give the data and k-means decides how to classify it. The idea is to split data into clusters based on cluster center and assign each point to nearest center.  There is drawback with such approach – the algorithm tries to establish the centers of  clusters with initial data set. If the data is very noisy and the centers are not stable, then every try will give you different results.

As you probably know, the distribution of financial data is very unstable. How to tackle this problem? We should be looking at daily returns instead of prices. The figure below shows daily returns of SPY stock.

?View Code RSPLUS
setwd('/home/git/Rproject/kmeans/')
require(quantmod)
require(ggplot2)
Sys.setenv(TZ="GMT")
getSymbols('SPY',from='2000-01-01')
 
x=data.frame(d=index(Cl(SPY)),return=as.numeric(Delt(Cl(SPY))))
png('daily_density.png',width=500)
ggplot(x,aes(return))+stat_density(colour="steelblue", size=2, fill=NA)+xlab(label='Daily returns')
dev.off()

Photobucket

I was ready to show another trick – how to neutralize long tails by replacing existing distribution with uniform distribution, but quick test revealed, that this leads to uninterpretable results.

OK, lets move further – how many clusters should we have? Can AI give us a clue? Of course, but keep in mind that then your future decision will be anchored.

?View Code RSPLUS
nasa=tail(cbind(Delt(Op(SPY),Hi(SPY)),Delt(Op(SPY),Lo(SPY)),Delt(Op(SPY),Cl(SPY))),-1)
 
#optimal number of clusters
wss = (nrow(nasa)-1)*sum(apply(nasa,2,var))
for (i in 2:15) wss[i] = sum(kmeans(nasa, centers=i)$withinss)
wss=(data.frame(number=1:15,value=as.numeric(wss)))
 
png('numberOfClusters.png',width=500)
ggplot(wss,aes(number,value))+geom_point()+
  xlab("Number of Clusters")+ylab("Within groups sum of squares")+geom_smooth()
dev.off()

Photobucket

The figure above implies, that we should have more than 15 clusters for financial data. Well, for sake of simplicity and education purpose lets use only 5.

?View Code RSPLUS
kmeanObject=kmeans(nasa,5,iter.max=10)
kmeanObject$centers
autocorrelation=head(cbind(kmeanObject$cluster,lag(as.xts(kmeanObject$cluster),-1)),-1)
xtabs(~autocorrelation[,1]+(autocorrelation[,2]))
 
y=apply(xtabs(~autocorrelation[,1]+(autocorrelation[,2])),1,sum)
x=xtabs(~autocorrelation[,1]+(autocorrelation[,2]))
 
z=x
for(i in 1:5)
{
  z[i,]=(x[i,]/y[i])
}

The code above actually shows, how to run k-means clustering in R. The first line runs the sorting and the second shows clusters’ centroids:

High Low Close
1 0.0388 -0.0094 0.0313
2 0.0049 -0.0050 0.0006
3 0.0143 -0.0038 0.0106
4 0.0038 -0.0148 -0.0103
5 0.0053 -0.0348 -0.0280

So, we have 5 clusters: 1. extremely positive day, 2. flat day, 3. positive day and 4,5 are clusters with negative outcome.
The third and fourth lines in the code above checks and prints autocorrelation between today(N0) and tomorrow(N1):

1 2 3 4 5
1 11 24 29 21 12
2 16 991 288 351 42
3 17 338 144 168 28
4 27 310 202 207 32
5 26 24 33 31 23

If you prefer percentages instead of plain numbers, the following table gives you that:

1 2 3 4 5
1 0.11 0.25 0.30 0.22 0.12
2 0.01 0.59 0.17 0.21 0.02
3 0.02 0.49 0.21 0.24 0.04
4 0.03 0.40 0.26 0.27 0.04
5 0.19 0.18 0.24 0.23 0.17

How to read such tables? Lets take for example line 2. The first table says, that the centers of the cluster are following: 0.0049;-0.0050;0.0006, meaning that during such day, the price of the asset is moving in very narrow range. Now, the table 2 or 3 shows, what are the chances for the next day (N1). Here is only 1 % chance, that following day will be extremely negative or positive (1 and 5 columns), 59 % chance, that it will be as today (N0) or it will be mild volatility with positive or negative outcome (3 and 4 columns). Put it shortly – if volatility today is very low, then most likely it will be tomorrow.

For further research I would advise to increase the number of clusters and check what are the results. On the same vein IntelligentTradingTech made a post while back.

The source code can be found here.

Comments (4)

timezone issue in R

While investigating Intraday patterns in FX returns and order flow paper I have faced the problem with timezone. I had 3 data sources with different timezones (GMT, CET, CEST). Most confusing thing was, that I didn’t know, how to deal with summer time.
But why did I have the data with summer time in the first place?  Well, I use IBrokers package to get latest Forex data and turns out, that it is impossible to specify timezone parameter within reqHistoricalData function. Once the data is received, it assigns default timezone of R (in my case GMT), but the requested data comes with OS timezone (CET/CEST in my case). It took for while to realize that, but the real challenge was how to convert CET/CEST to GMT.

The answer – don’t use CET/CEST, but instead of that use something like ‘Europe/Paris’ or ‘Europe/Berlin’. This approach takes into account summer time issue and you don’t need to worry about it. Once you have the data in ‘Europe/Paris’ or ‘Europe/Berlin’ format you can easily convert it:

?View Code RSPLUS
Sys.setenv(TZ="Europe/Paris")
eur.usd=reqHistoricalData(tws,currency,whatToShow='MIDPOINT',barSize='1 hour')
Sys.setenv(TZ="GMT")
eur.usd=xts(Op(eur.usd),tz='GMT')

or you can change the index:

?View Code RSPLUS
 format(eur.usd, tz="GMT",usetz=TRUE)

Comments

Next Page »