My recent assignment is to analyze slowness of a web service.
The service is an API layer that has two dependencies. There are some calling details such as cookie validation, and there is another prerequisite step that fetch parameters for both dependencies. After that requests to dependency A and B can be parallelized.
I put in use what I learned in Coursera "dss" programs, also spent many hours on Google and StackOverflow, as always.
At first I don't know how to measure it. The HTTP calls are hidden in many layers of function calls, I added many logs in each of them. Later I have to remove all but the one closest to the HTTP calls. I have to clean up everything else in the logs. Log4net is very handy in redirecting logs -- I abused logger names and finally I got a "clean" log file. Each HTTP call will write one line of log, with very little noise. With a little help of grep/awk, the file can be loaded into R or parsed in Excel.
Both dependency A and B have similar request format, a key-value based struct of data point Ids, plus a GUID or string Id for the subject. The first task is to learn which component has more impact to overall performance, and the second task is to learn the response time of different data point Ids. For either task, the first step is to write code to collect a large number of samples. My code ran in single thread for 12 hours, which gave me about 10k samples (say each one runs for 5s).
Samples should be cleaned up. During the 12 hours time a server restart happened, also log rotate happened at midnight. Some records during those periods are removed. Otherwise, response time shows same pattern over time. TBD how do I test for that?
The R functions and charts I found useful:
- scatterplot, correlations
xyplot() from package 'lattice'
scale()
Scatterplot is so easy to use. It is the first one I used, as I want to know the characters of samples. First I plot response time and clock time, then I plot response time of one request and overall time (to see correlation).
It is tempting to use xyplot() to plot everything on the same chart. However that did not work well for me. They carry too much information. When using xyplot(), it is still best to plot in panels than mix up.
I think the scatterplot that shows cluster is fine, when they do not overlap. But I still don't understand how to draw such a plot. Not used in this task.
If two groups have different mean value, "scale" will result in hard to explain numbers. Scale has to be done per groups so numbers are comparable.
xyplot with "smooth" or lm lines are so hard to read or interpret?
TBD I want to show mean and sd numbers on each panel of xyplot() too, when scaled value is displayed. It would be easier to parse what is a 3 times standard deviation looks like. When mean value is 3s and sd is also 3s, some data point got a 6 sigma response. The chart does not tell how bad it looks like, there is just a few dots above the heavily concentrated "0" line.
- frequency chart
quantile()
ecdf()
sm() from package "sm"
The "hist" chart is another easiest to use one. When scatterplot shows a long tail (especially when showing scaled values against time), to actually tell there is a long tail, one has to use such a chart. Either the frequency or the cumulative density will do.
Many charts can look more pretty when one or two extreme values are excluded. The task is not that stringent, also such removed values do not change my conclusion. With quantile I can set a threshold how many samples I do not want, and also run the same code against a band (value range) of values easily.
- regresssion
scale()
Regression is not used until I need to explain the variation of response time. I still do not know if am doing it alright, can I regress scaled values when there is long tail?
- heatmap (and other clustering charts)
transform()
t.test()
This is used when compare response time for two requests. The recommended way of showing dispersion is boxchart. But I got a pair-wise t.test result so a heatmap also works. No idea why the chart is not symmetric though?
No idea how to use many cluster related tools. Such functions usually require a distance matrix, also the number of cluster is hard to decide. I have a column of values and a column of factors, I thought some factor levels are more closely related (say 4 of the levels have mean response time of 5s), but the response time is also long tail, thus a cluster will not put those factor levels together. What should I do?
The overall feeling is that I am like a rookie hacker. While I have the most powerful tools at hand, another person might do more elegantly. When it requires using more than one tool or more than one step, it feels such a barrier that I can never find the combination (especially for the cluster case). An experienced person would have a clear mind.
A top hacker would dismiss the entire problem.
All modern tools starting from TI-89 makes me feel uneasy, because I am not sure which is the best way, or even if I am using power tools in a correct way. Not to mention I forget them all in a short time.
Are you going to invest in the tools you use? How much can you contribute to them? To me, R is like hacking, good for things quick and dirty. I love such tasks, but without long term commitment. Cannot imagine I could be good at this or any shell scripting tools.