Friday, August 13, 2010

Collecting User Actions with GWT

While I was at one of the Google I/O GWT sessions (courtesy of Bizo), a Google presenter mentioned how one of their internal GWT applications tracks user actions.


The idea is really just a souped-up, AJAX version of server-side access logs: capturing, buffering, and sending fine-grained user actions up to the server for later analysis.


The Google team was using this data to make A/B-testing-style decisions about features–which ones were being used, not being used, tripping users up, etc.


I thought the idea was pretty nifty, so I flushed out an initial implementation in BizAds for Bizo’s recent hack day. And now I am documenting my wild success for Bizo’s first post-hack-day “beer & blogs” day.


No Access Logs


Traditional, page-based web sites typically use access logs for site analytics. For example, the user was on a.html, then b.html. Services like Google Analytics can then slice and dice your logs to tell you interesting things.


However, desktop-style one-page webapps don’t generate these access logs–the user is always on the first page–so they must rely on something else.


This is pretty normal for AJAX apps, and Google Analytics already supports it via its asynchronous API.


We had already been doing this from GWT with code like:


public native void trackInGA(final String pageName) /*-{
$wnd._gaq.push(['_trackPageview', pageName]);
}-*/;

And since we’re using a MVP/places-style architecture (see gwt-mpv), we just call this on each place change. Done.


Google Analytics is back in action, not a big deal.


Beyond Access Logs


What was novel, to me, about this internal Google application’s approach was how the tracked user actions were much more fine-grained than just “page” level.


For example, which buttons the user hovers over. Which ones they click (even if it doesn’t lead to a page load). What client-side validation messages are tripping them up. Any number of small “intra-page” things that are nonetheless useful to know.


Obviously there are a few challenges, mostly around not wanting to detract from the user experience:



  • How much data is too much?

    Tracking the mouse over of every element would be excessive. But the mouse over of key elements? Should be okay.



  • How often to send the data?

    If you wait too long while buffering user actions before uploading them to the server, the user may leave the page and you’ll lose them. (Unless you use a page unload hook, and the browser hasn’t crashed.)


    If you send data too often, the user might get annoyed.




The key to doing this right is having metrics in place to know whether you’re prohibitively affecting the user experience.


The internal Google team had these metrics for their application, and that allowed them to start out batch uploading actions every 30 seconds, then every 20 seconds, and finally every 3 seconds. Each time they could tell the users’ experience was not adversely affected.


Unfortunately, I don’t know what exactly this metric was (I should have asked), but I imagine it’s fairly application-specific–think of GMail and average emails read/minute or something like that.


Implementation


I was able to implement this concept rather easily, mostly by reusing existing infrastructure our GWT application already had.


When interesting actions occur, I have the presenters fire a generic UserActionEvent, which is generated using gwt-mpv-apt from this spec:


@GenEvent
public class UserActionEventSpec {
@Param(1)
String name;
@Param(2)
String value;
@Param(3)
boolean flushNow;
}

Initiating the tracking an action is now just as simple as firing an event:


UserActionEvent.fire(
eventBus,
"someAction",
"someValue",
false);

I have a separate, decoupled UserActionUploader, which is listening for these events and buffers them into a client-side list of UserAction DTOs:


private class OnUserAction implements UserActionHandler {
public void onUserAction(final UserActionEvent event) {
UserAction action = new UserAction();
action.user = defaultString(getEmailAddress(), "unknown");
action.name = event.getName();
action.value = event.getValue();
actions.add(action);
if (event.getFlushNow()) {
flush();
}
}
}

UserActionUploader sets a timer that every 3 seconds calls flush:


private void flush() {
if (actions.size() == 0) {
return;
}
ArrayList<UserAction> copy =
new ArrayList<UserAction>(actions);
actions.clear();
async.execute(
new SaveUserActionAction(copy),
new OnSaveUserActionResult());
}

The flush method uses gwt-dispatch-style action/result classes, also generated by gwt-mpv-apt, to the server via GWT-RPC:


@GenDispatch
public class SaveUserActionSpec {
@In(1)
ArrayList<UserAction> actions;
}

This results in SaveUserActionAction (okay, bad name) and SaveUserActionResult DTOs getting generated, with nice constructors, getters, setters, etc.


On the server-side, I was able to reuse an excellent DatalogManager class from one of my Bizo colleagues (unfortunately not open source (yet?)) that buffers the actions data on the server’s hard disk and then periodically uploads the files to Amazon’s S3.


Once the data is in S3, it’s pretty routine to setup a Hive job to read it, do any fancy reporting (grouping/etc.), and drop it into a CSV file. For now I’m just listing raw actions:


-- Pick up the DatalogManager files in S3
drop table dlm_actions;
create external table dlm_actions (
d map<string, string>
)
partitioned by (dt string comment 'yyyyddmmhh')
row format delimited
fields terminated by '\n' collection items terminated by '\001' map keys terminated by '\002'
location 's3://<actions-dlm-bucket>/<folder>/'
;

alter table dlm_actions recover partitions;

-- Make a csv destination also in S3
create external table csv_actions (
user string,
action string,
value string
)
row format delimited fields terminated by ','
location 's3://<actions-report-bucket/${START}-${END}/parts'
;

-- Move the data over (nothing intelligent yet)
insert overwrite table csv_actions
select dlm.d["USER"], dlm.d["ACTION"], dlm.d["VALUE"]
from dlm_actions dlm
where
dlm.dt >= '${START}00' and dlm.dt < '${END}00'
;

Then we use Hudson as a cron-with-a-GUI to run this Hive script as an Amazon Elastic Map Reduce job once per day.


Testing


Thanks to the awesomeness of gwt-mpv, the usual GWT widgets, GWT-RPC, etc., can be doubled-out and testing with pure-Java unit tests.


For example, a method from UserActionUploaderTest:


UserActionUploader uploader = new UserActionUploader(registry);
StubTimer timer = (StubTimer) uploader.getTimer();

@Test
public void uploadIsBuffered() {
eventBus.fireEvent(new UserActionEvent("someaction", "value1", false));
eventBus.fireEvent(new UserActionEvent("someaction", "value2", false));
assertThat(async.getOutstanding().size(), is(0)); // buffered

timer.run();
final SaveUserActionAction a1 = async.getAction(SaveUserActionAction.class);
assertThat(a1.getActions().size(), is(2));
assertAction(a1, 0, "anonymous", "someaction", "value1");
assertAction(a1, 1, "anonymous", "someaction", "value2");
}

The usual GWT timers are stubbed out by a StubTimer, which we can manually tick via timer.run() to deterministically test timer-delayed business logic.


That’s It


I can’t say we have made any feature-altering decisions for BizAds based on the data gathered from this approach yet–technically its not live yet. But it’s so amazing that surely we will. Ask me about it sometime in the future.

hackday: analog meters

For this last hackday, I decided to work on something more hardware hacking related. At this year's Maker Fair, I was really inspired by all the cool stuff people were building, so I picked up an arduino and started playing around with a couple of things.

I've always wanted to have some cool old-school analog VU type meters displaying web requests.

Here's my completed hackday project:



Here's a view of the components from the back:



It's battery operated and receives data wirelessly over RF from another arduino I have hooked up via serial to my laptop.

It's pretty simple, but I'm still totally psyched about how it came out.

The main components are some analog panel meters (kinda pricey, but awesome), and an RF receiver. The frame is a piece of scrap acrylic from TAP Plastics that I drilled and cut to size, and the stand is a piece of a wire clothes hanger bent to shape.

Connected to my computer is a another arduino (actually a volksduino) that receives updates over USB and sends the data out over RF:



You may be asking, why bother with wireless if you need a computer hooked up through serial anyway. Or you may ask why not just connect to a wireless network directly.

Well, I wanted the meters to be able to be moved around, or mounted on a wall... I wanted them wireless. But, it turns out that wireless and even ethernet solutions for connecting an arduino to the internet directly are comparatively pretty expensive. Even using bluetooth is expensive. My long term plan is to have a single arduino connected to the internet directly (via ethernet or wireless), and have it serve as a proxy over RF for the others... So this is a bit of work towards that.

I wrote a bit of Java code to connect to amazon's cloudwatch to pull the load balancer statistics for two of our services. I then discovered it's near impossible to connect to anything over USB in Java... It is ridiculous. Luckily, it's REALLY easy to do this with Processing, so I wrote a simple processing program that used my cloudwatch library and wrote it out to serial.

And that's really it. The arduino reads data over serial, and periodically sends it over RF. The arduino hooked up to the meters simply reads the values over RF and sets the meters to display a scaled version of the results. They're showing requests per second. We get a huge amount of requests per second with these services, so the numbers on the dial aren't actually correct (I need to make some custom faceplates). It also flashes an LED every time it gets a RF transmission.

Here's a quick video of it in action:




The one thing I'm not crazy about is that the maximum resolution you can get from cloudwatch is stats per minute, so the meters don't actually change as often as I would like.

Still, pretty cool. I'm looking forward to building some more displays like this in the future.

Thursday, July 29, 2010

Extending Hive with Custom UDTFs

Let’s take a look at the canonical word count example in Hive: given a table of documents, create a table containing each word and the number of times it appears across all documents.


Here’s one implementation from the Facebook engineers:



CREATE TABLE docs(contents STRING);

FROM (
MAP docs.contents
USING 'tokenizer_script'
AS
word,
cnt
FROM docs
CLUSTER BY word
) map_output
REDUCE map_output.word, map_output.cnt
USING 'count_script'
AS
word,
cnt
;

In this example, the heavy lifting is being done by calling out to two scripts, ‘tokenizer_script’ and ‘count_script’, that provide custom mapper logic and reducer logic.


Hive 0.5 adds User Defined Table-Generating Functions (UDTF), which offers another option for inserting custom mapper logic. (Reducer logic can be plugged in via a User Defined Aggregation Function, the subject of a future post.) From a user perspective, UDTFs are similar to User Defined Functions except they can produce an arbitrary number of output rows for each input row. For example, the built-in UDTF “explode(array A)” converts a single row of input containing an array into multiple rows of output, each containing one of the elements of A.


So, let’s implement a UDTF that does the same thing as the ‘tokenizer_script’ in the word count example. Basically, we want to convert a document string into multiple rows with the format (word STRING, cnt INT), where the count will always be one.


The Tokenizer UDTF


To start, we extend the org.apache.hadoop.hive.ql.udf.generic.GenericUDTF class. (There is no plain UDTF class.) We need to implement three methods: initialize, process, and close. To emit output, we call the forward method.


Adding a name and description:



@description(name = "tokenize", value = "_FUNC_(doc) - emits (token, 1) for each token in the input document")
public class TokenizerUDTF extends GenericUDTF {

You can add a UDTF name and description using a @description annotation. These will be available on the Hive console via the show functions and describe function tokenize commands.


The initialize method:



public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException

This method will be called exactly once per instance. In addition to performing any custom initialization logic you may need, it is responsible for verifying the input types and specifying the output types.


Hive uses a system of ObjectInspectors to both describe types and to convert Objects into more specific types. For our tokenizer, we want a single String as input, so we’ll check that the input ObjectInspector[] array contains a single PrimitiveObjectInspector of the STRING category. If anything is wrong, we throw a UDFArgumentException with a suitable error message.



if (args.length != 1) {
throw new UDFArgumentException("tokenize() takes exactly one argument");
}

if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE
&& ((PrimitiveObjectInspector) args[0]).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING) {
throw new UDFArgumentException("tokenize() takes a string as a parameter");
}

We can actually use this object inspector to convert inputs into Strings in our process method. This is less important for primitive types, but it can be handy for more complex objects. So, assuming stringOI is an instance variable,



stringOI = (PrimitiveObjectInspector) args[0];

Similarly, we want our process method to return an Object[] array containing a String and an Integer, so we’ll return a StandardStructObjectInspector containing a JavaStringObjectInspector and a JavaIntObjectInspector. We’ll also supply names for these output columns, but they’re not really relevant at runtime since the user will supply his or her own aliases.



List fieldNames = new ArrayList(2);
List fieldOIs = new ArrayList(2);
fieldNames.add("word");
fieldNames.add("cnt");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
}

The process method:



public void process(Object[] record) throws HiveException

This method is where the heavy lifting occurs. This gets called for each row of the input. The first task is to convert the input into a single String containing the document to process:



String document = (String) stringOI.getPrimitiveJavaObject(record[0]);

We can now implement our custom logic:



if (document == null) {
return;
}
String[] tokens = document.split(“\\s+”);
for (String token : tokens) {
forward(new Object[] { token, Integer.valueOf(1) });
}
}

The close method:



public void close() throws HiveException { }

This method allows us to do any post-processing cleanup. Note that the output stream has already been closed at this point, so this method cannot emit more rows by calling forward. In our case, there’s nothing to do here.


Packaging and use:


We deploy our TokenizeUDTF exactly like a UDF. We deploy the jar file to our Hive machine and enter the following in the console:



> add jar TokenizeUDTF.jar ;
> create temporary function tokenize as ’com.bizo.hive.udtf.TokenizeUDTF’ ;
> select tokenize(contents) as (word, cnt) from docs ;

This gives us the intermediate mapped data, ready to be reduced by a custom UDAF.


The code for this example is available in this gist.

Friday, June 25, 2010

Come work at Bizo

Want to work on some interesting problems with a great development team?

We're looking to hire a junior developer and a quantitative engineer.

Send us your resume! Or if you know someone in the Bay Area that you think might be a good fit, please forward the posting to them.

Tuesday, June 8, 2010

Accessing Bizo API using Ruby OAuth

During a recent HackDay, I wrote a search-engine-like frontend to our multi-dimensional business demographic database.




Among other things, the interface provides search term suggestions based on business title classification, using the Classify operation of the Bizo API.

I chose to use Ruby and the lightweight Sinatra web framework for fast prototyping and since the Bizo API uses OAuth for authentication, I reached out for the excellent OAuth gem.

Now, while the documentation is good it took me a little time to grok the OAuth API and figure out how to use it. The Bizo API does not use a RequestToken; instead we use an API key and a shared secret. Since the OAuth gem documentation didn't include any example for this use-case, I figured I'd post my code here as a starting point for other people to reuse.

Without further ado, here's the short code fragment:


require 'rubygems'
require 'oauth'
require 'oauth/consumer'
require 'json'

key = 'xxxxxxxx'
secret = 'yyyyyyyy'

consumer = OAuth::Consumer.new(key, secret, {
:site => "http://api.bizographics.com",
:scheme => :query_string,
:http_method => :get
})

title = "VP of Marketing"
path = URI.escape("/v1/classify.json?api_key=#{key}&title=#{title}")

response = consumer.request(:get, path)

# Display response
p JSON.parse(response.body)


If you're curious, here's the JSON response for the "VP of Market..." title classification,

{
"usage" => 1,
"bizographics" => {
"group" => { "name" => "High Net Worth", "code" => "high_net_worth" },
"functional_area" => [
{"name" => "Sales", "code" => "sales" },
{"name" => "Marketing", "code" => "marketing" }
],
"seniority" => {"name" => "Executives", "code" => "executive" }
}
}


Hopefully this is useful to Rubyists out there needing quick OAuth integration using HTTP GET and a query string and don't need to go through the token exchange process.

Tuesday, May 11, 2010

Hackday: dependency searching using scala, jersey, gxp, mongodb

For my hackday project, I thought I would try to build an internal tool to let us more easily search our dependency repository. We use ivy for dependency management, and maintain our own repository in s3. It can be kind of a pain to track down the latest version of library X, especially if you're not sure what the organization is, or maybe you know the org and not the name. It seemed like a fun, useful project that I could tackle in a day, and that would allow me to play around with a couple of things I was interested in. To build it, I used jersey, gxp, and mongoDB. The whole thing was written using scala.

I borrowed the main layout from the SpringSource Enterprise Bundle Repository. I'm pretty happy with the results:



And the detail view:



There's also a browse view.

I've been really happy using scala and jersey, and I wanted something simple and easy for this project, so I thought it was worth a shot. After adding GXP for templating support, I have to say the combination of scala/jersey/GXP makes a pretty compelling framework for simple web apps.

As an example, here's the beginning of my 'Browse' Controller:


@Path("/b")
class Browse {
val db = new RepoDB

@Path("/o")
@GET @Produces(Array("text/html"))
def browseOrg() = browseOrgLetter("A")

@Path("/o/{letter}")
@GET @Produces(Array("text/html"))
def browseOrgLetter(@PathParam("letter") letter : String) = {

val orgs = db.getOrgLetters

val results = db.findByOrgLetter(letter, 30)

BrowseView.getGxpClosure("Organization", "o", orgs, letter, results)
}
...


It's using nested paths, so /b/o is the main browse by organization page, /b/o/G would be all organizations starting with 'G'.

Then, I have a simple MessageBodyWriter that can render a GxpClosure:


@Provider
@Produces(Array("text/html"))
class GxpClosureWriter extends MessageBodyWriter[GxpClosure] {
val context = new GxpContext(Locale.US)

override def isWriteable(dataType: java.lang.Class[_], ...) = {
classOf[GxpClosure].isAssignableFrom(dataType)
}

override def writeTo(gxp: GxpClosure, ...) {
val out = new java.io.OutputStreamWriter(_out)
gxp.write(out, context)
}
...


And, that's really all there is to it. Nice, simple, and lightweight.

Last but not least, mongodb. It was probably overkill for this project, but I was looking for an excuse to play with it some more. I use it to store and index all of the repository information. I have a separate crawler process that lists everything in our repository s3 bucket, then stores an entry for each artifact. As part of this, it does some basic tokenizing of the organization and artifact names for searching. Searching like this was a little disappointing compared to lucene. Overall though, I'm pretty happy with it. Browsing and searching are both ridiculously fast. Like I said, it was probably overkill for the amount of data we have.... but it can never be too fast. speed is most definitely a feature.

Anyway, that's the wrap-up.

I'd be interested to other thoughts/experiences on mongodb from anyone out there.

Wednesday, May 5, 2010

Improving Global Application Performance, continued: GSLB with EC2

This is an unofficial continuation of Amazon's blog post on the use of Amazon CloudFront to improve application performance.

CloudFront is a great CDN to consider, especially if you're already an Amazon Web Services customer. Unfortunately, it can only be used for static content; the loading of dynamic content will still be slower for far-away users than for nearby ones. Simply put, users in India will still see a half-second delay when loading the dynamic portions of your US-based website. And a half-second delay has a measurable impact on revenue.

Let's talk about speeding up dynamic content, globally.

The typical EC2 implementation comprises instances deployed in a single region. Such a deployment may span several availability zones for redundancy, but all instances are in roughly the same place, geographically.

This is fine for EC2-hosted apps with nominal revenue or a highly localized user base. But what if your users are spread around the globe? The problem can't be solved by moving your application to another region - that would simply shift the extra latency to another group.

For a distributed audience, you need a distributed infrastructure. But you can't simply launch servers around the world and expect traffic to reach them. Enter Global Server Load Balancing (GSLB).

A primer on GSLB
Broadly, GSLB is used to intelligently distribute traffic across multiple datacenters based on some set of rules.

With GSLB, your traffic distribution can go from this:


To this:


GSLB can be implemented as a feature of a physical device (including certain high-end load balancers) or as a part of a DNS service. Since we EC2 users are clearly not interested in hardware, our focus is on the latter: DNS-based GSLB.

Standard DNS behavior is for an authoritative nameserver to, given queries for a certain record, always return the same result. A DNS-based implementation of GSLB would alter this behavior so that queries return context-dependent results.

Example:
User A queries DNS for gslb.example.com -- response: 10.1.0.1
User B queries DNS for gslb.example.com -- response: 10.2.0.1

But what context should we use? Since our goal is to reduce wire latency, we should route users to the closest datacenter. IP blocks can be mapped geographically -- by examining a requestor's IP address, a GSLB service can return a geo-targeted response.

With geo-targeted DNS, our example would be:
User A (in China) queries DNS for geo.example.com -- response: 10.1.0.1
User B (in Spain) queries DNS for geo.example.com -- response: 10.2.0.1

Getting started
At a high level, implementation can be broken down into two steps
1) Deploying infrastructure in other AWS regions
2) Configuring GSLB-capable DNS

Infrastructure configurations will vary from shop to shop, but as an example, a read-heavy EC2 application with a single master database for writes should:
- deploy application servers to all regions
- deploy read-only (slave) database servers and/or read caches to all regions
- configure application servers to use the slave database servers and/or read caches in their region for reads
- configure application servers to use the single master in the "main" region for writes

This is what such an environment would look like:


When configuring servers to communicate across regions (app servers -> master DB; slave DBs -> master DB), you will need to use IP-based rules for your security groups; traffic from the "app-servers" security group you set up in eu-west-1 is indistinguishable from other traffic to your DB server in us-east-1. This is because cross-region communication is done using external IP addresses. Your best bet is to either automate security group updates or use Elastic IPs.

Note on more complex configurations: distributed backends are hard (see Brewer's [CAP] theorem). Multi-region EC2 environments are much easier to implement if your application tolerates the use of 1) regional caches for reads; 2) centralized writes. If you have a choice, stick with the simpler route.

As for configuring DNS, several companies have DNS-based GSLB service offerings:
- Dynect - Traffic Management (A records only) and CDN Manager (CNAMEs allowed)
- Akamai - Global Traffic Management
- UltraDNS - Directional DNS
- Comwired/DNS.com - Location Geo-Targeting

DNS configuration should be pretty similar for the vendors listed above. Basic steps are:
1) set up regional CNAMEs (us-east-1.example.com, us-west-1.example.com, eu-west-1.example.com, ap-southeast-1.example.com)
2) set up a GSLB-enabled "master" CNAME (www.example.com)
3) define the GSLB rules:
- For users in Asia, return ap-southeast-1.example.com
- For users in Europe, return eu-west-1.example.com
- For users in Western US, return us-west-1.example.com
- ...
- For all other users, return us-east-1.example.com

If your application is already live, consider abstracting the DNS records by one layer: geo.example.com (master record); us-east-1.geo.example.com, us-west-1.geo.example.com, etc. (regional records). Bring the new configuration live by pointing www.example.com (CNAME) to geo.example.com.

Bizo's experiences
Several of our EC2 applications serve embedded content for customer websites, so it's critical we minimize load times. Here's the difference we saw on one app after expanding into new regions (from us-east-1 to us-east-1, us-west-1, and eu-west-1) and implementing GSLB (load times provided by BrowserMob):

Load times before GSLB:


Load times after GSLB:


Reduced load times for everyone far from us-east-1. Users are happy, customers are happy, we're happy. Overall, a success.

It's interesting to see how the load is distributed throughout the day. Here's one application's HTTP traffic, broken down by region (ELB stats graphed by cloudviz):


Note that the use of Elastic Load Balancers and Auto Scaling becomes much more compelling with GSLB. By geographically partitioning users, peak hours are much more localized. This results in a wider difference between peak and trough demand per region; Auto Scaling adjusts capacity transparently, reducing the marginal cost of expanding your infrastructure to multiple AWS regions.

For our GSLB DNS service, we use Dynect and couldn't be more pleased. Intuitive management interface, responsive and helpful support, friendly, no-BS sales. Pricing is based on number of GSLB-enabled domains and DNS query rate. Contact Dynect sales if you want specifics (we work with Josh Delisle and Kyle York - great guys). Note that those intending to use GSLB with Elastic Load Balancers will need the CDN Management service.

Closing remarks
Previously, operating a global infrastructure required significant overhead. This is where AWS really shines. Amazon now has four regions spread across three continents, and there's minimal overhead to distribute your platform across all of them. You just need to add a layer to route users to the closest one.

The use of Amazon CloudFront in conjunction with a global EC2 infrastructure is a killer combo for improving application performance. And with Amazon continually expanding with new AWS regions, it's only going to get better.

@mikebabineau