Developer Blog
Java/Scala and Highly Scalable Systems on AWS
Beware Java Enums in Spark
23 February 2014
A few days back I wrote a Spark job that runs an A/B test to compare the conversion rates between two groups of website visitors on one of our client's websites.
ReadWhy we chose not to git fast forward merge
14 February 2014
This post comes from an interesting email thread we had. In the thread, Stephen was answering various questions from the team about why it is so important to NOT use fast-forward merges.
ReadSCM Migration
26 August 2013
We happily used Atlassian’s hosted OnDemand service for source code management with the following setup
ReadUsing AWS Custom SSL Domain Names for CloudFront
20 June 2013
AWS recently announced the limited availability of Custom SSL Domain Names for CloudFront. You have to request an invitation in order to start using it but I am guessing it won't be long until it has been rolled out to all customers.
ReadScala Command-Line Hacks
22 April 2013
Do you like command-line scripting and one-liners with Perl, Ruby and the like?
ReadEfficiency & Scalability
19 April 2013
Software engineers know that distributed systems are often hard to scale and many can intuitively point to reasons why this is the case by bringing up points of contention, bottlenecks and latency-inducing operations. gIndeed, there exists a plethora of reasons and explanations as to why most distributed systems are inherently hard to scale, from theCAP theoremto scarcity of certain resources, e.g., RAM, network bandwidth ...
ReadSensible Defaults for Apache HttpClient
15 April 2013
Before coming to Bizo, I wrote a web service client that retrieved daily XML reports over HTTP using the Apache DefaultHttpClient. Everything went fine until one day the connection simply hung forever. We found this odd because we had set the connection timeout. It turned out we also needed to set the socket timeout (HttpConnectionParams.SO_TIMEOUT). The default for both connection timeout (max time to wait for a connection) and socket timeout (max time to wait between consecutive data packets) is infinity. The server was accepting the connection but then not sending any data so our client hung forever without even reporting any errors. Rookie mistake, but everyone is a rookie at least once. Even if you are an expert with HttpClient, chances are there will be someone maintaining your code in the future who is not.
ReadMap-side aggregations in Apache Hive
18 February 2013
When running large scale Hive reports, one error we occasionally run into is the following:
ReadReader Driven Development
15 February 2013
In this talk on Effective ML, Yaron Minsky talks about Reader Driven Development. That is, writing your code with the reader in mind. Making decisions that will make the code more easily read and understood by other developers down the line.
Read