Archive

Archive for the ‘Architecture’ Category

Experiences at Cloud Computing Conference Pune 2011

June 7, 2011 Comments off

I was one of the speaker of the second IndicThreads conference held at Pune on 3-4th June 2011.

Sessions at the conference dealt with key topics like Cloud Security, Amazon Elastic Beanstalk, Legal Issues in Cloud Computing, OpenStack, Xen Cloud Platform, Rails and CouchDB on the cloud, CloudFoundry, Gigapaces PAAS, Monitoring Cloud Applications, ORM with Objectify-Appengine, Scalable Architecture on Amazon AWS Cloud, Cloud Lock-in, Cloud Interoperability, Apache Hadoop, Map Reduce and Predictive Analysis.

My talk focussed on managing persistence on GAE. It dealt with choices available to a developer and then focussed on doing it with Objectify-Appengine.



Demo application is present here for download. For the instructions on how to run it please read the wiki.

Experiences of First CSD Course in India, Bangalore 02 – 04 Feb’2011

April 14, 2011 Comments off

It has been bit more than 2 months since we conducted the first CSD course in India. Writing about our experiences of this first course somehow kept on getting postponed by one or the other reason. Today I decided to give it a go believing “Better Late than never” :-)

As I have mentioned in my last post on this subject, Inphina is conducting the training under the umbrella of GoodAgile in collaboration with old colleague and good friend Srinivas Chillara from Scrum Coach. We decided to conduct our first course in Bangalore because of heavy presence of Software professionals and generally greater demand than anywhere else in India for these kind of trainings. We were pleasantly surprised by the response as we soon sold out all the seats and had to put several candidates on the wait-list.

Course Overview Following are the broad agenda items we covered during the course:

  • Test Driven Development
  • Emergent Design and Agile Architecture
  • Refactoring
  • Collaboration
  • Continuous Integration

As the course centers on how part of Scrum methodology like delivering high quality software, embracing change in design of the system, cross-functional teams, fast and short delivery cycles, potentially shippable product increments, high velocity on a regular basis, we designed the course in a very hands-on manner. Before the start of the course, we sent out a list of hardware and software pre-requisites to all the participants to let me be prepared for the course.

Running the actual course We followed what we call a Two-Pass approach for the course. In this, we provided a brief overview about each of the five areas to be covered during the training and followed it up with a discussion with participants. This allowed us more clarity on their expectations along with some specific areas which should be either stressed more or incorporated during the course. After the first-pass, we delved into detail about each of the areas. For each topic, we used a combination of presentations, demos with code samples followed by one/more hands-on exercises to be done by the participants. Let’s quickly get into some of the significant details about each of the topics:
Read more…

Categories: Agile, Architecture Tags: ,

Build a simple web crawler with Scala and GridGain

March 10, 2011 4 comments

Inphina provides specialized Scala consulting, training and offshoring … Learn more about Scala@Inphina!

(Cross posted on my personal blog)

Recently, as a proof-of-concept, I had to build a crawler. Of course I cannot share much details about that project other than to state that it’s an absolute privilege to be part of. :-)

I set out to build this crawler.

Prior experience of pairing with Narinder and Vikas had made me aware of distributed computing technologies such as Hadoop and GridGain, so I knew there was my start. Based on past experiences, I immediately picked GridGain over Hadoop. Pretty obvious reasons too: More examples, better support etc.

My next choice was a programming language. Java was the obvious choice but I took a risk and chose Scala. GridGain’s support for Scala and abundance of examples made this choice a bit easier. A quick, unofficial definition for those unaware: Scala is an Objective-Functional programming language that is very attractive to programmers and has proved itself in high-scalability situations (Twitter, LinkedIn, FourSquare etc.)

Note – I am new to Scala and my Scala code may look more Java like than Functional. I’m still learning and future examples should be better. “Awesomeness of Scala code” not a valid parameter to judge this blog post!

Professional etiquette (and NDAs + lawyers) will not allow me to share exact details of this crawler. After all, it is not my intellectual property. But for the sake of this example I will consider my target to be a simple web crawler that would be used by search engines to index the content on the internet.

What would our web crawler do?

  1. Start at some base URL
  2. Index content of this URL
  3. Search for more URLs to index
  4. Repeat 2 & 3 for these new URLs

This blog post will not get into the operational logic of loading a URL, extracting keywords, adding to index, extracting URLs etc. That I believe has been done to death. Alternatively I will look at how to scale up the crawling process using Scala and GridGain.

Those already familiar with GridGain, for the sake of this example I would request you to merge the concepts of a GridTask and a GridJob. Here we will create custom GridTasks which have one corresponding, unique custom GridJob.

Our GridTask-GridJob Pairs will be:

  • LoadUrlDataTask, LoadUrlDataJob
  • IndexKeywordsTask, IndexKeywordsJob

Much of the game is being played in LoadUrlDataJob. Its role is envisioned as follows:

  1. Make HTTP request to URL
  2. Gather response data from URL
  3. Trigger IndexKeywordsTask for URL data
  4. Fetch new URLs from URL data
  5. Trigger LoadUrlDataTask for new URLs

While the rest have simple roles:

  • LoadUrlDataTask = Return one LoadUrlDataJob
  • IndexKeywordsTask = Return one IndexKeywordsJob
  • IndexKeywordsJob = Parse data and index keywords

In other words, an IndexKeywords job would index keywords and die. In contrast, a LoadUrlData job would trigger exactly one IndexKeyword job and trigger potentially multiple LoadUrlData jobs.

Let’s look at the sources:

package net.srirangan.simplewebcrawler.tasks

import java.lang.String
import java.util.{List,ArrayList}
import org.gridgain.grid._
import net.srirangan.simplewebcrawler.jobs.LoadUrlJob

class LoadUrlTask extends GridTaskNoReduceSplitAdapter[String] {

  def split(gridSize:Int, url:String):List[GridJob] = {
    val jobs:List[GridJob] = new ArrayList[GridJob]()
    val job:GridJob = new LoadUrlJob(url)
    jobs.add(job)
    jobs
  }
  
}
package net.srirangan.simplewebcrawler.jobs

import java.lang.String
import java.util.{List,ArrayList}
import org.gridgain.grid.GridJobAdapterEx
import org.gridgain.scalar.scalar._
import net.srirangan.simplewebcrawler.tasks.{LoadUrlTask,IndexKeywordsTask}

class LoadUrlJob(url:String) extends GridJobAdapterEx {
  def execute():Object = {
    println("load url for - " + url)

    val data:String = "this is data for " + url
    val urls:List[String] = new ArrayList[String]()

    //
    // .. actual parser logic comes here
    // .. data:String will contain the contents of url:String
    // .. urls:List is a list of all new URLs found in data:String
    //
    
    // Start indexing keywords for data:String from url:String
    grid.execute(classOf[IndexKeywordsTask], data).get
    
    // adding dummy url in urls:List
    urls.add(url + ".1")

    // start load url for urls:List
    while( urls.iterator.hasNext() ) {
      val url:String = urls.iterator.next()
      grid.execute(classOf[LoadUrlTask], url).get
    }

    data
  }
}
package net.srirangan.simplewebcrawler.tasks

import java.lang.String
import java.util.{List,ArrayList}
import org.gridgain.grid.GridJob
import org.gridgain.grid.GridTaskNoReduceSplitAdapter
import net.srirangan.simplewebcrawler.jobs.IndexKeywordsJob

class IndexKeywordsTask extends GridTaskNoReduceSplitAdapter[String] {

  protected def split( gridSize:Int, url:String):List[GridJob] = {
    val jobs:List[GridJob] = new ArrayList[GridJob]()
    val job:GridJob = new IndexKeywordsJob(url)
    jobs.add(job)
    jobs
  }
  
}
package net.srirangan.simplewebcrawler.jobs

import java.lang.String
import org.gridgain.grid.GridJobAdapterEx
import org.gridgain.scalar.scalar._

class IndexKeywordsJob(data:String) extends GridJobAdapterEx {
  def execute():Object = {
    println(data)
    // .. actual indexing logic comes here
    null
  }
}

Complete Mavenized sources for Scala GridGain SimpleWebCrawler can be found on GitHub.com – https://github.com/Srirangan/simplewebcrawler

A quick look at the role of LoadUrlDataJob and we know that this needs to scale and scale big. Here is a visualization showing three levels of LoadUrlData wherein each LoadUrlDataJob spawns three other LoadUrlDataJobs and one IndexKeywords Job.

GridGain takes care of this seamlessly and divides the tasks among available nodes without any configuration or instruction. Here are screenshots showing three nodes of GridGain, one inside my IDE while other two on the console.

Is this a perfect web crawler? No. Far from it. For one, you need to control its spawn-rate else your machine will die. :-)

But it is an example that does showcase the power of GridGain and the ease with which Scala / Scalar can leverage it.


Srirangan is a programmer / senior consultant with Inphina Technologies
Blog   GitHub   LinkedIn   Twitter

Servlet Unit Testing : Experiences with Mockrunner

February 13, 2011 1 comment

This is a follow-up of my previous post where I provided different views on approaching Servlet testing along with an overview of various frameworks for doing the same. In this post, I will be delving a bit into detail about my experiences while unit testing Servlet code with Mockrunner, one of the frameworks mentioned in my last post.

Integration with Maven I am using maven (version 2.2.1) as build management tool for the project, so a publicly available artifact allowing an easy integration in my project was my preference. Unfortunately Mockrunner team doesn’t publish their artifacts to a Maven repository as is clear from this discussion thread on their user-forum. But I did find a publicly available Mockrunner artifact to include in my pom.xml and decided to give it a try. As feared, I ran into issues because Mockrunner’s pom file was referring to couple of dependencies which are not available on Maven public repository. I am talking here of 0.4 version of Mockrunner with com.mockrunner as groupid and mockrunner-jdk1.6-j2ee1.3 as artifactid. Problematic artifactIds for me were cglib-nodep and jboss-jee. In case you want to integrate Mockrunner through Maven, here are a couple of steps to make it work:

  • Exclude problematic dependencies inside your project’s pom.xml. The snippet of my pom.xml for doing it :
    		<dependency>
    			<groupId>com.mockrunner</groupId>
    			<artifactId>mockrunner-jdk1.6-j2ee1.3</artifactId>
    			<version>0.4</version>
    			<exclusions>
    				<exclusion>
    					<groupId>cglib-nodep</groupId>
    					<artifactId>cglib-nodep</artifactId>
    				</exclusion>
    				<exclusion>
    					<groupId>jboss</groupId>
    					<artifactId>jboss-jee</artifactId>
    				</exclusion>
    			</exclusions>
    		</dependency>
    
  • Download Mockrunner from the web-site. Copy both the above mentioned jars (cglib-nodep-2.2.jar and jboss-javaee-modified.jar) from downloaded & exploded archive to a subdirectory inside your project. Explicitly mention both these jars as System dependencies in your pom.xml. Code snippet of my pom.xml for your reference:
    		<dependency>
    			<groupId>cglib</groupId>
    			<artifactId>nodep</artifactId>
    			<version>2.2</version>
    			<scope>system</scope>
    			<systemPath>${basedir}/lib/cglib-nodep-2.2.jar</systemPath>
    		</dependency>
    		<dependency>
    			<groupId>jboss</groupId>
    			<artifactId>javaee-modified</artifactId>
    			<version>4.2.0</version>
    			<scope>system</scope>
    			<systemPath>${basedir}/lib/jboss-javaee-modified.jar</systemPath>
    		</dependency>
    

    It’s based on the assumption that you copied the jars inside lib folder of the project.

  • Read more…

Servlet Testing : Requirements and Options

January 14, 2011 1 comment

It may sound a little bizarre in today’s world of multiple frameworks that we would need to write Servlets ourselves. Well that happens to be the case for me at the moment. If you were in a similar situation, you would like a way to test these servlets as well. Here it interesting as there are different opinions in terms of what and how we should be handling Servlet testing. One can get more fine-grained but I would like to broadly classify them in two categories:

  • Integration Testing
  • Functional Unit Testing

Integration Testing Many people believe that Servlets are like glue between your presentation layer and business layer. Most of the business logic is written in separate Java classes which gets just called via servlets. This business layer gets tested using appropriate testing frameworks like JUnit, DBUnit, Mock Objects etc. Testing of Servlets should then implicitly happen through an Integration Testing Framework. In this case, we will typically be interacting through a Servlet Container. If that’s your thought or requirement, following are some of the most popular test-frameworks for the purpose:
Read more…

Using Scripted Data-Sources in BIRT and Deploying in Application Server

September 12, 2010 Comments off

We have been working on requirement of rendering a complex financial report for users of our web-application and also offering the possibility to export the HTML report in PDF format. We decided to use BIRT for the purpose because of its rich capabilities in terms of design and layout along with its out-of-the box offering for exporting the generated report in different formats like PDF, PS, Excel etc. Along with this functional requirement, we also wanted to use pre-populated Java Object graph as Data-Source for our report. In this post, I would like to talk about two things:

  • How to Use Java Objects as Data Source for BIRT Report
  • How to package everything and deploy in an Application Server

Using Java Objects as Data Source for BIRT Report

While working on different type of reporting tools, the often used approach is to fetch data from underlying Database through JDBC Data-Sources and then use the output of corresponding Data-Sets for rendering different parts of the report in form of tables, charts, cross-tabs etc. In our case, the complexity was that data set is quite big and generally comes from multiple sources like different Databases, Web-Services and others. If we want to achieve the same inside BIRT, first, it might get quite complicated if not just impossible and secondly we might suffer from serious performance issues in terms of time required to render the complete report. So we decided to do this information gathering work in an earlier step and then use a pre-populated Java Object graph as an already complete Data-Source for our report. This also allowed us to re-use our existing components and domain classes and not duplicate/re-write code specifically for report designing.
Read more…

Inphina Dissects the Public Cloud Space

June 29, 2010 4 comments

Inphina recently conducted a study on the most significant offerings in the public cloud space. The 4 major public cloud vendors which were considered are


Inphina categorized each of them on the basis of their offerings available with respect to

  1. Software as a Service (SaaS)
  2. Infrastructure as a Service (IaaS), and
  3. Platform as a Service (PaaS)

While Google and Salesforce took a combined lead in all 3 categories, IBM and Amazon have significant offerings in the IaaS and Paas space. Check out the complete report with matrix comparisons and Inphina offerings in the cloud space.

Open Closed Principle in Action

June 10, 2010 Comments off

Ivar Jacobson stated that all software entities change during their life cycle and this must be borne in mind when developing systems which are expected to last longer than the first version. A design principle which helps you with long lasting software is OCP (Open Closed Principle).

Open Closed Principle was coined by Bertrand Meyer. It states that “Software entities should be open for extension but closed for modification”.

When a single change to a program causes a cascade of changes to dependent modules  this is what we call as code smell or bad software design. The Open closed principle attacks this in a straight forward way. It states that you should design modules that should never change. However, when requirements change, our first reaction is to add or modify code. Right?

This seems odd because new application requirements warrants code changes in modules. It seems bizarre how these seemingly opposite attributes can be made to work together.

Figure 1 shows a simple design that does not confirm to “open closed principle”. Both Client and server classes are concrete. The client class uses the server class. If tomorrow, there is a need for the client class to use a different server class then the Client class must be changed to use the new Server class. This smells of tight coupling.

Figure 1

Figure 2 on the other hand shows us the corresponding design that conforms to the open closed principle. In this case AbstractServer is an abstract class and the client uses this abstraction. Now the Client class will use derived classes of the AbstractServer class. If we really want Client to use a different Server class, then a new derivative of the AbstractServer class can be created. The Client remains unchanged.

Read more…

Combining CEP with JMS

April 29, 2010 1 comment

In recent posts, we have touched upon several areas related to CEP. Today, I will be talking about integration of CEP engine with a JMS Provider. The question which may come in your mind would be the requirement for such a scenario.

Need for JMS Integration with CEP Engine

  • Events are generated from multiple applications
  • Several instances of the same application are running on multiple nodes. All these instances are generating different events

In both the above scenarios, it will not be possible for us to embed CEP engine as part of the application. Because the power of CEP engine is to be able to co-relate multiple different events into a single Complex event which can then be further processed. The ideal solution would be have a separate application running CEP engine receiving events from multiple event producers. It’s here we can place a JMS system between one/multiple event producers and CEP engine. Now if we are convinced about the use-case, let’s see how we can easily integrate both the systems. Broadly architecure of such a system would look like :

For this post, I would be using Apache Active-MQ as JMS provider and Esper as CEP engine but one can use any other implementation and things should still work fine

Read more…

Object Caching for Higher Performance

April 12, 2010 4 comments

While working on high traffic web-sites, software developers are often confronted with the question of how to increase performance of the application. I would like to talk about one of the commonly recurring scenarios:

  • The application does lot of computation to gather requested information. These computations are quite memory intensive
  • The results are generally stored in a normalized manner in an RDBMS before being rendered to the caller
  • The same information is requested by multiple clients resulting in multiple RDBMS hits and thus resulting in performance deterioration

In this context, we thought of using a Caching framework which help us in significantly reducing the DB hits and hence get a noticeable performance gain in terms of response times.

There are many frameworks available today which provide object level caching. I will be sharing my experience about Memcached framework. As explained on their home page, it’s an open source caching system which also provides distributed object caching support. The list of clients using Memcached to alleviate Database load is pretty impressive. Please visit their web-site for more information.

Read more…

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: