domenica 20 aprile 2014

How to configure Grails and Geb as webscraping tool versus php simple html dom library

I went across Geb because i was looking for a smart way to webscrape pages using Grails. Geb is a browser automation tool.

This definition is really important: it is not a library to parse html, it is a browser automation tool. What does this mean? It means that it will run an external browser and it will execute all the operations you coded on the browser pages as a human would do.

So it will load pages, populate form fields with text, will simulate clicks and will do all the things you would do as a human. You will also be able to look at the browser while it automatically does all the operations because Geb will run an external browser.

It is a very sophisticated framework but i think it should be used primarily as a test automator. You can run tests with different browser such as explorer, chrome or safari and query for "visual stuffs" like div's height or width. Obviously you can query the page for specific elements (and this is the part where you may want to use it as a scraping tool).

This is the premise i wanted to share. I think that Geb is a powerful tool, but i think it can be really useful for automated tests and not for scraping. And this because it seems to me to be too much complex for common and batch operations like scraping, if you compare it to a html parser library easy to use such as simple html dom.

But if you want to use Geb for scraping, here are my two cents on how to configure it with Grails version > 2.3.5 (and this is good i think because a lot of good material out there is outdated).

Pay attention: i'm sure this is not the best way to configure it, but it took me a little to make it run because docs are not clear about how to configure it with Grails. This is just a basci configuration, to allow you run and experiment with it.
You can find a Geb plugin for Grails but it seems to be useless.

So, here is my recipe.

1) Download Chrome driver from here and save it locally.

2) Create a GebConfig file under conf folder. This should be the code:

import org.openqa.selenium.chrome.ChromeDriver
driver = {
 System.setProperty('webdriver.chrome.driver', 'path/to/your/downloaded/chromedriver')
 new ChromeDriver() 
}

3) In BuildConfig add this repository to repositories section

mavenRepo "https://oss.sonatype.org/content/repositories/releases/"

4) Add under dependencies:

compile "org.gebish:geb-core:0.9.2"
compile "org.seleniumhq.selenium:selenium-support:2.26.0"
compile "org.seleniumhq.selenium:selenium-chrome-driver:2.31.0"

5) Now you can create a controller to test Geb. For example:

package gt

import geb.Browser;

class GebController {

    def index() {
  
  Browser.drive {
   go "http://google.com/ncr"
   
   // make sure we actually got to the page
   assert title == "Google"
   
   // enter wikipedia into the search field
   $("input", name: "q").value("wikipedia")
   
   // wait for the change to results page to happen
   // (google updates the page dynamically without a new request)
   waitFor { title.endsWith("Google Search") }
   
   // is the first link to wikipedia?
   def firstLink = $("li.g", 0).find("a")
   assert firstLink.text().contains("Wikipedia")
   
   // click the link
   firstLink.click()
   
   // wait for Google's javascript to redirect to Wikipedia
   waitFor { 
    title.startsWith("Wikipedia")
    render "OK"
   }
   
   
  }
  
 }
}

6) Execute the controller: you will see a new Chrome browser running all the code.


PHP and simple html dom parsing library

I will not write a lot on how to use the php library, and this because it is super-simple to understand. But how to integrate php with Grails?

Well, starting an external process with groovy is really simple:

def process = command.execute();
process.waitForOrKill(MY_TIMEOUT);

Where command is a string. This will start the process with a fixed timeout.
It will be really easy to run an external php process which executes different scraping operations using the simple html dom library.

Then it will be easy to get the output of the php process and parse it.

As an example, in a project of mine, i run external php processes (scheduled by a grails application) which perform scraping operations and at the end, they do call a grails controller passing a json result. The Grails controller persists the object and executes different operations.

I think this is a faster and cleaner approach to webscraping instead of using Geb.
I think Geb may be useful if you have to compile a lot of forms and visit a lot of pages to perform the scraping. But if all the things you have to do are only scraping lists (for example "products lists") and navigate through pagination, then i think php parsing it's quicker and more productive.
Now that you have read my article, i would like to show you another thing: i've developed an app to help increase customers registration and customers conversion.

You can find it at appromocodes.com

mercoledì 2 aprile 2014

Grails Spring Security Plugin manual login - registration - remember me

Spring security plugin is a powerful framework to manage authentication. It can be used out of the box following its super-easy documentation, but what if you want to implement manual login/registration/rememberMe functionalities?
I found that login and registration are quite easy to understand from the plugin guide, while rememberMe was just a little cryptic.

So this is the code i want to share: it's the server-side code and it is examples (so values are hard-coded but this is just an example). This code is for spring-security-core:2.0-RC2 plugin.

Please refer to plugin guide for install and setup configurations.


User registration


This code will create a User instance and save it to db with password encrypted (for Role creation and assignment please refer to step 7 in tutorial)

def user = new User();
user.username = "john";
user.password = "secretpassword";
user.save()

User login


In your controller define reference to springSecurityService and passwordEncoder beans via

def springSecurityService
def passwordEncoder

then in your code

def user = User.findByUsername("john");
def isLoggable = passwordEncoder.isPasswordValid(user.password, thepasswordtocheck, null)

//add this if you want to log the user then
springSecurityService.reauthenticate(user.username);

Is user logged?


def isLoggedIn = springSecurityService.isLoggedIn();

RememberMe functionality


NOTE: with spring-security-core:2.0-RC2 version, property namespaces has changed from grails.plugins.springsecurity to grails.plugin.springsecurity (look the 's' in the word plugin), a lot of examples out there refers to old plugin version.

Set in your Config.groovy:

grails.plugin.springsecurity.rememberMe.alwaysRemember = true

I would add also set

grails.plugin.springsecurity.rememberMe.cookieName = 'grails_remember_me'
grails.plugin.springsecurity.rememberMe.key = 'anewrandomkey'

to better secure the token, but that is optional.

Then in your code you have to define reference to rememberMeServices bean via

def rememberMeServices

Then you should call rememberMeServices.loginSuccess...well, that is the interface as specified in the code but that is not working!!! you have to call rememberMeServices.onLoginSuccess.
This was a critical step i had to dig a lot to find it!

So code is:

def user = User.findByUsername("john");
springSecurityService.reauthenticate(user.username);
rememberMeServices.onLoginSuccess(request, response, springSecurityService.getAuthentication());



That's all, hope it can helps someone!
Now that you have read my article, i would like to show you another thing: i've developed an app to help increase customers registration and customers conversion.

You can find it at appromocodes.com