Crawling a Web Page with Scrapy

Hi… all today i’m going to share how to crawl a web page with scrapy. See my previous post for installation here.

Its always best to create directory for learning any language or programming.

Open Terminal (cntrl+Alt + T)

mkdir (name of the directory)

cd /directoryname

To create a new project in scrapy

scrapy startproject sample

now change the directory to

cd sample

type the command ls

sample   scrapy.cfg

now cd sample

again  –   ls

_init__.py   items.py   pipelines.py 
  settings.py   spiders

scrapy created these files within sample(name of project) folder.

If you already know django or python means you can easily understand these . If not don’t worry i will try to explain as much as i can .

  • scrapy.cfg: the project configuration file
  • sample/: the project’s python module, you’ll later import your code from here.
  • sample/items.py: the project’s items file.
  • sample/pipelines.py: the project’s pipelines file.
  • sample/settings.py: the project’s settings file.
  • sample/spiders/: a directory where you’ll later put your spiders.
  • __init__.py –  to initialize Python Packages and also consider this directory as Python Package.

Now ,it times to get into scrapy and crawl a page

Things we have to for crawling a web page are:

  • Define items in items.py
  • Write a spider to crawl a page and extract items
  • Writing an pipelines.py  to store the extracted Items

Things to do in items.py

from scrapy.item import Item, Field
class SampleItem(Item):
    # define the fields for your item here like:
    # name = Field()
    title=Field()  // for title of a page
    link=Field() // for links of the page
    desc=Field() //for  description of a page

Things to do with  Spiders/

To create Spider Scrapy genspider (name) (domain name) or

simply create one sample.py in spiders folder.

mydomain.py (i created mydomain.py in spiders folder of the project)

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from sample.items import SampleItem

class DmozSpider(BaseSpider):
    name = “dmoz”
    allowed_domains = [“dmoz.org”]
    start_urls = [
        “http://www.dmoz.org/Computers/Programming/Languages/Python/Books/”,
        “http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/”
    ]
    def parse(self, response):
       sel = HtmlXPathSelector(response)
       sites = sel.select(‘//ul/li’)
       items = []
       for site in sites:
           item = SampleItem()
           item[‘title’] = site.select(‘a/text()’).extract()
           item[‘link’] = site.select(‘a/@href’).extract()
           item[‘desc’] = site.select(‘text()’).extract()
           items.append(item)
       return items

Lets explain what i did above,

For spiders three things are necessary

  • name
  • start_url
  • parse method

name must be unique for every page

start_urlslist of urls which we want to crawl

parsemethod for extracting data from webpage.

allow_domains allows pages with given domains and restricts others pages

Thats it .. now run our file and see what’s happening .

To run a  spider

scrapy crawl [name of the spider]

so in our code spider name is ‘dmoz’ . 

scrapy crawl dmoz

In terminal you can see like this

if your code is correct i mean without any errors you can see Spider opened and crawled and closed like below

2013-11-27 16:09:20+0530 [dmoz] INFO: Spider opened
2013-11-27 16:09:20+0530 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-11-27 16:09:20+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-11-27 16:09:20+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-11-27 16:09:21+0530 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/&gt; (referer: None)
2013-11-27 16:09:21+0530 [dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/&gt;
{‘desc’: [u’\r\n\r\n ‘],
‘images’: [],
‘link’: [u’/’],
‘title’: [u’Top’]}
2013-11-27 16:09:21+0530 [dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/&gt;
{‘desc’: [], ‘images’: [], ‘link’: [u’/Computers/’], ‘title’: [u’Computers’]}
2013-11-27 16:09:21+0530 [dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/&gt;
{‘desc’: [],
‘images’: [],
‘link’: [u’/Computers/Programming/’],
‘title’: [u’Programming’]}

Want to Store the crawled details as json or csv ?

To store the crawled details as json run this command

 scrapy crawl dmoz -o items -t json

now you can see the items.json in your project directory.

Scrapy Shell

Scrapy comes with inbuilt tool for debugging and testing our code and makes developers job easily.

Inside project directory run   this command scrapy shell

you can see like this !!

2013-11-27 16:23:04+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
[s] Available Scrapy objects:
[s] item {}
[s] settings <CrawlerSettings module=<module ‘sample.settings’ from ‘/home/bala/scrapycodes/sample/sample/settings.pyc’>>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser

now lets check this

fetch(“http://www.google.co.in&#8221;)

2013-11-27 16:25:23+0530 [default] INFO: Spider opened
2013-11-27 16:25:29+0530 [default] DEBUG: Crawled (200) <GET http://www.google.co.in&gt; (referer: None)
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<html itemscope=”” itemtype=”http://sche’&gt;
[s] item {}
[s] request <GET http://www.google.co.in&gt;
[s] response <200 http://www.google.co.in>  # Things to note
[s] settings <CrawlerSettings module=<module ‘sample.settings’ from ‘/home/bala/scrapycodes/sample/sample/settings.pyc’>>
[s] spider <BaseSpider ‘default’ at 0x3cde3d0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser

hxs -> html xpath selector

now we have response of the requested url .

lets get the title of the page .

hxs.select(“//title”) 

<HtmlXPathSelector xpath=’//title’ data=u'<title>Google</title>’>]

it returns the selector element (with html tags)

we want text  of the tag

hxs.select(“//title”).extract()

[u'<title>Google</title>’]

extract -> used to extract element from page

Things to note:  Results are in list and u’ represents unicode.

Lets try this !!!

hxs.select(“//title/text()“).extract()[0]
 u’Google’

For xpath learning use this   Xpath.

Thats it … Happy coding ..

Scrapy Beginning Tutorial

Hi to all. Today i started learning scrapy .Here i’m going to start scrapy from the beginning ..

What is Scrapy?

Scrapy is an application framework for crawling websites and extracting  structured data which can be used for a wide range of useful applications, like data mining, information processing etc.

Take a look at the documentation of scrapy  for more information here

Scrapy was written in Python. Hence you must some knowledge in python to work in scrapy.

For those who are beginners in python i would suggest these books “A Byte of Python ”  & “Learning Python the HardWay” (or) “Dive into Python“.

For those who have already knowledge in python remember this . Scrapy supports python 2.6 and 2.7 . Scrapy doesn’t support Python3.

Lets see the installation here.

Pre requisites :

  • Python 2.7
  • lxml
  • Opensssl
  • Pip or easy_install (Python package Managers)

To install Scrapy Open Terminal and type(cntrl + Alt + T)

$ pip install Scrapy

or

$ easy_install Scrapy

After installation type  $ scrapy

Scrapy 0.18.4 – no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use “scrapy <command> -h” to see more info about a command

It shows the installed scrapy version and other details .
Thats it .. From my next post we will get started with Coding …

Happy Times !!! Thanks !!!!!!!!

 

The beginning of Django !!!

Hi… all .. Since it was long time for me . I actually stopped learning python. But my first job was in Android . When was working in Android app development ,i felt something is missing and more over i want to continue my learning in python and my other interested one’s .Yes the time came for me. Now i’m in job in my favorite language .This is the time i can learn my interested language and make use of this opportunity . So friends i try to explain  Django(web framework) from the scratch in my upcoming posts .

Happy coding !!!!!

How to format pendrive or Memorycard or USB in ubuntu by Terminal

Hi…all today ..i tried to to format by pendrive by our file manager. But sometimes it fails to format.So we have to format it by terminal.

After plugging your device(usb or pendrive or memory card).

Open Terminal(cntrl +Alt + T)

type the command as $ dmesg | Tail

it will display like this

[ 1749.108796] sd 9:0:0:0: [sdb] Write Protect is off
[ 1749.108813] sd 9:0:0:0: [sdb] Mode Sense: 43 00 00 00
[ 1749.109803] sd 9:0:0:0: [sdb] No Caching mode page present
[ 1749.109816] sd 9:0:0:0: [sdb] Assuming drive cache: write through
[ 1749.116895] sd 9:0:0:0: [sdb] No Caching mode page present
[ 1749.116906] sd 9:0:0:0: [sdb] Assuming drive cache: write through
[ 1749.118911]  sdb: sdb1    –> Things to be noted
[ 1749.122785] sd 9:0:0:0: [sdb] No Caching mode page present
[ 1749.122800] sd 9:0:0:0: [sdb] Assuming drive cache: write through
[ 1749.123771] sd 9:0:0:0: [sdb] Attached SCSI removable disk

Then Unmount your device by

$ sudo umount /dev/sdb1

Finally Enter this command to format Your device

$ sudo mkfs.vfat -n ‘Ubuntu’ -I /dev/sdb1

Thats it….Now plug out your device and plug in again to use . Your device is now formatted and ready to use now..

Happy coding!!!!

How to fix subtitles delay or ealier with your movies by python code

Hi..all …today i was watching movie with subtitles . I had delay with my subtitles files ,like  subtitles mismatch with  the every scenes of movie.  .

I sloved this issue by python code .Subtitles are in .srt format. (e.x) Pirates of the Caribbean -The Curse of the Black Pearl(2003).srt

I found that i got delay by 2 minutes in my .srt file

Library that  i used : pysrt

Open your terminal (cntrl+alt+t)

type python —> enter into python interpreter mode

import pysrt

if you got any error like –> no moduled named pysrt.  then you need to install pysrt .To install pysrt – sudo easy_install pysrt

for python3 users its available pysrt3

if don’t get any error then you already have this library.

now , if you want to make a delay in .srt file means do like this

 

>>> subs.shift(seconds=-2) # Move all subs 2 seconds earlier
>>> subs.shift(minutes=1)  # Move all subs 1 minutes later

finally save this file from your terminal by

>>> subs.save('path to ur location/newfilename.srt', encoding='utf-8')

My Entire code which sloved my delay in .srt file

#! usr/bin/python
import pysrt
subs=open("/home/bala/Pirates of the Caribbean -The Curse of the Black Pearl(2003).srt")
subs.shift(minutes=-2) # Move all subs 2 minutes earlier
subs.save('/home/bala/new.srt', encoding='utf-8')#saves file with new.srt in your home directory

This sloved my problem .
Thats it..   Happy coding !!!!

Android Splash Screen Activity

Hi . ..to all ..today i’m going to share about android splash screen activity .

Now Most of the applications coming with splash screen .Before going into session ,make sure you have made the development environment to develop application in android.

Development tools needed for Android are

1.Eclipse

2.Java Runtime Environment

3.Android ADT and Android Sdk

4.Android API

You can develop application from “Android Studio” also . Its upto to you people. I’m not going to tell about installations and and setting up environment.You can find lot of good tutorials to do this.

Here in this session i’m telling about developing app in Eclipse

Open up Eclipse and create new Android Application Project.

project name i given here is FreeTamilEbooks

package name for me here is com.ebook.freetamilebooks

If you created project in eclipse means you can see the file structure like this

Screenshot from 2013-10-01 21:02:16

in src folder you can see java files.

in layout folder you can see xml files.

 

Things to remind here .

UI part of the application was done in xml in layout folder.

The coding Part done in java in src folder.

Confusing ?? let me clear this.

UI in the sense , lets imagine a button in a screen,

now for creating a button in a screen you have to use xml in android (dynamically can create in java)

For Actions like clicking a button etc . have to be coded in java .

now got clear with this. I hope u got some idea about this.

now open up java file in src folder .My java code with splashscreen.java

package com.ebook.freetamilebooks;

import android.os.Bundle;
import android.os.Handler;
import android.app.Activity;
import android.content.Intent;
import android.view.Menu;
import android.view.Window;

public class SplashScreen extends Activity {

	@Override
	protected void onCreate(Bundle savedInstanceState) {
		super.onCreate(savedInstanceState);
        requestWindowFeature(Window.FEATURE_NO_TITLE);
		setContentView(R.layout.activity_splash_screen);
		 Handler handler=new Handler();
	        handler.postDelayed(new Runnable()
	        {               
	            @Override
	            public void run() 
	            {
	                Intent intent = new Intent(SplashScreen.this,Home.class);
	                startActivity(intent);
	                SplashScreen.this.finish();                         
	            }
	        }, 3000);
	}

	@Override
	public boolean onCreateOptionsMenu(Menu menu) {
		// Inflate the menu; this adds items to the action bar if it is present.
		getMenuInflater().inflate(R.menu.splash_screen, menu);
		return true;
	}

}

java code for second Screen Home.java

package com.ebook.freetamilebooks;

import android.app.Activity;
import android.os.Bundle;

public class Home extends Activity {
	@Override
	protected void onCreate(Bundle savedInstanceState) {
		// TODO Auto-generated method stub
		super.onCreate(savedInstanceState);
		setContentView(R.layout.home);
	}

}

my xml codes are 
activity_splashscreen.xml

<RelativeLayout xmlns:android="http://schemas.android.com/apk/res/android"
    xmlns:tools="http://schemas.android.com/tools"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    android:background="@drawable/splash"
    tools:context=".SplashScreen" >
</RelativeLayout>

my home.xml

<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    android:orientation="vertical" >
    <TextView 
        android:id="@+id/welcome"
        android:layout_width="fill_parent"
         android:layout_height="fill_parent"
        android:text="Welcome to home "/>
</LinearLayout>

Finally one important thing to do with AndroidManifest.xml

this xml file is like guidemap . While running our app this file gets executed first and tell the activites ,and necessary things to the emulator.

Android uses DVM (Dalvik Virtual Machine) . It combines java and xml code into .apk format.
java files gets compiled by JVM(Java Virtual Machine)

By default your very first activity is specified in this file. If you have more than one activity you have specify the activities in this file. Else your application won’t work .

AndroidManifest.xml

<?xml version="1.0" encoding="utf-8"?>
<manifest xmlns:android="http://schemas.android.com/apk/res/android"
    package="com.ebook.freetamilebooks"
    android:versionCode="1"
    android:versionName="1.0" >

    <uses-sdk
        android:minSdkVersion="8"
        android:targetSdkVersion="18" />

    <application
        android:allowBackup="true"
        android:icon="@drawable/ic_launcher"
        android:label="@string/app_name"
        android:theme="@style/AppTheme" >
        <activity
            android:name="com.ebook.freetamilebooks.SplashScreen"
            android:label="@string/app_name" >
            <intent-filter>
                <action android:name="android.intent.action.MAIN" />

                <category android:name="android.intent.category.LAUNCHER" />
            </intent-filter>
        </activity>
       <strong> <activity
            android:name="com.ebook.freetamilebooks.Home"
            android:label="@string/app_name" >
        </activity></strong>
    </application>

</manifest>

one last thing to do for splash screen . We need an image to show at very first screen of our app.

so choose your image .It must be in .pngformat.
create folder named drawable under /resdirectory in eclipse.

Here i named the image as splash.png . I used that image as in activity_splash.xml like this
android:background=@drawable/splash[/sourcode]

you can give any name to your image and make sure same name in while using in xml file too.

Thats it . Now save your file and press Contrl+F11

After home launched from emultor you can see the page like this

Screenshot from 2013-10-01 23:02:00

Thats it. Thanks for reading. Happy coding!!!

Android User Interface Design Basics

Hi.. to all.. today i’m going to share about android UI(User Interface) design .

UI:

your UI plays very import role in android application. User can see your User Interface(UI) and interact with your app.

Android provides some pre-built UI components such as structured Layout objects etc.

What is a Layout ?

Layout defines what is drawn on your screen ..In other words, it defines the visual structure of a user interface.

What are the Layouts in Android?

  • Linear Layout
  • Relative Layout
  • Table Layout
  • Frame Layout
  • Grid View
  • List View
  • Scroll View 

Linear Layout:

Linear Layout allows that all it’s child to a single direction either “Vertical ” or “Horizontal“. we can choose the orientation by android:orientation=”vertical/horizontal”

linear

Relative Layout:

Relative layout displays child views in relative positions. The position can be specified as left or right to siblings or below to particular  siblings or it may center .

Getting confused ???

let me clear with sample design..

take an example of  our gmail login page.

relativ1

Username is left to field of EditText(we call this as edittext  in android ).

similarly Password field is left to another field and also below to Username field

here Sign in Button is center of the screen .

Now you got the clear idea about to use relative layout in android i think..

List view:

List View is a view group that displays list of scrollable items. Items    are inserted into list view using  Adapter.  The source of an Adapter may be Array or database query that puts query results  into list view.

We will see about list view in my upcoming posts..

Grid View:

Grid View is a View Group in android that displays the items in  two-dimensional, scrollable grid. Items are  inserted into Grid View using ListAdapter. See the image for grid view below.

grid

Scroll View:

 Scroll View allows user to scroll the page or screen in android. One of the most important thing about Scroll View to remember is, Scroll View can have only one child.

we will see about Scroll View in my upcoming posts.

Table Layout:

A layout that arranges its children into rows and columns.  Our calculator is the best example for   Table Layout in android.

Hope you liked it.. Thanks for reading…

Android Basics

Hi.. to all ..I’m going to share my Android experience with you . I’m not an expert in android. I’m just a Android Application Developer .I will try to explain all my  posts as easy  for a beginner to understand .

Before going to into the session . I want you to have  some basic knowledge in java or oops . Because in android we are going to code in java.

What is Android?

Android is a Linux based Operating System primarily designed for mobiles such as Smartphones,tablets etc.

Basic Components of Android:

  • ACTIVITY
  • SERVICES
  • INTENT
  • CONTENT PROVIDER

ACTIVITY:

Activity is the building block of UI(User Interface). Almost all of the activities interact with the user .

INTENT:

Intents are system messages, running around inside of devices and notifying applications such as (e.g) New message arrived ,SD inserted . etc. You can not only respond to intents you can create your own intent to launch other activities . I will explain in my upcoming posts.

CONTENT PROVIDER:

Content Provider provides a level of abstraction that makes our application data accessed by multiple applications. In short, it makes our application data available to other applications . You can create your own content Provider. I will explain about this later. Android has some default Content provider class such as

(e.g) Contacts Content provider in Android

import android.provider.ContactsContract.Contacts;

by importing this you can access your contacts (Note: read ,write permissions should be given in order to use this in your application)

SERVICES:

Activity,intents,content provider are short-lived process they can be killed at any time. But services are long lived process , running in background .

(e.g) You are playing songs in music player in your mobile . Are you still in music player to listen songs? after playing any song we came out to some other screen from player. but player continues to play songs even though we are not in that screen. This is called service in Android.

Thanks for reading ..

How to install/update firefox in ubuntu

Hi…to all today i updated my firefox.  Its so easy to update in ubuntu..

Here the steps are.

Open terminal and type

sudo add-apt-repository ppa:ubuntu-mozilla-security/ppa

 sudo apt-get update

sudo apt-get install firefox

Thats it….. you successfully installed/updated firefox in your system… To check

$ firefox -v

Screenshot from 2013-01-23 09:16:23

this will show the version of the firefox you installed/updated..

 

Thanks …:)

 

 

 

 

Nice open source Applications created in Ruby on Rails

Hi….today i just seen the applications which are all created in Ruby on Rails.

Here are the few applications created by ruby rails

1) Notes

Notes is an open source (Reciprocal Public License) Ruby on Rails application designed to be a simple and fast to-do and notes manager.

2)Tracks

A to-do-list manager with a clean interface. Tracks lets you to categorize, prioritize, schedule & star items for a better usability .

Interface is Ajaxed & many tasks are done by drag’n drops. The application has multi-user support & comes with a built-in web-server for user who want to install it to their computers.

3) Mephisto

A widely used blogging engine that has ready-to-use plug-ins..

It uses Liquid templates for creating & editing themes. Mephisto also has a built-in caching system for faster loading.

4) Gallery

A simple but functional photo gallery built with Ruby on Rails.

It uses the lightweight mini_magick to resize photos, reducing the memory requirements of the full RMagick suite.

Photos can be sorted by drag’n drops, captions can be edited easily & the look/feel can be customized via CSS.

5) Spree

Spree is a highly extensible & customizable e-commerce application. Developers can easily override existing views, provide new ones or provide additional models, migrations and controllers

6) Ecompages

EcomPages is an e-commerce application which has most of the basic features of an e-store.

It has a good looking admin interface that makes managing products & orders easier.

The application is still being developed & is not feature-rich but can be a good base to start with & improve further.

7)  EchoWaves

This is a group chat social network application. You can start conversations and connect wih other users while discussing it.

It is possible to make a conversation read-only for presenting a content too.

8) Communityengine

A plugin for Ruby on Rails applications for having the features of a social netwoking website instantly.

Some great features include:

  • Authentication (sign up, log in), user search & user profiles
  • Blogs with tagging, categories and rich text editing
  • Photo uploading and tagging
  • Commenting, forums, friendship, activity feeds & more

9) Rubyurl

RubyURL is an online tool for converting long website addresses into short ones.

Besides the standard form input, it is possible to create short URLs via the REST API which supports both JSON & XML requests.

10) Mailr

Mailr is an open source webmail application that can be used with any IMAP server.

E-mails can be created both in HTML & plain text. E-mail addresses in the contact list are Ajax-auto-completed just like Gmail & more.

11)  Warehouse

This is a beautiful web-based subversion browser built with Ruby on Rails.

Multiple repositories can be managed from the same interface. Also, it is possible to add any number of users with different permissions.

12) typo

A Ruby on Rails blogging application that is developed continiously.

It comes with theming & plugins support for easier customization. Every part is planned for a better SEO, like friendly-URLs, ability to add keywords/description to every category/page & more.

Thanks for reading. Enjoy the framework with Rails …:)