Dictionary in Python3

Hi  all . Today we are gonna see dictionary in python3.

In python2.7

d= {"a": "b", "b": "c", "c": "d", "d": "e", "e": "f"}

d.keys() —> [‘a’, ‘c’, ‘b’, ‘e’, ‘d’]

d.values() —> [‘b’, ‘d’, ‘c’, ‘f’, ‘e’]

Here d.keys() returns list of keys, d.values() returns list of values respectively.

In Python3 . These functions return dictkeys and dict values

d= {"a": "b", "b": "c", "c": "d", "d": "e", "e": "f"}
d.keys() ---> dict_keys(['d', 'c', 'b', 'e', 'a'])
d.values() ---> dict_values(['e', 'd', 'c', 'f', 'b'])

However if we want same functionality as 2.7 we can get like this

list(d.keys()) ---> ['d', 'c', 'b', 'e', 'a']
list(d.values()) ---> ['e', 'd', 'c', 'f', 'b']

There is no dict.iteritems(), d.has_key() function in python3

d.items() ---> dict_items([('d', 'e'), ('c', 'd'), ('b', 'c'), ('e', 'f'), ('a', 'b')])
list(d.items()) ---> [('d', 'e'), ('c', 'd'), ('b', 'c'), ('e', 'f'), ('a', 'b')]

Instead has_key() function we can check like this

'a' in d.keys() --> True
'z' in d.keys() --> False

If key doesn’t exists it will return false.

 

d.pop(‘a’) –> pop item from dict by key

d.popitem() –> pops most recently inserted item in dict

d.clear() –> clears entire dictionary

d.setdefault(‘z’, 10) –> sets default value for ‘z’ to 10

d.get(‘z’) –> 10 # gets value of key ‘z’, returns none if key doesn’t exist.

 

Advertisement

How to get video files duration and resolution in python

Hi all,

Today i was about to find video file duration and resolution and few more info in python.

In debain based systems type this command in terminal

avconv

If this shows version , like this

avconv version 9.18-6:9.18-0ubuntu0.14.04.1, Copyright (c) 2000-2014 the Libav developers
built on Mar 16 2015 13:19:10 with gcc 4.8 (Ubuntu 4.8.2-19ubuntu1)
Hyper fast Audio and Video encoder
usage: avconv [options] [[infile options] -i infile]… {[outfile options] outfile}…

Use -h to get full help or, even better, run ‘man avconv’

If it says commond or package not found, then install this by

sudo apt-get install libav-tools

After successful installation, In terminal

avconv -i "filepath"  # give path to your video file

You will see output like this
Metadata:
    major_brand     : mp42
    minor_version   : 1
    compatible_brands: mp42mp41
    creation_time   : 2015-10-21 16:41:08
  Duration: 00:00:30.00, start: 0.000000, bitrate: 748 kb/s
    Stream #0.0(eng): Video: h264 (Constrained Baseline), yuv420p, 480×240, 488 kb/s, 25 fps, 25 tbr, 25 tbn
    Metadata:
      creation_time   : 2015-10-21 16:41:08
    Stream #0.1(eng): Audio: aac, 48000 Hz, stereo, fltp, 255 kb/s
    Metadata:
      creation_time   : 2015-10-21 16:41:08

You can see the resolution and duration highlighted my result.

Now lets see how to use this in python to get video resolution, duration.

The idea is simple , am gonna use subprocess and pipe the output

Here is the python code.


from subprocess import Popen, PIPE
import re

def getvideodetails(filepath):
    cmd = "avconv -i %s" % filepath
    p = Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE)
    di = p.communicate()
    for line in di:
        if line.rfind("Duration") > 0:
            duration = re.findall("Duration: (\d+:\d+:[\d.]+)", line)[0]
        if line.rfind("Video") > 0:
            resolution = re.findall("(\d+x\d+)", line)[0]
    return duration,resolution

# call function with file path
getvideodetails("filepath") 

This code can be found in gist too here

Thats it.

Happy coding!!!
Happy times !!!

MongoDB Basics

Hi all. Today we are going to see about mongodb basics.

what is it?

Mongodb is a No-Sql Database , document based database (closely dictionary in python , ruby).  Here you can read about it official documentation .

why is it used for?

It’s used when you no need to care about your schema . It means when you are using mysql , we have to have ID field integer, name varchar . Here it’s not needed .

Installation:

open terminal (cntl+alt + T)

sudo apt-key adv –keyserver hkp://keyserver.ubuntu.com:80 –recv 7F0CEB10

sudo apt-get update

sudo apt-get install -y mongodb-org

Hope you have successfully installed mongo in your system.

if you’re having trouble see official site for trouble shooting here

Time to play with it :

type mongo and enter in your shell

you will see something like this

MongoDB shell version: 2.4.9
connecting to: test

>

To see all database:

show dbs 

it lists all databases. This is equal to show databases in MYSQL

To select a particular database

use databasename

To see all collections inside a database.

Note: In mongodb collections refers tables. we won’t call them as tables . It’s collections here.

show collections

this is equals to show tables in MYSQL

Let’s create a database called bala .

use bala

unlike MYSQL it won’t throw you error unknown database.

to check in which db you’re issue db command  in mongo shell.

it shows bala

now create a document in it.

db.mycollection.insert({“name”: “bala”})

what i did here?

db => bala

mycollection => our collection name

insert => operation name

{“name”: “bala”}  => data we are inserting into our collection

To see a document is inserted into our db , type this command

> db.mycollection.find()
{ “_id” : ObjectId(“55a6070cab3e0cf5c31434cf”), “name” : “bala” }
>

see our document is inserted, find returns all documents in a collection(table) .

Have you noticed _id in above output. Yes that’s automatically created by mongo . Its alpha numeric . Similar to id primary key in MYSQL. 

To see one document in a collection

> db.mycollection.findOne()
{ “_id” : ObjectId(“55a6070cab3e0cf5c31434cf”), “name” : “bala” }
>

Now try inserting other document

db.mycollection.insert({“name”: “hari”, “country”: “india”, “state”: “Haryana”, “age”: 20})

what we did ?

we gave country , state, haryana . so many keys .  what happened?

As i said it’s schema less database , it doesn’t care about schema.

now let’s go and check it out.

> db.mycollection.find()
{ “_id” : ObjectId(“55a6070cab3e0cf5c31434cf”), “name” : “bala” }
{ “_id” : ObjectId(“55a60976ab3e0cf5c31434d0”), “name” : “hari”, “country” : “india”, “state” : “Haryana”, “age” : 20 }
>

see it shows two documents in our collection. how cool is this .

To find a document by condition (by name , age or any field)

let’s see i want to see all document’s which have name “hari”

> db.mycollection.find({“name”: “hari”})
{ “_id” : ObjectId(“55a60976ab3e0cf5c31434d0”), “name” : “hari”, “country” : “india”, “state” : “Haryana”, “age” : 20 }
>

here i have only one document having name “hari”, if I ‘ve many then it will show all documents.

We can see particular field value also by condition

let’s say i want to see all names alone in my collection

> db.mycollection.find({}, {“name”:1})
{ “_id” : ObjectId(“55a6070cab3e0cf5c31434cf”), “name” : “bala” }
{ “_id” : ObjectId(“55a60976ab3e0cf5c31434d0”), “name” : “hari” }
>

{} => empty dict to select all documents

“name”:1 => select only name from all documents

we can select  as many documents as possible

fieldname: 1  => selects the field to result.

fieldname:0  => ignores the filed to result.

Here instead of { } , we can give condition also .

Try it out those commands also.

Let’s update a document named “bala” , adding age field into it

> db.mycollection.update({“name”: “bala”}, {$set : {“age”: 20}})

$set is used to update field

let’s see it’s updated or not.

> db.mycollection.find({“name”: “bala”})
{ “_id” : ObjectId(“55a6070cab3e0cf5c31434cf”), “age” : 20, “name” : “bala” }
>

its updated , age field added into it. if we have many documents named “bala” then all got updated by above command.

till now we seen creation,insertion, updation.

Let’s move on to deletion of a document

To delete a document in collection

> db.mycollection.remove({“name”: “bala”})
> db.mycollection.find({“name”: “bala”})
>

first command deleted the document, second one verifies that we don’t have document name “bala”.

To remove all documents in a collection

db.mycollection.remove()

This command removes all document in a collection.

That’s all about mongodb basic (curd operations) .  Try playing with that , we will see with advanced commands or integration with any language (python or ruby) later .

 

Thanks for reading . Happy hacking .

 

 

 

String Module In Python3

Hi all. After long time I’m writing a post . Finally started learning python internals and playing with c python source code little bit.

Its actually nice to read the source code of cpython . To get a copy of source code follow this Cpython .

Hope you have a running copy of cpython.  Inside the source code directory run ./python

i have version 3.5.0a0 like this

Python 3.5.0a0 (default:5754f069b123, Dec 13 2014, 00:41:29)
[GCC 4.8.2] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>>

Here I’m not going to talk about python3 . Since i started coding and slowly moving to python3 from python2 (will explain module by module).

Lets start using string module. inside shell

import string

s = “this is my test string ”

print(string.capwords(s))

This Is My Test String

It just capitalize the first letter in every word in a given input string. The actual source code for capwords is

def capwords(s, sep=None):
    return (sep or ‘ ‘).join(x.capitalize() for x in s.split(sep))

you can check in file cpython/Lib/string.py

It just splits the words and capitalize and joins for result.

If you didn’t download source code . checking in ipython or bpython or python shell means . Don’t worry there is a another option to see the source code . Not only this module all the modules source code you can see .

Lets see that .

import inspect

inspect.getsource(string.capwords)

def capwords(s, sep=None):\n    “””capwords(s [,sep]) -> string\n\n    Split the argument into words using split, capitalize each\n    word using capitalize, and join the capitalized words using\n    join.  If the optional second argument sep is absent or None,\n    runs of whitespace characters are replaced by a single space\n    and leading and trailing whitespace are removed, otherwise\n    sep is used to split and join the words.\n\n    “””\n    return (sep or \’ \’).join(x.capitalize() for x in s.split(sep))\n’

See it returns source code in a string . How cool this inspect module ??. It has so many useful functions use it .

now coming back to our string module

string.ascii_letters

‘abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ’

string.ascii_lowercase

‘abcdefghijklmnopqrstuvwxyz’

string.ascii_uppercase

‘ABCDEFGHIJKLMNOPQRSTUVWXYZ’

I used to replace strings like this

k = “this is a test replace with b”

k.replace(‘a’,’b’)

I found this is less efficient than this one

test = str.maketrans(‘a’,’b’)

print(k.translate(test))

‘this is b test replbce with b’

Both functions does the same job but maketrans done in efficient way . It just creates mapping table and translates .

I test with time-it. Second one takes less time than first one .

%timeit k.replace(‘a’,’b’)
10000000 loops, best of 3: 167 ns per loop

%timeit k.translate(test)
10000000 loops, best of 3: 143 ns per loop

if you have noticed something  above,  i used str.maketrans(from,to)

In case of python 2.7+ to 3.

you can use like this

test = string.maketrans(‘a’,’b’)

print k.translate(test)

In python3.1maketrans () is not  a function of string module

Because in Python3 Strings are not bytes . Strings are Unicodes.

In python2  bytes was an alias for str.

You can check this in your shell

bytes is str   (In python2.7 shell)

True

bytes is str (In python3.5.0a0)

False

Hope you got something . Happy coding . Thanks for reading . Feel free to tell your suggestions .

 

Nginx to serve static files example

Hi all .. Today we ll see how to use nginx to serve static files .

what is Nginx?

Niginx is a high performence web server software. Its more flexible and lightweight than apache. Niginx is a reverse proxy protocol.

wait a minute.. what is meant by reverse proxy ?

lets say x- a client and y –  a  server

x->y

x sends request to server [this is direct proxy]

in reverse proxy

x sends request to server but request is sent to some internal proxy and that proxy passes request to server and sends response from server to the client. The client may not be aware of this.

In short, requests are not directly processed by server some intermediate guy will process requests and responses.

x->z->y   [here z is that intermediate guy]

Got something ??

Lets see the installation of nginx

1. open Terminal (cntrl + alt +T)

2. sudo apt-get install nginx

Lets play with nginx now .

To play ,lets create one directory

mkdir /cache

cd /cache/

chmod 777 /cache/   

lets create some dummy files in this directory

 cat >>hi.txt

hi this is test file

this is line

cntrl +  d  [press contrl + d to save this file]

cat >> second.txt

this is my second file

this is testing second file

press contrl + d to save this file .

Now ,its time create a sample configuration file to server this directory.

Before we write ,its useful  to know about two import directory in nginx

1. sites-enabled

cd /etc/nginx/sites-enabled/  

This directory contains symlink files from sites-available directory

[we can create files here but don’t do that it’s not a good way to do]

2. sites-available

cd /etc/nginx/sites-available/

here all the config files are there for nginx

make sure your inside sites-available directory else

do cd /etc/nginx/sites-available/

touch sample.conf

open this sample.conf in your favorite  editor  and type the below code

server {
    listen                *:9000;
    server_name          _;
    access_log            /var/log/sample.access.log;  # this line is optional for log purpose
    error_log             /var/log/sample.error.log;  # this line is optional

    location /test/ {
        alias /cache/;
        autoindex on; 
        expires off;
        default_type application/octet-stream;
   }
}

Lets see ,

listen *:9000  –> server listens to the port 9000

server name  – is optinal [you can give your domain name like http://www.example.com,]

next two lines are to log who accesses files and error details

next loctaion /test/{

telling nginx -x to server this location

alias  /cache/     —> /cache directory can  be accessed using /test/

auto index on —> to list all the files inside directory [ not recommended ,because user will know all the files inside your directory ]

expires off  [ this disables caching files ]

default_type application/octet-stream   –> this sets ,whenever the user visits this page it serves the content of the file .[asks for open or download]. It never opens in browser

if you want to just view in browser not to download

just comment that last line which contains octet-stream

just save this file now

now lets create a symlink to sites-enable

It will be like this

sudo ln -s   /etc/nginx/sites-available/sample.conf    /etc/nginx/sites-enabled/

Note: Always use absolute path . don’t use relative path like this sudo ln -s sample.conf /etc/nginx/sites-enabled/  — this will not work . Always use absolute path

ok lets restart nginx server

sudo service nginx restart

open your favorite browser and visit localhost:9000/test/

you will see like this

nginx auto index on

click on the any file [ note i commented octet-stream]

content view in browser

 

if you want file to be downloaded uncomment that last line . You browser will ask file to be opened or download

 

Thats it … Play with it .. Happy coding..:)

 

 

 

 

 

 

 

 

 

 

 

 

Start Using TMUX

Hi …all ,till few days ago I used screen in remote server . Now i switched to TMUX from screen. It has so many advantages and features as compared to screen .

what is Tmux?

Tmux is a software application that multiples several virtual consoles ,that allows us to use multiple terminal session into single terminal window.

Installation

apt-get install tmux

Lets start using and play with tmux

To start new session  open terminal and type

tmux

To attach to a session

tmux attach-session -t [session name]

To split the window into panes (two terminal session in same window )

cntrl+b  %  (splits vertically)

cntrl+b ”   (splits window horizontally)

To move from one pane to another pane

cntrl +b o

To open new window

cntrl +b c

To go to next window

cntrl +b n

To go to previous window

cntrl +b p

To kill current pane or window

cntrl + b x

To close all other pane expect current

cntrl + b !

To exit from shell

cntrl + b d

To list windows

cntrl +b w

To go to particular window(lets say #)

cntrl +b #

To go to last  active window

cntrl +b l

To list all the sessions

tmux list-sessions

To rename current window

cntrl +b ,

This is how i’m using tmux in my remote server .. In single window i’m using three terminals in it . see the screen shot below..

Tmux demo

Here i’m using three terminal session in single window .

In first pane – i’m running ipython

In second pane – i’m running glances

In third pane – mysql shell

commands i used for these to achieve are:

tmux             – to start new tmux session

cntl +b %    – split vertically

cntrl +b “      -split 1st pane into by horizontally

cntrl + b o       –to switch between panes

See how useful these tool .. Start using this and play with it ..

Thanks for reading .. Please feel free to tell your suggestions ..

Happy coding .!!!!

Hello World in Django !!!!

Hi …After long time i ‘m blogging about django..Lets start our journey to django .Before reading this i would suggest to get some knowledge in python .

My suggestions for python learning  are

  1. A byte of python with pdf
  2. Dive into Python
  3. Learn python the Hard way

What is Django?

Django is a Web framework written in python language .

what is meant by frame work?

After searching about framework, Frameworks are similar to libraries or templates that are already written for you, you can re-use the code to build your system . i.e(Collection of codes that uses some control mechanisum).

still didn’t get???

I explain with my view, lets forget about all technical terms and others. Lets say you want to build your new house ?

To build new house what are the things you needed?

  • money
  • stones
  • cement
  • empty ground (place)
  • water
  • sand and others stuffs etc….

now same thing for building your system(project)

Framework offers everything you needed, the only thing you need to do is ,place the things in correct manner. like(database details in proper place, your logics , representation of your data. etc.)

Hope you understand something !!!!

now coming to our django framework.It follows MVC pattern like Ruby on Rails  .

MVC: => Model View Controller

Model: for Database access.(this contains database,table details)

Views: your logic goes here

Templates: To represent your data(html files)

Why MVC or What is the Advantage of this?

The biggest Advantage of MVC is ,one’s change doesn’t affect other.

for example if you want to change your database details(say db name or table details) you need to change in model file only( no need of find and replace in entire project).Because each and everything are loosely coupled(independent). Same thing applies for view and templates too.

didn’t get it??

while doing sample code i will explain it …

To install django see install

Sorry i ‘m not going to tell about installation . Since i’m an linux user I know about installation in linux only. You can find lot of source for your operating system installations.

Django documentation is fair enough for any os.

After installing ,lets create the sample project.

Its good practice to create working directory for learning any language .

open terminal(cntrl+alt+T)

create directory mkdir djcodes(here djcodes is my directory name ,you can give any name)

cd djcodes(change to working directory)

To start a new django project (i’m using django 1.5 version):

django-admin startproject sample # here sample is a project name

change directory to sample

cd sample/

now issue  ls command ,

manage.py      sample

You have manage.py file and sample folder inside sample project .

manage.py – points to the settings file in your project. (This files are automatically created by django)

Inside sample folder you have 4 files

__init__py  – this will indicate your project as python package to compiler.

urls.py – file to hold the urls of your website .(e.x)http://localhost/hello

here in order to use /hello in our project you have to mention this in urls.py.

(see my below explanation for these concepts)

settings.py – File that will hold all apps,database settings of your information .

(if you open this file means you can see,time zone,templates etc..). You will learn more about these files in my upcoming posts.

Wsgi.py– This file handles our requests and responses .(our django development server)

Ok. Lets start the server by

./manage.py runserver (Note inside project directory)

this is show like this

Validating models…

0 errors found
Django version 1.4.5, using settings ‘sample.settings’
Development server is running at http://127.0.0.1:8000/
Quit the server with CONTROL-C.
[17/Jan/2014 11:45:00] “GET / HTTP/1.1” 200 1957

Open your favorite browser and see the link   http://127.0.0.1:8000/

It worked!

Congratulations on your first Django-powered page.

Of course, you haven’t actually done any work yet. Here’s what to do next:

  • If you plan to use a database, edit the DATABASES setting in sample/settings.py.
  • Start your first app by running python manage.py startapp [appname].

You’re seeing this message because you have DEBUG = True in your Django settings file and you haven’t configured any URLs. Get to work!

This is default page in django .

Our goal is Hello world page in django .

Lets Start creating our app and display hello world

To create an app(inside project directory)

django-admin startapp hello

This will create a hello folder in project directory.It has four files

__init__.py  – this file indicates your app as python package.

models.py – file to hold your database informations

views.py – your functions to hold requests,logics etc.

tests.py – for testing purposes.

Three things to do for our task

1. add url in urls.py with associate function.

2. write code for url in urls.py

3. html file to render a response.

Lets add url in urls.py

gedit  /sample/urls.py

add your like this

from django.conf.urls import patterns, include, url
from hello.views import myfunction

# Uncomment the next two lines to enable the admin:
# from django.contrib import admin
# admin.autodiscover()

urlpatterns = patterns('',
    # Examples:
    # url(r'^$', 'sample.views.home', name='home'),
    # url(r'^sample/', include('sample.foo.urls')),

    # Uncomment the admin/doc line below to enable admin documentation:
    # url(r'^admin/doc/', include('django.contrib.admindocs.urls')),

    # Uncomment the next line to enable the admin:
    # url(r'^admin/', include(admin.site.urls)),
    url(r'hello/$',myfunction),
)

my views.py(In hello/views.py)

# Create your views here.
from django.http import HttpResponse
def myfunction(request,):
	return HttpResponse("Hello")

save your files .
now run server ./manage.py runserver

Now see in browser by typing http://localhost:8000/hello

this will return hello in page . Thats it our goal is done .

What happens behind the scene . I will explain shortly.

when you type localhost:8000/hello —> this will send request to django

this is will read urls.py and look for pattern and url match . if it matches it calls the associated function.

In our case it matches hello in urls.py and calls myfunction in views.py.

In views.py –> we have the code to display the hello as HttpResponse .

Thats it .. Thanks for reading… Happy coding !!!

Crawling a Web Page with Scrapy

Hi… all today i’m going to share how to crawl a web page with scrapy. See my previous post for installation here.

Its always best to create directory for learning any language or programming.

Open Terminal (cntrl+Alt + T)

mkdir (name of the directory)

cd /directoryname

To create a new project in scrapy

scrapy startproject sample

now change the directory to

cd sample

type the command ls

sample   scrapy.cfg

now cd sample

again  –   ls

_init__.py   items.py   pipelines.py 
  settings.py   spiders

scrapy created these files within sample(name of project) folder.

If you already know django or python means you can easily understand these . If not don’t worry i will try to explain as much as i can .

  • scrapy.cfg: the project configuration file
  • sample/: the project’s python module, you’ll later import your code from here.
  • sample/items.py: the project’s items file.
  • sample/pipelines.py: the project’s pipelines file.
  • sample/settings.py: the project’s settings file.
  • sample/spiders/: a directory where you’ll later put your spiders.
  • __init__.py –  to initialize Python Packages and also consider this directory as Python Package.

Now ,it times to get into scrapy and crawl a page

Things we have to for crawling a web page are:

  • Define items in items.py
  • Write a spider to crawl a page and extract items
  • Writing an pipelines.py  to store the extracted Items

Things to do in items.py

from scrapy.item import Item, Field
class SampleItem(Item):
    # define the fields for your item here like:
    # name = Field()
    title=Field()  // for title of a page
    link=Field() // for links of the page
    desc=Field() //for  description of a page

Things to do with  Spiders/

To create Spider Scrapy genspider (name) (domain name) or

simply create one sample.py in spiders folder.

mydomain.py (i created mydomain.py in spiders folder of the project)

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from sample.items import SampleItem

class DmozSpider(BaseSpider):
    name = “dmoz”
    allowed_domains = [“dmoz.org”]
    start_urls = [
        “http://www.dmoz.org/Computers/Programming/Languages/Python/Books/”,
        “http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/”
    ]
    def parse(self, response):
       sel = HtmlXPathSelector(response)
       sites = sel.select(‘//ul/li’)
       items = []
       for site in sites:
           item = SampleItem()
           item[‘title’] = site.select(‘a/text()’).extract()
           item[‘link’] = site.select(‘a/@href’).extract()
           item[‘desc’] = site.select(‘text()’).extract()
           items.append(item)
       return items

Lets explain what i did above,

For spiders three things are necessary

  • name
  • start_url
  • parse method

name must be unique for every page

start_urlslist of urls which we want to crawl

parsemethod for extracting data from webpage.

allow_domains allows pages with given domains and restricts others pages

Thats it .. now run our file and see what’s happening .

To run a  spider

scrapy crawl [name of the spider]

so in our code spider name is ‘dmoz’ . 

scrapy crawl dmoz

In terminal you can see like this

if your code is correct i mean without any errors you can see Spider opened and crawled and closed like below

2013-11-27 16:09:20+0530 [dmoz] INFO: Spider opened
2013-11-27 16:09:20+0530 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-11-27 16:09:20+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-11-27 16:09:20+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-11-27 16:09:21+0530 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/&gt; (referer: None)
2013-11-27 16:09:21+0530 [dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/&gt;
{‘desc’: [u’\r\n\r\n ‘],
‘images’: [],
‘link’: [u’/’],
‘title’: [u’Top’]}
2013-11-27 16:09:21+0530 [dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/&gt;
{‘desc’: [], ‘images’: [], ‘link’: [u’/Computers/’], ‘title’: [u’Computers’]}
2013-11-27 16:09:21+0530 [dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/&gt;
{‘desc’: [],
‘images’: [],
‘link’: [u’/Computers/Programming/’],
‘title’: [u’Programming’]}

Want to Store the crawled details as json or csv ?

To store the crawled details as json run this command

 scrapy crawl dmoz -o items -t json

now you can see the items.json in your project directory.

Scrapy Shell

Scrapy comes with inbuilt tool for debugging and testing our code and makes developers job easily.

Inside project directory run   this command scrapy shell

you can see like this !!

2013-11-27 16:23:04+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
[s] Available Scrapy objects:
[s] item {}
[s] settings <CrawlerSettings module=<module ‘sample.settings’ from ‘/home/bala/scrapycodes/sample/sample/settings.pyc’>>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser

now lets check this

fetch(“http://www.google.co.in&#8221;)

2013-11-27 16:25:23+0530 [default] INFO: Spider opened
2013-11-27 16:25:29+0530 [default] DEBUG: Crawled (200) <GET http://www.google.co.in&gt; (referer: None)
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<html itemscope=”” itemtype=”http://sche’&gt;
[s] item {}
[s] request <GET http://www.google.co.in&gt;
[s] response <200 http://www.google.co.in>  # Things to note
[s] settings <CrawlerSettings module=<module ‘sample.settings’ from ‘/home/bala/scrapycodes/sample/sample/settings.pyc’>>
[s] spider <BaseSpider ‘default’ at 0x3cde3d0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser

hxs -> html xpath selector

now we have response of the requested url .

lets get the title of the page .

hxs.select(“//title”) 

<HtmlXPathSelector xpath=’//title’ data=u'<title>Google</title>’>]

it returns the selector element (with html tags)

we want text  of the tag

hxs.select(“//title”).extract()

[u'<title>Google</title>’]

extract -> used to extract element from page

Things to note:  Results are in list and u’ represents unicode.

Lets try this !!!

hxs.select(“//title/text()“).extract()[0]
 u’Google’

For xpath learning use this   Xpath.

Thats it … Happy coding ..

Scrapy Beginning Tutorial

Hi to all. Today i started learning scrapy .Here i’m going to start scrapy from the beginning ..

What is Scrapy?

Scrapy is an application framework for crawling websites and extracting  structured data which can be used for a wide range of useful applications, like data mining, information processing etc.

Take a look at the documentation of scrapy  for more information here

Scrapy was written in Python. Hence you must some knowledge in python to work in scrapy.

For those who are beginners in python i would suggest these books “A Byte of Python ”  & “Learning Python the HardWay” (or) “Dive into Python“.

For those who have already knowledge in python remember this . Scrapy supports python 2.6 and 2.7 . Scrapy doesn’t support Python3.

Lets see the installation here.

Pre requisites :

  • Python 2.7
  • lxml
  • Opensssl
  • Pip or easy_install (Python package Managers)

To install Scrapy Open Terminal and type(cntrl + Alt + T)

$ pip install Scrapy

or

$ easy_install Scrapy

After installation type  $ scrapy

Scrapy 0.18.4 – no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use “scrapy <command> -h” to see more info about a command

It shows the installed scrapy version and other details .
Thats it .. From my next post we will get started with Coding …

Happy Times !!! Thanks !!!!!!!!

 

The beginning of Django !!!

Hi… all .. Since it was long time for me . I actually stopped learning python. But my first job was in Android . When was working in Android app development ,i felt something is missing and more over i want to continue my learning in python and my other interested one’s .Yes the time came for me. Now i’m in job in my favorite language .This is the time i can learn my interested language and make use of this opportunity . So friends i try to explain  Django(web framework) from the scratch in my upcoming posts .

Happy coding !!!!!