Scrapy Cheat Sheet



The best Python project structure

  1. Scrapy Cheat Sheet Pdf
  2. Scrapy Cheat Sheet 2019
  3. Scrapy Cheat Sheet Printable
  4. Scrapy Cheat Sheet
  5. Scrapy Cheat Sheet 2020

Scrapy is a framework for building web crawlers and includes an API that can be used directly from a python script. The framework includes many components and options that manage the details of requesting pages from websites and collecting and storing the desired data. Here is a collection of various tips, shortcuts and settings for Paintshop Pro illustrated or used during the Basic Scrap Course 1. Useful tips:. Hold the Ctrl key to select more than one. Scrapy cheatsheet. GitHub Gist: instantly share code, notes, and snippets. GitHub Gist: star and fork sany2k8's gists by creating an account on GitHub.

There are multiple ways to do a nice python project structure, so I can only show the best way I have found to do mine by now.

DOWNLOAD THE CHEAT SHEET:


You can find an example template with all the files and folders on this repository:


Scrapy cheat sheet

The files structure:

APP.py

  • README.md
  • requeriments.txt
  • license.txt
  • /Data
    • config.json
    • config_test.json
    • /Files
  • /App
    • __init__.py
    • App.py
    • exception.py
    • /Test
      • App_test.py


README.md: A document to help others understand and replicate your code. In my experience, the best way to do so is by following these points.

  1. Description
  2. Installation
  3. Usage
  4. Troubleshooting
  5. Disclaimer
  6. Help wanted
  7. Links

requirements.txt: This file contains all the needed libraries and packages to run the project. You can always install all the content of this file by the command:

An example of the file content:


license.txt: Another important document where you inform the usage purposes of your app. A nice template for an open-code app is the X11 license or MIT license. I use a template like this for my open-code projects.


/Data: This folder contains all the output or configuration files needed for the project.

config.json: This file contains a dictionary with all the global variables to execute the app. This is a nice way to central all the variables in the same place. Here a small example:


/App: This folder contains all the code, scripts, and modules running the App behind. There are many ways to call this folder: src, main..
There are different and opposite opinions about this folder name, but in my opinion, I call it always by the name of the app.

The argument parser, a huge allie:

This module is for me one of the biggest allies when developing your app in Python. It came with version 3.2 and it helps me in every new project I code.

The argparse module makes it easy to write user-friendly command-line interfaces. The program defines what arguments it requires, and argparse will figure out how to parse those out of sys.argv. The argparse module also automatically generates help and usage messages and issues errors when users give the program invalid arguments.
Here you can see an easy code template.

I wrote an (*) where you write the flag options. You can define and give different choices for the flags with the following commands.

  • [action] > Action to be taken
  • [nargs] > number of arguments
  • [const] > A constat value required
  • [default] > Value produced if absent
  • [type] > Type to be converted in
  • [choices] > Allowable values
  • [required] > Optional or not
  • [help] > A brief description of the arg.
  • [metavar] > A name for the argument in usage messages.
  • [dest] > The name of the attribute to be added to the object returned by parse_args().


With this module, you can show help and useful descriptions when calling the Main.py script to run the program. You can do it by calling -h when executing the Main.py:

And the output would be something like this.


Other references and links

  • Github repository: cheatsheets


Most important bash commands for managing processes, Git, Python, R, SQL/SQLite and LaTeX for researchers and data scientists. This cheat sheet only focusses on bash commands run from the terminal.

Table of Contents

  • Managing processes
    - First-aid procedure for killing a running process
  • Git
    - Clone repository from GitHub to local machine
  • Python
    - Virtual environments
  • R
    - Open new window
  • SQL and SQLite
    - Repair corrupt database
  • Text editing and LaTeX
    - Calculate the number of words in a Latex file

Managing processes

First-aid procedure for killing a running process

  • open new terminal window
  • type ps + enter
  • identify PID (processid) of the process
  • type kill -9 <PID>

OR:

  • control + C (twice if needed)

Cronjobs and Crontab

Schedule crontab task

  • If you want to run the cronjob on a server: enter the server
  • enter crontab -e in the terminal
  • enter <minutes> <hours> <day of month> <month> <day of week>
  • for example 6 0 * * 1-6 cd /home/annerose/Python/continuousscraper/ && python processcontrol.py
  • this signifies that the process will start to run Monday through Saturday at 6 minutes past midnight.

For more information, see

  • http://www.everydaylinuxuser.com/2014/10/an-everyday-linux-user-guide-to.html and

Kill an existing cronjob

  • enter ps -e in the terminal to see all existing processes.
  • determine which processid your process has.
  • enter kill -9 <processid>
Scrapy Cheat Sheet

Tmux sessions

Tmux allows to keep processes running after ending an ssh session. For more detailed explanation, see here.

  • ssh into the remote machine
  • start tmux by typing tmux into the shell
  • start the process you want inside the started tmux session
  • leave/detach the tmux session by typing Ctrl+B and then D

You can now safely logoff from the remote machine, your process will keep running inside tmux. When you come back again and want to check the status of your process you can use tmux attach to attach to your tmux session.

If you want to have multiple session running side-by-side you should name each session using Ctrl-B and $. You can get a list of the currently running sessions using tmux list-sessions.

Some more useful tmux commands (see also this video):

CommandSignificance
control + -b <command>to tell the shell that it’s for tmux and not just normal shell.
control + -b pprevious window
control + -b nnext window
control + -b ccreate window
control + -b wlist windows
control + -b %split window vertically into two parts
control + -bsplit-horizontally : split window horizontally
tmux - new s <sessionname>create a new tmux session
control + -xclose (kill) tmux pane
control + -b ddetach from tmux session. (without stopping the process)
tmux list-sessionsList all tmux sessions
tmux attach -t <sessionname>attach to a certain tmux session
tmux attachattach all tmux sessions/ any tmux session

Bash profiles

Create bash profile

touch creates the file, so no need to run this command when the file already exists. Alternative:

For editing the .bash_profile. opens in a text editor. See here

Git

Clone repository from GitHub to local machine

  • create new repository on GitHub
  • go to the directory on your local machine where the cloned repository should be saved.
  • type git clone https://github.com/your-name/repository-name.git
  • the repository should now appear in the local folder on your machine.

Commit file from terminal

  • go to the directory of your repository inside the terminal
  • type git add . This recurses into sub-directories. Alternative: git add or git commit -a
  • git commit -m “your commit message”. Commit the changes.
  • git push. Push the changes.

To see the status of your repository: git status.

See this useful blog.

Managing branches

Branches are very important when you collaboratively work on Github.

This github page contains useful information on how to create a new branch and how to manage branches on github.

  • go to the directory of your repository inside the terminal
  • before creating a new branch, make sure all changes are pulled to your local repository
  • Create new branch by typing git checkout -b [name_of_your_new_branch]
  • Push the new branch to github by typing git push origin [name_of_your_new_branch]
  • Check out which branches exist for this repository: git branch. (If there is only the master branch, it will return * master.)
  • Add a new remote for your branch: git remote add [name_of_your_remote]. A remote (URL) is Git’s fancy way of saying “the place where your code is stored.” (see here)
  • Push changes from your commit into your branch (= into your remote): git push [name_of_your_new_remote] [name_of_your_branch]
  • Update your branch from the original (master) branch: git fetch [name_of_your_remote]
  • To merge changes between your branch and the original (master) branch, you should first switch to master branch in your terminal: git checkout master. Then simply type git merge [name_of_your_branch].

global .gitignore file

Solidworks 2020 student edition. See here

Create a global .gitignore file (file types to be excluded from every git project):

The file is found under Documents/Username (as a hidden file). Open it in a text editor to edit it and add files you don’t want tosync with git/GitHub.

local .gitignore file

In the terminal, go to the working directory of the project you want to commit to github.

The file is found locally in the working environment of the project. Open it in a text editor to edit it and add files.

How to prevent conflicts in a collaborative Github project

The following procedure should help you considerably to prevent conflicts in collaborative Github and Git project.

Before you start working: pull

Once you’ve made any changes to the project:

  1. Commit
  2. Pull
  3. If you get an error message, clean the file, solve conflicts
  4. Push

To summarize: pull, commit, pull, clean, push

Solve conflict using VIM editor

See this Stackoverflow post: http://stackoverflow.com/questions/5599122/problems-with-entering-git-commit-message-with-vim

Videology mobile phones & portable devices driver download for windows. If there is a conflict between your local version of the project and the version on Github, a window of the VIM editor will open after you’ve tried to commit your local changes. In this case, you should proceed as follows:

  • type i into the VIM editor, which opens the editing (insert”) mode
  • type your merge message
  • press Esc to be sure to have left insert mode
  • then type :wq followed by Enter, which writes the current file and then closes it.
  • your merge should now have been accepted.

Push commits from terminal with two-factor authentification

See this helpful page on how to push commits from the terminal when using two-factor authentification on Github:
https://gist.github.com/wikimatze/9790374

Important: You need to use your personal access token, not your Github password to push commits from the terminal.

Python

Virtual environments

Change virtual environment:

How to set up and manage virtual environments in Ubuntu: http://askubuntu.com/questions/244641/how-to-set-up-and-use-a-virtual-python-environment-in-ubuntu

Configure Pycharm to use a virtual environment

See here.

Then set the shell Preferences->Tools->Terminal->Shell path to
/bin/bash --rcfile ~/.pycharmrc

Check which python packages are installed

Start scrapy project

Start scrapy project for webscraping: enter the following commandin the terminal (in the directory where you want to start your project).

R

Open new window

Scrapy Cheat Sheet Pdf

  • Open new RStudio window from terminal (e.g. when one RStudio needs to run for an extended period of time):
  • enter open -n -a 'rstudio' in terminal
  • How to add an RStudio project to Github: https://www.r-bloggers.com/rstudio-and-github/

Add R project to Github

Add the following commands in shell after having created the project in Github:

Scrapy Cheat Sheet 2019

Markdown and R

Scrapy Cheat Sheet Printable

Render/compile an R Markdown file from Terminal:

This resource on R Markdown is helpful.

An R Markdown cheatsheet is available from RStudio here.

Options settings

Set options, even options that aren’t defined by default. This can be useful for example for setting your consumer key, consumer secret etc. of your Twitter app:

SQL and SQLite

Repair corrupt database

How to repair db database: see stackoverflow

Merge two SQLite databases

Leaving out duplicates:

Open new SQLiteBrowser window from terminal

Python

SQLiteBrowser is well suited for viewing and editing database files compatible with SQLite.

If you want to view several databases side by side, you have to open a new SQLiteBrowser window from terminal (it doesn’t seem to be possible to open a new window from within SQLiteBrowser). To this end, go to the directory where yourapplications are stored (in Mac). Normally, this should be:

Thereafter, type the command to open a new SQLiteBrowser window:

Text editing and LaTeX

Scrapy Cheat Sheet

Calculate the number of words in a Latex file

Scrapy Cheat Sheet 2020

  • Change the working directory of your terminal to where the LaTeX TeX file is located.
  • I use one of the two options: (1) detex or (2) texcount
    • detex:
      Enter detex <document_name>.tex | wc -w -c -l or just detex <document_name>.tex | wc
      To calculate word count in pdf document: pdftotext <document_name>.pdf - | wc -w
    • texcount:
      Enter texcount -1 <document_name>.tex
      There are thousands of options for texcount
      For example, for including the bibliography in the word count, use texcount -1 -incbib <document_name>.tex
      To include several documents in the word count (e.g. main paper and appendix), just add the different documents behind one another: texcount -1 -incbib <main_document>.tex <appendix>.tex
      For more information on texcount, see this website