Hello folks....!!!!
Today let us look at a very important package 'csv'.
What is csv ?
- CSV stands for Comma Separated Values (and sometimes Character Separated Values).
- The csv files are the important ones for any kind of machine learning/deep learning projects.
- These csv files will serve as the datasets for any kind of classification/clustering/prediction problems.
- The files just look like excel spreadsheets but are saved with an extension '.csv'.
- The values in these files will be separated by any delimiting character, mostly a comma (',') or semicolon (';') is used.
- Each value in a row corresponds to a column.
Standard sites
There are some standard sites from which you can download csv files for
processing.
- UCI repository : https://archive.ics.uci.edu/ml/index.php
- Kaggle : https://www.kaggle.com/
- Indian government's open datasets : https://data.gov.in/
- Quandl : https://www.quandl.com/search
- Google's public datasets : https://cloud.google.com/bigquery/public-data/
Some sites may ask you to create an account with them. You can create one
with your mail ID in the above websites and they are trust worthy.
How to download a csv file ?
You can download a csv file by following the steps given below and store in
any location in your PC. I will show you an example using UCI repository.
1. Open the link https://archive.ics.uci.edu/ml/index.php. You will a window like the one given
below.
2. Now click on the data set you want to download. I am
downloading iris dataset now.
3. You will get a window like the one given below. If
you want the details like number of attributes, name and explanation for the
attributes click on Data set description and it will be
downloaded. Open it with notepad/wordpad or any other editor to view.
4. To download the dataset, click on Data Folder. A
window like the one given below will appear.
The index link contains a downloadable, which has information on
creation of the dataset. bezdekIris.data and iris.data are the links to
download dataset. The iris.names also contain a downloadable which has
information that describes the dataset.
5. Click on iris.data. The file will be downloaded.
Open it with a notepad or any editor to view the
contents. The downloaded file will look as
the one given below.
Explanation :
This dataset has 5 attributes. Now consider the first row and it can
be thought as :sepal length in cm
- sepal length in cm :5.1
- sepal width in cm :3.5
- petal length in cm :1.4
- petal width in cm :0.2
- class : Iris-setosa.
Which means, the flowers with above properties (sepal length and width,
petal length and width) come under the class Iris-setosa.
Reading csv files :
reader()
The csv files can be read using the reader provided by the csv package.
First import the package, then open the file in read mode. Look at code
snippet and its output given below. The dataset considered is the iris
dataset downloaded before and stored in C:\
The reader( ) returns an reader object which is an iterable. It can
be imagined as list of lists. Each row in the file is considered as a
separate list. Thus using a for loop each row is printed out here.
Custom delimiters :
Some files do not have comma (',') as the character that separates the
values. It may be a tab space even or a '|' or a ';'. Thus the reader() can
be customized to read the file based on the delimiter. To specify the
delimiter, one can set the value for optional parameter
'delimiter'.
Consider a dataset where the values are separated by a '|' (pipe symbol):
To read this csv, we can use the code snippet given below.
You can notice a space after the delimiter (before each value) in some csv files. To remove that extra space one can pass the optional parameter skipinitialspace. It takes a boolean value .The default value is false. If set to true the leading space wont be considered. For example if the csv is as given below :
You can write skipinitialspace=True in the reader( ).
Quoting - parameter:
Consider a csv file which contains the famous personalities and their
quotes.
Now if you read it normally , the reader() returns everything in the string
format. Thus all the values will be enclosed within the quotes as you see
below.
After reading the csv file, we get an output like this. An extra quote mark
can be seen here as our csv file itself contains quotation marks. To avoid
such mess, we can pass quoting parameter to the reader() function. The
quoting can take up four values,
csv.QUOTE_ALL, csv.QUOTE_MINIMAL, csv.QUOTE_NONE,
csv.QUOTE_NONNUMERIC. Mostly QUOTE_NONE is used with the reader() and others ar used
while writing csv files. If you don't want the double quotes to be treated
specially, then use QUOTE_NONE. Look at the example below.
It is required to write the encoding=utf8, in case of any unicode error while reading the csv file.
Now all the quotes appear as normal character here while reading.
Dialects:
Till this topic, you have seen number of parameters passed to this reader()
function. Passing them again and again when working with huge number of file
will make the ugly and readability will reduce. To overcome this situation,
dialects are used. It helps to group the formatting parameters. Look at the
code below.
Initially we have to register the dialect with a name and the formatting
that have to be applied. Here I have registered two dialects with different
formatting styles. I have just passed the dialect parameter in the reader()
function instead of writing all the other formatting parameters every time.
This reduces the code and improves the readability. We can even reuse the
same dialect for reading another csv file also.
DictReader()
It is used to read the contents of the csv file into a dictionary. It
creates an object that reads like a normal reader() but maps the values with
specific columns. Look at the example below:
The output will be an Ordered Dictionary. If you just print it, you will see
something like the one given below.
To view it more clearly, replace the
print(row) statement in
for loop with print(dict(row)). Then you will get the output as given below.
Next Page👉
Comments
Post a Comment