March 9, 2021 Yen Lily

Loading Machine Learning Data in Python

This article is for beginners who want to load your data properly in Python. We will introduce to you some different techniques so that you can start your machine learning project in Python easier.

Source: shanelynn.ie

Load Machine Learning Data

There are critical parts of your data file you will need to identify. In machine learning, CSV or comma separated values is the most commonly used format. And the critical parts and features in CSV files of machine learning data are CSV File Header, Comments, Delimiter and Quotes.

        • CSV File Header: The header in a CSV file is used in automatically assigning names or labels to each column of your dataset. You will have to manually name your attributes if your file doesn’t have a header.
        • Comments: You can identify comments in a CSV file when a line starts with a hash sign (#). Depending on the method you choose to load your machine learning data, you will have to determine if you want these comments to show up, and how you can identify them.
        • Delimiter: A delimiter separates multiple values in a field and is indicated by the comma (,). The tab (\t) is another delimiter that you can use, but you have to specify it clearly.
        • Quotes: If field values in your file contain spaces, these values are often quoted and double quotation marks is used to denote it. If you choose to use other characters, you need to specify this in your file.

After you finish identifying these critical parts of your data file, we will continue to learn the different methods on how to load machine learning data in Python.

Load Data with Python Standard Library

To load your data with Python Standard Library, you will be using the module CSV and the function reader(). Upon loading, the CSV data will be automatically converted to NumPy array which can be used for machine learning.

Below is an example for you. It’s a small code that when you run using the Python API will load this dataset that has no header and contains numeric fields. It will also automatically convert it to a NumPy array.

# Load CSV (using python)
import csv
import numpy
filename = ‘pima-indians-diabetes.data.csv’
raw_data = open(filename, ‘rt’)
reader = csv.reader(raw_data, delimiter=’,’, quoting=csv.QUOTE_NONE)
x = list(reader)
data = numpy.array(x).astype(‘float’)
print(data.shape)

Or we can understand it simply like that: this code commands the program to load an object that enables iteration over each row of the data and can be converted easily into a NumPy array. The below shape of the array will be produced when you run the sample code:

1 (768, 9)

Load Data File With NumPy

Another way to load machine learning data in Python is by using NumPy and the numpy.loadtxt() function.

For example, you can see the sample code below. The function assumes that your file has no header row and all data use the same format. It also assumes that the file pima-indians-diabetes.data.csv is stored in your current directory.

# Load CSV
import numpy
filename = ‘pima-indians-diabetes.data.csv’
raw_data = open(filename, ‘rt’)
data = numpy.loadtxt(raw_data, delimiter=”,”)
print(data.shape)

Below is shape of the data, and the file will be loaded as numpy.ndarray when you run the sample code above:

1 (768, 9)

If your file can be retrieved using a URL, the above code can be changed to the following, while producing the same dataset:

# Load CSV from URL using NumPy
from numpy import loadtxt
from urllib.request import urlopen
url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indiansiabetes.data.csv’
raw_data = urlopen(url)
dataset = loadtxt(raw_data, delimiter=”,”)
print(dataset.shape)

You will have the same resulting shape of the data if you run the code:

1 (768, 9)

Source: pythonbasics.org

Load Data File With Pandas

The third way to load your machine learning data is using Pandas and the pandas.read_csv() function.

This is the most flexible and ideal way to load your machine learning data. It returns a pandas.DataFrame and you can start summarizing and plotting immediately.

The sample code below assumes that the pima-indians-diabetes.data.csv file is stored in your current directory.

1 # Load CSV using Pandas
2 import pandas
3 filename = ‘pima-indians-diabetes.data.csv’
4 names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
5 data = pandas.read_csv(filename, names=names)
6 print(data.shape)

Names of each attribute to the DataFrame below are explicitly identified. When you run the sample code above, the following shape of the data will be printed:

1 (768, 9)

If your file can be retrieved using a URL, the above code can be changed as to the following, while producing the same dataset:

1 # Load CSV using Pandas from URL
2 Import pandas
3 url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv”
4 names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
5 data = pandas.read_csv(url, names=names)
6 print(data.shape)

Running the sample code above will download a CSV file, parse it, and produce the following shape of the loaded DataFrame:

1 (768, 9)

Conclusion

Above are three different methods of importing your data to Python and they’re just basic workflow. You can choose the most suitable one to start your project.

iRender is currently providing GPU Cloud for AI/DL service so that users can train their models. With our high configuration and performance machines, you can install any software you need for your demands. Just a few clicks, you are able to get access to our machine and take full control of it. Your model training will speed up 10 times or even 50 times faster.

For more information, please sign up here and try using our services! Or contact us via WhatsApp: (+84) 916806116 for advice and support.

Source: pythonbasics.org
, , , , , , , , , , , , , , , , ,

Yen Lily

Hi everyone. Being a Customer Support from iRender, I always hope to share and learn new things with 3D artists, data scientists from all over the world.
Contact

INTEGRATIONS

Autodesk Maya
Autodesk 3DS Max
Blender
Cinema 4D
Houdini
Daz Studio
Maxwell
Nvidia Iray
Lumion
KeyShot
Unreal Engine
Twinmotion
Redshift
Octane
And many more…

iRENDER TEAM

MONDAY – SUNDAY
Hotline: (+84) 912-785-500
Skype: iRender Support
Email: [email protected]
Address 1: 68 Circular Road #02-01, 049422, Singapore.
Address 2: No.22 Thanh Cong Street, Hanoi, Vietnam.

Contact
[email protected]