Loading Machine Learning Data in Python
This article is for beginners who want to load your data properly in Python. We will introduce to you some different techniques so that you can start your machine learning project in Python easier.
Load Machine Learning Data
There are critical parts of your data file you will need to identify. In machine learning, CSV or comma separated values is the most commonly used format. And the critical parts and features in CSV files of machine learning data are CSV File Header, Comments, Delimiter and Quotes.
- CSV File Header: The header in a CSV file is used in automatically assigning names or labels to each column of your dataset. You will have to manually name your attributes if your file doesn’t have a header.
- Comments: You can identify comments in a CSV file when a line starts with a hash sign (#). Depending on the method you choose to load your machine learning data, you will have to determine if you want these comments to show up, and how you can identify them.
- Delimiter: A delimiter separates multiple values in a field and is indicated by the comma (,). The tab (\t) is another delimiter that you can use, but you have to specify it clearly.
- Quotes: If field values in your file contain spaces, these values are often quoted and double quotation marks is used to denote it. If you choose to use other characters, you need to specify this in your file.
After you finish identifying these critical parts of your data file, we will continue to learn the different methods on how to load machine learning data in Python.
Load Data with Python Standard Library
To load your data with Python Standard Library, you will be using the module CSV and the function reader(). Upon loading, the CSV data will be automatically converted to NumPy array which can be used for machine learning.
Below is an example for you. It’s a small code that when you run using the Python API will load this dataset that has no header and contains numeric fields. It will also automatically convert it to a NumPy array.
|# Load CSV (using python)
filename = ‘pima-indians-diabetes.data.csv’
raw_data = open(filename, ‘rt’)
reader = csv.reader(raw_data, delimiter=’,’, quoting=csv.QUOTE_NONE)
x = list(reader)
data = numpy.array(x).astype(‘float’)
Or we can understand it simply like that: this code commands the program to load an object that enables iteration over each row of the data and can be converted easily into a NumPy array. The below shape of the array will be produced when you run the sample code:
1 (768, 9)
Load Data File With NumPy
Another way to load machine learning data in Python is by using NumPy and the numpy.loadtxt() function.
For example, you can see the sample code below. The function assumes that your file has no header row and all data use the same format. It also assumes that the file pima-indians-diabetes.data.csv is stored in your current directory.
# Load CSV
Below is shape of the data, and the file will be loaded as numpy.ndarray when you run the sample code above:
1 (768, 9)
If your file can be retrieved using a URL, the above code can be changed to the following, while producing the same dataset:
# Load CSV from URL using NumPy
You will have the same resulting shape of the data if you run the code:
1 (768, 9)
Load Data File With Pandas
The third way to load your machine learning data is using Pandas and the pandas.read_csv() function.
This is the most flexible and ideal way to load your machine learning data. It returns a pandas.DataFrame and you can start summarizing and plotting immediately.
The sample code below assumes that the pima-indians-diabetes.data.csv file is stored in your current directory.
1 # Load CSV using Pandas
Names of each attribute to the DataFrame below are explicitly identified. When you run the sample code above, the following shape of the data will be printed:
1 (768, 9)
If your file can be retrieved using a URL, the above code can be changed as to the following, while producing the same dataset:
1 # Load CSV using Pandas from URL
Running the sample code above will download a CSV file, parse it, and produce the following shape of the loaded DataFrame:
1 (768, 9)
Above are three different methods of importing your data to Python and they’re just basic workflow. You can choose the most suitable one to start your project.
iRender is currently providing GPU Cloud for AI/DL service so that users can train their models. With our high configuration and performance machines, you can install any software you need for your demands. Just a few clicks, you are able to get access to our machine and take full control of it. Your model training will speed up 10 times or even 50 times faster.
For more information, please sign up here and try using our services! Or contact us via WhatsApp: (+84) 916806116 for advice and support.