Week 2 of Bytewise Fellowship: Experience with Machine Learning using Python
Hello all tech enthusiasts!
How are you all? It’s my second week in the Bytewise Fellowship Program and I am very excited to share my journey with you all so far. This week, Ma’am Nimra Waqar has given us two exciting tasks to deepen our understanding of machine learning and data analytics.
Let’s see what I learned and accomplished this week!
Task 1: Machine Learning Basics
Machine learning is a branch of artificial intelligence that allows systems to learn and improve from experience without being explicitly programmed. It includes algorithms that can process data, recognize patterns, and make decisions with minimal human intervention.
Types of Machine Learning
Supervised Learning:
This involves training a model on labeled data where the input and output pairs are known. Examples include classification and regression tasks.
Unsupervised Learning:
We deal with unlabeled data and the model tries to recognize patterns and relationships in the data. Examples include clustering and dimensionality reduction.
Reinforcement Learning:
This involves training an agent to make a series of decisions by rewarding or punishing it based on its behavior.
Basic Terminology
Dataset: A collection of data used to train and evaluate a model.
Function: An independent variable or attribute in the data.
Label: A dependent variable or outcome in supervised learning.
Training: The process of teaching a model using a training data set.
Testing: Evaluating the model’s performance on a test data set.
Python Basics
Understanding the basics of Python is important to implement machine learning algorithms. This includes knowledge of data types, control structures, and functions.
Introduction to NumPy and Pandas
NumPy and Pandas are important Python libraries for data manipulation and analysis. NumPy supports large multidimensional arrays and matrices, while Pandas provides the data structures and functions needed to manipulate tables of numbers and time series.
Exploring the Iris Dataset
The Iris dataset is a classic dataset used in machine learning. It contains measurements of iris flowers and is often used for classification tasks.
Here’s a quick look at the code I used:
Important Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action="ignore", category =FutureWarning)
plt.rcParams["figure.figsize"] = [10,5]
#loading dataset
from sklearn.datasets import load_iris
data = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
print(iris_df.head())
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(iris_df, iris.target, test_size=0.2, random_state=42)
print(f"Training set size: {X_train.shape}, Testing set size: {X_test.shape}")
X = iris_df[['sepal length (cm)']]
y = iris.target
model = LinearRegression()
model.fit(X, y)
# Print model coefficients
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
You can find the full notebook on my GitHub account.
Task 2: Exploring the Titanic dataset
Exploratory Data Analysis (EDA)
Exploring the Titanic dataset was an eye-opening experience. The dataset contains information about passengers such as age, gender, fare, survival status, etc. We conducted EDA to uncover patterns and insights.
Data Cleaning and Preprocessing
Cleaning the dataset includes handling missing values, fixing data types, and encoding categorical variables. Data preprocessing ensures that the data is in the right format for training the model.
Data Visualization
Data visualization helped me understand the distribution and relationships among the variables.
Here are some charts created with Seaborn and Matplotlib:
plt.figure(figsize=(10,5)) # Adjusting the size of figure, you can change it
correlation = numeric_columns.corr() # Calculating the correlation
sns.heatmap(correlation,cmap="BrBG",annot=True) # Displaying the correlation using the heap map # Here Br: Brown. B: Blue, G: Green
survival_counts = df['Survived'].value_counts().sort_index()
survival_labels = ['Did Not Survive', 'Survived']
sns.barplot(x=survival_labels, y=survival_counts.values, palette='viridis')
# Add title and labels
plt.title('Number of Passengers Who Survived on the Titanic')
plt.xlabel('Survival Status')
plt.ylabel('Number of Passengers')
# Show the plot
plt.show()
sns.violinplot(x='Pclass', y='Age', data=df, palette='viridis')
plt.title('Violin Plot of Age by Class')
plt.xlabel('Passenger Class')
plt.ylabel('Age')
plt.show()
sns.kdeplot(df['Age'], color= 'red', linewidth = 2)
You can find the full notebook on my Github Account.
That’s it for Week 2. Stay tuned for more exciting updates next week. Stay tuned!
You can learn more about me by connecting with me on LinkedIn and following me on Instagram.
Thanks for reading!