If you have any query feel free to chat us!
Happy Coding! Happy Learning!
After performing data analysis, the next step is to prepare the data for training a machine learning model. Data preparation involves handling missing values, encoding categorical features, and splitting the dataset into features (X) and the target variable (y). Here's how you can prepare the Titanic dataset for machine learning:
pythonCopy code
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline # Load the dataset train_df = pd.read_csv('train.csv') # Drop unnecessary columns drop_columns = ['PassengerId', 'Name', 'Ticket', 'Cabin'] train_df = train_df.drop(columns=drop_columns) # Handling missing values train_df['Age'].fillna(train_df['Age'].median(), inplace=True) train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True) # Splitting features and target variable X = train_df.drop(columns=['Survived']) y = train_df['Survived'] # Splitting the dataset into training and validation sets X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42) # Preprocessing: ColumnTransformer to handle numeric and categorical features numeric_features = ['Age', 'SibSp', 'Parch', 'Fare'] categorical_features = ['Pclass', 'Sex', 'Embarked'] numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())]) categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))]) preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) # Combine preprocessing with a machine learning model from sklearn.ensemble import RandomForestClassifier model = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))]) # Train the model model.fit(X_train, y_train) # Evaluate the model accuracy = model.score(X_valid, y_valid) print(f'Model Accuracy: {accuracy:.2f}')
Here's a breakdown of the data preparation steps in the code:
Drop Unnecessary Columns: Remove columns that are not useful for modeling.
Handling Missing Values: Fill missing values in the "Age" and "Embarked" columns.
Splitting Data: Split the dataset into features (X) and the target variable (y), and further split into training and validation sets.
Preprocessing: Define a ColumnTransformer
that applies scaling to numeric features and one-hot encoding to categorical features.
Model Pipeline: Create a pipeline that combines the preprocessing steps with a machine learning model (Random Forest in this case).
Train the Model: Fit the pipeline to the training data.
Evaluate the Model: Calculate and print the model accuracy on the validation set.
This code provides a basic example of data preparation and model training for the Titanic challenge. You can further refine the preprocessing steps, experiment with different algorithms, tune hyperparameters, and explore more advanced techniques to improve your model's performance.
Comments: 0