Data Engineering Summit 2022, presented by Google Cloud and organised by Analytics India Magazine, is India’s first conference dedicated to the high-demand and impactful field of data engineering. This virtual conference, to be held on April 30, 2022, will focus on data engineering innovation and give attendees direct access to top engineers and innovators working in leading tech companies.
This will be a golden opportunity for attendees to learn about the software deployment architecture of machine learning systems, how to produce the latest data frameworks and solutions for business use cases from the very best in the field.
Data Engineering Championship by MachineHack
MachineHack is organising a data engineering hackathon for data scientists & data engineers to participate and win a chance to present at DES 2022.
Data engineering consists of collecting, provisioning and maintaining excellent quality data to get insights. In order to do that, a data engineer needs to design and develop a scalable data architecture, set up processes that pool data from multiple sources, check the data quality, and eliminate corrupt data. In addition, exploratory data analysis (EDA) and extract, transform, and load (ETL) techniques are required to access and use downstream to solve business problems.
START DATE: 13th April 2022, 6:00 PM
END DATE: 30th May 2022, 6:00 PM
REGISTER NOW
All you need to know about the ‘Data Engineering Championship’
With the dataset provided, the participants need to analyse and create features of the following description.
- ‘DATE’: create the date from year, month and day of the week
- ‘LOW’: Lower value of DEP_TIME_BLK
- ‘HIGH’: Higher value of DEP_TIME_BLK
- ‘TIMESTAMP’: create a timestamp with date and lower value of DEP_TIME_BLK
- ‘WIND_CHILL’: the perceived temperature due to cooling effect of wind blowing
- ‘PRCP_SNOW_RATIO’: ratio of precipitation and snow
- ‘PLANE_AGE_AIRLINE_AIRPORT_FLIGHTS_MONTH_RATIO’: ratio of plane age and airline and airport flights months.
- ‘SEAT_DISTRIBUTION’: Ratio of seats and in concurrent flight CONCURRENT_FLIGHTS
- ‘SEAT_DISTRIBUTION_NORMALISED’: normalized values of ratio of seats and in concurrent flight
Evaluation
In order to calculate the winners of the hackathon, the submissions will be evaluated using the mean absolute error. One can use sklearn.metrics.mean absolute error to calculate the same mean_squared_error(y_true, y_pred, squared=False).
This hackathon will support private and public leaderboards.
- The public leaderboard is evaluated on 30% of the dataset
- The private leaderboard will be made available at the end of the hackathon, which will be evaluated on 100% of the dataset
- The final score represents the score achieved based on the Best Score on the public leaderboard
How to generate a valid submission file?
In order to submit your file, the following steps have to be kept in mind.
- Sklearn models should support the predict() method to generate the predicted values.
- The participant should submit a .csv file with exactly 2,00,00 rows with 9 columns. The submission will return an Invalid Score if you have extra rows or columns.
- The file should have exactly 9 columns.
Points to note:
- One should not shuffle the sequence of the test series
- If you are using pandas, use the following submission code:
submission_df.to_csv(‘my_submission_file.csv’, index=False
Dataset: 200000 rows x 26 columns
- MONTH: Month
- DAY_OF_WEEK: Day of Week
- DEP_DEL15: TARGET Binary of a departure delay over 15 minutes (1 is yes)
- DISTANCE_GROUP: Distance group to be flown by departing aircraft
- DEP_BLOCK: Departure block
- SEGMENT_NUMBER: The segment that this tail number is on for the day
- CONCURRENT_FLIGHTS: Concurrent flights leaving from the airport in the same departure block
- NUMBER_OF_SEATS: Number of seats on the aircraft
- CARRIER_NAME: Carrier
- AIRPORT_FLIGHTS_MONTH: Avg Airport Flights per Month
- AIRLINE_FLIGHTS_MONTH: Avg Airline Flights per Month
- AIRLINE_AIRPORT_FLIGHTS_MONTH: Avg Flights per month for Airline AND Airport
- AVG_MONTHLY_PASS_AIRPORT: Avg Passengers for the departing airport for the month
- AVG_MONTHLY_PASS_AIRLINE: Avg Passengers for the airline for the month
- FLT_ATTENDANTS_PER_PASS: Flight attendants per passenger for airline
- GROUND_SERV_PER_PASS: Ground service employees (service desk) per passenger for airline
- PLANE_AGE: Age of departing aircraft
- DEPARTING_AIRPORT: Departing Airport
- LATITUDE: Latitude of departing airport
- LONGITUDE: Longitude of departing airport
- PREVIOUS_AIRPORT: Previous airport that aircraft departed from
- PRCP: Inches of precipitation for the day
- SNOW: Inches of snowfall for the day
- SNWD: Inches of snow on the ground for the day
- TMAX: Max temperature for the day
- AWND: Max wind speed for the day
START DATE: 13th April 2022, 6:00 PM
END DATE: 30th May 2022, 6:00 PM
REGISTER NOW
Prize
The three winners will be getting a chance to present their solution approaches at the Data Engineering Summit (DES 2022).
Submission deadline
If you want to be a part of this exciting hackathon, make sure to submit your entries by May 30, 2022, at 06:00 PM IST, as the private leaderboard will be frozen at that time.
Disqualification
- If any of the details entered are found incorrect, Analytics India Magazine reserves the right to disqualify any participant.
- Any external dataset usage is strictly prohibited. The participants will be disqualified if found using any external dataset.