student alcohol consumption dataset
People who contributed to this were Aaron Patrick Nathaniel, Lim Yue Hng (Neil) and 45 Using Python to Analyze Secondary School Student Alcohol Consumption and Their Academic Performance 1Poonam Kumari and 2Aditya Pratap 1Research Scholar, Department of Computer Science, IITM Janakpuri, New Delhi,India 2Research Scholar, Department of Computer Science, IITM Janakpuri, New Delhi,India [email protected], … You may want to explore combining the grades into one feature since G3 is likely derived from G1 and G2. A lot of time is lost I alcohol consumption that the students only place less time in their academic work. student's relationship with his/her family is low because of their high level of alcohol consumption. With the Student Alcohol Consumption data set from UCI Machine Learning Archive (Fabio Pagnotta 2016), we thought it would be interesting to see what features are important to determine if the student is a heavy drinker or not. For numeric data, correlations are important to help determine if we should join information of highly correlated features. This helps you to understand whether the distribution of the numeric variable is significantly different at different levels of the categorical target. Section 2e. It does not state the level of intimacy between them. We could perform this merge differently later by performing a full join and then dealing with the NA values, by performing the analysis on the individual sets, or by inner joining the two sets and just working with that data. activites (column 19), romantic (column 23), famrel (column 24), goout (column 26), Dalc (column 27), Walc (column 28) Therefore, researchers seek to rectify that lack by conducting a survey to obtain important raw data on alcohol consumption The following shows basic statistics of each feature: Addressing skewness, the mean of absences is 4.4348659 and the median is 2, indicating that the data is right-skewed and given the spread between the min and max, the skewness is significant. Our main goal is using Data Mining To Predict School Student Alcohol Consumption and finding the significant factors. Dinescu, D., Turkheimer E., Beam, C.R., Horn, E.E., Duncan, G., Emery, R.E. This analysis was done as part of Yaml is a good tool for setting up configurations, but in this case, we will set the configurations manually. The traditional Our explanation would be more focused on the final grade because we think that students will be For a student to pass the subject, there are a couple of factors that could be correlated with the outcome. Many students in college experiment with drugs and alcohol and sometimes these two things negatively affect their academic performance. This may not hold true because it is a possibility that the recorded to have participated. We remove skewness by applying a log, square root, or/and inverse transformation. 2016. Since the dataset is called “Student Alcohol Consumption”, of course, we should do some analyses on it. This analysis was done as part of fulfilling the Data Mining course in Multimedia University. Many of them are ordinal and were discretized from continuous values. Section 2d. This modification coincides with the original report where the authors modified the target with the formula acl = (Dalc * 5 + Walc * 2) / 7 and then assumed values of 3 or more were heavy drinkers. Testing correlation between alcohol consumption and social, gender, study time, and grade attributes for each student. Is marriage a buzzkill? Subscribe to our mailing list and get interesting stuff and updates to your email inbox. 3. al. relationship with his/her family has a low value. The datasets have a total of 33 attribute columns of which we could do some column selection based on certain parameters. The primary reason for this data was to see the effects of drinking and grades. In general, we would assume that people who are not healthy, will In our data set, many of the categorical features are numeric, but for this illustration, we will continue with treating them as categorical. However, the data reveals that there was a total of 382 students that were in both datasets, this was evident in the exact To do so, we reductions of GPA. Reddit, a popular community discussion site, has a section devoted to sharing interesting data sets. National Institute on Alcohol Abuse and Alcoholism Alcohol Use and Consumption Tables A large number of html and text files on alcohol use and consumption. Thus, their final grade would be the perfect measure of to 1 hour, or 4 – >1 hour), studytime – weekly study time (numeric: 1 – <2 hours, 2 – 2 to 5 hours, 3 – 5 to 10 hours, or 4 – >10 hours), failures – number of past class failures (numeric: n if 1<=n<3, else 4), schoolsup – extra educational support (binary: yes or no), famsup – family educational support (binary: yes or no), paid – extra paid classes within the course subject (Math or Portuguese) (binary: yes or no), activities – extra-curricular activities (binary: yes or no), nursery – attended nursery school (binary: yes or no), higher – wants to take higher education (binary: yes or no), internet – Internet access at home (binary: yes or no), romantic – with a romantic relationship (binary: yes or no), famrel – quality of family relationships (numeric: from 1 – very bad to 5 – excellent), freetime – free time after school (numeric: from 1 – very low to 5 – very high), goout – going out with friends (numeric: from 1 – very low to 5 – very high), Dalc – workday alcohol consumption (numeric: from 1 – very low to 5 – very high), Walc – weekend alcohol consumption (numeric: from 1 – very low to 5 – very high), health – current health status (numeric: from 1 – very bad to 5 – very good), absences – number of school absences (numeric: from 0 to 93), G1 – first period grade (numeric: from 0 to 20), G2 – second period grade (numeric: from 0 to 20), G3 – final grade (numeric: from 0 to 20, output target), Joining information from existing features (PCA is a common example, or some knowledge about how features are correlated), Depending on the model, remove features that are not important to the model. However, a research conducted in the United States by Balsa (2011), showed that increases in levels of alcohol consumption only resulted in small In April 2016, 3000 undergraduate students were randomly selected to participate in the survey, and 802 undergraduate students responded to at least part of the survey. information about the students from the mathematics course only. The types of columns are listed as follows: One way to get an idea about the structure of the data is to calculate basic statistics, such as the min, max, mean, and median, and missing value counts. Besides family relationships, we can also try to find if there is a relationship between being single and consuming high levels of alcohol. consumption) and/or column 28 (weekend alcohol consumption). The results make sense. There are two categorical columns “Dalc” and “Walc” showing consumption on workday and weekend. workday and/or weekend alcohol consumption would also be lower. However, if more elaborate data mining techniques were to be used, more features can be selected and used in order to Treatment utilization alcohol PDF 98 KB. Next Steps in Preparing the Data for a Model, https://archive.ics.uci.edu/ml/datasets/STUDENT%20ALCOHOL%20CONSUMPTION, http://www.who.int/substance_abuse/publications/global_alcohol_report/en/, Data Exploratory Analysis – Student Alcohol Consumption, Facebook Stock Price after Quarterly Report, Forecast Stock Prices Example with r and STL, school – student’s school (binary: ‘GP’ – Gabriel Pereira or ‘MS’ – Mousinho da Silveira), sex – student’s sex (binary: ‘F’ – female or ‘M’ – male), age – student’s age (numeric: from 15 to 22), address – student’s home address type (binary: ‘U’ – urban or ‘R’ – rural), famsize – family size (binary: ‘LE3’ – less or equal to 3 or ‘GT3’ – greater than 3), Pstatus – parent’s cohabitation status (binary: ‘T’ – living together or ‘A’ – apart), Medu – mother’s education (numeric: 0 – none, 1 – primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education), Fedu – father’s education (numeric: 0 – none, 1 – primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education), Mjob – mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. predict if a student will get a passing grade based on the factors mentioned above. To test this, we will also apply the Box and Cox method to determine the parameter that indicates which method is best. We use cookies on Kaggle to deliver our services, analyze web traffic, and improve your experience on the site. Your email address will not be published. we respect your privacy and take protecting it seriously. They are: Exploratory Data Analysis on the Student Alcohol Consumption dataset (Code) », address - U/R for urban or rural respectively, famsize - LE3/GT3 for less than or greater than three family members, Pstatus - T/A for living together or apart from parents, respectively, Medu - 0 (none) / 1 (primary-4th grade) / 2 (5th - 9th grade) / 3 (secondary) / 4 (higher) for mother's education, Fedu - 0 (none) / 1 (primary-4th grade) / 2 (5th - 9th grade) / 3 (secondary) / 4 (higher) for father's education, Mjob - 'teacher', 'health' care related, civil 'services', 'at_home' or 'other' for the student's mother's job, Fjob - 'teacher', 'health' care related, civil 'services', 'at_home' or 'other' for the student's father's job, reason - close to 'home', school 'reputation', 'course' preference or 'other' for the choice of school, guardian - mother/father/other as the student's guardian, traveltime - 1 (<15mins) / 2( 15 - 30 mins) / 3 (30 mins - 1 hr) / 4 (>1hr) for time from home to school, studytime - 1 (<2hrs) / 2 (2 - 5hrs) / 3 (5 - 10hrs) / 4 (>10hrs) for weekly study time, failures - 1-3/4 for number of class failures (if more than 3 than record 4), schoolsup - yes/no for extra educational support, famsup - yes/no for family educational support, paid - yes/no for extra paid classes for Math or Portuguese, activities - yes/no for extra-curricular activities, nursery - yes/no for whether attended nursery school, higher - yes/no for desire to continue studies, internet - yes/no for internet access at home, romantic - yes/no for relationship status, famrel - 1-5 scale on quality of family relationships, freetime - 1-5 scale on how much free time after school, goout - 1-5 scale on how much student goes out with friends, Dalc - 1-5 scale on how much alcohol consumed on weekdays, Walc - 1-5 scale on how much alcohol consumed on weekend, absences - 0-93 amount of absences from school, the amount of time a student studies (studytime, column 14), does the student join any extra paid classes (paid, column 18), does the student participate in any extra co-curricular activities (activities, column 19), if the student is involved in any romantic relationship (romantic, column 23), how is the student's family relationship quality (famrel, column 24), the tendency of the student to go out with friends (goout, column 26), weekday alcohol consumption (Dalc, column 27), weekend alcohol consumption (Walc, column 28). By using Kaggle, you agree to our use of cookies. You can see the level of correlation by the degree of the ellipse. comes with the mantle of adulthood. consumption (both column 27 and 28) when famrel has a low value. While I recognize that having a great many students living on campus may be contributing to these numbers, and while I am relieved that students know how and when to seek care, I am c… GitHub is where the world builds software. I'm sorry, the dataset "STUDENT ALCOHOL CONSUMPTION" does not appear to exist. This data set contains survey information from a group of students in a secondary school. Fedu and Medu correlate more that some others, so we might want to combine the information. EDUCATION SYSTEM IN PORTUGAL. We would oversample since we have limited data. The violin plot of absences shows more of a log normal distribution, and a large number of outliers lie well outside of the top whisker. We will take a closer look at the distribution of this feature. Depending on the model you choose, removing skewness could help improve the predictive ability of the model. (2016), studied the relationship between married couples with their single counterparts and found out that if partners are more To obtain insights on this, we could refer to column 29 (health), column 27 at Kaggle. We can use studytime (column 14), paid (column 18), This will be explained in the next section (Section C). This will be attempted Section 2c. The original data comes from a survey conducted by a professor in Portugal. Generally, many models prefer using features that are independent of each other and have low correlations. 2014. It could be alcohol poisoning or an alcohol-related injury or both. and/or column 28 (weekend alcohol consumption), column 31 (first period grade), column 32 (second period grade) and The most recent statistics from the National Institute on Alcohol Abuse and Alcoholism (NIAAA) estimate that about 1,519 college students ages 18 to 24 die from alcohol-related unintentional injuries, including motor vehicle crashes. Secondary school student alcohol consumption data with social, gender and study information. We assume that a father’s education level is similar to a mother’s education level, so let us visualize the association: The above plot shows that the education levels between mother and father do coincide fairly often and might want to explore more or consider the possibility of joining these features in preprocessing the data before model building. We could take into consideration the intimate, they will drink less. Nicolas Raj. Essentially, the blue rectangles show that the observed counts and expected counts (derived from a loglinear model) coincide well, and since the size of the rectangles are large, the confidence covers a majority of the observations. EuroEducation.net. Balsa, A. I., Giuliano, L. M., & French, M. T. (2011). Alcohol Abuse and Dependence: Roughly 20 percent of college students meet the criteria for an alcohol use disorder in a given year (8 percent alcohol abuse, 13 percent alcohol dependence). Your email address will not be published. The scope of these data sets varies a lot, since they’re all user-submitted, but they tend to be very interesting and nuanced. The effects of alcohol use on academic achievement in high school. Best part, these are all free, free… This information can give you a hint of the skewness and of possible outliers. in section E as part of the preprocessing before plotting the data for our exploratory data analysis. We test hypothesis 0 (h0) that the numeric variable has the same mean values across the different levels of the categorical variable. Published in: Technology. There are a few columns which we think could be further clarified or changed. To get an idea of how features interact with each-other, we can determine the rank associated with the features to a target, in this case, the actual target or level of drinking. Five columns play a major role in this which are: column 27 (workday alcohol consumption) At an alcohol consumption level of 1, the median and 25th percentile are the same value of 2 hours of study. The original data contains the following attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets: The following grades are related with the course subject, Math or Portuguese: Before exploration, we combine the rows of the two data sets and mark each instance with the class in which the survey was taken. number of absences, parent’s job and education, alcohol consumption). because it would be less accurate for the classification model to predict a numeric value ranging from 0-20. emotion. weekend alcohol consumption and their health. It’s called the datasets subreddit, or /r/datasets. This Student Alcohol Consumption dataset is based on data collected in two secondary schools in Portugal. From this analysis, what might we preprocess before creating the model? The amount of mathematics students involved in the collection was 395, whereas 649 Portuguese Language students were result as pass/fail rather than a discrete numeric number. Earthdata. The dataset is originally designed for the estimation of high school student’s performance where alcohol consumption is used as one of the parameters. http://www.who.int/substance_abuse/publications/global_alcohol_report/en/. The students included in the survey were in the While … Fabio Pagnotta, Hossain Mohammad Amran. This helps you to understand the top dependent variables (grouped by numerical and categorical). Family history alcohol PDF 140 KB. Click on the arrow near the name of each column to evoke the context menu. Derived output: Alc = (Walc X 2 + Dalc X 5) / 7, again, in the range of 1 – 5. The original data comes from a survey conducted by a professor in Portugal. that particular student's success. courses of mathematics and Portuguese. in a student environment as well as their demographic information and other data that may be of some relevance. As we all know, human relationships play a major role in people's lives. The box plot portion of the graph also helps us identify outliers. For example, if there were a high correlation, say 0.9, between two numeric features, then the information provided to the model would be redundant, and depending on the model make the model more complex than it needs to be. Section 3b. According to the World Health Organization (Global Status Report on Alcohol and Health 2014 2014), gender, family, and social factors affect alcohol consumption. We look a bit closer at the distribution of absences and test for normality. Short exploratory data analysis focusing on the alcohol variables from the Portuguese school dataset. administrative or police), ‘at_home’ or ‘other’), reason – reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’), guardian – student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’), traveltime – home to school travel time (numeric: 1 – <15 min., 2 – 15 to 30 min., 3 – 30 min. Core measures include: Baseline surveys included standard demographics, religiosity, current alcohol and drug diagnoses (DIS), ASI alcohol, drug and psychiatric problem severity, number of heavy drinkers in social networks, prior treatment utilization, and lifetime and past-year 12-step meeting attendance and involvement, Six- and 12-month surveys involved a subset of these …
Le Meilleur De Muriel Robin, Collège Privé St Joseph St Julien Chapteuil Pronote, Point Lumineux Dans Le Ciel Août 2020, Ancienne Monnaie Du Pérou, Veiller Tard Spectacle, évaluation Pythagore 4ème Avec Corrigé, Juzni Vetar 1 Epizoda Cela, Cours De Droit Constitutionnel Ucad Pdf, Bdo Sailboat To Caravel, Tablier L'atelier De Roxane, Oléacée Mots Fléchés,